Tuesday, February 18, 2014

Cassandra SSTableSimpleUnsortedWriter and Non-Compact Storage

Patrick Callaghan of Datastax created a sample project showing how to bulk load data into Cassandra (thanks Patrick!).
https://github.com/PatrickCallaghan/datastax-bulkload-example
The 'marker' is cell with values only for the clustering column(s) of the primary key and an empty string for the column value.
I asked him
Why do you need the cql3 row marker in BulkLoadTransactions.java? I haven't seen any reference to this pattern before.
Is it mandatory? What happens is you don't add it? 
He replied
The row marker is an important part of the difference between compact and non-compact tables. I am creating a non-compact table and this requires a marker for the different clustering columns.
Q: How much space do I save using Compact Storage?
A: Non-Compact Storage adds 2 bytes of overhead per internal cells. The comparator used for these cells is a CompositeType instead of a single component comparator like UTF8Type
Q: When can I use Compact storage?
A: You can use Compact Storage if your table uses compound primary keys (more than one column in the PK) and you have only one data column, or if you have a table with a single-column primary key.
Q: Is it recommended using Compact Storage?
A: No. Non-Compact is the default option for new tables.    1. The overhead that is further diminished by sstable compression, which is enabled by default since Cassandra 1.1.0    2. Collections require CompositeType comparators, it is highly suggested using Non-Compact Storage for being able to evolve your table with collections in the future.    3.  If your table uses a compound primary key then you can't evolve your table and add more than one data column.
Q: What's the risk of not marking rows with the cql3 marker in SSTable files?
A: I don't know. I wouldn't want to be the first to find out :)
I did bulk load rows without the marker into a Non-Compact Storage table in a PoC and it worked well but I wouldn't want to try it out in production.

Bottom line, which I found surprising since it's almost undocumented, is that it if you use a table with a compound primary key it is best practice (maybe a required) to add an empty cell 'marker' to every row when using SSTableSimpleUnsortedWriter.