Abstract

Cassandra is distributed storage system that has data being spread over many servers located in different data centers located geographically. The data is highly available with no single point of failure. In many ways Cassandra resembles a database but its does not support a full relational data model rather its follows a simple data models and thus allows having a dynamic control over our data layout.

It has been designed to run on cheap infrastructure having both write and read efficiency. It was designed to have multi datacenter support from the very beginning. It plays the best when it comes to cross data center replication. It would fit for an application which has some hundred ‘s millions of users geographically and the application’s data also needs to be geographically present in datacenters to minimize the latencies.

Interesting Design Beneath

Management of data

The design mechanism that is used by Cassandra to store data is very similar to log-structured merge trees. It’s not a B-Tree. Cassandra storage engine sequentially writes to disk in append mode storing the data contiguously. Cassandra works extremely well with SSD (Solid state Disks). It minimizes the wear and tear on the SSD by making minimal I/O operations.

Throughput and latency both are key performance factors when working with Cassandra. In Cassandra all the operation are performed in parallel so both latency and throughput are independent. Writes are very efficient in Cassandra. When random writes of small amount of data is done, Cassandra reads in the SSD sector. The Log-Structured design obviates the disk seeks. As when the database updates are received Cassandra does not overwrites the row in place modifying the data on the disk. It eliminates the disk data modification and erase block cycles to save time. It rewrites the entire sector back. The operational design nicely integrates with operating systems page cache because Cassandra does not modify the data so the dirty pages that would have been generated to be flushed out never got generated.

Writes Mechanism

The consistency of Cassandra is one of the important features and needs to be appreciated. This actually refers to how a row is up to date and synchronized across all of the replicas. Cassandra uses the hinted hand off mechanism. This allows Cassandra to offer full write availability and improves the response consistency after temporary outage.

Cassandra uses the architecture of SSTables (Sorted String Tables) and MemTable. When writes occur in Cassandra the data are stored in memory structure called the MemTable. Cassandra can manage the memory that needs to be allocated to the MemTable or we can configure it. It performs better writes than relation databases because in Cassandra the tables are not related so it does not have to perform additional work to maintain the integrity. In the process of writing to MemTable it also write the commit log on to the disk. Once the data in MemTable reaches to a certain extent its flushes the data to SSTables. The MemTable and SSTables are maintained per table. The SSTables are immutable so update on a row is stored over multiple SSTables.

Read Mechanism

The reads in Cassandra are parallel and with low latency. The Bloom Filter is a mechanism used, which tells Cassandra about the probability of the data requested based on the partition key in the SSTables before doing any disk I/O. When the read request for a row comes to a node. Cassandra uses the level compaction strategy, which prevents the data from being fragmented. Cassandra also uses the caching feature it provides partition key cache and well as the row cache which allows access of data much faster.

Features to admire in Cassandra

Order of magnitude more data per node in the cluster

With a typical java based system like Cassandra we can allocate 8GB of max memory. Having more than 8GB of memory can also the garbage collection to pause in the young generation. In Cassandra it has been very much thought through to minimize the fragmentation of data.

In version 1.1 of Cassandra the Bloom filter, Partition Key Cache, Partition summary and Compression offset all were stored in heap memory. Out of the four components the Partition Key Cache is has the fixed size where other three components grows with a direct proportion to data. So to maximize the memory through put in version 2.0 the Bloom filter, Partition Key Cache, Partition summary and Compression offset were moved off heap.

Smoother compaction throttling

The compaction throttling allows setting a target rate of how much IO you want the compaction system to use. Tuning the compaction throttle maximizes the reads and writes in Cassandra. This was tuned further more in version 1.2.5.

No manual token management

When a node was added to the cluster, there is mechanism in the bootstrap to identify the token for the node to join the cluster by bisecting the range of the heavily loaded node in the cluster. This mechanism actually never worked as expected. The best way to tackle was to add a token manually for the Node to join the cluster. To solve the problem the token range bisection was removed and replaced with virtual nodes. The virtual node allows each of the nodes to have large no of small partition ranges distributed through out the cluster. Virtual node uses a mechanism for constant hashing to distribute the data, which limits the generation and assignment of token.

Balancing was taken care by evenly distributing the data when a new node joined the cluster.
When a node failed the data was distributed evenly between the nodes in the cluster.
Rebuilding the dead node was potentially faster because the data from the other node were sent to replacement node incrementally instead of waiting for validation.
Finally we are free from the calculation of token or assigning tokens to each node.
Concurrent streaming.

Consistency

Cassandra offers an eventual consistency but you can opt for a tunable consistency. Eventual consistency is always concurrent which means operations are always going on at the same time in the cluster. Cassandra is based on the CAP (Consistency, Atomicity and Partition tolerance) theorem and it cannot satisfy all the three at the same time so we can tune it according to our need. It has got also serial consistency but using the serial consistency, we may need to compromise with performance.

Triggers

Having trigger gives a bonus point to use Cassandra for immediate consistency. The trigger can be implemented by implementing the Itrigger interface. This basically deals with RowMutation and ColumnFamily objects.

Data model

Cassandra has a rich data model. Its basically based on schema optional and column oriented data model. In Cassandra Keyspace is the container for you application data. Inside the keyspace we have one or more than one column family. Column families are a set of columns, which is identified by row key. Cassandra doesn’t enforce relationship between Column families the way relational database does. There are no foreign keys and joins in column family at query time is not supported which helps Cassandra performs more faster.

Working with Cassandra

Working with Cassandra has become much simpler and easier when compared to the Thrift based API’s that exposed the internal storage structure of Cassandra.

CQL

The CQL is a structured query language to query Cassandra. CQL is based on the data model of partitioned row store with tunable consistency. Tables can be created, altered and dropped at runtime without blocking updates and queries. It doesn’t support joins and sub queries.

For more information: http://www.datastax.com/documentation/cql/3.0/webhelp/index.html

Java Drivers

There are many java API’s available in the open. We have Pelops, Kundera, Hector, Thrift, Virgil, Datastax java driver and Astyanax to name a few. We are going to mostly focus on Astyanax. Well Astyanax evolved at Netflix as a solution to a problem, which they faced while interacting with Cassandra. Astyanax can be said to be refractored version of Hector. It has got few advantages

Ease of uses of the API

Composite column support

Connection pooling

Latency

Documented

Break Object into multiple chunks.

One row per chunk

Parallelize upload

Retry small chunk instead of entire object

Eliminate hot spot

Astyanax has been supporting the Cassandra 1.2 version and the official release for support of Cassandra 2.0 is yet to come.