Abstract
Cassandra is
distributed storage system that has data being spread over many servers located
in different data centers located geographically. The data is highly available
with no single point of failure. In many ways Cassandra resembles a database
but its does not support a full relational data model rather its follows a
simple data models and thus allows having a dynamic control over our data
layout.
It has been designed to
run on cheap infrastructure having both write and read efficiency. It was
designed to have multi datacenter support from the very beginning. It plays the
best when it comes to cross data center replication. It would fit for an
application which has some hundred ‘s millions of users geographically and the
application’s data also needs to be geographically present in datacenters to
minimize the latencies.
Interesting
Design Beneath
Management of data
The design mechanism that
is used by Cassandra to store data is very similar to log-structured merge
trees. It’s not a B-Tree. Cassandra storage engine sequentially writes to disk
in append mode storing the data contiguously. Cassandra works extremely well
with SSD (Solid state Disks). It minimizes the wear and tear on the SSD by
making minimal I/O operations.
Throughput and latency
both are key performance factors when working with Cassandra. In Cassandra all
the operation are performed in parallel so both latency and throughput are
independent. Writes are very efficient in Cassandra. When random writes of
small amount of data is done, Cassandra reads in the SSD sector. The
Log-Structured design obviates the disk seeks. As when the database updates are
received Cassandra does not overwrites the row in place modifying the data on
the disk. It eliminates the disk data modification and erase block cycles to
save time. It rewrites the entire sector back. The operational design nicely
integrates with operating systems page cache because Cassandra does not modify
the data so the dirty pages that would have been generated to be flushed out
never got generated.
Writes Mechanism
The consistency of
Cassandra is one of the important features and needs to be appreciated. This
actually refers to how a row is up to date and synchronized across all of the replicas.
Cassandra uses the hinted hand off mechanism. This allows Cassandra to offer
full write availability and improves the response consistency after temporary
outage.
Cassandra uses the
architecture of SSTables (Sorted String Tables) and MemTable. When writes occur
in Cassandra the data are stored in memory structure called the MemTable.
Cassandra can manage the memory that needs to be allocated to the MemTable or
we can configure it. It performs better writes than relation databases because
in Cassandra the tables are not related so it does not have to perform
additional work to maintain the integrity. In the process of writing to
MemTable it also write the commit log on to the disk. Once the data in MemTable
reaches to a certain extent its flushes the data to SSTables. The MemTable and
SSTables are maintained per table. The SSTables are immutable so update on a
row is stored over multiple SSTables.
Read Mechanism
The reads in Cassandra
are parallel and with low latency. The Bloom Filter is a mechanism used, which
tells Cassandra about the probability of the data requested based on the
partition key in the SSTables before doing any disk I/O. When the read request
for a row comes to a node. Cassandra uses the level compaction strategy, which
prevents the data from being fragmented. Cassandra also uses the caching
feature it provides partition key cache and well as the row cache which allows
access of data much faster.
Features
to admire in Cassandra
Order of magnitude more data per node in the cluster
With a typical java
based system like Cassandra we can allocate 8GB of max memory. Having more than
8GB of memory can also the garbage collection to pause in the young generation.
In Cassandra it has been very much thought through to minimize the
fragmentation of data.
In version 1.1 of
Cassandra the Bloom filter, Partition Key Cache, Partition summary and
Compression offset all were stored in heap memory. Out of the four components
the Partition Key Cache is has the fixed size where other three components
grows with a direct proportion to data. So to maximize the memory through put
in version 2.0 the Bloom filter, Partition Key Cache, Partition summary and
Compression offset were moved off heap.
Smoother compaction throttling
The compaction
throttling allows setting a target rate of how much IO you want the compaction
system to use. Tuning the compaction throttle maximizes the reads and writes in
Cassandra. This was tuned further more in version 1.2.5.
No manual token management
When a node was added
to the cluster, there is mechanism in the bootstrap to identify the token for
the node to join the cluster by bisecting the range of the heavily loaded node
in the cluster. This mechanism actually never worked as expected. The best way
to tackle was to add a token manually for the Node to join the cluster. To
solve the problem the token range bisection was removed and replaced with
virtual nodes. The virtual node allows each of the nodes to have large no of
small partition ranges distributed through out the cluster. Virtual node uses a
mechanism for constant hashing to distribute the data, which limits the generation
and assignment of token.
- Balancing was taken care by evenly distributing the data when a new node joined the cluster.
- When a node failed the data was distributed evenly between the nodes in the cluster.
- Rebuilding the dead node was potentially faster because the data from the other node were sent to replacement node incrementally instead of waiting for validation.
- Finally we are free from the calculation of token or assigning tokens to each node.
- Concurrent streaming.
Consistency
Cassandra offers an
eventual consistency but you can opt for a tunable consistency. Eventual
consistency is always concurrent which means operations are always going on at
the same time in the cluster. Cassandra is based on the CAP (Consistency,
Atomicity and Partition tolerance) theorem and it cannot satisfy all the three
at the same time so we can tune it according to our need. It has got also serial consistency but
using the serial consistency, we may need to compromise with performance.
Triggers
Having trigger gives a
bonus point to use Cassandra for immediate consistency. The trigger can be
implemented by implementing the Itrigger interface. This basically deals with
RowMutation and ColumnFamily objects.
Data model
Cassandra has a rich
data model. Its basically based on schema optional and column oriented data
model. In Cassandra Keyspace is the container for you application data. Inside
the keyspace we have one or more than one column family. Column families are a
set of columns, which is identified by row key. Cassandra doesn’t enforce
relationship between Column families the way relational database does. There
are no foreign keys and joins in column family at query time is not supported
which helps Cassandra performs more faster.
Working with Cassandra
Working with Cassandra
has become much simpler and easier when compared to the Thrift based API’s that
exposed the internal storage structure of Cassandra.
CQL
The CQL is a structured
query language to query Cassandra. CQL is based on the data model of
partitioned row store with tunable consistency. Tables can be created, altered
and dropped at runtime without blocking updates and queries. It doesn’t support
joins and sub queries.
For more information: http://www.datastax.com/documentation/cql/3.0/webhelp/index.html
Java Drivers
There are many java
API’s available in the open. We have Pelops, Kundera, Hector, Thrift, Virgil,
Datastax java driver and Astyanax to name a few. We are going to mostly focus
on Astyanax. Well Astyanax evolved at Netflix as a solution to a problem, which
they faced while interacting with Cassandra. Astyanax can be said to be
refractored version of Hector. It has got few advantages
Ease of uses of the API
Composite column
support
Connection pooling
Latency
Documented
Break Object into
multiple chunks.
One row per chunk
Parallelize upload
Retry small chunk
instead of entire object
Astyanax has been supporting the Cassandra
1.2 version and the official release for support of Cassandra 2.0 is yet to
come.
References
4.
Tech talks from Jonathan Ellis, Christos
Kalantzis, Patrick Mc Fadin.
No comments:
Post a Comment