Monday, December 9, 2013

Understanding the architecture of Cassandra

Whats is Cassandra ?

Some of us know about Cassandra and some of may not know what Cassandra is ?. Well Cassandra is a high performance , fault tolerant , extremely scalable and distributed database management system. It is not relational or we can say it as a post relational database solution. It can serve as realtime data store for online / transactional application and also can be used for read intensive database for business intelligence systems.

Overview of Cassandra Architecture.

Cassandra was a thoughtful innovation keeping in mind that failure is the key to success i.e there may be hardware failure or system crashes and data is important. It is a peer to peer design database management system, there is no concept of master or slave. Being a peer to peer architecture you can read / write to any Cassandra node in a cluster, all the nodes are treated as the same. Data is partitioned throughout the nodes and it ensure the system to be fault tolerant by replication the data though custom data replication.

Each node in cassandra communicates with each other through gossip protocol, which exchanges the information across the cluster in intervals. 

When data is written to Cassandra to assure the data durability it logs all the data to a commit log. The data in the commit log is then written to a in memory data structure call the memtable. Once the memtable is full the data is written to the disk called the SSTable.

The data is contained within a schema which is based on google big table , its a row oriented column structured design. It has the concept of keyspace which is similar to that of a relational database, The column family is the core object to manage data which is again very similar to relational database management system but the scheme is more flexible and dynamic in nature. A row in a column family can be indexed by its key and also other columns can be indexed as well.


Why would you use Cassandra ??

1. Scaling to gigabyte or petabyte.
2. By adding nodes linear performance can be achieved.
3. No single point of failure.
4. Data is distributed and replication of data is easy.
5. Capability to run in multiple datacenter and cloud.
6. Specially in Cassandra there is no need of a caching layer.
7. Tunable data consistency.
8. Schema design is flexible.
9. SQL like queries.
10. Support key languages and runs on commodity hardware or software.
11. Data compression with no performance penalty.

No comments:

Post a Comment