Monday, December 9, 2013

Data Consistency in Cassandra

Understanding Data Consistency in Cassandra

To understand the data consistency in Cassandra we need to understand the points below.

1. Overview Cassandra reads/writes.
2. Details of how writes are performed in Cassandra.
3. The CAP theorem.
4. Tunable data consistency
5. Choosing a data consistency strategy for writes.
6. Choosing a data consistency strategy for  reads.
7. Example.


-Overview Cassandra reads/writes.

Cassandra is a peer to peer architecture , it does not have the concept of master or salve. All the nodes in cluster, may be in different rack, may be in different datacenter all are treated as the same so the data can be written to any node and can be read from any node. Cassandra automatically takes care of partitioning and replicating the data through out the cluster in any rack or any datacenter.


-Details of how writes are performed in Cassandra.

Writes in Cassandra are considered to be extremely fast in the industry today. When data is written to Cassandra the data is persisted in a commit log for durability. The same data is then moved to a in memory table called the memtable. Once the memtable is  full the data is moved to the disc called the SSTables. Writes in Cassandra are atomic at row level, all columns are written or  updated or not written at all. RDBMS style transaction are not supported. Based on a benchmark independently it has ben recorded that Cassandra has
a. 4x better writes.
b. 2x better reads.
c. 12x better in reads/updates.


-The CAP theorem

In distributed database system you can have two of three things.

-you can have strong consistency which mean reading and writing the latest copy of the data.
-you can have Strong availability of the data which means if one node goes done  which has the data you still have other nodes which has the data to server the request.
-You can loose messages between couple of the nodes but still the system operate well.

Cassandra is known for having strong availability an partition tolerance but its provides tunable data consistency


-Tunable data consistency

In Cassandra you have the flexibility to choose the data consistency, strong or eventual. The data consistency can be defined in Cassandra per operation basis.

-Data consistency strategy for writes

There are various strategy for writes.
1. Any: A write must succeed on any available node.
2. One: A write must succeed on any node responsible for that row.
3. Quorum: A write must succeed on a quorum of replica nodes which is determined by (replicationFactor/2)+1.
4. Local_Quorum: A write must succeed on a quorum of replica nodes in the same datacenter as the coordinator node.
5. Each_Quorum: A write must succeed on a quorum of replica nodes in all data centers.
6. All: A write must succeed on all replica nodes for a row key.


-Hinted Handoffs

Hinted Handoff is a methodology implemented in Cassandra when performing writes to a row for all replicas for that row. If all replica nodes are not available then a hint is stored one of the nodes to update the downed nodes with the row once the node are available. If no replica nodes are available then use of any consistency level will instruct the coordinator node to store the hint and the row data which is passed to replica node once its is available.


-Data consistency strategy for Reads

There are various strategy for reads in Cassandra
1. One: reads the closest node holding the data.
2. Quorum: Returns a result from a Quorum of servers with the most recent timestamp for the data.
3. Local_Quorum: Returns the result from a Quorum of the servers with most recent timestamp for the data in the same data center as the coordinator node.
4. Each_Quorum: Returns the result from a Quorum of servers with the most recent timestamp in all data centers.
5. All: Returns the result from all replica nodes for the key.


-Read Repair

To ensure the data consistency while reading Cassandra performs read repair. suppose i am reading a data which is stale in one of the node, Cassandra issues a repair to other node which has the data the most recent data is updated on the node which has issued the repair so the next time when request comes it will give the latest data.


USING CONSISTENCY clause can be used to provide the consistency level on the operation to be executed.


No comments:

Post a Comment