The Cassandra NoSQL Distributed Database

  • Cassandra is an open source, distributed, decentralised, elastically scalable, highly available and fault-tolerant database.

Distributed & Decentralised

  • The Cassandra database uses a shared-nothing architecture. It has no central controller and no notion of primary/secondary replicas.

  • Other database such as MySQL, Bigtable and Mongo follow the primary/secondary scheme in which the primary can be replaced by an automated leader election process.

Elastic Scalability

  • Cassandra instead, is a truly masterless database where all its nodes are the same. Replication is peer to peer. It is written in Java with data compression built-in.

  • Nodes can be added and removed from the cluster, allowing for Elastic Scalability

Easy to Maintain

  • Creating and maintaining custom data shards by hand is very messy. Cassandra uses an approach similar to key based sharing, to distribute data across nodes. It does this automatically.

  • Hence, setting up many nodes is not much different from setting up just one node. Almost no configuration is required.

  • There is no single point of failure; read/writes occur across multiple geographically dispersed data centres, cloud availability zones and regions.

Tuneable Consistency

  • Cassandra is frequently called “eventually consistent”. It trades some consistency in order to achieve total availability.

  • “Tunable Consistency” is a more accurate term for Cassandra because it allows you to decide the level of consistency you require in balance with the level of availability.

Consistency Levels

The available consistency level are:

  • Strict Consistency: This is the most stringent level of consistency. It requires any reads to return the most recent written value. Across geographies, however, this can be tricky. It means that all nodes share the same global clock to time stamp all operations, regardless of node location, the user requesting the transaction or the number of such transactions.

  • Causal Consistency: This is a slightly weaker level of consistency that does not rely on a global clock or time stamps. Instead, it attempts to determine the cause of events to create consistency from their order. This means that writes that are potentially related must be read in sequence. Causal consistency dictates that causal writes must be read in sequence.

  • Weak (eventual) Consistency: Here, all updates will propagate throughout all replicas, but take some time to be eventually consistent.

Replication Factor

The replication factor sets the number of nodes in the cluster that you want the updates to propagate to.

Tunning at the transaction level

  • The replication factor and consistency levels are specified by the client on every operation, allowing the client to decide how many replicas in the cluster must acknowledge the write operation or respond to a read operation in order to be considered successful.

  • Cassandra has pushed this decision to the client at the time of the transaction / operation, instead of having a configurable global setting.

Hybrid Cloud & Multicloud Deployment

  • Not only can you deploy Cassandra across multiple data centres, but these centres can be from multiple cloud service providers.

  • In a hybrid cloud architecture, Cassandra can replicate data from a traditional on-premise data centre to those in the cloud.

  • With its easy deployment, networking becomes the challenge rather than the Cassandra database.

Other Characteristics

  • It bases its distribution design on Amazon’s Dynamo, with a query language similar to SQL.

  • It uses the gossip protocol to maintain and keep in sync a list of nodes that are alive or dead.

The original paper that introduces Cassandra can be found here