DS201 Study Notes: Cassandra Write Path and Read Path

DS201 course study notes on Cassandra's Write Path (Commit Log, Memtable, SSTable) and Read Path (Bloom Filter, caching) mechanisms.

Study notes from DS201: Foundations of Apache Cassandra™ and DataStax Enterprise.

Cassandra has unique data storage and retrieval mechanisms that support its high write and read performance.

Write Path

Data write operations in Cassandra follow these steps:

  1. Writing to the Commit Log: When Cassandra receives a data change request, it first appends the change to the commit log. The commit log is stored on disk (HDD or SSD) and plays a critical role in ensuring data durability. This allows data recovery from the commit log even if Cassandra crashes.

  2. Writing to the Memtable: In parallel with writing to the commit log, data is also written to the Memtable in RAM (Random Access Memory). The Memtable is an in-memory data structure used for high-speed processing of data additions and updates. It holds the latest version of the data and provides fast read/write access.

  3. Flushing to SSTable (Sorted String Table): When the data in the Memtable reaches a certain size or a certain amount of time has passed, Cassandra writes the Memtable contents to an SSTable on disk (flush). SSTables are immutable file formats composed of sorted key-value pairs that enable efficient read operations. SSTables are Cassandra’s persistent data storage.

This process allows Cassandra to achieve fast writes while ensuring data durability.

Read Path

Data read operations in Cassandra follow these steps:

  1. Searching the Bloom Filter: When a read request occurs, Cassandra first searches the in-memory Bloom filter. A Bloom filter is a probabilistic data structure that quickly determines whether specific data exists in an SSTable. If it determines that data is likely not present, it ends the read operation without disk access, minimizing disk I/O.

  2. Searching the Memtable: If the Bloom filter indicates the data may exist, Cassandra next searches the Memtable. Since the Memtable contains the latest data, finding data here means it can be retrieved at high speed without disk access.

  3. Searching and Merging SSTables: If data is not found in the Memtable, Cassandra searches the SSTables on disk. Since data may be distributed across multiple SSTables, Cassandra reads data from multiple SSTables and uses a merge sort algorithm to consolidate the latest version of the data.

  4. Loading into Cache: Data read from disk is loaded into in-memory caches (such as Key Cache and Row Cache) to speed up subsequent reads. This allows the same data to be served from fast memory access instead of disk access on future requests.

These optimized Write Path and Read Path mechanisms enable Cassandra to achieve high throughput and low-latency data operations.