DS201 Study Notes: Cassandra Compaction

DS201 course study notes on Cassandra Compaction: purpose, tombstones, and the difference between minor and major compaction strategies.

Study notes from DS201: Foundations of Apache Cassandra™ and DataStax Enterprise.

Compaction

To achieve high write performance, Cassandra does not directly overwrite existing data during writes. Instead, it appends new data as SSTables (Sorted String Tables) to disk. As a result, data with the same key may be scattered across multiple SSTables, and deleted data may remain physically on disk.

Compaction is the process of merging these SSTables, removing unnecessary data, and optimizing data placement to improve database performance and efficiently utilize disk space.

Main Purposes of Compaction

  • Data deduplication: Removes old data, keeping only the latest version for each key.
  • Physical deletion of removed data: Physically deletes data marked with “Tombstones” (deletion markers).
  • SSTable consolidation: Merges many small SSTables into fewer, larger SSTables to improve read performance.
  • Disk space reclamation: Frees disk space by removing unnecessary data.

About Tombstones

In Cassandra, when a delete operation is performed, the data is not immediately physically deleted. Instead, a Tombstone marker indicating that the data is “deleted” is written to the SSTable. This Tombstone is processed during the compaction process and, after a certain period (configured by gc_grace_seconds), is physically deleted along with the associated data.

Types of Compaction

Cassandra has several compaction strategies, but here they are explained in general terms as “minor compaction” and “major compaction.”

Minor Compaction

Minor compaction is primarily executed automatically after Memtables are flushed to SSTables or under specific conditions (e.g., size-based).

  • Target: Relatively new SSTables or SSTables selected by a specific compaction strategy.
  • Processing: Merges the selected SSTables, processing duplicate data and Tombstones. This reduces the number of SSTables searched during reads and optimizes disk space.

Major Compaction

Major compaction is a large-scale compaction that merges all SSTables within the cluster.

  • Target: All SSTables in the cluster.
  • Processing: Combines multiple SSTables into a single new SSTable, completely eliminating data duplication and physically deleting all Tombstones. This significantly frees disk space and optimizes read performance.
  • Impact: Consumes significant disk I/O and CPU resources, increasing system load. Therefore, it is typically scheduled automatically in the background or manually executed during maintenance windows.

Cassandra provides various compaction strategies (e.g., SizeTieredCompactionStrategy, LeveledCompactionStrategy, TimeWindowCompactionStrategy), and it is important to select the optimal strategy based on workload characteristics.