Study notes from DS201: Foundations of Apache Cassandra™ and DataStax Enterprise.
Partition
A partition in Cassandra is a fundamental concept for distributing and storing data. A Cassandra table consists of one or more partitions, and each partition is identified by a unique partition key.
The partition key serves as the basis for determining which node in the cluster stores the data. Since data is physically distributed based on the partition key value, proper partition key design is directly tied to Cassandra’s performance and scalability.
PRIMARY KEY and Partition Key
When defining a table in CQL (Cassandra Query Language), you specify a PRIMARY KEY. The first column of this primary key becomes the partition key.
Example:
Here is the definition of the videos table:
cqlsh:killrvideo> DESCRIBE TABLE videos;
CREATE TABLE killrvideo.videos (
video_id timeuuid PRIMARY KEY,
added_date timestamp,
title text
) WITH additional_write_policy = '99p'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND cdc = false
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND memtable = 'default'
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND extensions = {}
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99p';
In the table definition above, PRIMARY KEY (video_id) is declared. This makes the video_id column the partition key for this table.
Token
Cassandra calculates the hash value of a partition key and determines which node in the cluster stores the data based on that hash value (token). This process is managed by the partitioner.
You can check the token value of a specific partition key using the token() function:
cqlsh:killrvideo> SELECT token(video_id), video_id FROM videos;
system.token(video_id) | video_id
------------------------+--------------------------------------
-7805440677194688247 | 245e8024-14bd-11e5-9743-8238354b7e32
(1 rows)
The first column "system.token(video_id)" in the query result represents the token value of video_id. This token value is the hash of the partition key used internally by Cassandra to determine which node stores the data.
Designing Proper Partition Keys
Proper partition key design is critical for maximizing Cassandra’s performance.
- Even data distribution: Choose columns with high cardinality (variety of values) and uniform access patterns as partition keys so that data is evenly distributed across the cluster.
- Avoiding hotspots: Design should avoid “hotspots” where access is concentrated on specific partitions.
- Query efficiency: Since most queries are performed by specifying the partition key, partition keys should be determined with query patterns in mind.