DS201 Study Notes: Cassandra Clustering Columns

DS201 course study notes on Cassandra clustering columns: PRIMARY KEY structure, sort ordering, and efficient range queries.

Study notes from DS201: Foundations of Apache Cassandra™ and DataStax Enterprise.

Clustering Column

A clustering column in a Cassandra table is a column that is part of the PRIMARY KEY but is not the partition key. Clustering columns determine how data with the same partition key is sorted and stored on disk.

Primary Key Structure

A Cassandra PRIMARY KEY consists of two parts:

  1. Partition Key: The first column (or multiple columns enclosed in parentheses) of the primary key becomes the partition key. It determines which node in the cluster stores the data.
  2. Clustering Columns: The columns following the partition key become clustering columns. Data with the same partition key (i.e., data within the same partition) is sorted and stored on disk in the order of the clustering columns.

Example: Creating a Table with Clustering Columns

The following example creates a videos_by_tag table:

CREATE TABLE videos_by_tag (
    tag text,
    video_id uuid,
    added_date timestamp,
    title text,
    PRIMARY KEY ((tag), added_date, video_id)
) WITH CLUSTERING ORDER BY(added_date DESC);

The PRIMARY KEY ((tag), added_date, video_id) in this table definition means:

  • tag: The partition key. Data is distributed across nodes in the cluster based on the value of tag.
  • added_date, video_id: The clustering columns. Data with the same tag is sorted by added_date in descending order (DESC), and when added_date is the same, it is sorted by video_id.

WITH CLUSTERING ORDER BY(added_date DESC) explicitly specifies the sort order of the added_date column as descending. This ensures that the most recent videos are positioned at the beginning of the partition.

Verifying After Data Insertion

After inserting data, running SELECT * FROM videos_by_tag; shows that rows with the same partition key (tag) are grouped together and sorted by added_date in descending order.

cqlsh:killrvideo> SELECT * FROM videos_by_tag;

 tag       | added_date                      | video_id                             | title
-----------+---------------------------------+--------------------------------------+------------------------------
  datastax | 2013-10-16 09:00:00.000000+0000 | 4845ed97-14bd-11e5-8a40-8338255b7e33 |              DataStax Studio
  datastax | 2013-04-16 09:00:00.000000+0000 | 5645f8bd-14bd-11e5-af1a-8638355b8e3a | What is DataStax Enterprise?
 cassandra | 2014-01-29 09:00:00.000000+0000 | 1645ea59-14bd-11e5-a993-8138354b7e31 |            Cassandra History
 cassandra | 2013-03-17 09:00:00.000000+0000 | 3452f7de-14bd-11e5-855e-8738355b7e3a |              Cassandra Intro
 cassandra | 2012-04-03 09:00:00.000000+0000 | 245e8024-14bd-11e5-9743-8238356b7e32 |             Cassandra & SSDs

(5 rows)

From this result, you can confirm that rows with tag of datastax and cassandra are grouped respectively, and within each group, they are sorted by added_date from newest to oldest.

Importance of Clustering Columns

  • Efficient range queries: Since data is sorted by clustering columns, range queries within the same partition key (e.g., retrieving the latest 10 videos for a specific tag) are very efficient.
  • Data model flexibility: Clustering columns play an important role in “denormalization,” where multiple tables are created for the same data to support different access patterns (similar to materialized views).

Proper design of partition keys and clustering columns is one of the most important elements of Cassandra data model design.