Apache Cassandra
Masterless, AP-leaning, linearly-scalable wide-column store. Built for write-heavy, multi-DC, always-on workloads where eventual consistency is acceptable and the data model is known up front.
Wide-Column NoSQL Distributed AP (Tunable) v5.0 GAAs of 2026-06-08 · Cassandra 5.0 GA (Sep 2024)
Cassandra is the right answer when your access patterns are known, your dataset is sharded by a natural partition key, you write 5–10x more than you read, and you cannot tolerate a single-region failure. It is the wrong answer when you need joins, ad-hoc queries, transactions, or strong consistency by default. The hardest lesson at scale isn't operating the cluster, it's resisting the urge to query it like Postgres. Every team that gets burned (Discord at 177 nodes, Netflix's first time-series cluster) gets burned on the same thing: partitioning. Cassandra 5.0's SAI plus tries plus UCS finally close some of the read-side ergonomic gaps, but the data-model discipline does not change.
Best default choices
1 · Overview
Apache Cassandra is a partitioned, leaderless, eventually-consistent distributed database. It was born at Facebook in 2008 from the marriage of Amazon's Dynamo paper (partitioning, replication, gossip) and Google's Bigtable paper (column-family data model, SSTables, memtables). The shape of the system is unusual: every node is identical, every node accepts writes, every node accepts reads, and there is no leader, no shard manager, no central catalog, and no coordination service like ZooKeeper. The ring is the catalog.
The promise is operational: you can lose a node, a rack, a data center, or a region without an outage and without manual failover. The price is data-modeling: you must know your queries before you create your tables, because Cassandra cannot do joins, cannot do ad-hoc aggregations efficiently, and cannot index arbitrary columns without paying for it.
Where Cassandra sits in the CAP space
By default, Cassandra is an AP system: it prefers availability over consistency during a partition. But it is tunable. You set the consistency level per query, and the system trades availability for consistency at the request level. QUORUM reads + QUORUM writes give you strong consistency at the cost of tolerating fewer node failures. ONE reads + ONE writes give you the cheapest, lowest-latency path but no read-your-writes guarantee.
The math is simple and worth memorizing: if W + R > RF (write nodes + read nodes > replication factor), reads see the latest write. Most production deployments run RF=3 with LOCAL_QUORUM (W=2, R=2), giving strong consistency within a DC and async replication across DCs.
What Cassandra is good at
Writes. Cassandra writes are append-only to a commit log and an in-memory memtable. There is no read-before-write, no random disk I/O, no row-level lock. A write at LOCAL_QUORUM in a healthy cluster lands in single-digit milliseconds and the cluster can sustain hundreds of thousands of writes per second per node. Apple has clusters doing millions of operations per second. Uber publicly stated their largest clusters do 1M+ writes/sec.
What Cassandra is not good at
Anything that requires looking across partitions. Joins, aggregations, secondary lookups, ad-hoc filters, transactional updates spanning multiple rows. The system has lightweight transactions (Paxos-based, slow), counter columns (commutative, with caveats), materialized views (deprecated-by-default), and secondary indexes (rewritten in 5.0 as SAI), but every one of these is a workaround for the same constraint: the partition key is the only first-class lookup axis.
2 · Architecture
Cassandra's architecture is a ring of identical nodes. The token ring is a 64-bit hash space (Murmur3 by default). Each node owns one or more contiguous slices of that ring, called token ranges. When you write a row, the partition key is hashed to a token, the token falls inside one node's range, and that node is the primary replica. The next N-1 nodes clockwise on the ring hold the additional replicas, where N is the replication factor.
The Ring Topology
The Storage Engine (LSM)
Each node runs an LSM tree per table. Writes flow through three stages:
- CommitLog — append-only durability log on disk. fsync per write batch (or periodic, depending on config). This is what makes writes durable before they hit the SSTable.
- Memtable — in-memory sorted structure (skiplist pre-5.0, trie in 5.0+). Per-table. Flushed to disk when it fills.
- SSTable — immutable on-disk file. Sorted by token, then clustering key. Has a Bloom filter, partition index, and partition summary. Compacted in the background to reclaim deleted/tombstoned space.
Coordinator and Replicas
Any node can be a coordinator. When a client connects (typically via a driver that knows the topology), the driver picks a coordinator close to the data. The coordinator hashes the partition key, identifies the N replica nodes, and fans out the request to them. For writes, it waits for the consistency level number of acks. For reads, it queries the closest replica(s) and uses digest checks against the others.
Gossip and Failure Detection
There is no master, so nodes discover each other via gossip. Once a second, each node exchanges state with a small random subset of peers. State propagates probabilistically in O(log N) rounds. Failure detection uses an accrual detector: nodes track inter-gossip intervals and compute a confidence score (phi) that a peer is dead. The default phi threshold is 8. The system can tolerate Byzantine-ish behavior in failure detection because the consequence of a false positive is only that the coordinator skips a replica, not data loss.
3 · Core Concepts
Partition Key, Clustering Key, Primary Key
The primary key is composite: PRIMARY KEY ((partition_key), clustering_key_1, clustering_key_2). The partition key chooses the node. Clustering keys sort rows within the partition. A query without a partition key restriction is a full cluster scan, which Cassandra refuses by default (or warns and runs as ALLOW FILTERING).
This is the single most important design constraint. The Discord war story comes down to this: they partitioned messages by channel_id. A few channels (a popular Discord server) had hundreds of millions of messages. A query for "last 50 messages in channel X" was fast — but compaction on that partition, repair on that partition, and any read that touched the partition were all hot. They eventually re-partitioned with a bucket (channel_id, bucket_id) to bound partition size.
Replication Factor (RF) and Replication Strategy
RF is per-keyspace. The two strategies that matter are SimpleStrategy (single DC, dev only) and NetworkTopologyStrategy (production, RF per DC). NTS uses snitches (rack-aware or DC-aware) to place replicas on different racks within a DC, so a rack failure doesn't kill a quorum. A typical multi-DC production keyspace: { 'class': 'NetworkTopologyStrategy', 'us-east': 3, 'us-west': 3, 'eu-west': 3 }.
Consistency Levels
| Level | Meaning | When to Use |
|---|---|---|
ANY | Hint to any node (incl. hinted handoff). Lowest durability. | Fire-and-forget telemetry where loss is OK. |
ONE | 1 replica acks. | Edge writes, observability, max throughput. |
LOCAL_ONE | 1 replica in the local DC. | Same as ONE but DC-pinned. Default for many fast paths. |
QUORUM | ⌈(RF+1)/2⌉ replicas across all DCs. | Strong consistency across DCs. Expensive: hits cross-region latency. |
LOCAL_QUORUM | Quorum within the local DC. | The 95% production default. Strong within DC, async across DCs. |
EACH_QUORUM | Quorum in every DC. | Cross-region critical writes (rare). Worst latency. |
ALL | Every replica acks. | Almost never. One node down = write fails. |
SERIAL / LOCAL_SERIAL | Paxos-based linearizable consistency. | Lightweight transactions only. 4x slower than QUORUM. |
Tombstones and gc_grace_seconds
Deletes in Cassandra don't remove data. They write a tombstone — a marker that says "this row/cell was deleted at time T." Tombstones must propagate to all replicas before they can be safely garbage collected, because if one replica missed the delete and the tombstone disappeared too soon, a re-replication would resurrect the data.
The window is gc_grace_seconds, default 10 days. Tombstones can only be purged during compaction after this window. Two pathologies follow:
- Read amplification: A partition with many tombstones forces the read path to scan past them. The default threshold is 100K tombstones per query — past this, the query fails. Past 1K, you get warnings in the log.
- Resurrection: If a node is offline longer than gc_grace_seconds and tombstones get purged on the live nodes, the offline node's old data will re-appear when it rejoins. The fix is to run repair before gc_grace expires.
Compaction Strategies
| Strategy | Best For | Write Amp | Read Amp | Gotcha |
|---|---|---|---|---|
STCS (Size-Tiered) |
Write-heavy, low read needs. Default in 3.x/4.x. | Low (~1-3x) | High (key may exist in any SSTable) | "Big SSTable problem": a 4GB SSTable needs 3 more 4GB peers to compact. Can sit uncompacted for months. |
LCS (Leveled) |
Read-heavy, predictable latency. Wide rows. | High (~10x per level) | Low (1 SSTable per level) | Writes can fall behind compaction during bulk loads. SSDs only. |
TWCS (Time-Window) |
Time-series with TTL. Append-only. | Low | Medium | Mixing TTL'd and non-TTL'd data inside one window blocks tombstone drops. Bucket size matters. |
UCS (Unified, 5.0+) |
General-purpose. Replaces all of the above with one tunable. | Tunable | Tunable | Newer, fewer production scars. Most teams will migrate over the next 1-2 years. |
Snitches and Topology Awareness
Snitches tell the coordinator which DC and rack each node belongs to. GossipingPropertyFileSnitch is the production default. Replicas are placed on different racks to survive rack failure. Cross-DC traffic is minimized because the coordinator prefers local replicas for reads.
4 · Execution Model
Write Path
Read Path
Anti-Entropy: Hinted Handoff, Read Repair, Anti-Entropy Repair
Three mechanisms keep replicas in sync:
| Mechanism | When It Runs | What It Fixes | Cost |
|---|---|---|---|
| Hinted handoff | Immediate, while a replica is down | Writes that couldn't reach a down node | Cheap, bounded (3h default) |
| Read repair | On every read at QUORUM or higher | Drift detected via digest mismatch | Adds latency to that read; small |
| Anti-entropy repair (Merkle tree) | Scheduled, nodetool repair | Drift over time (e.g., after long outages) | Expensive. Must run before gc_grace expires. |
5 · Feature Deep Dive (12 features)
The capabilities the user asked about, with PE-grade nuance: what each feature buys you, where it hurts, and the production-grade caveat.
5.1 · Masterless Ring Topology
// Every node equal. No primary, no shard manager, no ZooKeeper.
Every node knows the full ring topology via gossip. Any node can coordinate any request. There is no leader election, no split-brain protocol (because there is no leader), no failover dance. The trade-off: there is no global ordering, no global transactions, and no efficient cross-partition consistency.
Where it bites: Schema changes are eventually consistent. DDL during a network partition can leave nodes with divergent schemas. Always run schema changes from a single node, with no in-flight repairs.
5.2 · Linear Scalability
// Throughput grows ~linearly with node count.
Double the nodes, roughly double the throughput. This is true because the partition key spreads writes uniformly across the ring and each node owns roughly 1/N of the data and traffic. Netflix has clusters with thousands of nodes; Apple operates clusters of 1000+ nodes.
Where it bites: Linearity assumes uniform partition load. One hot partition breaks linearity entirely — Discord's hot-partition story is exactly this. Also: gossip and schema propagation are O(N), so very large clusters (1000+) start to feel gossip pressure.
5.3 · Peer-to-Peer Replication
// All replicas equal. Writes go to all. Reads from any quorum.
There is no primary replica per partition. The coordinator fans writes out to all N replicas in parallel. There is no replication lag in the leader-follower sense; the only lag is between when a replica acks and when slower replicas catch up via hinted handoff or repair.
Where it bites: Conflict resolution is last-write-wins (LWW) by timestamp. If two clients write the same cell at the same millisecond on different nodes with clock skew, the loser's write silently disappears. Use coarse NTP and consider monotonic timestamps for high-write hot paths.
5.4 · High-Speed Sequential Writes
// CommitLog append + memtable insert. No read-before-write.
The LSM design means writes are sequential disk I/O (commit log append) plus an in-memory insert (memtable). There is no random I/O on the write path, no row lock, no read-before-write. Single-node write throughput on modern NVMe approaches the CommitLog fsync rate.
Where it bites: The cost is paid later, at compaction. Write-heavy clusters with sloppy compaction strategy fill disks with redundant data. Storage planning needs to account for 2-3x raw size during compaction, plus tombstone overhead.
5.5 · Column-Family Data Structure
// Wide-row model. Rows can have different columns. Sparse cells are free.
Each row in a table can hold up to 2 billion cells (clustering key + column combinations). A row is identified by partition key; cells are addressed by (clustering keys, column name). Sparse cells (NULL) are not stored. Schemas can evolve by adding columns without rewriting rows.
Where it bites: The "wide row" pattern is a footgun. A partition with 1M clustering rows works; one with 100M kills compaction. The recommended max is around 100MB per partition or low millions of clustering rows.
5.6 · Cassandra Query Language (CQL)
// SQL-shaped surface, NoSQL semantics underneath.
CQL looks like SQL: SELECT, INSERT, UPDATE, DELETE, CREATE TABLE. What it does not look like under the hood: there are no joins, no subqueries, no aggregates across partitions (without ALLOW FILTERING), no foreign keys, no constraints. Every query must hit the partition key or it is a full-cluster scan.
Where it bites: Developers fresh from SQL write queries Cassandra rejects (or worse, accepts with ALLOW FILTERING). Code review needs a rule: no query without a fully-specified partition key.
5.7 · Tunable Consistency Levels
// Trade availability for consistency per query.
Set CL on every read and every write. Use ONE for cheap eventually-consistent reads, LOCAL_QUORUM for strong local consistency, EACH_QUORUM for strong global consistency. The math W + R > RF determines whether you read your own writes. Cassandra is the only mainstream DB that exposes CAP this granularly.
Where it bites: Teams default to QUORUM when they should use LOCAL_QUORUM, paying 100+ ms of cross-region latency on every read. Or they default to ONE for everything and get phantom reads after writes. The default for almost all production workloads should be LOCAL_QUORUM/LOCAL_QUORUM.
5.8 · Multi-Data Center Replication
// NetworkTopologyStrategy + GossipingPropertyFileSnitch = N-DC live.
Replicas are placed per DC according to the keyspace strategy. Cross-DC writes happen asynchronously by default (the local DC acks the write, then the coordinator forwards to remote DC replicas). LOCAL_QUORUM keeps cross-DC latency off the request path. EACH_QUORUM puts it on the path.
Where it bites: Bootstrapping a new DC at scale is slow and bandwidth-heavy. Streaming TB of data across regions can take days. Plan dual-write windows and use nodetool rebuild with rate limits.
5.9 · Fault-Tolerant Routing
// Token-aware driver bypasses the coordinator hop.
Modern drivers (DataStax, Scylla) are token-aware: they hash the partition key client-side and route directly to a replica node. This eliminates one network hop on the write/read path. The coordinator role still exists for fan-out, but the driver picks a replica as the coordinator.
Where it bites: Outdated drivers or naive load balancers (round-robin without token awareness) double your tail latency by routing to a random node that proxies to the correct one. Always use token-aware load balancing in production.
5.10 · Active-Active Global Replication
// Multiple DCs accept writes simultaneously. LWW resolves conflicts.
Cassandra supports active-active across DCs without external machinery. Each DC accepts writes locally with LOCAL_QUORUM and replicates asynchronously to other DCs. Conflicts are resolved by timestamp (LWW). This is unique to Cassandra at this scale — Postgres and MySQL require external tools (Bucardo, Tungsten) and a lot of operational care.
Where it bites: LWW is brutal. If two DCs both write the same row in the partition window of cross-DC replication, the later write wins and the earlier write is lost. For anything resembling a counter or shopping cart, you need CRDTs at the application layer, not LWW.
5.11 · Row-Level Time-to-Live (TTL)
// Per-column TTL. Expired data becomes a tombstone.
You can set TTL on inserts (INSERT ... USING TTL 86400) or per-column updates. When TTL expires, the cell becomes a tombstone and is eventually compacted out. Excellent for time-series, session stores, rate-limit counters, ephemeral data.
Where it bites: TTL'd data turns into tombstones, and tombstones cost reads. If you have a TTL of 30 days but gc_grace_seconds of 10 days, you still pay tombstone cost for 10 days after each expiry. Combine TTL with TWCS (or UCS in 5.0+) so that whole SSTables expire together, no compaction needed.
5.12 · Secondary Indexing (now SAI in 5.0)
// Legacy 2i: scatter-gather across all nodes, slow. SAI: per-SSTable, fast.
Legacy secondary indexes (2i) are local per-node, meaning a query has to scatter-gather across all nodes. They are fine for high-cardinality, partition-restricted queries; useless for low-cardinality or cluster-wide queries. Cassandra 5.0's Storage-Attached Index (SAI) is per-SSTable, much faster for varied predicates, supports more types (incl. vectors for ANN search via HNSW).
Where it bites: Even with SAI, secondary indexes are not a substitute for a properly designed denormalized table. They are a query convenience, not a substitute for data modeling. Discord built dedicated lookup tables for every access pattern rather than relying on indexes.
6 · Trade-Offs (10 rows)
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Eventual consistency (AP) over strong consistency (CP) | Multi-DC live writes, no outages during partitions, <10ms write latency. | No read-your-writes by default, no serializability, last-write-wins resolves conflicts. | Counter columns lose increments under partition; shopping cart merges silently overwrite; account balance reads can lag. | LOCAL_QUORUM/LOCAL_QUORUM gets you strong within a DC. CRDTs or LWT cover the rest. Never use Cassandra for ledgers. |
| Write speed via LSM over read speed via B-tree | Writes are 5-10x faster than B-tree DBs. No read-before-write. Append-only disk pattern. | Reads may touch multiple SSTables. Read amplification grows with sloppy compaction. | Read-heavy access patterns on partitions with many SSTables (STCS endgame) start exceeding 50ms p99. | If your read:write ratio is 5:1 or higher, consider Postgres or DynamoDB. Cassandra shines at 1:5 and beyond. |
| Schema-on-write rigidity over schema flexibility | Predictable storage, no surprises at read time. Type safety in CQL. | Cannot adapt to unknown access patterns. New query = new table = backfill. | Product team adds a new dashboard filter ("show me by author") and you discover Cassandra can't do it without a new denormalized table. | Build the data model from queries backward. The Netflix/Discord pattern: one table per query. SAI in 5.0 narrows this gap but does not close it. |
| Partition-key locality over relational joins | Single-partition reads are O(1) to find, sequential to read. Predictable p99. | No joins. Cross-partition queries are full-cluster scatter-gather or impossible. | Reports, ad-hoc analytics, anything requiring "find all X where Y." You need Spark or a search index (Solr, Elasticsearch). | Pair Cassandra with a streaming pipeline (Kafka → Flink → ES/Pinot) for the analytical access patterns it cannot serve. |
| Linear horizontal scale over single-node performance | Add nodes, get throughput. Apple runs 300K nodes. No vertical ceiling. | Per-node throughput is mediocre vs. a tuned Postgres or Scylla node. Hardware utilization is poor. | Below ~5 nodes, you're paying coordination overhead for capacity you don't need. Above 1000 nodes, gossip pressure starts to bite. | Discord went from 177 Cassandra nodes to 72 Scylla nodes for the same workload. Cassandra trades single-node efficiency for ring simplicity. |
| JVM/GC operational complexity over native binary simplicity | Mature JVM tooling, JFR, JMX metrics, decade of GC tuning lore. | GC pauses (especially with G1) cause tail latency spikes and gossip false-positives. Tuning is a job. | Hot partitions trigger massive object allocation, which triggers stop-the-world GC, which triggers gossip flaps, which triggers cascading retries. | The single biggest reason teams migrate to ScyllaDB. C++/Seastar = no GC. Discord cited this explicitly as their motivation. |
| Tombstone-based deletes over in-place deletes | Deletes are O(1) writes. Replication can happen at any time. Resurrection-safe within gc_grace. | Deletes are not free. Tombstones consume disk and slow reads until compaction. | Queue-as-table anti-pattern: heavy DELETE traffic creates a tombstone wall that fails reads (default cap: 100K tombstones/query). | If your access pattern is delete-heavy, you've designed wrong. Use TTL or model as immutable append-only. Cassandra is not a message queue. |
| Tunable consistency over fixed semantics | Per-query control. Reads at ONE, writes at QUORUM, depending on the path. | Easy to misuse. Three CL knobs (read, write, serial) per call site. Defaults rarely match the workload. | A new team copies an old service's QUORUM read pattern and pays 60ms cross-region per call. | Codify CL in a shared client wrapper. Don't let app code set CL ad-hoc. Make LOCAL_QUORUM the unmissable default. |
| Active-active multi-DC over leader-follower | Every region accepts writes. RTO ≈ 0 on regional failover. No leader election. | Conflict resolution is LWW. Application must be conflict-aware for non-monotonic data. | Two regions write the same row inside the replication window. The later timestamp wins; the earlier write is gone. | For anything resembling a counter or set, use Cassandra counters carefully or CRDTs at the app layer. Don't trust LWW for anything mutable. |
| Operational predictability over operational simplicity | Once tuned, runs for years. Netflix has clusters from 2014 still running. | Getting tuned is hard. Repair, compaction strategy, GC, snitches, vnodes — every dimension wants attention. | First production incident on a self-managed cluster. Usually compaction or repair starvation. Nobody's first Cassandra runbook is good. | Buy managed (Astra, Instaclustr) until you have 2+ SREs who own the cluster as a product. The DIY savings vanish in the first incident. |
7 · Use Cases (7 rows)
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| User viewing history / time-series per-user | Netflix CompressedVH | Append-heavy writes (9:1 write:read), wide rows compressed per user, predictable p99 reads of last-N records. | 98% of streaming data; hundreds of clusters; tens of thousands of nodes; PBs; millions of TPS. | Postgres can't shard linearly to PBs. DynamoDB lacks multi-region active-active at the same cost. HBase needs HDFS+ZooKeeper. |
| Chat message storage | Discord (until 2022) → Scylla | Insert-heavy, partition by (channel_id, bucket), clustering by message_id (TimeUUID) for tail reads. | Trillions of messages; 177 nodes peak before Scylla migration. | MongoDB: working set must fit in RAM. Postgres: doesn't scale writes linearly. The Scylla move was JVM/hot-partition pain, not the model. |
| Trip event ingestion, marketplace state, telemetry | Uber Cassandra-as-a-Service (on Odin/Mesos) | Multi-region active-active, single-digit ms writes for trip events, multi-DC replication for global availability. | Largest clusters: 1M+ writes/sec, ~100K reads/sec; tens of millions of QPS across the fleet. | MySQL/Schemaless: built in-house but limited to certain access patterns. DynamoDB: vendor lock-in, multi-region cost prohibitive. |
| iCloud metadata, iMessage, Siri data | Apple (largest Cassandra deployment on earth) | Petabyte multi-DC storage, masterless for zero-downtime ops, integrated with FoundationDB for the transactional layer. | ~300K nodes, 100+ PB, thousands of clusters, millions of QPS. | Nothing else operates at this scale with this operational pattern. Apple has internal forks and Cassandra committers. |
| Personalization stack: user taste profiles, playlist metadata | Spotify Discover Weekly / Daily Mix | Kafka → Storm → Cassandra. Real-time event-driven taste-profile updates. Bulk loads from HDFS via hdfs2cass. | 100+ Cassandra clusters; petabytes of user attributes and entity metadata. | HBase: Spotify uses it but for different patterns. Postgres: doesn't handle the write throughput. Cassandra fits the wide-row, time-stamped, low-coupling pattern. |
| Security event logging, telepresence state | Cisco (firewalls, Webex collaboration) | Continuous high-volume event ingest, time-windowed retention via TTL+TWCS, multi-tenant isolation. | Multi-PB across hundreds of clusters; billions of events/day. | Elasticsearch: too expensive at this volume for cold data. Hadoop: not real-time. Kafka: not a long-term store. |
| IoT telemetry, smart-grid metering, vehicle tracking | Various (utilities, automotive) | TWCS-compacted time-series, per-device partition with bucketed clustering, TTL for retention. | Millions of devices, billions of points/day, multi-year retention. | InfluxDB/Timescale: scale ceiling per node. Cassandra scales linearly to fleet size. |
8 · Limitations
| Limitation | Severity | Workaround | Workaround Cost |
|---|---|---|---|
| No JOINs across tables | High | Denormalize: one table per query pattern. App-side fan-out for the rare cross-partition need. | 2-5x storage. Higher write amplification (write the same data into N tables). |
| No multi-row ACID transactions | Critical (for financial/ledger) | Lightweight transactions (Paxos) for single-partition CAS. App-layer sagas or external coordinator for multi-row. | LWT is ~4x slower than CL=QUORUM. Sagas add complexity and partial-failure semantics. |
| Wide-partition limit (~100MB / millions of clustering rows) | High | Bucket the partition key: PARTITION KEY ((user_id, day_bucket)). |
Reads spanning buckets become multi-partition fan-outs. Bucket sizing is workload-specific tuning. |
| Tombstone cap (default 100K per query) | High | Avoid delete-heavy patterns. Use TTL + TWCS so whole SSTables expire. Tune tombstone_warn_threshold. |
Forces design discipline. Some legitimate patterns (queue-as-table) become non-viable. |
| Secondary indexes (2i) don't scale across cluster | Medium (5.0 SAI mitigates) | Build inverted-index tables manually, or use SAI in 5.0+ for per-SSTable indexing. | Manual: doubles writes. SAI: still recent, limited operator scars in production. |
| JVM GC pauses affect tail latency | High | Tune G1 or ZGC. Avoid hot partitions. Use off-heap memtables and bloom filters. | GC tuning is specialist work. Even tuned, p99.99 occasionally spikes 100ms+. Scylla migration eliminates this entirely. |
| Repair operation is expensive and easy to neglect | Critical | Use Cassandra Reaper or built-in incremental repair. Schedule to complete within gc_grace_seconds. | Adds I/O and CPU load. Mis-scheduled repair causes data resurrection (zombie data). |
| Counter columns have anomaly windows under failure | Medium | For exact counts, use append-only event log + batch aggregation (Spark/Flink) instead of counters. | Higher storage cost, batch latency for the count, but accuracy. |
| Bootstrapping a new DC is slow and bandwidth-heavy | Medium | nodetool rebuild with throttling. Dual-write window. Zero-copy streaming in 4.0+. |
Days to weeks for multi-TB clusters. Network costs in cloud non-trivial. |
| Schema changes are eventually consistent | Medium | Run DDL from a single node. Avoid DDL during repairs or network instability. Validate convergence with nodetool describecluster. |
Schema-disagreement bugs are silent and only surface on read. |
9 · Fault Tolerance
| Dimension | Behavior | Operational Reality |
|---|---|---|
| Replication model | Leaderless, peer-to-peer. All replicas accept writes. RF per keyspace per DC. | Production default: RF=3 per DC. RF=1 only for dev. RF=2 is a trap (no quorum survives a single failure). |
| Failure detection | Accrual failure detector (phi-based). Gossip every second. Default phi threshold = 8. | Detection in 5-30 seconds. Can false-positive under GC pauses; tune phi_convict_threshold higher if seeing flapping. |
| Failover mechanism | None needed. Coordinator skips down replicas and meets CL with remaining replicas. | Single-node failure: invisible to clients at CL=LOCAL_QUORUM with RF=3. Two-node failure: writes fail at LOCAL_QUORUM unless RF≥5. |
| RTO (typical) | 0 for single-node failure. Seconds for rack/AZ failure (gossip propagation). | Apple, Netflix: clients see no impact on single-node loss. Token-aware drivers reroute instantly. |
| RPO (typical) | 0 at CL=QUORUM in healthy state. Up to commit-log fsync window (typically 10s) on simultaneous multi-node failure. | periodic_commitlog_sync: durable on fsync interval. batch: durable per write but slower. Most prod runs periodic at 10s. |
| Split-brain behavior | None. There is no leader to lose. Both sides of a partition accept writes; conflicts resolved by LWW timestamp at reconciliation. | The strength of the model. Network partitions don't cause outages or data loss, but they do cause write divergence that LWW will silently resolve. |
| Blast radius of single-node failure | ~1/N of writes briefly route around the failed node (hinted handoff). No data loss with RF≥3. | The point of the architecture. Apple's 300K-node fleet sees nodes die routinely with zero customer impact. |
| Cross-region failover story | None needed. Active-active. If us-east loses ⅓ replicas, LOCAL_QUORUM continues serving from the remaining 2. Other DCs unaffected. | Uber: lost a region during the 2017 us-east-1 outage with no production impact on Cassandra workloads. |
| Data loss scenarios | (1) Lose all RF replicas before repair (RF=3, lose 3 nodes simultaneously holding same range). (2) Tombstone resurrection from a node down longer than gc_grace_seconds. | Both are avoidable: rack-aware placement prevents (1); regular repair prevents (2). Apple/Netflix have run for a decade without significant data loss. |
11 · Replication
| Dimension | Behavior | Operational Reality |
|---|---|---|
| Replication topology | Leaderless. Every replica accepts writes. NetworkTopologyStrategy places replicas per DC, rack-aware. | Production default: NTS with RF=3 per DC. Replicas on different racks within a DC to survive AZ failure. |
| Sync vs async | Intra-DC: sync to local quorum (LOCAL_QUORUM). Inter-DC: async by default; the coordinator returns once local CL is met, then forwards to remote DC replicas. | EACH_QUORUM makes inter-DC sync (writes wait for quorum in every DC). Almost never used in production due to latency cost. |
| Replication factor (default / max) | Default: RF=3 per DC. Max: bounded by node count per DC. Common in production: 3 to 5. | RF=3 is the canonical choice. RF=5 used by some financial workloads to tolerate 2 failures at QUORUM. RF=1 = dev only. |
| Consistency level options | Per-query: ANY, ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL, LOCAL_ONE, SERIAL, LOCAL_SERIAL. | ~95% of production traffic uses LOCAL_QUORUM. SERIAL only for LWT (compare-and-set). ALL is a trap; one slow replica = whole cluster slow. |
| Replication lag (typical) | Intra-DC: sub-millisecond. Inter-DC: bounded by network RTT (50-300ms). Hinted handoff backfills offline replicas within 3 hours. | Uber Mesos-Cassandra paper: cross-DC p99 replication lag ~47ms. Netflix runs similar numbers globally. |
| Conflict resolution | Last-write-wins (LWW) by cell timestamp. Higher timestamp wins; ties broken by value comparison. | The major footgun. Clock skew + concurrent writes = silent data loss. Use NTP, monotonic timestamps, or app-layer CRDTs for mutable data. |
| Cross-region replication | Native via NTS. No external tools, no log shipping. Each region is a first-class DC in the keyspace. | The crown jewel of Cassandra. This is why Netflix and Uber chose it. Postgres needs Bucardo or BDR; Cassandra needs nothing. |
| Replication during partition | Each side of the partition continues accepting local writes. Reads succeed if local CL is met. On heal: hinted handoff and read repair reconcile via LWW. | This is the AP behavior. Some writes will be lost (LWW) but no node refuses traffic. The opposite trade-off from Postgres or Spanner. |
12 · Better Usage Patterns (8 rows)
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Bucket the partition key | Partition by entity ID alone (user_id, channel_id), then watch popular entities create wide-partition pathology. | Composite partition key: ((user_id, day_bucket)). Bound partition size at design time, not at incident time. |
Discord's 200ms tail latencies were hot partitions on popular channels. Bucketing would have bought them years before Scylla migration. |
| Use TWCS (now UCS) for time-series with TTL | Default STCS on TTL'd time-series data. Tombstones never properly expire; reads accumulate amplification. | TWCS with bucket size = expected query window. SSTables age together; whole files drop when window expires. No compaction needed for cleanup. | Cassandra 5.0's UCS makes this less manual but the principle stands. Time-windowed compaction is essential for any TTL workload. |
| One table per query pattern | Try to make one table serve many access patterns. End up using ALLOW FILTERING or scatter-gather queries. | Denormalize. Build a table per access pattern. Write to all of them on every insert (via batched logged writes or app-layer fan-out). | Storage is cheap; latency is not. Netflix, Apple, and Discord all follow this rule. The 2-5x storage cost is the price of predictable p99. |
| Codify CL in client libraries | Let app code set CL ad-hoc per call. Inconsistent CL across services for the same logical operation. | Wrap the driver. Expose ergonomic methods like readStrong(), readFast(), writeLocal() that hide the CL choice. |
The CL surface is too sharp for casual use. A shared wrapper makes the right thing easy and the wrong thing hard. |
| Use token-aware load balancing | Default round-robin driver. Every request goes through one extra coordinator hop. | Enable TokenAwarePolicy in the driver. Hash the partition key client-side and connect directly to a replica. | One fewer hop = ~30% lower latency on simple queries. Free if you upgrade the driver config; no server changes needed. |
| Schedule repair, don't run it on a whim | Run nodetool repair ad-hoc or skip it because "the cluster looks fine." |
Cassandra Reaper (or its successors) to schedule incremental repair that completes within gc_grace_seconds for every keyspace. | Skipping repair = zombie data after node outages. The cluster looks fine until a node comes back from a long outage and resurrects deleted data. |
| Use batch logged writes only for multi-partition atomicity | Use BATCH for performance reasons ("batch up writes for speed"). It does the opposite. | BATCH is for atomicity across partitions, not throughput. For throughput, use async writes with concurrency control client-side. | Logged BATCH writes to a batch log table on the coordinator, then forwards to replicas. Slower than individual writes, not faster. |
| Avoid counter columns for anything you care about | Use counter columns to track "real-time" view counts, balances, anything that must be accurate. | Write an append-only event log of increments. Aggregate via Spark/Flink for accurate counts. Use counters only for advisory metrics. | Counters have anomaly windows during failure. Two simultaneous +1s on a partitioned cluster can result in one +1. The math doesn't recover. |
13 · Advanced / Next-Gen Alternatives
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| ScyllaDB | C++/Seastar shared-nothing architecture. No JVM, no GC pauses. Shard-per-core. CQL-compatible drop-in. Discord cited 40-125ms→15ms read p99, 5-70ms→5ms write p99, 177→72 nodes. | Production | Medium. CQL-compatible; need to re-tune sstable format, rewrite some operational tooling, run a dual-write window. | JVM GC tail-latency pain is the dominant operational issue. Hardware bill is a concern at scale. |
| Amazon Keyspaces (managed Cassandra) | Serverless Cassandra-compatible. Auto-scaling, no node operations, pay-per-request. | Production | Low to medium. CQL-compatible at v3.11 level; some features missing (materialized views, batch). Pricing model differs sharply from self-managed. | Small to medium teams without dedicated SRE. Variable workloads where you don't want to pay for peak 24/7. |
| DataStax Astra | Managed Cassandra-as-a-service with newer Cassandra versions, vector search built-in, Stargate API gateway. | Production | Low. Same Cassandra; just managed. Easy on-ramp; lock-in via Stargate APIs if used. | Want managed Cassandra without leaving Cassandra. AI/vector use cases that need ANN. |
| DynamoDB | Fully managed, serverless, multi-region active-active (Global Tables). Smaller ops surface; predictable cost. | Production | High. Different data model (single-table design), different query semantics, AWS lock-in. Migration takes months for non-trivial workloads. | AWS-native shops without multi-cloud needs. Spiky workloads benefiting from on-demand billing. |
| CockroachDB / YugabyteDB | Distributed SQL. ACID, joins, secondary indexes, Postgres wire compatibility. Multi-region with serializable transactions. | Production | High. Different paradigm entirely. Worth it if you actually needed ACID and joins, in which case Cassandra was the wrong choice originally. | OLTP workloads with relational access patterns. Netflix migrated some Cassandra workloads to CockroachDB in 2020 for exactly this reason. |
| FoundationDB (with Record Layer) | ACID key-value with strong serializability. Apple uses it alongside Cassandra for the CloudKit transactional layer where Cassandra falls short. | Production | High. Lower-level API; needs schema/record layer on top. Operations are very different from Cassandra. | Need transactional semantics + Cassandra-style horizontal scale. Multi-tenant systems with strict isolation requirements. |
| Cassandra 5.0+ itself (SAI, UCS, Trie, Vector) | Not a replacement, an upgrade. SAI replaces 2i with per-SSTable indexes. UCS replaces STCS/LCS/TWCS. Trie memtables reduce memory. Vector search via HNSW for AI. | Emerging (GA Sep 2024) | Low. In-place upgrade. Most existing data models continue to work; new features are opt-in. | Already on Cassandra. Want better secondary indexing, AI/vector use cases, or fewer compaction-strategy decisions. |
14 · Production Stories
Six companies, deep enough to learn from each. Each story carries the optimization or lesson that's interview-worthy.
Netflix 98% of streaming data · 10Ks of nodes · PBs · millions of TPS
Netflix is the canonical Cassandra reference. After the 2008 on-prem outage that drove them to AWS, they needed a database that could span regions natively without manual failover. Cassandra was the choice in 2012, and by 2014 they had become its largest external committer and operator. Today Cassandra holds the majority of their streaming data: viewing history, bookmarks, billing state, customer metadata, recommendation features.
The Time-Series Redesign
The first major architectural lesson came from viewing history. Netflix processes over 140 million hours of viewing per day. Each play generates several viewing records. The initial design stored each user's full viewing history in a single Cassandra partition (one row per user, growing forever). It worked beautifully at first — fast partition reads, simple model — but as power users accumulated tens of thousands of records, partitions grew beyond the practical 100MB threshold.
The redesigned architecture introduced three optimizations:
- CompressedVH: compress the per-user partition with delta encoding + LZ4. Drastically reduced storage and read time for users with moderate histories.
- Live vs. Historical split: recent activity in a hot, small partition (live cluster). Historical data rolled up and moved to a separate cold cluster with different compaction.
- Chunked partitions for power users: users above a threshold get bucketed by (user_id, chunk_id) so no single partition grows unbounded.
Result, per the Netflix tech blog: ~6x reduction in data size, ~13x reduction in Cassandra maintenance time, ~5x reduction in read latency, ~1.5x reduction in write latency.
Operational Optimizations
Apple ~300K nodes · 100+ PB · thousands of clusters · millions of QPS
Apple has the largest known Cassandra deployment on earth. The fleet has grown from ~75K nodes in 2015 to ~300K nodes by 2022. It backs significant portions of iCloud, iMessage, Siri, Maps, iTunes, and iAd. Apple employs Cassandra committers and has contributed extensively to upstream.
The Hybrid with FoundationDB
Apple's most interesting architectural decision is what they did NOT solve with Cassandra. After acquiring FoundationDB in 2015, they built CloudKit on top of FoundationDB's Record Layer for the transactional, strongly-consistent metadata layer. Cassandra remained for the high-throughput, eventually-consistent paths.
The split is instructive: Cassandra handles the volume (exabytes of mostly-immutable user data); FoundationDB handles the correctness (the "this user owns these records, in this state, with this revision" metadata). Each iCloud user gets a logical FoundationDB record store; the actual blob/metadata storage spans Cassandra and other systems.
The Cluster-Size Philosophy
Apple's approach is many-small-clusters, not one-giant-cluster. ~27K nodes in their largest single cluster, but ~300K total across thousands of clusters. The reasoning (publicly inferred from talks): gossip cost grows with cluster size; schema convergence gets harder; blast radius of operational mistakes grows. Multiple smaller clusters mean fewer hot spots, fewer noisy neighbors, and the ability to upgrade or replace one cluster without affecting the rest.
Operational Optimizations
Uber 1M+ writes/sec · tens of millions of QPS · 6+ years in prod
Uber adopted Cassandra to back real-time marketplace state: trip events, driver telemetry, marketplace pricing signals. The defining characteristic of Uber's workload is bursty multi-region writes that cannot tolerate regional failure, with read patterns that are usually local to the trip or driver.
The Mesos + Cassandra Architecture (2016)
Uber's now-famous talk from 2016 described running Cassandra in containers on Mesos across multiple data centers. The reasoning was agility: dev teams could provision Cassandra clusters declaratively, without bespoke operational work per cluster. The bare-metal-vs-Mesos overhead measured at 5-10%, which they accepted in exchange for orchestration consistency. Peak measured throughput: 1M+ writes/sec on the largest clusters with cross-DC p99 replication latency ~47ms.
The Odin + Cassandra Framework (2020+)
Uber later replaced the Mesos approach with Odin, their internal stateful control plane. The Cassandra-on-Odin framework provides one-click operations (seed node selection, rolling restart, capacity adjustment, node replacement, decommission). Application teams consume Cassandra as a service through a forked Go/Java client with enhanced observability.
Operational Optimizations
Spotify 100+ Cassandra clusters · PBs · personalization stack
Spotify's Cassandra usage is unusual: it's primarily a personalization and analytics-feature store, not a primary OLTP store. The pipeline is Kafka → Storm (later Flink) → Cassandra. Cassandra holds the materialized user-taste profile and entity (artist, track, playlist) metadata that feeds Discover Weekly, Daily Mix, and the home feed.
The hdfs2cass Bulk Load
Many of Spotify's Cassandra writes come not from real-time events but from Hadoop batch jobs. Spotify open-sourced hdfs2cass, a tool that writes SSTables directly from HDFS and bulk-loads them into Cassandra, bypassing the coordinator write path. This is essential for periodic recomputation of recommendation features — you don't want a billion writes per second to flood your live cluster.
The Data Model Evolution
Spotify's initial design used two column families: user attributes and entity attributes, both as key-value bags. As the personalization features grew more sophisticated, they refactored toward strongly-typed schemas per access pattern, with separate tables for different freshness requirements (short-TTL real-time features vs. long-TTL batch-computed features).
Operational Optimizations
Discord trillions of messages · 177 → 72 nodes (Cassandra → ScyllaDB)
Discord is the most instructive Cassandra story because it ends with a migration off. The reason isn't that Cassandra failed; it's that ScyllaDB delivered the same model with better operational characteristics. The lessons apply equally to anyone running Cassandra at scale.
The Journey: MongoDB → Cassandra → ScyllaDB
2015: Discord stored messages on MongoDB replicas. By 100M messages, the working set exceeded RAM and latency became unpredictable. 2017: they migrated to Cassandra with 12 nodes for what was then billions of messages. The data model: partition by channel_id, cluster by message_id (TimeUUID). This works beautifully when channels are uniform in size.
2022: the messages cluster had grown to 177 Cassandra nodes holding trillions of messages. The latency story had degraded: p99 reads at 40-125ms, p99 writes at 5-70ms. The cause: hot partitions on the largest channels, GC pauses from JVM allocation pressure, and compaction never quite keeping up.
The Hot Partition Problem
Partitioning by channel_id meant a large Discord server's busiest channel had all of its messages on one set of replicas. A query for "recent messages in this channel" could touch dozens of SSTables. Compaction on that partition lagged because the partition was always growing. GC kicked in to clean up the allocations, pausing the JVM and causing gossip false-positives.
The Migration
The plan: dual-write to Cassandra and Scylla, backfill the historical trillions of messages via the ScyllaDB Spark Migrator, then cut reads over. Their data services were rewritten in Rust (using fearless concurrency to coalesce hot-partition traffic) to reduce per-request pressure on the database. The migration was estimated at 3 months; using Rust-based concurrency control, they finished in 9 days.
One technical hiccup: the migration stuck at 99.9999% due to large tombstone ranges in the last token ranges. They forced a token-range compaction to flush the tombstones, then completed the migration.
Results
Cisco multi-PB · billions of events/day · security + collaboration
Cisco uses Cassandra in two distinct lines of business: security monitoring (firewalls, threat intelligence, NGFW logs) and collaboration (Webex, telepresence state). Both are write-heavy time-series workloads with strict retention requirements (regulatory and operational), and both need to span Cisco's global infrastructure.
Security Event Pipeline
Cisco's security products ingest billions of events per day across customers' deployed appliances and cloud services. The events feed both real-time threat analysis (which uses other systems) and a long-term forensic store (Cassandra). The Cassandra schema is typically partitioned by (customer_id, event_type, day_bucket) with clustering by event timestamp. TWCS keeps tombstone cost bounded as data expires.
Webex/Collaboration State
Cisco's collaboration platforms store meeting metadata, presence, telemetry, and chat state. Cassandra's multi-region active-active model fits the need: regional meetings should not depend on a far-away primary, and presence updates have a natural per-user partition key.
Operational Optimizations
15 · Top Production Use Cases
The six dominant use case patterns where Cassandra is the right answer. Each is concrete: who runs it this way, what the partition key looks like, and the gotcha you'll meet first.
15.1 · High-Volume Time-Series
// Financial ticks, infra monitoring, app metrics.
Partition pattern: ((metric_id, day_bucket), timestamp). Bucket by day or hour depending on rate.
Compaction: TWCS (pre-5.0) or UCS (5.0+). Time window = bucket size.
Driving property: sequential writes, append-only, deterministic TTL-based expiration.
Gotcha: Mixing TTL'd and non-TTL'd data in the same table blocks tombstone drops.
15.2 · IoT Sensor Telemetry
// Smart grids, vehicle trackers, manufacturing.
Partition pattern: ((device_id, week_bucket), reading_timestamp).
Replication: RF=3 per region, often single-region per product line to control cost.
Driving property: millions of devices, billions of writes/day, multi-year retention.
Gotcha: Device fleets are not uniform. A few "chatty" devices create hot partitions; bucket aggressively.
15.3 · Real-Time Activity / Event Tracking
// E-commerce clickstream, app impressions.
Partition pattern: ((user_id, day_bucket), event_timestamp); secondary tables for product-keyed queries.
Architecture: Kafka → stream processor → Cassandra at LOCAL_QUORUM.
Driving property: high write throughput, recent-activity reads dominate, analytical reads via streaming to a lake.
Gotcha: Analytical queries ("how many clicks on product X in the last hour?") are not Cassandra's job. Stream out.
15.4 · Chat / Messaging
// Discord, Webex, in-app chat, multi-device sync.
Partition pattern: ((channel_id, bucket), message_timeuuid). Bucket to bound hot partitions.
Driving property: low-latency tail reads of recent messages, persistent history, fan-out friendly.
Gotcha: The Discord hot-partition story. Popular rooms WILL overwhelm one partition. Plan bucketing from day one.
15.5 · Fraud Detection
// Aggregate transaction history for instant pattern checks.
Partition pattern: ((account_id), txn_timestamp) for last-N-transaction reads; separate aggregation in Flink/Spark.
Driving property: append-only history with sub-millisecond latency for the most recent N transactions.
Gotcha: Counter columns lie under failure. Use append-only event logs + downstream aggregation for accurate counts.
15.6 · Global User Profile / Preferences
// User settings, auth state, preferences across regions.
Partition pattern: ((user_id), attribute_name) with sparse columns per attribute.
Driving property: multi-region active-active, low-latency reads in every region, no failover required.
Gotcha: Concurrent updates from two regions to the same attribute resolve via LWW. For non-monotonic data (counters, sets), use CRDT-style merge at the app layer.