Apache Cassandra

Masterless, AP-leaning, linearly-scalable wide-column store. Built for write-heavy, multi-DC, always-on workloads where eventual consistency is acceptable and the data model is known up front.

Wide-Column NoSQL Distributed AP (Tunable) v5.0 GA

As of 2026-06-08 · Cassandra 5.0 GA (Sep 2024)

PE Verdict

Cassandra is the right answer when your access patterns are known, your dataset is sharded by a natural partition key, you write 5–10x more than you read, and you cannot tolerate a single-region failure. It is the wrong answer when you need joins, ad-hoc queries, transactions, or strong consistency by default. The hardest lesson at scale isn't operating the cluster, it's resisting the urge to query it like Postgres. Every team that gets burned (Discord at 177 nodes, Netflix's first time-series cluster) gets burned on the same thing: partitioning. Cassandra 5.0's SAI plus tries plus UCS finally close some of the read-side ergonomic gaps, but the data-model discipline does not change.

Best default choices

Use for AP-scale writesPick Cassandra for write-heavy, multi-DC, always-on workloads with known access patterns Model by partition keyDesign tables from queries; avoid low-cardinality keys, wide partitions, and ALLOW FILTERING Respect LSM costsCompaction, tombstones, repair, and SSTable counts decide read p99 more than the happy path LOCAL_QUORUM by defaultUse RF=3 per DC with LOCAL_QUORUM for most production paths and test conflict behavior explicitly

1 · Overview

Apache Cassandra is a partitioned, leaderless, eventually-consistent distributed database. It was born at Facebook in 2008 from the marriage of Amazon's Dynamo paper (partitioning, replication, gossip) and Google's Bigtable paper (column-family data model, SSTables, memtables). The shape of the system is unusual: every node is identical, every node accepts writes, every node accepts reads, and there is no leader, no shard manager, no central catalog, and no coordination service like ZooKeeper. The ring is the catalog.

The promise is operational: you can lose a node, a rack, a data center, or a region without an outage and without manual failover. The price is data-modeling: you must know your queries before you create your tables, because Cassandra cannot do joins, cannot do ad-hoc aggregations efficiently, and cannot index arbitrary columns without paying for it.

The mental model · Cassandra is a globally-distributed sorted hash map. The partition key picks the node. The clustering key sorts rows within the partition. Reads inside a partition are fast and sequential. Reads across partitions are expensive scatter-gathers. Every design decision flows from this.

Where Cassandra sits in the CAP space

By default, Cassandra is an AP system: it prefers availability over consistency during a partition. But it is tunable. You set the consistency level per query, and the system trades availability for consistency at the request level. QUORUM reads + QUORUM writes give you strong consistency at the cost of tolerating fewer node failures. ONE reads + ONE writes give you the cheapest, lowest-latency path but no read-your-writes guarantee.

The math is simple and worth memorizing: if W + R > RF (write nodes + read nodes > replication factor), reads see the latest write. Most production deployments run RF=3 with LOCAL_QUORUM (W=2, R=2), giving strong consistency within a DC and async replication across DCs.

What Cassandra is good at

Writes. Cassandra writes are append-only to a commit log and an in-memory memtable. There is no read-before-write, no random disk I/O, no row-level lock. A write at LOCAL_QUORUM in a healthy cluster lands in single-digit milliseconds and the cluster can sustain hundreds of thousands of writes per second per node. Apple has clusters doing millions of operations per second. Uber publicly stated their largest clusters do 1M+ writes/sec.

What Cassandra is not good at

Anything that requires looking across partitions. Joins, aggregations, secondary lookups, ad-hoc filters, transactional updates spanning multiple rows. The system has lightweight transactions (Paxos-based, slow), counter columns (commutative, with caveats), materialized views (deprecated-by-default), and secondary indexes (rewritten in 5.0 as SAI), but every one of these is a workaround for the same constraint: the partition key is the only first-class lookup axis.

2 · Architecture

Cassandra's architecture is a ring of identical nodes. The token ring is a 64-bit hash space (Murmur3 by default). Each node owns one or more contiguous slices of that ring, called token ranges. When you write a row, the partition key is hashed to a token, the token falls inside one node's range, and that node is the primary replica. The next N-1 nodes clockwise on the ring hold the additional replicas, where N is the replication factor.

The Ring Topology

The Storage Engine (LSM)

Each node runs an LSM tree per table. Writes flow through three stages:

CommitLog — append-only durability log on disk. fsync per write batch (or periodic, depending on config). This is what makes writes durable before they hit the SSTable.
Memtable — in-memory sorted structure (skiplist pre-5.0, trie in 5.0+). Per-table. Flushed to disk when it fills.
SSTable — immutable on-disk file. Sorted by token, then clustering key. Has a Bloom filter, partition index, and partition summary. Compacted in the background to reclaim deleted/tombstoned space.

Coordinator and Replicas

Any node can be a coordinator. When a client connects (typically via a driver that knows the topology), the driver picks a coordinator close to the data. The coordinator hashes the partition key, identifies the N replica nodes, and fans out the request to them. For writes, it waits for the consistency level number of acks. For reads, it queries the closest replica(s) and uses digest checks against the others.

Gossip and Failure Detection

There is no master, so nodes discover each other via gossip. Once a second, each node exchanges state with a small random subset of peers. State propagates probabilistically in O(log N) rounds. Failure detection uses an accrual detector: nodes track inter-gossip intervals and compute a confidence score (phi) that a peer is dead. The default phi threshold is 8. The system can tolerate Byzantine-ish behavior in failure detection because the consequence of a false positive is only that the coordinator skips a replica, not data loss.

3 · Core Concepts

Partition Key, Clustering Key, Primary Key

The primary key is composite: PRIMARY KEY ((partition_key), clustering_key_1, clustering_key_2). The partition key chooses the node. Clustering keys sort rows within the partition. A query without a partition key restriction is a full cluster scan, which Cassandra refuses by default (or warns and runs as ALLOW FILTERING).

This is the single most important design constraint. The Discord war story comes down to this: they partitioned messages by channel_id. A few channels (a popular Discord server) had hundreds of millions of messages. A query for "last 50 messages in channel X" was fast — but compaction on that partition, repair on that partition, and any read that touched the partition were all hot. They eventually re-partitioned with a bucket (channel_id, bucket_id) to bound partition size.

Replication Factor (RF) and Replication Strategy

RF is per-keyspace. The two strategies that matter are SimpleStrategy (single DC, dev only) and NetworkTopologyStrategy (production, RF per DC). NTS uses snitches (rack-aware or DC-aware) to place replicas on different racks within a DC, so a rack failure doesn't kill a quorum. A typical multi-DC production keyspace: { 'class': 'NetworkTopologyStrategy', 'us-east': 3, 'us-west': 3, 'eu-west': 3 }.

Consistency Levels

Level	Meaning	When to Use
`ANY`	Hint to any node (incl. hinted handoff). Lowest durability.	Fire-and-forget telemetry where loss is OK.
`ONE`	1 replica acks.	Edge writes, observability, max throughput.
`LOCAL_ONE`	1 replica in the local DC.	Same as ONE but DC-pinned. Default for many fast paths.
`QUORUM`	⌈(RF+1)/2⌉ replicas across all DCs.	Strong consistency across DCs. Expensive: hits cross-region latency.
`LOCAL_QUORUM`	Quorum within the local DC.	The 95% production default. Strong within DC, async across DCs.
`EACH_QUORUM`	Quorum in every DC.	Cross-region critical writes (rare). Worst latency.
`ALL`	Every replica acks.	Almost never. One node down = write fails.
`SERIAL / LOCAL_SERIAL`	Paxos-based linearizable consistency.	Lightweight transactions only. 4x slower than QUORUM.

Tombstones and gc_grace_seconds

Deletes in Cassandra don't remove data. They write a tombstone — a marker that says "this row/cell was deleted at time T." Tombstones must propagate to all replicas before they can be safely garbage collected, because if one replica missed the delete and the tombstone disappeared too soon, a re-replication would resurrect the data.

The window is gc_grace_seconds, default 10 days. Tombstones can only be purged during compaction after this window. Two pathologies follow:

Read amplification: A partition with many tombstones forces the read path to scan past them. The default threshold is 100K tombstones per query — past this, the query fails. Past 1K, you get warnings in the log.
Resurrection: If a node is offline longer than gc_grace_seconds and tombstones get purged on the live nodes, the offline node's old data will re-appear when it rejoins. The fix is to run repair before gc_grace expires.

Compaction Strategies

Strategy	Best For	Write Amp	Read Amp	Gotcha
`STCS` (Size-Tiered)	Write-heavy, low read needs. Default in 3.x/4.x.	Low (~1-3x)	High (key may exist in any SSTable)	"Big SSTable problem": a 4GB SSTable needs 3 more 4GB peers to compact. Can sit uncompacted for months.
`LCS` (Leveled)	Read-heavy, predictable latency. Wide rows.	High (~10x per level)	Low (1 SSTable per level)	Writes can fall behind compaction during bulk loads. SSDs only.
`TWCS` (Time-Window)	Time-series with TTL. Append-only.	Low	Medium	Mixing TTL'd and non-TTL'd data inside one window blocks tombstone drops. Bucket size matters.
`UCS` (Unified, 5.0+)	General-purpose. Replaces all of the above with one tunable.	Tunable	Tunable	Newer, fewer production scars. Most teams will migrate over the next 1-2 years.

Snitches and Topology Awareness

Snitches tell the coordinator which DC and rack each node belongs to. GossipingPropertyFileSnitch is the production default. Replicas are placed on different racks to survive rack failure. Cross-DC traffic is minimized because the coordinator prefers local replicas for reads.

4 · Execution Model

Write Path

Client → Coordinator. Driver picks a coordinator near the data (token-aware load balancing). The coordinator hashes the partition key.

Coordinator → N replicas. Fan-out to all replicas in parallel. For CL=LOCAL_QUORUM with RF=3 per DC, it waits for 2 of 3 local acks.

Replica node: CommitLog append. fsync per write batch (configurable). This is the durability boundary.

Replica node: Memtable write. Skiplist (or trie in 5.0+) insertion. No disk I/O on the hot path.

Replica → Coordinator ack. CL counted. Coordinator returns success to client when threshold met.

Async: Memtable flush. When memtable hits threshold (size or time), it flushes to a new SSTable. CommitLog segments covering flushed data are deleted.

Async: Hinted handoff. If a replica was down, the coordinator stores a hint and replays it when the node returns (up to 3 hours by default).

Read Path

Client → Coordinator. Same routing as writes.

Coordinator → Replicas. For CL=LOCAL_QUORUM, queries one closest replica fully and the others for a digest (hash of the data) to verify consistency.

Replica: Bloom filter check. For each SSTable on disk, ask the Bloom filter "could this key be here?" Skip SSTables where the answer is no.

Replica: Partition summary → Partition index → SSTable. Two-level index lookup, then read the actual partition data. Cassandra also merges from the memtable and the row cache (if enabled).

Replica: Merge versions. If the partition exists in multiple SSTables (and the memtable), merge by timestamp and apply tombstones. This is where wide partitions and high SSTable counts hurt.

Coordinator: Digest comparison. If digests from replicas disagree, trigger read repair (synchronous fix-up) before returning.

Coordinator → Client. Returns the latest version.

Where reads get slow · Wide partitions (100MB+) force the engine to seek through deep indexes. Many small SSTables (STCS endgame) force many Bloom filter checks. High tombstone counts force scanning past them. The fix is always upstream in the data model, not in the query.

Anti-Entropy: Hinted Handoff, Read Repair, Anti-Entropy Repair

Three mechanisms keep replicas in sync:

Mechanism	When It Runs	What It Fixes	Cost
Hinted handoff	Immediate, while a replica is down	Writes that couldn't reach a down node	Cheap, bounded (3h default)
Read repair	On every read at QUORUM or higher	Drift detected via digest mismatch	Adds latency to that read; small
Anti-entropy repair (Merkle tree)	Scheduled, `nodetool repair`	Drift over time (e.g., after long outages)	Expensive. Must run before gc_grace expires.

5 · Feature Deep Dive (12 features)

The capabilities the user asked about, with PE-grade nuance: what each feature buys you, where it hurts, and the production-grade caveat.

5.1 · Masterless Ring Topology

// Every node equal. No primary, no shard manager, no ZooKeeper.

Every node knows the full ring topology via gossip. Any node can coordinate any request. There is no leader election, no split-brain protocol (because there is no leader), no failover dance. The trade-off: there is no global ordering, no global transactions, and no efficient cross-partition consistency.

Where it bites: Schema changes are eventually consistent. DDL during a network partition can leave nodes with divergent schemas. Always run schema changes from a single node, with no in-flight repairs.

5.2 · Linear Scalability

// Throughput grows ~linearly with node count.

Double the nodes, roughly double the throughput. This is true because the partition key spreads writes uniformly across the ring and each node owns roughly 1/N of the data and traffic. Netflix has clusters with thousands of nodes; Apple operates clusters of 1000+ nodes.

Where it bites: Linearity assumes uniform partition load. One hot partition breaks linearity entirely — Discord's hot-partition story is exactly this. Also: gossip and schema propagation are O(N), so very large clusters (1000+) start to feel gossip pressure.

5.3 · Peer-to-Peer Replication

// All replicas equal. Writes go to all. Reads from any quorum.

There is no primary replica per partition. The coordinator fans writes out to all N replicas in parallel. There is no replication lag in the leader-follower sense; the only lag is between when a replica acks and when slower replicas catch up via hinted handoff or repair.

Where it bites: Conflict resolution is last-write-wins (LWW) by timestamp. If two clients write the same cell at the same millisecond on different nodes with clock skew, the loser's write silently disappears. Use coarse NTP and consider monotonic timestamps for high-write hot paths.

5.4 · High-Speed Sequential Writes

// CommitLog append + memtable insert. No read-before-write.

The LSM design means writes are sequential disk I/O (commit log append) plus an in-memory insert (memtable). There is no random I/O on the write path, no row lock, no read-before-write. Single-node write throughput on modern NVMe approaches the CommitLog fsync rate.

Where it bites: The cost is paid later, at compaction. Write-heavy clusters with sloppy compaction strategy fill disks with redundant data. Storage planning needs to account for 2-3x raw size during compaction, plus tombstone overhead.

5.5 · Column-Family Data Structure

// Wide-row model. Rows can have different columns. Sparse cells are free.

Each row in a table can hold up to 2 billion cells (clustering key + column combinations). A row is identified by partition key; cells are addressed by (clustering keys, column name). Sparse cells (NULL) are not stored. Schemas can evolve by adding columns without rewriting rows.

Where it bites: The "wide row" pattern is a footgun. A partition with 1M clustering rows works; one with 100M kills compaction. The recommended max is around 100MB per partition or low millions of clustering rows.

5.6 · Cassandra Query Language (CQL)

// SQL-shaped surface, NoSQL semantics underneath.

CQL looks like SQL: SELECT, INSERT, UPDATE, DELETE, CREATE TABLE. What it does not look like under the hood: there are no joins, no subqueries, no aggregates across partitions (without ALLOW FILTERING), no foreign keys, no constraints. Every query must hit the partition key or it is a full-cluster scan.

Where it bites: Developers fresh from SQL write queries Cassandra rejects (or worse, accepts with ALLOW FILTERING). Code review needs a rule: no query without a fully-specified partition key.

5.7 · Tunable Consistency Levels

// Trade availability for consistency per query.

Set CL on every read and every write. Use ONE for cheap eventually-consistent reads, LOCAL_QUORUM for strong local consistency, EACH_QUORUM for strong global consistency. The math W + R > RF determines whether you read your own writes. Cassandra is the only mainstream DB that exposes CAP this granularly.

Where it bites: Teams default to QUORUM when they should use LOCAL_QUORUM, paying 100+ ms of cross-region latency on every read. Or they default to ONE for everything and get phantom reads after writes. The default for almost all production workloads should be LOCAL_QUORUM/LOCAL_QUORUM.

5.8 · Multi-Data Center Replication

// NetworkTopologyStrategy + GossipingPropertyFileSnitch = N-DC live.

Replicas are placed per DC according to the keyspace strategy. Cross-DC writes happen asynchronously by default (the local DC acks the write, then the coordinator forwards to remote DC replicas). LOCAL_QUORUM keeps cross-DC latency off the request path. EACH_QUORUM puts it on the path.

Where it bites: Bootstrapping a new DC at scale is slow and bandwidth-heavy. Streaming TB of data across regions can take days. Plan dual-write windows and use nodetool rebuild with rate limits.

5.9 · Fault-Tolerant Routing

// Token-aware driver bypasses the coordinator hop.

Modern drivers (DataStax, Scylla) are token-aware: they hash the partition key client-side and route directly to a replica node. This eliminates one network hop on the write/read path. The coordinator role still exists for fan-out, but the driver picks a replica as the coordinator.

Where it bites: Outdated drivers or naive load balancers (round-robin without token awareness) double your tail latency by routing to a random node that proxies to the correct one. Always use token-aware load balancing in production.

5.10 · Active-Active Global Replication

// Multiple DCs accept writes simultaneously. LWW resolves conflicts.

Cassandra supports active-active across DCs without external machinery. Each DC accepts writes locally with LOCAL_QUORUM and replicates asynchronously to other DCs. Conflicts are resolved by timestamp (LWW). This is unique to Cassandra at this scale — Postgres and MySQL require external tools (Bucardo, Tungsten) and a lot of operational care.

Where it bites: LWW is brutal. If two DCs both write the same row in the partition window of cross-DC replication, the later write wins and the earlier write is lost. For anything resembling a counter or shopping cart, you need CRDTs at the application layer, not LWW.

5.11 · Row-Level Time-to-Live (TTL)

// Per-column TTL. Expired data becomes a tombstone.

You can set TTL on inserts (INSERT ... USING TTL 86400) or per-column updates. When TTL expires, the cell becomes a tombstone and is eventually compacted out. Excellent for time-series, session stores, rate-limit counters, ephemeral data.

Where it bites: TTL'd data turns into tombstones, and tombstones cost reads. If you have a TTL of 30 days but gc_grace_seconds of 10 days, you still pay tombstone cost for 10 days after each expiry. Combine TTL with TWCS (or UCS in 5.0+) so that whole SSTables expire together, no compaction needed.

5.12 · Secondary Indexing (now SAI in 5.0)

// Legacy 2i: scatter-gather across all nodes, slow. SAI: per-SSTable, fast.

Legacy secondary indexes (2i) are local per-node, meaning a query has to scatter-gather across all nodes. They are fine for high-cardinality, partition-restricted queries; useless for low-cardinality or cluster-wide queries. Cassandra 5.0's Storage-Attached Index (SAI) is per-SSTable, much faster for varied predicates, supports more types (incl. vectors for ANN search via HNSW).

Where it bites: Even with SAI, secondary indexes are not a substitute for a properly designed denormalized table. They are a query convenience, not a substitute for data modeling. Discord built dedicated lookup tables for every access pattern rather than relying on indexes.

6 · Trade-Offs (10 rows)

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Eventual consistency (AP) over strong consistency (CP)	Multi-DC live writes, no outages during partitions, <10ms write latency.	No read-your-writes by default, no serializability, last-write-wins resolves conflicts.	Counter columns lose increments under partition; shopping cart merges silently overwrite; account balance reads can lag.	LOCAL_QUORUM/LOCAL_QUORUM gets you strong within a DC. CRDTs or LWT cover the rest. Never use Cassandra for ledgers.
Write speed via LSM over read speed via B-tree	Writes are 5-10x faster than B-tree DBs. No read-before-write. Append-only disk pattern.	Reads may touch multiple SSTables. Read amplification grows with sloppy compaction.	Read-heavy access patterns on partitions with many SSTables (STCS endgame) start exceeding 50ms p99.	If your read:write ratio is 5:1 or higher, consider Postgres or DynamoDB. Cassandra shines at 1:5 and beyond.
Schema-on-write rigidity over schema flexibility	Predictable storage, no surprises at read time. Type safety in CQL.	Cannot adapt to unknown access patterns. New query = new table = backfill.	Product team adds a new dashboard filter ("show me by author") and you discover Cassandra can't do it without a new denormalized table.	Build the data model from queries backward. The Netflix/Discord pattern: one table per query. SAI in 5.0 narrows this gap but does not close it.
Partition-key locality over relational joins	Single-partition reads are O(1) to find, sequential to read. Predictable p99.	No joins. Cross-partition queries are full-cluster scatter-gather or impossible.	Reports, ad-hoc analytics, anything requiring "find all X where Y." You need Spark or a search index (Solr, Elasticsearch).	Pair Cassandra with a streaming pipeline (Kafka → Flink → ES/Pinot) for the analytical access patterns it cannot serve.
Linear horizontal scale over single-node performance	Add nodes, get throughput. Apple runs 300K nodes. No vertical ceiling.	Per-node throughput is mediocre vs. a tuned Postgres or Scylla node. Hardware utilization is poor.	Below ~5 nodes, you're paying coordination overhead for capacity you don't need. Above 1000 nodes, gossip pressure starts to bite.	Discord went from 177 Cassandra nodes to 72 Scylla nodes for the same workload. Cassandra trades single-node efficiency for ring simplicity.
JVM/GC operational complexity over native binary simplicity	Mature JVM tooling, JFR, JMX metrics, decade of GC tuning lore.	GC pauses (especially with G1) cause tail latency spikes and gossip false-positives. Tuning is a job.	Hot partitions trigger massive object allocation, which triggers stop-the-world GC, which triggers gossip flaps, which triggers cascading retries.	The single biggest reason teams migrate to ScyllaDB. C++/Seastar = no GC. Discord cited this explicitly as their motivation.
Tombstone-based deletes over in-place deletes	Deletes are O(1) writes. Replication can happen at any time. Resurrection-safe within gc_grace.	Deletes are not free. Tombstones consume disk and slow reads until compaction.	Queue-as-table anti-pattern: heavy DELETE traffic creates a tombstone wall that fails reads (default cap: 100K tombstones/query).	If your access pattern is delete-heavy, you've designed wrong. Use TTL or model as immutable append-only. Cassandra is not a message queue.
Tunable consistency over fixed semantics	Per-query control. Reads at ONE, writes at QUORUM, depending on the path.	Easy to misuse. Three CL knobs (read, write, serial) per call site. Defaults rarely match the workload.	A new team copies an old service's QUORUM read pattern and pays 60ms cross-region per call.	Codify CL in a shared client wrapper. Don't let app code set CL ad-hoc. Make LOCAL_QUORUM the unmissable default.
Active-active multi-DC over leader-follower	Every region accepts writes. RTO ≈ 0 on regional failover. No leader election.	Conflict resolution is LWW. Application must be conflict-aware for non-monotonic data.	Two regions write the same row inside the replication window. The later timestamp wins; the earlier write is gone.	For anything resembling a counter or set, use Cassandra counters carefully or CRDTs at the app layer. Don't trust LWW for anything mutable.
Operational predictability over operational simplicity	Once tuned, runs for years. Netflix has clusters from 2014 still running.	Getting tuned is hard. Repair, compaction strategy, GC, snitches, vnodes — every dimension wants attention.	First production incident on a self-managed cluster. Usually compaction or repair starvation. Nobody's first Cassandra runbook is good.	Buy managed (Astra, Instaclustr) until you have 2+ SREs who own the cluster as a product. The DIY savings vanish in the first incident.

7 · Use Cases (7 rows)

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
User viewing history / time-series per-user	Netflix CompressedVH	Append-heavy writes (9:1 write:read), wide rows compressed per user, predictable p99 reads of last-N records.	98% of streaming data; hundreds of clusters; tens of thousands of nodes; PBs; millions of TPS.	Postgres can't shard linearly to PBs. DynamoDB lacks multi-region active-active at the same cost. HBase needs HDFS+ZooKeeper.
Chat message storage	Discord (until 2022) → Scylla	Insert-heavy, partition by (channel_id, bucket), clustering by message_id (TimeUUID) for tail reads.	Trillions of messages; 177 nodes peak before Scylla migration.	MongoDB: working set must fit in RAM. Postgres: doesn't scale writes linearly. The Scylla move was JVM/hot-partition pain, not the model.
Trip event ingestion, marketplace state, telemetry	Uber Cassandra-as-a-Service (on Odin/Mesos)	Multi-region active-active, single-digit ms writes for trip events, multi-DC replication for global availability.	Largest clusters: 1M+ writes/sec, ~100K reads/sec; tens of millions of QPS across the fleet.	MySQL/Schemaless: built in-house but limited to certain access patterns. DynamoDB: vendor lock-in, multi-region cost prohibitive.
iCloud metadata, iMessage, Siri data	Apple (largest Cassandra deployment on earth)	Petabyte multi-DC storage, masterless for zero-downtime ops, integrated with FoundationDB for the transactional layer.	~300K nodes, 100+ PB, thousands of clusters, millions of QPS.	Nothing else operates at this scale with this operational pattern. Apple has internal forks and Cassandra committers.
Personalization stack: user taste profiles, playlist metadata	Spotify Discover Weekly / Daily Mix	Kafka → Storm → Cassandra. Real-time event-driven taste-profile updates. Bulk loads from HDFS via hdfs2cass.	100+ Cassandra clusters; petabytes of user attributes and entity metadata.	HBase: Spotify uses it but for different patterns. Postgres: doesn't handle the write throughput. Cassandra fits the wide-row, time-stamped, low-coupling pattern.
Security event logging, telepresence state	Cisco (firewalls, Webex collaboration)	Continuous high-volume event ingest, time-windowed retention via TTL+TWCS, multi-tenant isolation.	Multi-PB across hundreds of clusters; billions of events/day.	Elasticsearch: too expensive at this volume for cold data. Hadoop: not real-time. Kafka: not a long-term store.
IoT telemetry, smart-grid metering, vehicle tracking	Various (utilities, automotive)	TWCS-compacted time-series, per-device partition with bucketed clustering, TTL for retention.	Millions of devices, billions of points/day, multi-year retention.	InfluxDB/Timescale: scale ceiling per node. Cassandra scales linearly to fleet size.

8 · Limitations

Limitation	Severity	Workaround	Workaround Cost
No JOINs across tables	High	Denormalize: one table per query pattern. App-side fan-out for the rare cross-partition need.	2-5x storage. Higher write amplification (write the same data into N tables).
No multi-row ACID transactions	Critical (for financial/ledger)	Lightweight transactions (Paxos) for single-partition CAS. App-layer sagas or external coordinator for multi-row.	LWT is ~4x slower than CL=QUORUM. Sagas add complexity and partial-failure semantics.
Wide-partition limit (~100MB / millions of clustering rows)	High	Bucket the partition key: `PARTITION KEY ((user_id, day_bucket))`.	Reads spanning buckets become multi-partition fan-outs. Bucket sizing is workload-specific tuning.
Tombstone cap (default 100K per query)	High	Avoid delete-heavy patterns. Use TTL + TWCS so whole SSTables expire. Tune `tombstone_warn_threshold`.	Forces design discipline. Some legitimate patterns (queue-as-table) become non-viable.
Secondary indexes (2i) don't scale across cluster	Medium (5.0 SAI mitigates)	Build inverted-index tables manually, or use SAI in 5.0+ for per-SSTable indexing.	Manual: doubles writes. SAI: still recent, limited operator scars in production.
JVM GC pauses affect tail latency	High	Tune G1 or ZGC. Avoid hot partitions. Use off-heap memtables and bloom filters.	GC tuning is specialist work. Even tuned, p99.99 occasionally spikes 100ms+. Scylla migration eliminates this entirely.
Repair operation is expensive and easy to neglect	Critical	Use Cassandra Reaper or built-in incremental repair. Schedule to complete within gc_grace_seconds.	Adds I/O and CPU load. Mis-scheduled repair causes data resurrection (zombie data).
Counter columns have anomaly windows under failure	Medium	For exact counts, use append-only event log + batch aggregation (Spark/Flink) instead of counters.	Higher storage cost, batch latency for the count, but accuracy.
Bootstrapping a new DC is slow and bandwidth-heavy	Medium	`nodetool rebuild` with throttling. Dual-write window. Zero-copy streaming in 4.0+.	Days to weeks for multi-TB clusters. Network costs in cloud non-trivial.
Schema changes are eventually consistent	Medium	Run DDL from a single node. Avoid DDL during repairs or network instability. Validate convergence with `nodetool describecluster`.	Schema-disagreement bugs are silent and only surface on read.

9 · Fault Tolerance

Dimension	Behavior	Operational Reality
Replication model	Leaderless, peer-to-peer. All replicas accept writes. RF per keyspace per DC.	Production default: RF=3 per DC. RF=1 only for dev. RF=2 is a trap (no quorum survives a single failure).
Failure detection	Accrual failure detector (phi-based). Gossip every second. Default phi threshold = 8.	Detection in 5-30 seconds. Can false-positive under GC pauses; tune phi_convict_threshold higher if seeing flapping.
Failover mechanism	None needed. Coordinator skips down replicas and meets CL with remaining replicas.	Single-node failure: invisible to clients at CL=LOCAL_QUORUM with RF=3. Two-node failure: writes fail at LOCAL_QUORUM unless RF≥5.
RTO (typical)	0 for single-node failure. Seconds for rack/AZ failure (gossip propagation).	Apple, Netflix: clients see no impact on single-node loss. Token-aware drivers reroute instantly.
RPO (typical)	0 at CL=QUORUM in healthy state. Up to commit-log fsync window (typically 10s) on simultaneous multi-node failure.	periodic_commitlog_sync: durable on fsync interval. batch: durable per write but slower. Most prod runs periodic at 10s.
Split-brain behavior	None. There is no leader to lose. Both sides of a partition accept writes; conflicts resolved by LWW timestamp at reconciliation.	The strength of the model. Network partitions don't cause outages or data loss, but they do cause write divergence that LWW will silently resolve.
Blast radius of single-node failure	~1/N of writes briefly route around the failed node (hinted handoff). No data loss with RF≥3.	The point of the architecture. Apple's 300K-node fleet sees nodes die routinely with zero customer impact.
Cross-region failover story	None needed. Active-active. If us-east loses ⅓ replicas, LOCAL_QUORUM continues serving from the remaining 2. Other DCs unaffected.	Uber: lost a region during the 2017 us-east-1 outage with no production impact on Cassandra workloads.
Data loss scenarios	(1) Lose all RF replicas before repair (RF=3, lose 3 nodes simultaneously holding same range). (2) Tombstone resurrection from a node down longer than gc_grace_seconds.	Both are avoidable: rack-aware placement prevents (1); regular repair prevents (2). Apple/Netflix have run for a decade without significant data loss.

10 · Sharding

Dimension	Behavior	Operational Reality
Sharding model	Consistent hashing on the partition key. Murmur3 hash → 64-bit token ring.	Hash-based; not range. You cannot do range scans across partition keys, only within a partition's clustering keys.
Shard key constraints	Partition key must produce uniform distribution. High-cardinality required (millions+ distinct values).	The #1 design error. Low-cardinality partition keys (status, region, day) create hot partitions and node imbalance.
Rebalancing mechanism	vnodes (virtual nodes): each physical node owns ~256 token ranges by default. Streaming on add/remove redistributes ranges automatically.	Pre-vnode Cassandra required manual token assignment. 256 vnodes/node is the modern default; tune to 8-16 for very large clusters to reduce gossip cost.
Rebalancing cost / impact	Streaming during bootstrap/decommission is bandwidth-heavy but doesn't block reads/writes. Repair after streaming is the long pole.	Adding a node to a 100-node cluster: hours of streaming, then hours of repair. Netflix automates this; smaller teams often suffer through it.
Hot-shard behavior	Cassandra cannot rebalance an oversubscribed partition. One node's box is on fire while the rest are idle.	The Discord story. Channel hot-partitions caused cascading 200ms latency, GC pauses, gossip flaps, and ultimately the Scylla migration.
Maximum shards (practical)	Largest documented production cluster: Apple, ~27K nodes per single cluster; total fleet ~300K nodes across thousands of clusters.	Single cluster sweet spot: 50-500 nodes. Past 1000 nodes, gossip pressure and schema convergence start to feel painful. Multiple clusters preferred over one giant one.
Resharding without downtime?	Yes (node add/remove). No (changing partition key requires new table + backfill).	Adding capacity: routine. Changing the data model: a multi-week migration with dual-writes and ETL. This is the worst pain point of the model.
Cross-shard query support	None first-class. `ALLOW FILTERING` permits cluster-wide scans but is operationally hostile. Spark Cassandra Connector is the production answer.	For analytics across partitions: stream Cassandra to a data lake (S3/parquet) or query via Spark. Don't ALLOW FILTERING in OLTP code paths.

11 · Replication

Dimension	Behavior	Operational Reality
Replication topology	Leaderless. Every replica accepts writes. NetworkTopologyStrategy places replicas per DC, rack-aware.	Production default: NTS with RF=3 per DC. Replicas on different racks within a DC to survive AZ failure.
Sync vs async	Intra-DC: sync to local quorum (LOCAL_QUORUM). Inter-DC: async by default; the coordinator returns once local CL is met, then forwards to remote DC replicas.	EACH_QUORUM makes inter-DC sync (writes wait for quorum in every DC). Almost never used in production due to latency cost.
Replication factor (default / max)	Default: RF=3 per DC. Max: bounded by node count per DC. Common in production: 3 to 5.	RF=3 is the canonical choice. RF=5 used by some financial workloads to tolerate 2 failures at QUORUM. RF=1 = dev only.
Consistency level options	Per-query: ANY, ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL, LOCAL_ONE, SERIAL, LOCAL_SERIAL.	~95% of production traffic uses LOCAL_QUORUM. SERIAL only for LWT (compare-and-set). ALL is a trap; one slow replica = whole cluster slow.
Replication lag (typical)	Intra-DC: sub-millisecond. Inter-DC: bounded by network RTT (50-300ms). Hinted handoff backfills offline replicas within 3 hours.	Uber Mesos-Cassandra paper: cross-DC p99 replication lag ~47ms. Netflix runs similar numbers globally.
Conflict resolution	Last-write-wins (LWW) by cell timestamp. Higher timestamp wins; ties broken by value comparison.	The major footgun. Clock skew + concurrent writes = silent data loss. Use NTP, monotonic timestamps, or app-layer CRDTs for mutable data.
Cross-region replication	Native via NTS. No external tools, no log shipping. Each region is a first-class DC in the keyspace.	The crown jewel of Cassandra. This is why Netflix and Uber chose it. Postgres needs Bucardo or BDR; Cassandra needs nothing.
Replication during partition	Each side of the partition continues accepting local writes. Reads succeed if local CL is met. On heal: hinted handoff and read repair reconcile via LWW.	This is the AP behavior. Some writes will be lost (LWW) but no node refuses traffic. The opposite trade-off from Postgres or Spanner.

12 · Better Usage Patterns (8 rows)

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Bucket the partition key	Partition by entity ID alone (user_id, channel_id), then watch popular entities create wide-partition pathology.	Composite partition key: `((user_id, day_bucket))`. Bound partition size at design time, not at incident time.	Discord's 200ms tail latencies were hot partitions on popular channels. Bucketing would have bought them years before Scylla migration.
Use TWCS (now UCS) for time-series with TTL	Default STCS on TTL'd time-series data. Tombstones never properly expire; reads accumulate amplification.	TWCS with bucket size = expected query window. SSTables age together; whole files drop when window expires. No compaction needed for cleanup.	Cassandra 5.0's UCS makes this less manual but the principle stands. Time-windowed compaction is essential for any TTL workload.
One table per query pattern	Try to make one table serve many access patterns. End up using ALLOW FILTERING or scatter-gather queries.	Denormalize. Build a table per access pattern. Write to all of them on every insert (via batched logged writes or app-layer fan-out).	Storage is cheap; latency is not. Netflix, Apple, and Discord all follow this rule. The 2-5x storage cost is the price of predictable p99.
Codify CL in client libraries	Let app code set CL ad-hoc per call. Inconsistent CL across services for the same logical operation.	Wrap the driver. Expose ergonomic methods like `readStrong()`, `readFast()`, `writeLocal()` that hide the CL choice.	The CL surface is too sharp for casual use. A shared wrapper makes the right thing easy and the wrong thing hard.
Use token-aware load balancing	Default round-robin driver. Every request goes through one extra coordinator hop.	Enable TokenAwarePolicy in the driver. Hash the partition key client-side and connect directly to a replica.	One fewer hop = ~30% lower latency on simple queries. Free if you upgrade the driver config; no server changes needed.
Schedule repair, don't run it on a whim	Run `nodetool repair` ad-hoc or skip it because "the cluster looks fine."	Cassandra Reaper (or its successors) to schedule incremental repair that completes within gc_grace_seconds for every keyspace.	Skipping repair = zombie data after node outages. The cluster looks fine until a node comes back from a long outage and resurrects deleted data.
Use batch logged writes only for multi-partition atomicity	Use BATCH for performance reasons ("batch up writes for speed"). It does the opposite.	BATCH is for atomicity across partitions, not throughput. For throughput, use async writes with concurrency control client-side.	Logged BATCH writes to a batch log table on the coordinator, then forwards to replicas. Slower than individual writes, not faster.
Avoid counter columns for anything you care about	Use counter columns to track "real-time" view counts, balances, anything that must be accurate.	Write an append-only event log of increments. Aggregate via Spark/Flink for accurate counts. Use counters only for advisory metrics.	Counters have anomaly windows during failure. Two simultaneous +1s on a partitioned cluster can result in one +1. The math doesn't recover.

13 · Advanced / Next-Gen Alternatives

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
ScyllaDB	C++/Seastar shared-nothing architecture. No JVM, no GC pauses. Shard-per-core. CQL-compatible drop-in. Discord cited 40-125ms→15ms read p99, 5-70ms→5ms write p99, 177→72 nodes.	Production	Medium. CQL-compatible; need to re-tune sstable format, rewrite some operational tooling, run a dual-write window.	JVM GC tail-latency pain is the dominant operational issue. Hardware bill is a concern at scale.
Amazon Keyspaces (managed Cassandra)	Serverless Cassandra-compatible. Auto-scaling, no node operations, pay-per-request.	Production	Low to medium. CQL-compatible at v3.11 level; some features missing (materialized views, batch). Pricing model differs sharply from self-managed.	Small to medium teams without dedicated SRE. Variable workloads where you don't want to pay for peak 24/7.
DataStax Astra	Managed Cassandra-as-a-service with newer Cassandra versions, vector search built-in, Stargate API gateway.	Production	Low. Same Cassandra; just managed. Easy on-ramp; lock-in via Stargate APIs if used.	Want managed Cassandra without leaving Cassandra. AI/vector use cases that need ANN.
DynamoDB	Fully managed, serverless, multi-region active-active (Global Tables). Smaller ops surface; predictable cost.	Production	High. Different data model (single-table design), different query semantics, AWS lock-in. Migration takes months for non-trivial workloads.	AWS-native shops without multi-cloud needs. Spiky workloads benefiting from on-demand billing.
CockroachDB / YugabyteDB	Distributed SQL. ACID, joins, secondary indexes, Postgres wire compatibility. Multi-region with serializable transactions.	Production	High. Different paradigm entirely. Worth it if you actually needed ACID and joins, in which case Cassandra was the wrong choice originally.	OLTP workloads with relational access patterns. Netflix migrated some Cassandra workloads to CockroachDB in 2020 for exactly this reason.
FoundationDB (with Record Layer)	ACID key-value with strong serializability. Apple uses it alongside Cassandra for the CloudKit transactional layer where Cassandra falls short.	Production	High. Lower-level API; needs schema/record layer on top. Operations are very different from Cassandra.	Need transactional semantics + Cassandra-style horizontal scale. Multi-tenant systems with strict isolation requirements.
Cassandra 5.0+ itself (SAI, UCS, Trie, Vector)	Not a replacement, an upgrade. SAI replaces 2i with per-SSTable indexes. UCS replaces STCS/LCS/TWCS. Trie memtables reduce memory. Vector search via HNSW for AI.	Emerging (GA Sep 2024)	Low. In-place upgrade. Most existing data models continue to work; new features are opt-in.	Already on Cassandra. Want better secondary indexing, AI/vector use cases, or fewer compaction-strategy decisions.

14 · Production Stories

Six companies, deep enough to learn from each. Each story carries the optimization or lesson that's interview-worthy.

Netflix 98% of streaming data · 10Ks of nodes · PBs · millions of TPS

Netflix is the canonical Cassandra reference. After the 2008 on-prem outage that drove them to AWS, they needed a database that could span regions natively without manual failover. Cassandra was the choice in 2012, and by 2014 they had become its largest external committer and operator. Today Cassandra holds the majority of their streaming data: viewing history, bookmarks, billing state, customer metadata, recommendation features.

The Time-Series Redesign

The first major architectural lesson came from viewing history. Netflix processes over 140 million hours of viewing per day. Each play generates several viewing records. The initial design stored each user's full viewing history in a single Cassandra partition (one row per user, growing forever). It worked beautifully at first — fast partition reads, simple model — but as power users accumulated tens of thousands of records, partitions grew beyond the practical 100MB threshold.

The redesigned architecture introduced three optimizations:

CompressedVH: compress the per-user partition with delta encoding + LZ4. Drastically reduced storage and read time for users with moderate histories.
Live vs. Historical split: recent activity in a hot, small partition (live cluster). Historical data rolled up and moved to a separate cold cluster with different compaction.
Chunked partitions for power users: users above a threshold get bucketed by (user_id, chunk_id) so no single partition grows unbounded.

Result, per the Netflix tech blog: ~6x reduction in data size, ~13x reduction in Cassandra maintenance time, ~5x reduction in read latency, ~1.5x reduction in write latency.

Operational Optimizations

Topology

Hundreds of clusters, AWS-native, multi-region (us-east, us-west, eu-west). RF=3 per region, LOCAL_QUORUM by default.

Backup

Priam (Netflix's open-source Cassandra sidecar) handles backup, restore, token management, configuration.

Tail-latency tools

Internal "Cassandra Health Service" (Mantis-powered). Per-cluster p99 read/write tracked at coordinator level.

Workload split

Some workloads migrated to CockroachDB in 2020 where ACID semantics mattered more than write throughput.

The Netflix lesson · Wide partitions are inevitable if your access pattern is "all of one user's activity." Design the bucketing strategy from day one. The 6x storage reduction came from compression; the 5x latency win came from bounding partition size.

Apple ~300K nodes · 100+ PB · thousands of clusters · millions of QPS

Apple has the largest known Cassandra deployment on earth. The fleet has grown from ~75K nodes in 2015 to ~300K nodes by 2022. It backs significant portions of iCloud, iMessage, Siri, Maps, iTunes, and iAd. Apple employs Cassandra committers and has contributed extensively to upstream.

The Hybrid with FoundationDB

Apple's most interesting architectural decision is what they did NOT solve with Cassandra. After acquiring FoundationDB in 2015, they built CloudKit on top of FoundationDB's Record Layer for the transactional, strongly-consistent metadata layer. Cassandra remained for the high-throughput, eventually-consistent paths.

The split is instructive: Cassandra handles the volume (exabytes of mostly-immutable user data); FoundationDB handles the correctness (the "this user owns these records, in this state, with this revision" metadata). Each iCloud user gets a logical FoundationDB record store; the actual blob/metadata storage spans Cassandra and other systems.

The Cluster-Size Philosophy

Apple's approach is many-small-clusters, not one-giant-cluster. ~27K nodes in their largest single cluster, but ~300K total across thousands of clusters. The reasoning (publicly inferred from talks): gossip cost grows with cluster size; schema convergence gets harder; blast radius of operational mistakes grows. Multiple smaller clusters mean fewer hot spots, fewer noisy neighbors, and the ability to upgrade or replace one cluster without affecting the rest.

Operational Optimizations

Custom fork

Apple maintains internal patches against upstream Cassandra. Some land in OSS; some stay internal for years.

Hardware

Primarily on bare metal in Apple data centers, with some AWS clusters. Lots of small nodes (vertical scaling not the model).

Multi-tenancy

Logical isolation via keyspaces + ACLs; physical isolation via separate clusters for noisy or high-priority workloads.

Failure pattern

Node failures are routine and invisible. Hardware loss is part of weekly operations, not an incident.

The Apple lesson · The "many small clusters" pattern is the right answer at extreme scale. One giant cluster is operationally fragile. Pair Cassandra with another system (FoundationDB) for the transactional surface; don't ask Cassandra to do both.

Uber 1M+ writes/sec · tens of millions of QPS · 6+ years in prod

Uber adopted Cassandra to back real-time marketplace state: trip events, driver telemetry, marketplace pricing signals. The defining characteristic of Uber's workload is bursty multi-region writes that cannot tolerate regional failure, with read patterns that are usually local to the trip or driver.

The Mesos + Cassandra Architecture (2016)

Uber's now-famous talk from 2016 described running Cassandra in containers on Mesos across multiple data centers. The reasoning was agility: dev teams could provision Cassandra clusters declaratively, without bespoke operational work per cluster. The bare-metal-vs-Mesos overhead measured at 5-10%, which they accepted in exchange for orchestration consistency. Peak measured throughput: 1M+ writes/sec on the largest clusters with cross-DC p99 replication latency ~47ms.

The Odin + Cassandra Framework (2020+)

Uber later replaced the Mesos approach with Odin, their internal stateful control plane. The Cassandra-on-Odin framework provides one-click operations (seed node selection, rolling restart, capacity adjustment, node replacement, decommission). Application teams consume Cassandra as a service through a forked Go/Java client with enhanced observability.

Operational Optimizations

Topology

Multi-region clusters with replicas spanning regions. Single-DC failures invisible to applications.

Client library

Forked DataStax drivers with service-discovery integration, custom tracing/observability hooks, no hardcoded endpoints.

Hot partition mitigation

Bucketed partition keys for driver-state and high-traffic trip events. Backpressure on coordinators to shed load during traffic spikes.

Cache tier

Redis in front of Cassandra for the hottest reads. Reduces tail latency under load; falls through to Cassandra on miss.

The Uber lesson · Cassandra-as-a-service is the right org pattern past ~5 clusters. Build the platform once; let product teams consume it through a typed client wrapper that enforces the right defaults (CL, timeout, retry policy).

Spotify 100+ Cassandra clusters · PBs · personalization stack

Spotify's Cassandra usage is unusual: it's primarily a personalization and analytics-feature store, not a primary OLTP store. The pipeline is Kafka → Storm (later Flink) → Cassandra. Cassandra holds the materialized user-taste profile and entity (artist, track, playlist) metadata that feeds Discover Weekly, Daily Mix, and the home feed.

The hdfs2cass Bulk Load

Many of Spotify's Cassandra writes come not from real-time events but from Hadoop batch jobs. Spotify open-sourced hdfs2cass, a tool that writes SSTables directly from HDFS and bulk-loads them into Cassandra, bypassing the coordinator write path. This is essential for periodic recomputation of recommendation features — you don't want a billion writes per second to flood your live cluster.

The Data Model Evolution

Spotify's initial design used two column families: user attributes and entity attributes, both as key-value bags. As the personalization features grew more sophisticated, they refactored toward strongly-typed schemas per access pattern, with separate tables for different freshness requirements (short-TTL real-time features vs. long-TTL batch-computed features).

Operational Optimizations

Topology

100+ clusters, mostly on GCP. Per-feature isolation: recommendation features, playlist metadata, listening history each get their own clusters.

Write path

Two distinct paths: real-time (Kafka → Storm/Flink → Cassandra at LOCAL_QUORUM) and batch (Spark → hdfs2cass → SSTables).

TTL strategy

Heavy use of TTL for short-lived features (recent skips, current session). TWCS for those tables to keep tombstones contained.

Read patterns

Always partition-keyed by user_id or entity_id. No ALLOW FILTERING in serving paths.

The Spotify lesson · Cassandra is excellent as the serving layer of a streaming feature store. Combine with Kafka + a stream processor for fresh features and a batch pipeline (Spark) for backfills. The two write paths look different but feed the same tables.

Discord trillions of messages · 177 → 72 nodes (Cassandra → ScyllaDB)

Discord is the most instructive Cassandra story because it ends with a migration off. The reason isn't that Cassandra failed; it's that ScyllaDB delivered the same model with better operational characteristics. The lessons apply equally to anyone running Cassandra at scale.

The Journey: MongoDB → Cassandra → ScyllaDB

2015: Discord stored messages on MongoDB replicas. By 100M messages, the working set exceeded RAM and latency became unpredictable. 2017: they migrated to Cassandra with 12 nodes for what was then billions of messages. The data model: partition by channel_id, cluster by message_id (TimeUUID). This works beautifully when channels are uniform in size.

2022: the messages cluster had grown to 177 Cassandra nodes holding trillions of messages. The latency story had degraded: p99 reads at 40-125ms, p99 writes at 5-70ms. The cause: hot partitions on the largest channels, GC pauses from JVM allocation pressure, and compaction never quite keeping up.

The Hot Partition Problem

Partitioning by channel_id meant a large Discord server's busiest channel had all of its messages on one set of replicas. A query for "recent messages in this channel" could touch dozens of SSTables. Compaction on that partition lagged because the partition was always growing. GC kicked in to clean up the allocations, pausing the JVM and causing gossip false-positives.

The Migration

The plan: dual-write to Cassandra and Scylla, backfill the historical trillions of messages via the ScyllaDB Spark Migrator, then cut reads over. Their data services were rewritten in Rust (using fearless concurrency to coalesce hot-partition traffic) to reduce per-request pressure on the database. The migration was estimated at 3 months; using Rust-based concurrency control, they finished in 9 days.

One technical hiccup: the migration stuck at 99.9999% due to large tombstone ranges in the last token ranges. They forced a token-range compaction to flush the tombstones, then completed the migration.

Results

Node count

177 Cassandra nodes → 72 ScyllaDB nodes. ~60% reduction.

Read p99

40-125ms → 15ms. Some Discord reports show 200ms → 5ms on hottest paths.

Write p99

5-70ms (variable) → 5ms (predictable).

Operational change

GC pauses eliminated (no JVM). Hot partitions still exist but Scylla's shard-per-core architecture handles them better than Cassandra's coordinator/replica pattern.

The Discord lesson · Hot partitions are the dominant operational pain at scale. Bucketed partition keys would have bought years. The JVM is the tail-latency villain you can't fully tame; Scylla wins on this axis specifically. The model (CQL, ring, leaderless) was right; the runtime was holding it back.

Cisco multi-PB · billions of events/day · security + collaboration

Cisco uses Cassandra in two distinct lines of business: security monitoring (firewalls, threat intelligence, NGFW logs) and collaboration (Webex, telepresence state). Both are write-heavy time-series workloads with strict retention requirements (regulatory and operational), and both need to span Cisco's global infrastructure.

Security Event Pipeline

Cisco's security products ingest billions of events per day across customers' deployed appliances and cloud services. The events feed both real-time threat analysis (which uses other systems) and a long-term forensic store (Cassandra). The Cassandra schema is typically partitioned by (customer_id, event_type, day_bucket) with clustering by event timestamp. TWCS keeps tombstone cost bounded as data expires.

Webex/Collaboration State

Cisco's collaboration platforms store meeting metadata, presence, telemetry, and chat state. Cassandra's multi-region active-active model fits the need: regional meetings should not depend on a far-away primary, and presence updates have a natural per-user partition key.

Operational Optimizations

Topology

Multi-region clusters per product line. RF=3 per region with LOCAL_QUORUM for the latency-sensitive paths.

Retention

Heavy TTL usage tied to customer regulatory requirements (90 days, 1 year, 7 years depending on contract). TWCS to bound tombstone overhead.

Schema strategy

Per-customer keyspace isolation in some products; logical partitioning in others. Multi-tenancy enforced at the schema level.

Integration

Cassandra feeds downstream analytics (Spark, Elasticsearch) for query patterns Cassandra cannot serve directly.

The Cisco lesson · Multi-tenant SaaS workloads with strict retention and per-region isolation map cleanly onto Cassandra. TWCS + customer-bucketed partition keys + LOCAL_QUORUM is a working recipe for "audit log at PB scale across regions."

15 · Top Production Use Cases

The six dominant use case patterns where Cassandra is the right answer. Each is concrete: who runs it this way, what the partition key looks like, and the gotcha you'll meet first.

15.1 · High-Volume Time-Series

// Financial ticks, infra monitoring, app metrics.

Partition pattern: ((metric_id, day_bucket), timestamp). Bucket by day or hour depending on rate.

Compaction: TWCS (pre-5.0) or UCS (5.0+). Time window = bucket size.

Driving property: sequential writes, append-only, deterministic TTL-based expiration.

Gotcha: Mixing TTL'd and non-TTL'd data in the same table blocks tombstone drops.

15.2 · IoT Sensor Telemetry

// Smart grids, vehicle trackers, manufacturing.

Partition pattern: ((device_id, week_bucket), reading_timestamp).

Replication: RF=3 per region, often single-region per product line to control cost.

Driving property: millions of devices, billions of writes/day, multi-year retention.

Gotcha: Device fleets are not uniform. A few "chatty" devices create hot partitions; bucket aggressively.

15.3 · Real-Time Activity / Event Tracking

// E-commerce clickstream, app impressions.

Partition pattern: ((user_id, day_bucket), event_timestamp); secondary tables for product-keyed queries.

Architecture: Kafka → stream processor → Cassandra at LOCAL_QUORUM.

Driving property: high write throughput, recent-activity reads dominate, analytical reads via streaming to a lake.

Gotcha: Analytical queries ("how many clicks on product X in the last hour?") are not Cassandra's job. Stream out.

15.4 · Chat / Messaging

// Discord, Webex, in-app chat, multi-device sync.

Partition pattern: ((channel_id, bucket), message_timeuuid). Bucket to bound hot partitions.

Driving property: low-latency tail reads of recent messages, persistent history, fan-out friendly.

Gotcha: The Discord hot-partition story. Popular rooms WILL overwhelm one partition. Plan bucketing from day one.

15.5 · Fraud Detection

// Aggregate transaction history for instant pattern checks.

Partition pattern: ((account_id), txn_timestamp) for last-N-transaction reads; separate aggregation in Flink/Spark.

Driving property: append-only history with sub-millisecond latency for the most recent N transactions.

Gotcha: Counter columns lie under failure. Use append-only event logs + downstream aggregation for accurate counts.

15.6 · Global User Profile / Preferences

// User settings, auth state, preferences across regions.

Partition pattern: ((user_id), attribute_name) with sparse columns per attribute.

Driving property: multi-region active-active, low-latency reads in every region, no failover required.

Gotcha: Concurrent updates from two regions to the same attribute resolve via LWW. For non-monotonic data (counters, sets), use CRDT-style merge at the app layer.

Best default choices

Search this guide

1 · Overview

Where Cassandra sits in the CAP space

What Cassandra is good at

What Cassandra is not good at

2 · Architecture

The Ring Topology

The Storage Engine (LSM)

Coordinator and Replicas

Gossip and Failure Detection

3 · Core Concepts

Partition Key, Clustering Key, Primary Key

Replication Factor (RF) and Replication Strategy

Consistency Levels

Tombstones and gc_grace_seconds

Compaction Strategies

Snitches and Topology Awareness

4 · Execution Model

Write Path

Read Path

Anti-Entropy: Hinted Handoff, Read Repair, Anti-Entropy Repair

5 · Feature Deep Dive (12 features)

5.1 · Masterless Ring Topology

5.2 · Linear Scalability

5.3 · Peer-to-Peer Replication

5.4 · High-Speed Sequential Writes

5.5 · Column-Family Data Structure

5.6 · Cassandra Query Language (CQL)

5.7 · Tunable Consistency Levels

5.8 · Multi-Data Center Replication

5.9 · Fault-Tolerant Routing

5.10 · Active-Active Global Replication

5.11 · Row-Level Time-to-Live (TTL)

5.12 · Secondary Indexing (now SAI in 5.0)

6 · Trade-Offs (10 rows)

7 · Use Cases (7 rows)

8 · Limitations

9 · Fault Tolerance

10 · Sharding

11 · Replication

12 · Better Usage Patterns (8 rows)

13 · Advanced / Next-Gen Alternatives

14 · Production Stories

Netflix 98% of streaming data · 10Ks of nodes · PBs · millions of TPS

The Time-Series Redesign

Operational Optimizations

Apple ~300K nodes · 100+ PB · thousands of clusters · millions of QPS

The Hybrid with FoundationDB

The Cluster-Size Philosophy

Operational Optimizations

Uber 1M+ writes/sec · tens of millions of QPS · 6+ years in prod

The Mesos + Cassandra Architecture (2016)

The Odin + Cassandra Framework (2020+)

Operational Optimizations

Spotify 100+ Cassandra clusters · PBs · personalization stack

The hdfs2cass Bulk Load

The Data Model Evolution

Operational Optimizations

Discord trillions of messages · 177 → 72 nodes (Cassandra → ScyllaDB)

The Journey: MongoDB → Cassandra → ScyllaDB

The Hot Partition Problem

The Migration

Results

Cisco multi-PB · billions of events/day · security + collaboration

Security Event Pipeline

Webex/Collaboration State

Operational Optimizations

15 · Top Production Use Cases

15.1 · High-Volume Time-Series

15.2 · IoT Sensor Telemetry

15.3 · Real-Time Activity / Event Tracking

15.4 · Chat / Messaging

15.5 · Fraud Detection

15.6 · Global User Profile / Preferences