DynamoDB vs Cassandra vs PostgreSQL
Three databases that solve different problems and get mistakenly compared because they all store rows. Deep research edition.
As of 2026-05-16. Framing: "When would you pick each." Grounded in 2024-2025 post-mortems and recent benchmarks.
Pick Postgres unless you have a specific reason not to. Pick DynamoDB when you want zero ops and your access patterns are known and frozen, accepting AWS regional dependency risk. Pick Cassandra only when you genuinely need active-active multi-region writes and have the ops headcount to operate it (most teams that pick Cassandra in 2026 should have picked ScyllaDB or DynamoDB instead, as Discord's 177-to-72 node migration demonstrated).
Headline Benchmark Numbers (2024-2025 Independent Tests)
- DynamoDB: 4K-40K ops/sec range (highly variable by access pattern), single-digit ms p99 in steady state, ~$1.25 per million requests at typical sizing
- Cassandra: 80K-106K ops/sec on mixed workloads (3-node baseline), p99 reads 40-125ms historical (Discord), p99 writes 5-70ms historical
- PostgreSQL: ~16K ops/sec write-heavy OLTP single-node, ~1.8x MySQL on writes; modern hardware pushes 100K+ TPS on well-tuned r6i.32xlarge
Best default choices
1. Trade-Offs
DynamoDB
Use when access patterns are known, low-latency key-value lookups matter, AWS-native operations are preferred, and zero database ops is worth the modeling constraints.
| # | Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|---|
| 01 | Partition key drives all access patterns | Predictable sub-10ms p99 reads at any scale, no query planner surprises | Cannot do ad-hoc queries, every new access pattern requires a new GSI or table | PM asks for a new query shape post-launch and the answer is "we need a GSI or rebuild the table" | Single-table design with overloaded composite keys is the escape, but most teams discover it after building 5 separate tables and now need a migration. |
| 02 | 3000 RCU / 1000 WCU per-partition ceiling | Forces good key design, prevents one workload from starving the table | Hot key kills you even if table-level capacity is unused | Viral content concentrates all traffic on one partition key, you throttle at 5% of provisioned capacityAWS docs: adaptive capacity and split-for-heat take 5-10min to kick in; dropped writes during the lag window | The 1000 WCU per-partition limit applies even in on-demand mode and is not raisable. Pre-shard via write-suffix pattern before viral moments, not during. |
| 03 | Eventual consistency by default, strong reads cost 2x | Cheap reads at scale, ~5ms p99 with eventual | Read-after-write surprises if you forget the ConsistentRead flag, plus 2x cost when you need it | User updates profile, refreshes page, sees stale data, files a bug. You add ConsistentRead everywhere and the bill doubles. | GSIs are always eventually consistent, no flag will fix that. If you need read-after-write on a secondary access pattern, you need to query the base table. |
| 04 | On-demand pricing vs provisioned | Zero capacity planning, scales to any spike | Roughly 7x the cost of saturated provisioned capacity | Workload becomes steady-state and you're paying 7x what you could be. Or stays spiky and provisioned would have throttled. | The right answer is usually provisioned + auto-scaling once you know your traffic shape. On-demand is the right starting default and the wrong long-term choice. |
| 05 | 400KB item size hard limit | Forces externalization of blobs, keeps per-partition storage manageable | Cannot store medium-sized documents, must split or offload to S3 | Product launches a feature that stores a 1MB JSON blob and you're now juggling two-system consistency between DDB and S3 | Most teams paper over this with "store S3 key in DDB" but never solve the consistency problem (S3 write succeeds, DDB write fails, orphan blob). Build the cleanup job before you ship. |
| 06 | No native joins, no cross-table transactions across regions | Predictable latency, no query planner to explain when things go slow | Joins become application-level multi-get, transactions across Global Tables require app coordination | Compliance asks for a report across users + orders + refunds, and now you're scanning three tables and merging in Lambda | TransactWriteItems works within a single region and table, up to 100 items. Cross-region needs sagas. This is the silent gotcha that derails most "let's go multi-region" projects. |
| 07 | Vendor lock-in plus implicit regional dependency on us-east-1 | Tight integration with Lambda, IAM, CloudWatch, no ops burden | Cannot move without rewriting data access layer; AWS control plane dependencies cascade through DDB | Oct 19-20, 2025: DNS race condition in DDB us-east-1 took down DDB regionally, then cascaded to EC2 (DWFM lease failures), Lambda, NLB. 15 hours of degraded service.Source: AWS post-mortem, Oct 2025. Race between DNS Planner and DNS Enactor wrote empty DNS record for dynamodb.us-east-1.amazonaws.com | "Fully managed" doesn't mean "no failure modes you need to plan for." Most teams treat DDB as a black box and have zero runbook for regional DDB unavailability. Build the multi-region failover plan even if you don't need it today. |
| 08 | Streams provide CDC, but with 24h retention | Native change capture into Lambda, Kinesis, EventBridge | If your consumer falls behind 24 hours, data is gone | Lambda consumer fails silently overnight, you discover Monday morning and the changes are unrecoverable | Use Kinesis Data Streams sink (longer retention) for anything important. The 24h limit is fine for triggers but never for critical pipelines. |
Cassandra
Choose only when you need active-active multi-region writes, massive write scale, tunable consistency, and you have the operational depth to own compaction, repair, and clock-skew risk.
| # | Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|---|
| 01 | Tunable consistency per operation (ONE / QUORUM / ALL / LOCAL_*) | Latency and consistency become per-query knobs, not architectural commitments | Every developer must understand R+W>RF formula, every team gets it wrong at least once | Reads default to ONE, writes default to ONE, team thinks they have consistency, race conditions appear in production | QUORUM/QUORUM is the right default for most workloads. LOCAL_QUORUM for multi-DC. Anyone who codes against CL=ONE for writes is shipping a bug. |
| 02 | Peer-to-peer with no leader, every node accepts writes | No failover step, no leader election lag, true active-active multi-DC | Last-writer-wins via timestamps, silent data loss on clock skew | NTP drifts on one node, writes from that node lose every conflict, days pass before someone notices | This is the single biggest operational risk in Cassandra and it's not in any vendor pitch. Monitor clock skew like your job depends on it, because it does. |
| 03 | LSM tree storage, write-optimized via commit log + memtable | Sub-ms writes, massive write throughput per node | Read amplification (multiple SSTables), compaction is a permanent ops concern | Compaction falls behind during ingest spikes, reads slow down, eventually disks fill up | Compaction strategy choice (STCS vs LCS vs TWCS) is one of the highest-leverage decisions and most teams pick wrong. Time-series should be TWCS, no exceptions. |
| 04 | JVM-based runtime with garbage collection | Mature JVM ecosystem, broad tooling, well-understood operational model | GC pauses create p99 tail latency variance; "super long consecutive GC pauses" require manual node reboot | Discord 2022: p99 read latency 40-125ms, p99 writes 5-70ms, frequent on-call paging for GC-induced cluster instabilityDiscord migrated 177-node Cassandra cluster to 72-node ScyllaDB cluster (C++, no GC); p99 reads improved to 15ms, p99 writes to 5ms | If you're picking Cassandra in 2026 over ScyllaDB, you should have a specific reason (ecosystem maturity, existing skills, multi-cloud) because the GC-pause tail latency tax is real and operationally expensive. |
| 05 | Tombstones for deletes (gravestone records) | Distributed deletes work eventually-consistently without coordination | Reads must scan tombstones until compaction reaps them (gc_grace_seconds, default 10 days) | Queue-like workload (insert, read, delete) tombstones reads to 10x slower as old tombstones pile up | Cassandra is famously the wrong choice for queue patterns. If your workload has a high delete:insert ratio, you're going to have a bad time, regardless of tuning. Discord's migration was held up by tombstone-heavy token ranges in the last 0.0001%. |
| 06 | CQL looks like SQL but isn't | Familiar surface for relational developers, fast onboarding | Familiarity is a trap, developers write Cassandra anti-patterns thinking they're writing SQL | JOIN-less, GROUP BY-less, the first cross-table query someone tries fails and they redesign the schema | L7-level red flag in interviews: a candidate who treats CQL as SQL with limitations. Real signal: candidate who treats it as a key-value API with a SQL-shaped wrapper. |
| 07 | Operational complexity is high (repair, compaction, GC tuning) | Total control over performance, no managed-service ceiling | Needs 2-3 dedicated SREs minimum at production scale | You hit a JVM GC pause issue at 50TB and the team has nobody who knows the JVM well enoughDiscord described their Cassandra ops as "high-toil" with "unpredictable latency and frequent on-call incidents" before migrating | DataStax Astra or ScyllaDB Cloud removes most of this. If you're picking Cassandra in 2026 self-hosted, you're picking a 3-engineer operational commitment. |
| 08 | Lightweight transactions (LWT) via Paxos | Compare-and-set semantics on top of an eventually-consistent store | 4 round trips, ~10x slower than regular writes, scoped to single partition | Developer sprinkles LWT for "safety", throughput drops 80%, cluster CPU pegs at 100% | LWT is a tactical escape valve, not a strategy. Cassandra 5.x ships Accord (paxos-derived strict-serializable transactions); watch this space if you've been blocked by transactions. |
PostgreSQL
Default to Postgres for most product and OLTP systems where relational modeling, ACID transactions, SQL flexibility, extensions, and operational familiarity matter.
| # | Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|---|
| 01 | Single-leader replication, vertical scaling primary | Strong consistency, ACID transactions, no clock-skew failure modes | Write throughput capped by single primary, eventual cliff requires sharding | Write QPS approaches 30K and you're researching Citus on a deadline | Modern hardware pushes the single-node ceiling much higher than most teams realize (100K+ TPS on a properly tuned r6i.32xlarge with NVMe). Don't shard until you've actually tuned. |
| 02 | MVCC via row versions, no read locks | Readers never block writers, writers never block readers | VACUUM is permanent ops concern; XID wraparound is a documented production-killing failure mode | Mailchimp/Mandrill 2016: autovacuum fell behind on a busy shard, XID wraparound protection halted writes, ~40 hours of outage. Sentry 2024: same root cause, emergency manual vacuuming required.Postgres uses a 32-bit XID counter; database refuses writes at ~2.1B transactions to prevent corruption. Alert at 1.5B XID age. | XID wraparound is the most under-appreciated failure mode in Postgres. It develops over weeks with no visible symptoms then takes the database offline completely. Set up the alerts before you need them. |
| 03 | Streaming replication (physical) + logical replication | Read replicas for scale, logical for selective replication and major version upgrades | Physical replicas read-only and version-locked, logical has known caveats (DDL, sequences, large transactions) | Logical replication breaks on a schema change and you're debugging replication slot growth at 3am | For major version upgrades, logical replication is the only zero-downtime path and it's still terrifying. Practice the cutover on a copy first. |
| 04 | Connection-per-process model | Strong isolation, easy debugging per session | Each connection costs ~10MB RAM, max_connections becomes a hard ceiling at ~500-1000 | Microservices proliferate, each maintains its own pool, you hit the connection wall | PgBouncer in transaction mode is mandatory for any non-trivial deployment. Without it you're capping your architecture at a few hundred concurrent backend connections. |
| 05 | Native partitioning, native sharding via Citus extension | Tiered scaling story: partition first, shard later when needed | Sharding is a retrofit, query planner has no native distributed-query concept | Cross-shard JOIN performance degrades to scatter-gather, application has to be sharding-aware | Citus works when data partitions naturally on a tenant_id-like key and 95% of queries are tenant-scoped. It breaks on global aggregations. Plan the data model for sharding from day one if you suspect you'll need it. |
| 06 | Extension ecosystem (PostGIS, pgvector, TimescaleDB, pg_partman) | One database for OLTP, GIS, vector search, time-series, queues | Each extension is a maintenance commitment, version-coupling across extensions | Postgres major upgrade requires waiting for 4 extensions to support the new version | Extensions are Postgres's superpower and its hidden cost. The "use one tool" gain is real until you have 6 extensions, then upgrade cadence becomes the bottleneck. |
| 07 | Synchronous replication available but optional | Trade durability for latency on a per-transaction basis | Default async means a failover can lose committed transactions | Primary AZ fails, replica promoted, 30 seconds of writes are gone, audit team is unhappy | synchronous_commit=remote_apply is the strongest setting and costs 2-5ms latency per write. For financial workloads it's non-negotiable. Most teams don't even know it exists. |
| 08 | Index proliferation is cheap, except when it isn't | Add indexes liberally to speed up queries | Each index slows down writes, increases bloat, fights for buffer cache | Write performance silently degrades as the indexes-per-table count creeps to 12+ | Use pg_stat_user_indexes to find unused indexes quarterly. Most production systems carry 30%+ dead-weight indexes that nobody dares to drop because "they might be used." |
2. Use Cases
DynamoDB
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Shopping cart and session store | Amazon retail (origin use case) | Sub-10ms p99 with zero ops overhead across regions | Trillions of items, peak ~89M req/sec during Prime Day | Cassandra would need a dedicated SRE team; Postgres would not scale writes past single primary |
| Ad-tech bid storage and counters | Mid-size DSP, 200K QPS sustained | Predictable p99 under 15ms, no GC pauses | 50TB hot data, 3M writes/sec spike during traffic peaks | Cassandra GC tail latency at 99.9% would breach the 100ms bid-window SLO; Postgres caps below 1M writes/sec |
| Mobile app backend (user profiles, prefs, sync state) | Snapchat-style, Tinder-style apps | Per-user lookup latency, global tables for region affinity | 100M+ MAU, ~5K QPS per region | Postgres single-leader latency cross-region would force read replicas plus app-layer routing |
| IoT device state and telemetry ingest | Fleet management, ~1M devices reporting every 30s | High write throughput with auto-scaling, TTL for retention | ~33K writes/sec sustained, time-series data with 90-day TTL | Postgres TimescaleDB cheaper at low scale but doesn't auto-scale through traffic spikes |
| Serverless event-sourced workload | Lambda + DDB Streams pattern | Native Lambda trigger, IAM integration, zero connection-pool concerns | ~10K events/sec with downstream fanout | Postgres would require RDS Proxy and connection limits would still cap concurrency |
| Gaming leaderboards and player state | Mobile games, competitive titles | Single-digit ms read on per-player profile lookup | 50M players, 200K concurrent at peak | Redis loses persistence guarantees; Cassandra adds operational tax for no latency gain |
Cassandra
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Time-series metrics and observability storage | Netflix Atlas (1 trillion requests/day on Cassandra) | Massive write throughput with linear horizontal scaling | Billions of writes/day, multi-PB datasets | Postgres TimescaleDB caps below 1M inserts/sec; DDB cost prohibitive at this scale (~$1.25/M requests vs ~$0.0001/M on Cassandra hardware) |
| Multi-region active-active write workload | Global IoT platforms, sync-heavy mobile apps | Local-DC writes with no failover, async cross-DC convergence | 10+ regions, sub-50ms write latency in each | DDB Global Tables exist but lose to Cassandra on per-key conflict resolution flexibility; Postgres has no native active-active story |
| User activity / event log storage (historical context) | Discord (177 nodes at peak before ScyllaDB migration) | Append-heavy workload, channel-partitioned by Snowflake ID | Trillions of messages, ~177 Cassandra nodes pre-migration | Postgres write ceiling; DDB cost; though Discord ultimately concluded ScyllaDB was the better trade-off |
| Recommendation system feature store | Spotify-style music streaming, news ranking | Wide-row reads (all features for a user) with low-latency lookup | 500M users, 1000+ features per user, 10K QPS | Postgres row width and per-query latency wouldn't sustain it; DDB item-size limit (400KB) kills the wide-row pattern |
| Fraud detection feature aggregation | Payment processors, large e-commerce | Real-time aggregation across time windows with eventual consistency tolerance | Tens of TB, sub-100ms feature retrieval | Redis loses persistence at scale; DDB throughput limits push cost too high |
PostgreSQL
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Transactional system of record | Most B2B SaaS (Stripe hybrid, Notion, Linear) | ACID transactions, mature ecosystem, schema flexibility | 10s of TB, sub-10K TPS sustained (well within single-node ceiling) | DDB lacks cross-table transactions across regions; Cassandra has no real ACID story |
| Multi-tenant SaaS with tenant isolation | Notion, Linear, Heroku, Citus customers | Per-tenant query patterns, schema-per-tenant or RLS isolation | 10K-100K tenants, varied query shapes per tenant | NoSQL forces denormalization that breaks tenant-isolated reporting; Citus extends Postgres for sharding when needed |
| Geospatial workloads | Uber (early), delivery platforms, ride-sharing | PostGIS extension provides production-grade spatial indexing | Millions of geo-queries per minute | DDB lacks spatial primitives; Cassandra has no geo-index that's production-credible |
| Vector search for AI applications | RAG pipelines, embedding storage, Pinecone alternatives | pgvector provides HNSW indexes, mature SQL surrounds it | 10M-100M embeddings with metadata filtering | Dedicated vector DBs are faster but force a second store and a sync problem |
| Analytics-adjacent OLTP | Internal admin tools, reporting backends | Window functions, CTEs, JSONB, full SQL for ad-hoc analysis | 1-10TB, complex queries | Cassandra and DDB cannot do ad-hoc aggregation; warehouse is overkill for this scale |
| Job queue / task scheduling | Hatchet, Trigger.dev, Sidekiq alternatives | SELECT ... FOR UPDATE SKIP LOCKED makes Postgres a credible queue, single source of truth | 10K-100K jobs/sec with retries and dead-letter | Dedicated brokers (Redis, SQS) lose the durability and queryability of Postgres-backed queues |
3. Limitations
| Limitation | DynamoDB | Cassandra | PostgreSQL |
|---|---|---|---|
| Max item / row size | 400KB hard limit High | 2GB partition recommended, ~100MB row practical Med | 1.6TB row, 1GB per field practical Med |
| Per-key throughput ceiling | 3000 RCU / 1000 WCU per partition (not raisable, even on-demand) Critical | Per-node ceiling depends on hardware, no logical per-key cap Med | Single writer per row, no logical cap (single-node ceiling caps writes) Med |
| Query flexibility | Only by partition key + sort key range Critical | Only by partition key + clustering key range, no joins High | Full SQL, only limited by query planner Med |
| Cross-key transactions | TransactWriteItems up to 100 items, single region only High | No multi-partition transactions; LWT single-partition; Accord coming in 5.x Critical | Full ACID across the entire database Med |
| Schema evolution | Schema-on-read, but GSI changes are slow Med | ALTER TABLE works but can be expensive at scale High | Online DDL works, some operations still lock Med |
| Backup and restore | PITR up to 35 days, full restore is slow Med | Snapshot-based, ops-heavy, no PITR native High | WAL-based PITR, mature tooling Med |
| Multi-region write story | Global Tables, LWW conflict resolution Med | Native active-active multi-DC Med | No native active-active, requires BDR or external tools Critical |
| Cost predictability at scale | Spike-driven cost cliffs on on-demand (~$1.25/M requests) High | Hardware + ops headcount fixed cost (~$0.0001/M on equivalent hw) Med | Compute + storage fixed cost Med |
| Connection scaling | HTTP-based, no connection pool needed Med | Native driver pool, ~10K connections per node Med | Process-per-connection caps at ~500-1000 without PgBouncer High |
| Documented catastrophic failure mode | Regional DNS race condition (Oct 2025 us-east-1, ~15h outage) Critical | Clock skew silent data loss; GC pauses requiring manual node reboot High | XID wraparound (Sentry 2024, Mailchimp 2016 ~40h outage) Critical |
4. Fault Tolerance
| Dimension | DynamoDB | Cassandra | PostgreSQL |
|---|---|---|---|
| Replication model | 3x sync across 3 AZs, leader-based, automatic | Tunable RF (typically 3), peer-to-peer, no leader | Single primary, async streaming replicas (sync optional) |
| Failure detection | AWS control plane, sub-30s (when control plane is healthy) | Gossip protocol, ~10-30s detection | External (Patroni, repmgr, RDS), 30-60s typical |
| Failover mechanism | Automatic, transparent to client | No failover needed (no leader), client retries to next replica | External orchestrator promotes a replica, DNS/proxy update |
| RTO (typical) | Sub-second to client, ~60s control-plane recovery (but see Oct 2025: ~3h DDB recovery, ~15h full ecosystem) | Sub-second (just retry to another coordinator) | 30-120s for managed (RDS), 5-30s for tuned Patroni setups |
| RPO (typical) | 0 for AZ failure (sync replication) | 0 with CL=ALL writes, otherwise potential loss on coordinator failure pre-replication | 0 with synchronous_commit, seconds with async (default) |
| Split-brain behavior | Prevented by AWS control plane quorum | Possible during partition; LWW resolves on heal, can lose data via clock skew | Possible if failover is misconfigured; STONITH or fencing required |
| Blast radius of single-node failure | Single partition unavailable ~30-60s, no data loss | ~RF replicas absorb load, no client-visible impact at QUORUM | If primary: full write unavailability until promotion; if replica: degraded read capacity |
| Regional / control-plane failure | Documented: Oct 2025 DNS race condition took down DDB us-east-1 for ~3h, cascaded to EC2/Lambda/NLB for ~15h total | Multi-DC native; LOCAL_QUORUM keeps regional impact local | No native multi-region story; depends on tooling and orchestrator design |
| Cross-region failover story | Global Tables, active-active, no failover needed | Multi-DC native, LOCAL_QUORUM keeps reads/writes local | No native active-active; requires logical replication or BDR |
| Silent data loss vector | Operator error on Global Tables (LWW resolution) | Clock skew + LWW: drifting NTP loses every write conflict, undetected for days | XID wraparound: dev-week-scale silent buildup, then total write outage (Mailchimp ~40h) |
6. Replication
| Dimension | DynamoDB | Cassandra | PostgreSQL |
|---|---|---|---|
| Replication topology | Leader-follower per partition, single leader | Leaderless (Dynamo paper lineage), peer-to-peer | Single primary, cascading replicas supported |
| Sync vs async | Sync within region (3 AZs), async cross-region (Global Tables) | Depends on consistency level: sync at QUORUM/ALL, async at ONE | Async by default; sync, remote_apply, remote_write configurable per transaction |
| Replication factor | 3, not configurable | Configurable, typically 3 per DC; per-keyspace setting | Configurable number of replicas, no theoretical cap |
| Consistency level options | Eventually consistent (default), strongly consistent (2x cost, in-region only, not on GSIs) | ONE / QUORUM / LOCAL_QUORUM / ALL / SERIAL (LWT) — tunable per query | Read-your-writes via primary; replicas have configurable max staleness |
| Replication lag (typical) | Single-digit ms within region, ~1s cross-region (Global Tables) | Sub-second within DC, 100ms-seconds cross-DC depending on network | 10-100ms in-region async; less than 1ms sync remote_write |
| Conflict resolution | Last-writer-wins by timestamp (Global Tables) | Last-writer-wins by timestamp; vulnerable to clock skew silent loss | No multi-master, so no conflict resolution; BDR extension adds LWW with column-level resolution |
| Cross-region replication | Global Tables: active-active, eventually consistent | Native multi-DC with NetworkTopologyStrategy | Logical replication across regions; no native active-active |
| Replication during partition | Minority AZ writes fail, majority continues | Both sides accept writes at CL=ONE; conflicts resolved on heal (LWW) | Async: writes continue on primary, replica falls behind. Sync: writes block |
| Replication failure modes | Control-plane DNS dependency (Oct 2025 incident) | Hinted handoff overflow under sustained replica down; tombstone tsunami on repair | Replication slot growth (logical) fills disk; physical replica fall-behind |
7. Better Usage Patterns
DynamoDB
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Single-table design | Create one table per entity (Users, Orders, Products) like a relational schema | One table with overloaded PK/SK and entity-type prefix (USER#123, ORDER#456) | Co-located access patterns become single queries; new access patterns become key-design decisions, not migrations |
| Write sharding for hot keys | Use the natural key and hope adaptive capacity catches up (it won't, in time) | Suffix the write key with a random 1-N shard (user_id#shard_0..N), aggregate on read | Pre-shards distribute load before adaptive capacity has a chance to throttle; eliminates flash-spike risk (the 5-10min adaptive capacity lag is a documented kill window) |
| GSI projection design | Project ALL on every GSI "just in case" | Project only attributes the query uses, plus the keys | GSIs store and bill independently; ALL projections double or triple your storage and write cost |
| Provisioned + auto-scaling vs on-demand | Stay on on-demand forever because "it just works" | Move to provisioned with auto-scaling once traffic is predictable (after ~3 months) | Provisioned saturated is 6-7x cheaper than on-demand for the same throughput; auto-scaling handles the variance |
| Conditional writes for idempotency | Use UpdateItem without conditions, rely on retries being idempotent at the app layer | ConditionExpression on every retryable write (attribute_not_exists, version-based) | Prevents the double-write bug that appears once a year in production and ruins your week |
| Multi-region failure planning | Treat managed = invulnerable, no runbook for regional DDB outage | Build read-failover to a second region via Global Tables; explicit chaos test against regional control-plane failure | Oct 2025 us-east-1 outage was a wake-up call. Most DDB customers had no plan and were down for hours waiting for AWS |
Cassandra
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Query-first table design | Design one normalized schema, then try to query it different ways | One table per query pattern, denormalize ruthlessly, write to all tables on update | Cassandra has no query optimizer; the table layout IS the query plan. Wrong table design means linear scans or impossible queries |
| Compaction strategy selection | Use default STCS for everything | STCS for general, LCS for read-heavy, TWCS for time-series with TTL | Wrong strategy causes 10x read amplification or storage bloat; TWCS for time-series is the difference between working and not |
| Clock synchronization | Trust the OS NTP defaults | Run chrony or dedicated NTP with sub-ms accuracy, monitor drift as an SLO | LWW conflict resolution silently loses writes when clocks drift; this is the #1 silent data loss vector in Cassandra |
| Consistency level selection | Use ONE/ONE for performance "because Cassandra is eventually consistent" | QUORUM/QUORUM as default, LOCAL_QUORUM in multi-DC, drop to ONE only with explicit reason | R+W>RF is the correctness threshold; teams that don't enforce it ship race conditions to production |
| Tombstone management | Use Cassandra as a queue (insert, read, delete pattern) | Use TTL for transient data, design tables append-only where possible, monitor tombstone count | Tombstone scans degrade reads exponentially; Discord's migration was held up at 99.9999% by exactly this issue |
| Request coalescing for hot partitions | Let multiple users requesting the same data hit the database independently | Add a request coalescing layer (Discord built theirs in Rust) — many users for the same key share one DB read | Hot partition reads multiplied by user concurrency is what killed Discord's Cassandra cluster; coalescing reduces effective fanout |
PostgreSQL
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Connection pooling | Each app/service manages its own pool, hit max_connections wall | PgBouncer in transaction mode in front of every Postgres deployment | Postgres caps at ~500-1000 connections without it; PgBouncer multiplexes thousands of clients onto tens of backend connections |
| VACUUM and autovacuum tuning | Leave autovacuum at defaults, blame Postgres when things slow down | Tune autovacuum_vacuum_scale_factor per-table for high-churn tables; monitor pg_stat_user_tables.n_dead_tup and XID age aggressively | Default autovacuum settings are conservative; high-churn tables accumulate dead tuples faster than vacuum runs. Mailchimp's ~40h outage and Sentry's emergency vacuuming both traced to this pattern |
| XID wraparound monitoring | Trust autovacuum to handle it, alert on slow queries instead | Alert on SELECT max(age(datfrozenxid)) exceeding 1.5B (well before the 2B emergency threshold) | XID wraparound develops over days/weeks with zero symptoms then halts all writes. This single alert prevents the worst Postgres failure mode |
| Index strategy | Add indexes for every slow query, never remove any | Quarterly pg_stat_user_indexes audit, drop indexes with zero scans, use partial indexes for selective conditions | Each unused index slows down every write and bloats the buffer cache; typical prod systems carry 30%+ dead-weight indexes |
| Long-running transactions | Wrap big batch jobs in a single transaction "for atomicity" | Chunk into smaller transactions with checkpoint table; use advisory locks for coordination | Long transactions block VACUUM from cleaning dead tuples, cause bloat to spiral, accelerate XID age. Duffel's 2021 outage traced to exactly this pattern (DDL statements without timeouts) |
| Replica usage for reads | Send all reads to primary "for consistency" | Route reads to replicas with explicit staleness tolerance; primary handles writes and strongly-consistent reads | Read-replica routing typically removes 60-80% of primary load; teams that don't do this scale vertically until they can't |
8. Advanced / Next-Gen Alternatives
DynamoDB
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| ScyllaDB / ScyllaDB Cloud (Alternator) | Wire-compatible DDB API at lower cost, no per-partition WCU ceiling, multi-cloud | Production at scale | Low for basic API, high if using Streams or Transactions | Cost at scale becomes prohibitive ($50K+/mo), regional dependency risk became unacceptable post-Oct-2025, or multi-cloud mandate |
| Aurora DSQL | Strongly consistent multi-region writes with SQL surface and ACID | GA but young | High — full schema rewrite, SQL access patterns | You need multi-region writes AND ACID transactions, and can wait out the maturity curve |
| FoundationDB | Strict serializability with multi-key transactions across the entire dataset | Production (Apple iCloud, Snowflake) | Very high — different data model, requires layer design | Correctness ceiling matters more than ops cost (financial, ledger, regulated workloads) |
| TigerBeetle (for financial) | Strict serializable double-entry accounting at 1M+ TPS | Early production | Very high — purpose-built, not a general database | You're building a ledger / payment system and DDB's eventual consistency on indexes becomes a correctness problem |
Cassandra
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| ScyllaDB | Same data model and CQL, 3-10x throughput per node, no JVM GC pauses (Discord: 177→72 nodes, p99 reads 40-125ms→15ms) | Production at scale (Discord, Numberly) | Low — wire-compatible with Cassandra driver | Tail latency or operational cost of Cassandra has become unacceptable, you want to keep the data model. This is the default modern answer for self-hosted wide-column. |
| Cassandra 5.x with Accord (transactions) | Native multi-partition strict-serializable transactions via Accord protocol | Emerging (5.x GA, Accord ongoing) | None if you're already on Cassandra | You want Cassandra's scale but have been blocked by lack of multi-partition transactions |
| YugabyteDB / CockroachDB | Strongly consistent distributed SQL with global transactions | Production | High — different query model (SQL vs CQL), strong vs eventual consistency reshapes app logic | You need horizontal scale AND ACID AND SQL — Cassandra was the wrong original choice |
| DynamoDB (managed) | Removes the entire ops burden (compaction, repair, GC tuning) | Production | High — different API, vendor lock-in, regional dependency risk | You picked Cassandra for scale but the 3-SRE ops cost exceeds the value; you're already AWS-native and accept the trade-off |
PostgreSQL
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Citus (extension) | Horizontal sharding for Postgres, keeps SQL surface | Production at scale | Medium — requires distribution-key design, some query rewrites | Single-primary write ceiling reached, data partitions naturally by tenant or workspace |
| Neon / Aurora serverless | Separates compute from storage, instant branching, auto-scale | Production | Low — wire-compatible Postgres | Variable workload, dev/prod parity matters, want to scale compute independent of storage |
| CockroachDB / YugabyteDB | Horizontal scale with strong consistency, Postgres-wire-compatible | Production | Medium — wire compat but different operational model and some SQL features differ | Need active-active multi-region with strong consistency, Postgres can't provide it natively |
| OrioleDB / Postgres 17+ undo-log storage | Eliminates VACUUM via undo-log MVCC, reduces bloat, eliminates XID wraparound risk | Emerging | Low at maturity — Postgres-compatible | Watch this space; if it lands stable, the entire VACUUM tuning burden becomes optional, not mandatory. This is the most exciting Postgres direction in a decade. |