NoSQL · Managed · AWS

Amazon DynamoDB

Principal Engineer deep-dive: architecture, every major feature, eight mandatory trade-off tables, and seven production case studies from operators running it at billion-item scale.

As of 2026-06-08 · DynamoDB 2022 paper + post-2024 features (MRSC GA Jun 2025, optimized stream discovery Jul 2025)

PE Verdict

DynamoDB is the right answer when your access patterns are known and your scale is unforgiving. The architecture (Multi-Paxos under shared-nothing partitions, MemDS-backed routing, "constant work" everywhere) buys you single-digit-ms p99 at any QPS with zero ops headcount, paid for with rigid query shapes, an item-size ceiling, and a cost curve that turns hostile if you treat it like a relational database.

The two most important decisions before adopting it: single-table vs multi-table (decided once, lived with forever) and MREC vs MRSC Global Tables (cross-region strong consistency now exists; it costs write latency and locks you to exactly three regions). Get those two wrong and the operational savings disappear into application-layer plumbing.

Best default choices

Overview

DynamoDB is a fully-managed, serverless, key-value and document NoSQL database from AWS. Born out of Amazon.com's 2004 holiday-season scaling pain (the original Dynamo paper, 2007) and re-built from the ground up as a multi-tenant managed service in 2012, it shares almost nothing with the original Dynamo except the name. Internally it is leader-based, range-partitioned, and synchronous via Multi-Paxos, the architectural opposite of the leaderless quorum-based Dynamo of the paper.

What you actually get

  • Predictable p99 latency. Single-digit milliseconds for point reads, regardless of dataset size (1KB or 100TB). This is the marketing promise and it is genuinely engineered, not aspirational.
  • Horizontal scale without an ops team. Partitions split automatically on size (10GB) or heat (3000 RCU / 1000 WCU). No vacuum, no rebalancing playbook, no on-call escalation for "the database is full."
  • Pay-per-request OR pay-for-provisioned. On-demand mode autoscales to traffic; provisioned mode is 3-7x cheaper at steady state and lets you cap spend.
  • Built-in multi-region. Global Tables with active-active replication (MREC, ~1s lag) or strong consistency (MRSC, GA June 2025, RPO=0 with cross-region write latency penalty).
  • CDC out of the box. DynamoDB Streams (24h retention) or Kinesis Data Streams adapter (365d retention) emit every change for Lambda or external pipelines.

What it explicitly does not give you

  • Ad-hoc queries. Every access pattern must be designed for. There is no SELECT * FROM users WHERE city = ? AND last_login > ? unless you built a GSI for it.
  • Server-side joins. The recommended workaround is single-table design with overloaded keys. The unrecommended workaround is doing joins in your application layer (slow, expensive, easy to get wrong).
  • Items above 400KB. Hard limit. Store the blob in S3, store the S3 key in DDB.
  • Full-text search. Pair with OpenSearch or pgvector. DDB's contains is a filter expression, not an index.
  • Free analytics. Scans are linear in table size and consume capacity. Stream the table to S3, query with Athena.
PE One-Liner

DynamoDB rewards engineers who know their access patterns at design time and punishes those who don't. The penalty for an unplanned new query pattern is a new GSI or a table migration, not a query optimizer's overnight surprise.

Architecture

DynamoDB is a five-tier shared-nothing service: Request Routers (stateless front door), MemDS (metadata cache), Storage Nodes (the actual B-tree replicas), Auto Admin (control plane), and S3 archival (for WAL durability beyond replica disks). The bones haven't changed since the 2018 re:Invent talk; what's been added is adaptive capacity, hot-item isolation, on-demand mode, and now multi-region strong consistency.

Application / SDK REQUEST ROUTER (RR) TIER · stateless · auto-scaled • Authenticates via IAM / Signature v4 • Resolves partition key → storage-node owner via MemDS lookup (constant-work pattern: always calls MemDS, even on cache hit) • Routes write/strong-consistent-read to leader replica; eventually-consistent-read to any replica MEMDS · partition metadata cache • In-memory distributed B-tree of (table, key-range → replica-set) • Constant work: RR queries MemDS every request to keep cache warm • Updated by Auto Admin on splits, failovers, region adds AUTO ADMIN · control plane • Partition splits (size >10GB, sustained heat >3000 RCU / 1000 WCU) • Leader election, replica replacement (spins up log replica first for fast quorum) • Cross-AZ placement, GSI propagation, Global Tables replication setup STORAGE NODES · 3 replicas per partition · cross-AZ • Multi-Paxos for leader election + consensus (stable leader, full Paxos only on leader change) • Local WAL + B-tree storage engine; storage replicas serve reads, log replicas only persist WAL • Write quorum = 2 of 3; ack to client after majority WAL persistence Replica 1 (AZ-a) LEADER · WAL · B-tree Replica 2 (AZ-b) FOLLOWER · WAL · B-tree Replica 3 (AZ-c) FOLLOWER · WAL · B-tree Log Replica (on failure) WAL only · fast quorum recovery S3 ARCHIVAL • Continuous WAL backup • PITR (35-day window) • On-demand backups • Source of truth for replica reconstruction DynamoDB · 5-tier shared-nothing architecture (2022 USENIX paper + 2024–2025 additions)

The Five Tiers

1. Request Router (RR)

Stateless HTTP front door. Authenticates the request, resolves the partition key against MemDS, then routes to the correct storage node. Auto-scales horizontally with traffic.

Key insight: RRs do not cache routing info indefinitely. They consult MemDS on every request (constant-work pattern), which means MemDS handles steady traffic, not a recovery thundering herd.

2. MemDS (Metadata Service)

Distributed in-memory B-tree of partition metadata: which key range lives on which replica set, who's the current leader, what replicas are healthy. Sits between RRs and storage nodes.

Constant-work pattern: even on cache hits, the RR queries MemDS. This keeps MemDS warm and avoids the cold-cache catastrophe during failover.

3. Storage Nodes

Where the actual data lives. Each partition is replicated 3-way across AZs in a region. Uses Multi-Paxos for leader election and log replication. B-tree on disk; WAL for durability.

Two replica types: storage replicas (WAL + B-tree, can serve reads) and log replicas (WAL only, cheap to spin up for fast quorum recovery during failover).

4. Auto Admin (Control Plane)

Handles partition splits (size-based at 10GB, heat-based on sustained load), replica placement across AZs, leader elections, GSI backfill, Global Tables setup, and PITR/backup orchestration.

When a storage node dies, Auto Admin spins up a log replica first (much faster than a full storage replica) to restore quorum, then rebuilds the storage replica in the background.

5. S3 Archival

WAL records are continuously shipped to S3 for cross-AZ durability beyond the 3-replica set. This is the basis for PITR (35-day window) and on-demand backups.

S3 is also the source of truth for rebuilding replicas after correlated AZ failures. The "11 nines of durability" pitch is grounded here, not in the in-region replica count.

Cross-Cutting: Constant Work Pattern

The architectural through-line. Whether cache hit or miss, partition hot or cold, request rate high or low — internal subsystems do the same amount of work. Costs more on average; eliminates the bimodal failure mode where a system that "usually does little work" collapses when conditions change.

This is why DynamoDB has predictable latency. It is also why DynamoDB is expensive.

Core Concepts

Eight ideas that, if you internalize them, eliminate 80% of common DDB mistakes. Most production incidents trace back to misunderstanding one of these.

Concept What It Is Why It Matters (PE Lens)
Table Top-level container of items. No schema beyond required key attributes. No fixed columns. Items in the same table can have wildly different attribute sets. Schemalessness is a footgun if you don't enforce shape in code. Use a tagged-union approach with a type attribute on every item to keep multi-entity tables debuggable.
Item A row. A collection of attributes. Hard 400KB size cap including attribute names. Identified uniquely by its primary key. The 400KB cap is the #1 source of mid-life-cycle migrations. If you might ever store images, embeddings, or PDFs, store them in S3 and hold the key in DDB.
Attribute A key-value pair. Scalar (String, Number, Binary, Bool, Null), set (StringSet, NumberSet, BinarySet), or document (List, Map). Attribute names are stored on disk every item. Long attribute names compound at scale. userPreferenceLastUpdatedTimestamp at 35 bytes × 50M items = 1.75GB of pure attribute-name overhead, paid for in RCU/WCU on every read/write.
Primary Key (PK) Either a single partition key or a composite of (partition key, sort key). Required, immutable, indexed automatically. Determines physical placement. The single most consequential schema decision in DDB. Cannot be changed without rebuilding the table. Get it right by enumerating access patterns first, naming the table second.
Partition Physical storage unit. Hosts a contiguous range of partition-key hashes. 3-way replicated across AZs. Hard ceilings: 10GB storage, 3000 RCU, 1000 WCU. These ceilings exist on every individual partition, not just the table total. A hot partition key (e.g., userId = "global") hits the 3000 RCU wall while the rest of your table is idle.
RCU / WCU Read Capacity Unit = 1 strongly-consistent read of a 4KB item/sec (or 2 eventually-consistent, or 0.5 transactional). Write Capacity Unit = 1 write of a 1KB item/sec. Rounded up. The 4KB read / 1KB write asymmetry is intentional and surprising. A 4.1KB read costs 2 RCU. Trim attribute names and break large items apart; you save real money.
GSI (Global Secondary Index) A separately-stored, asynchronously-replicated alternate-key index. Different partition/sort key than base. Own RCU/WCU. Eventually consistent only. GSI writes consume base-table WCU during propagation, so a write to a base item with 4 GSIs costs 5x writes. GSIs are tables with extra steps, not just an index.
LSI (Local Secondary Index) An alternate sort key on the same partition key. Created at table-creation time only. Shares partition-level capacity with the base table. Supports strongly-consistent reads. LSIs cap the item collection per partition key at 10GB (hard) — the partition can no longer split. Most production teams avoid LSIs entirely for this reason.

Execution Model

How a request actually flows through DynamoDB. Understanding this is how you reason about latency, why strongly-consistent reads cost double, and where partial failures hide.

Write Path

PutItem / UpdateItem / DeleteItem

1. Client sends request to nearest DynamoDB endpoint (HTTPS, signed with SigV4).

2. Request Router authenticates, then asks MemDS: "who owns the partition for hash(userId#42)?"

3. RR routes to the leader replica of that partition (followers reject writes).

4. Leader appends to its local WAL, then sends the WAL entry to the two followers via Multi-Paxos.

5. Once a quorum of 2 of 3 replicas has durably persisted the WAL entry, the leader acks the client.

6. Leader applies the change to its B-tree storage; followers apply asynchronously after the ack.

7. WAL is shipped to S3 archive in the background (durability beyond the 3 replicas).

8. If the item has any GSIs, the change is propagated asynchronously to GSI partitions (separate Paxos groups). This is the eventually-consistent nature of GSIs.

Eventually-Consistent Read Path (default)

GetItem (ConsistentRead=false)

1. Client sends request. RR resolves partition via MemDS.

2. RR routes to any replica in the partition's replica set (often the closest by latency).

3. Replica responds from its local B-tree. If the replica is a follower that hasn't applied the latest WAL entry yet, the client sees a slightly stale value.

4. Typical p99 latency: 2-5ms. Cost: 0.5 RCU per 4KB.

Strongly-Consistent Read Path

GetItem (ConsistentRead=true)

1. Client sends request with ConsistentRead: true. RR resolves partition.

2. RR must route to the leader replica. No other replica is authoritative because Multi-Paxos guarantees the leader has all committed log entries.

3. Leader serves the read from its B-tree (already has all acked writes applied).

4. Typical p99 latency: 5-10ms (~5ms higher than eventual). Cost: 1 RCU per 4KB (2x eventual).

Failure mode: If the leader is unreachable or in the middle of re-election (typically 5-15 seconds), strongly-consistent reads return errors. Eventually-consistent reads keep working from followers.

Query Path (range scan within a partition key)

Query on (PK, SK BETWEEN x AND y)

1. Query is bound to a single partition key. Routed to that partition's owner (any replica for eventual, leader for strong).

2. Replica B-tree-scans the sort-key range within that partition. No cross-partition fan-out.

3. Up to 1MB of results returned per call; client paginates with LastEvaluatedKey.

4. Cost: sum of all item sizes in the page, divided by 4KB, rounded up.

Scan Path (full-table scan)

Scan — the path you should not take

1. RR fans out to every partition in the table.

2. Each partition returns up to 1MB of items.

3. Total cost: every 4KB of every item in the table = RCU. A 1TB table = ~250M RCU per scan.

4. Use ParallelScan with TotalSegments + Segment to distribute load. Filter expressions are applied after the read; they don't reduce cost.

5. Default to never. Use Athena over an S3 export instead.

Transaction Path (TransactWriteItems / TransactGetItems)

ACID across up to 100 items / 4MB

1. Client sends a batch of conditional writes (up to 100 items, up to 4MB total).

2. DynamoDB runs a two-phase commit coordinator across the partitions hosting those items.

3. Phase 1 (prepare): each partition leader confirms it can apply the change without conflict, locks the items.

4. Phase 2 (commit): all partitions apply the change atomically. If any prepare fails, all roll back.

5. Cost: 2x WCU for writes, 2x RCU for reads. p99 latency 10-30ms typical.

Caveats: Transactions cannot span regions in Global Tables (single-region only). The MRSC consistency mode does not support transaction APIs.

Feature Deep-Dive

Every major feature you listed, plus a few PE-relevant ones you didn't (capacity modes, transaction APIs, GSI propagation lag). Each is treated as a row in a table that captures what it does, what it costs, and where it bites.

Feature What It Actually Does Cost / Constraint PE Watch-Out
Single-digit ms latency p99 of 2-5ms for eventually-consistent reads, 5-10ms for strongly-consistent reads, 5-10ms for writes — at any table size, any RCU/WCU. Cost of constant-work pattern: idle clusters cost almost as much as busy ones. Latency is at the API boundary, not your application. Network round-trip from your service to DDB endpoint adds 1-5ms inside AWS, 20-80ms from outside.
Seamless scalability Automatic partition splits on size (10GB) or heat (sustained >3000 RCU or >1000 WCU). No downtime, no client-side reconfiguration. Splits take 5-30 minutes during which the hot partition is throttled. Adaptive Capacity bridges the gap. The "seamless" pitch hides a 5-30 minute throttle window. For predictable launches (Disney+, Super Bowl ads), pre-warm partitions by writing high WCU to the table for 20+ minutes before launch.
DynamoDB Accelerator (DAX) In-memory write-through cache cluster fronting your table. Two caches: item cache (per-item, 5-min default TTL) and query cache (per-query-result, 5-min default). 3+ node cluster, ~$0.04-0.30 per node-hour. 24/7 cost. Eventually-consistent only; strongly-consistent reads pass through to DDB. DAX does NOT track Global Tables replication. Writes from another region invalidate nothing. Direct writes that bypass DAX (via raw DDB SDK) silently leave DAX stale until TTL.
Key-value & document store JSON-like nested documents (Map, List, StringSet, NumberSet) as attribute values, all addressable in update expressions (SET data.address.city = :city). 400KB item ceiling still applies to the whole document. Nested-attribute updates count full item size for WCU. Deep nesting is a billing trap: updating one field in a deep object still rewrites and re-charges for the entire 4KB block.
Global Tables (MREC) Async multi-active replication across regions. Local writes ack immediately, replicate to other regions in <1s typical (p99 in seconds). Last-Writer-Wins conflict resolution. 2x WCU per write (counted as replicated WCU / rWCU). Cross-region data transfer fees. LWW is silent. A write in us-east-1 and a write in eu-west-1 to the same key, ~1s apart, will have the later-timestamp version preserved without warning. Build idempotency on the application side.
Global Tables (MRSC) NEW Jun 2025 Strongly-consistent active-active replication across exactly 3 regions (or 2 + 1 witness). RPO = 0 during regional failures. Multi-region journal for strong consistency. Higher write latency (cross-region round-trip; ~30-100ms+ depending on region distance). Strongly-consistent reads also cross-region. Transactions not supported. MRSC locks you to 3 regions for the lifetime of the table. Cannot convert MRSC ↔ MREC after creation. Pick the right mode at table creation or rebuild.
Serverless operations No instances to size, patch, or back up. No VPC for DDB itself (DAX needs one). No connection pooling — every request is independent HTTPS. Per-request fees on top of capacity (on-demand) or 24/7 provisioned capacity charge. You still own capacity planning (provisioned), partition-key design, and access-pattern enumeration. "Serverless" eliminates ops on the box, not engineering on the schema.
Built-in security Encryption at rest (AWS-owned KMS by default, customer-managed KMS opt-in). TLS in transit. IAM resource-level + attribute-level policies. VPC endpoints (PrivateLink) for traffic isolation. KMS adds ~10-30ms to cold reads if using CMK with low usage. PITR + backups also encrypted. Fine-grained access (deny user X from item where userId != $caller) is doable via IAM conditions on dynamodb:LeadingKeys, but is non-obvious and easy to get wrong. Audit early.
Multi-AZ availability Every partition is 3-way replicated synchronously across 3 AZs in the region. Auto-failover on AZ loss within seconds. Always-on cost of 3 replicas (priced into storage rate). "Multi-AZ" is the default and not optional. There is no single-AZ DynamoDB mode. This is why DDB doesn't have a "high-availability" upsell — every table is HA.
Continuous backups (PITR) Point-in-time recovery to any second in the last 35 days. Restored to a new table. ~$0.20/GB/month for PITR storage. Restore time scales with table size. Restore creates a new table, not in-place. App needs a swap pattern (alias / config flip / reroute). Tabletop the failover before you need it.
DynamoDB Streams 24-hour CDC log of item-level changes. Ordered per partition key. Consumed by Lambda or via KCL. Four view types: KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES. One stream shard per table partition. Shard limit = partition count. 24h retention only. 1:1 mapping of partitions to stream shards means a hot partition = hot shard = single Lambda concurrency bottleneck. Use Parallelization Factor (up to 10) for higher concurrency per shard. For >24h retention or pure event-source, use Kinesis Data Streams adapter.
Time to Live (TTL) Mark an attribute as TTL; DDB asynchronously deletes items past that epoch timestamp. Deletes also flow through Streams (with userIdentity: dynamodb). Free — no WCU charged for TTL deletes. Best-effort timing: usually within 48h of expiry, not instant. If your billing logic depends on TTL having actually deleted by a deadline, you will be sad. Treat TTL as a janitor, not a scheduler. Use a TTL stream consumer if you need a hard event at expiry.
Strong Consistency option Per-read flag (ConsistentRead: true) that routes the read to the partition leader, guaranteeing all acked writes are visible. 2x RCU cost. ~5-10ms higher p99. Unavailable on GSI. Returns errors during ~5-15s leader re-election windows. See dedicated Strong Consistency section below for what "the leader" actually is and what happens to "read replicas" in this mode.
ACID transactions TransactWriteItems / TransactGetItems: up to 100 items / 4MB across any partitions in a region. Two-phase commit with conditional checks. 2x WCU/RCU. Single-region only. Not supported under MRSC Global Tables. p99 10-30ms. Use sparingly. A 100-item txn touching 100 different partitions multiplies p99 latency variance and quintuples the probability of a conditional-check failure.
Query operation B-tree range scan within a single partition key. Supports KeyConditionExpression on PK (equality only) + SK (range). 1MB result page cap; paginate via LastEvaluatedKey. Costs sum-of-page-size RCU. Filter expressions run AFTER the page read; they reduce returned rows, not RCU consumed. Push selectivity into the SK / GSI design, not into FilterExpression.
Scan operation Full-table read. ParallelScan supports segmenting work across N workers. Reads every byte = pays RCU for every byte. Filter does not reduce cost. If you're scanning regularly, your access patterns are wrong or your tool is wrong. Either redesign with the right GSI or move analytics to Athena over S3 export.
Primary Keys (PK / PK+SK) PK alone = simple key (UUID, userId). PK+SK = composite key supporting Query, item collections, time-series. PK determines partition placement; SK determines storage order within the partition. Both PK and SK are immutable after item creation. Schema decision for the life of the table. The single most consequential design decision. Composite keys with overloaded SK values (PROFILE#, ORDER#2024-06-01, POST#) are the foundation of single-table design.
Global Secondary Index (GSI) Independent table replicating from base. Different PK/SK. Own RCU/WCU. Up to 20 per table (soft). Every base write replicates to all GSIs; GSI writes consume base-table WCU. Eventually consistent only. ~1-2s lag typical. GSI write amplification is real: 4 GSIs = 5x write cost. Sparse indexes (project only items with the indexed attribute) are the lever for cost. Use them.
Local Secondary Index (LSI) Same PK as base, different SK. Created at table creation only (can't add later). Supports strongly-consistent reads. Shares partition capacity with base. Caps each item collection at 10GB. The partition can no longer split when LSI present. Most production teams avoid LSI for the 10GB cap. The only reason to use one: you genuinely need strongly-consistent reads on an alternate sort key for a bounded data set.
Capacity Modes Provisioned (you set RCU/WCU, optional auto-scaling 0-N) or On-Demand (per-request pricing, scales instantly). On-demand is ~7x the per-request cost of provisioned at full utilization. Provisioned charges 24/7 regardless of usage. Switch limit: once every 24h per table. Common pattern: launch on On-Demand to discover traffic shape, switch to Provisioned + auto-scaling once stable.

Strong Consistency Mode — What Actually Happens to Replicas

People reach for the term "read replicas" because that's the Postgres mental model. DynamoDB does not have read replicas in the Postgres sense. It has three storage replicas per partition, all maintained synchronously via Multi-Paxos, and one of them is currently the leader. Strong consistency is not "skipping the read replica" — it's "you must hit the leader."

The replica topology, concretely

Client GetItem(ConsistentRead=true) Request Router LEADER REPLICA · AZ-a ✓ Has all acked writes (Paxos) ✓ Serves strongly-consistent reads ✓ Serves writes ✓ Serves eventually-consistent reads Holds leader lease (~few seconds) Renews via Paxos heartbeat FOLLOWER · AZ-b ⚠ May lag a few ms behind leader ✗ Does NOT serve strong reads ✗ Does NOT serve writes ✓ Serves eventually-consistent reads Receives WAL from leader, applies asynchronously to B-tree FOLLOWER · AZ-c ⚠ May lag a few ms behind leader ✗ Does NOT serve strong reads ✗ Does NOT serve writes ✓ Serves eventually-consistent reads Receives WAL from leader, applies asynchronously to B-tree ConsistentRead=true → leader only followers refuse strong reads Cost of forcing leader-only reads: 2x RCU · 5-10ms higher p99 · returns errors during ~5-15s leader re-election · unavailable across regions (until MRSC) · unavailable on GSI ever

What happens in each failure mode

Scenario Strongly-Consistent Read Behavior Eventually-Consistent Read Behavior
Steady state, all replicas healthy Routes to leader. Sees latest acked write. Adds ~5-10ms vs eventual. Routes to any replica (often closest). p99 2-5ms. May see up to ~1s stale data in the worst case.
Leader replica fails Returns 500 / InternalServerError for ~5-15s during Paxos leader election. Reads briefly fail. Continues working from healthy followers. No interruption.
Network partition isolates leader from majority Leader steps down (loses lease). New leader elected from the majority side. Reads on the minority side fail. Minority-side replicas still serve eventually-consistent reads (potentially stale). Majority side normal.
AZ failure (1 of 3 replicas down) If leader was in failed AZ, ~10s blip then new leader elected in healthy AZ. If leader survived, no impact. No interruption — reads route to surviving replicas.
GSI read Not available at all. ConsistentRead=true is silently ignored on GSI; you always get eventually-consistent. GSIs are replicated asynchronously. Default. Lag ~1-2s typical, can spike under heavy base-table write load.
Cross-region read (MREC Global Tables) Strong consistency is scoped to a single region's leader. A strongly-consistent read in us-east-1 does not see writes from eu-west-1 until they replicate (~1s). Same — eventually-consistent within region, replication-lag stale across regions.
Cross-region read (MRSC Global Tables, Jun 2025) Strong consistency across all 3 MRSC regions. Reads see latest committed write from any region. Cost: writes require cross-region quorum, adding 30-100+ms latency. Available, lower latency than strong MRSC reads. Same eventually-consistent semantics within and across regions.
The PE Mental Model

Think of strong consistency in DynamoDB as "give me the most-recent-committed-write, even if I have to talk to the one specific replica that knows." The price is paid in three currencies: money (2x RCU), latency (~5ms penalty), and availability (errors during leader re-elections). All three matter under failure conditions.

For 80% of read paths, eventual consistency is correct and you're paying a tax for safety you don't need. The 20% where strong consistency is genuinely required: inventory check before sale, balance check before debit, configuration read after a known recent write. Be deliberate about which paths get the flag.

Common Misuse: Read-Your-Own-Writes via Strong Consistency

If your reason for ConsistentRead=true is "I just wrote it and want to see my own write back," there is a cheaper fix: cache the write locally in the client. You wrote it, you already know its content. Don't pay 2x RCU to ask DDB to confirm what you just told it.

Trade-Offs

Ten distinct trade-offs at PE depth. Each row is something where you genuinely give up X to get Y; "fast" is not a trade-off, "predictable latency at the cost of bimodal expensive-when-idle pricing" is.

Trade-Off What You Gain What You Give Up When It Bites You PE Nuance
Partition key drives every access pattern Predictable p99 latency under 10ms at any scale — no query planner uncertainty. All access patterns must be enumerated before writing code. Ad-hoc analytics impossible without separate pipeline. The first time PM asks for a new query post-launch and the answer is "we need to add a GSI" or "we need a new table." Single-table design with overloaded keys is the only escape, and most teams discover it too late. Every new access pattern is a schema decision, not a query decision.
Multi-Paxos under shared-nothing partitions Linearizable writes, strongly-consistent reads available, automatic failover in seconds. Strongly-consistent reads route to leader only; 2x RCU; ~5-10ms latency adder; brief errors during leader re-election. Auto Admin re-elects a leader after AZ event — strongly-consistent reads on that partition return 500s for 5-15s. Eventually-consistent reads keep working. The leader-only-for-strong-reads property is why DDB Streams ordering is per-partition-key (not table-wide) and why GSIs cannot ever offer strong consistency.
Constant-work pattern over efficient-when-idle No bimodal failure: a quiet system and a busy system behave identically. Recovery from outages is graceful. Idle clusters cost almost as much as busy ones. Provisioned capacity charges 24/7 regardless of utilization. Dev/staging tables on Provisioned mode at 100 RCU/100 WCU still cost ~$80/mo even with zero traffic. Multiply by every environment. This is the deliberate engineering choice that makes DDB latency predictable. It is also why everyone says "DDB is expensive" — you're paying for predictability, not for usage.
400KB item size cap Bounded read/write cost per request. Predictable WCU consumption. Network round-trip stays small. Cannot store images, large documents, embeddings, or attachments inline. Forces two-system designs. The day someone adds a base64-encoded thumbnail to your item and writes start failing with ValidationException. Store blobs in S3 with the S3 key in DDB. Now you own consistency across two systems: orphan blobs after failed writes, missing blobs on race conditions. Use S3 lifecycle + DDB conditional writes to manage it.
GSI write amplification Multiple access patterns from one base table; no application-layer maintenance. Every GSI multiplies write cost. 4 GSIs = 5x WCU per base write. Adding a "new filtering capability" GSI to an existing high-write table — costs quintuple overnight, and the GSI backfill itself eats hours-to-days of additional WCU. Sparse GSIs (project only items where the GSI attribute exists) and partial GSIs (project subsets of attributes) are the levers. Use them to keep GSI write cost ≤2x base.
Eventually-consistent reads by default 2x throughput from same RCU; reads served from any replica; survives leader failures. Up to ~1s of staleness in the worst case; reads of just-written data may return old value. A user updates their profile and immediately reads it back — sees the old value. PM files a "data not saving" bug. You explain eventual consistency. Most read-after-write paths don't need strong consistency: cache the write locally in the client. The legitimate strong-read use cases are inventory decrements, balance checks, and config-read-after-known-recent-write.
Provisioned vs On-Demand capacity Provisioned is ~7x cheaper at full utilization; On-Demand handles unpredictable spikes without ceremony. Provisioned throttles unpredictable spikes; On-Demand bills hostilely for steady-state traffic. Switch limited to once / 24h per table. Black Friday on a Provisioned table that didn't auto-scale fast enough — throttles all morning. Or a steady 50K QPS workload on On-Demand — $30K/mo bill instead of $4K. Pattern: launch on On-Demand to learn traffic shape, switch to Provisioned + auto-scaling once stable. Set min capacity at the actual minimum; auto-scale up; ceiling at the budget cap.
Hard partition ceilings (3000 RCU / 1000 WCU) Predictable per-partition performance; multi-tenancy guarantees fairness across customers. No matter how much total table capacity, a single partition key cannot exceed these limits sustainably. Viral content (single video, single hot product, celebrity user) generates a hot key — table-wide RCU is at 5% but the hot partition is throttling. Adaptive Capacity and Split-for-Heat fix this in ~5-30 minutes. For predictable hotness (Super Bowl ad, product launch), write-shard the key at design time: append a random 0-N suffix, fan out on write, gather on read.
DAX caching layer Reads from µs (vs ms); reduced RCU cost; protects DDB from read storms. 3+ node 24/7 cluster cost; eventually-consistent only; opaque to Global Tables replication; cache coherency held by 5-min TTL only. You write to DDB directly from another service (skipping DAX) and your DAX-fronted reads return stale data until the next TTL expiry. DAX is a write-through cache, but only for writes that go through it. The simplest mistake is having two write paths: one through DAX, one direct to DDB. Pick one.
Global Tables: MREC vs MRSC MREC: low latency local writes, eventual cross-region. MRSC (Jun 2025): RPO=0, strong consistency across 3 regions. MREC: LWW conflict resolution silently overwrites; MRSC: cross-region write latency (30-100+ms); locked to 3 regions; no transactions. MREC: dual-write race between us-east-1 and eu-west-1 silently loses the earlier write. MRSC: writes appear slow until you realize it's cross-region quorum, not a bug. Mode is fixed at table creation; cannot convert MREC ↔ MRSC after creation. Pick deliberately. MRSC is the right choice for financial / compliance workloads; MREC is right for most read-heavy global apps.

Use Cases

Eight production use cases with named operators, scale dimensions, and the specific property that ruled out the obvious alternative. Each maps to a customer story expanded in the Case Studies section.

Use Case Company / Scenario Driving Property Scale Dimension Why Not Alternative
Media bookmarks & watchlists with cross-device sync Disney+ (Hulu, ESPN+) Single-digit ms reads for "resume from where I paused" across globally-replicated devices, with billions of writes/day at unpredictable launch shape. 3B+ daily content API requests at launch; bookmarks + watchlists across >100M subscribers globally. Cassandra would have required a 5-engineer ops team; Aurora Global wouldn't have handled the write skew and active-active replication out of the box.
Catalog, subscriptions, rentals, playback tracking Amazon Prime Video Predictable p99 across heterogeneous workloads (catalog browsing, entitlement checks, real-time playback events) without per-workload tuning. ~200M Prime members globally; catalog spans 100Ks of titles; playback events at peak Super Bowl traffic. RDS Aurora hits write-scaling walls at this throughput; a relational design forces denormalization that defeats the schema benefits.
Real-time ML feature store for personalization Netflix Sub-10ms feature lookups during request-time inference; CDC fan-out to downstream personalization services via Streams. 260M+ subscribers; ~700 microservices; trillions of Kafka messages/day; ML model serving needs feature reads in <5ms. Cassandra (used elsewhere at Netflix) is heavier ops; Redis lacks durability story for the source of truth.
High-frequency GPS tracking across cities Lyft (ride-tracking system) Burst write capacity for thousands of concurrent rides, each writing GPS coordinates every few seconds; multi-city geographic partitioning. Millions of rides daily across hundreds of cities; sustained >100K writes/sec at peak. Self-managed Cassandra cost engineering headcount Lyft didn't want to allocate; PostGIS doesn't scale write-side to this throughput.
Multi-region active-active for resilience-mandated apps Capital One (mainframe migration to serverless) Application failover time reduced 99% via active-active multi-region; financial-grade durability and audit. Cross-region read/write workload from US East to US West with zero RPO target. Aurora Global Database is multi-region but not active-active (failover required); Aurora Multi-Master was single-region. Only DDB Global Tables (now MRSC) offered both.
Massive learning-content state for an education app Duolingo Storage for 30B+ learning objects with spiky read patterns (lesson exercises) and predictable p99. ~50M DAU; 24K read units/sec, 3.3K write units/sec sustained; 31B+ items in a single table. Self-managed databases cost 3-5 engineers; the cost of an ops team exceeded the cost of DDB at their scale until very late.
Web-scale session storage with web/mobile lookups Airbnb (session store, real-time lookups) Read-heavy workload (every page view = session lookup) with global users and low p99 SLO. Hundreds of millions of sessions across regions; every request reads session state. Redis Cluster fronted by ElastiCache trades durability for speed; the persistence + consistent multi-region story is harder than DDB.
Event-driven fan-out for serverless architectures Mid-to-large serverless shops (broad pattern) Streams → Lambda turns every item change into an asynchronous trigger, removing the need for a separate event bus for many use cases. Order-update fan-outs in e-commerce, audit-log fan-outs in fintech, denormalization fan-outs in CQRS architectures. Kafka + a separate CDC connector is more powerful but adds infra. Streams is "good enough" CDC with zero infra for <24h retention needs.
Transactional financial ledgers, multi-region Critical payment systems (ISO 20022 platforms) Strongly-consistent reads on transaction state within a region, plus Global Tables for cross-region status replication, with audit-grade durability. Tens of thousands of transactions/sec; legal/regulatory durability requirements. Stronger alternatives like TigerBeetle are early-stage; DDB + Global Tables is the conservative, vendor-supported path.

Limitations

The hard ceilings and architectural constraints. Severity reflects how often they actually block real production workloads, not whether they're documented.

Limitation Severity Workaround Workaround Cost
400KB item size limit Critical Store payload in S3, store S3 key + metadata in DDB. Two-system consistency burden, ~$0.0004/GET on hot reads, orphan-cleanup logic, eventual consistency between DDB and S3.
1MB query/scan result page limit Medium Paginate via LastEvaluatedKey; tune page size with Limit parameter. Multi-page latency hides; clients must handle cursors correctly to avoid skipped items on backend changes.
No server-side joins or relational queries High Single-table design with overloaded keys + composite SK values; or denormalize and tolerate duplication. Schema rigidity; "new join" is a schema migration. App-layer joins double round trips.
Per-partition 3000 RCU / 1000 WCU ceiling High Write sharding (suffix-randomized PK); rely on Adaptive Capacity for natural skew. Write sharding complicates reads (fan-out gather); Adaptive Capacity takes 5-30 min to react to viral spikes.
GSI cannot be strongly consistent High Read base item for strong consistency; use GSI only for eventually-acceptable lookups. Double-read pattern; or hybrid index where authoritative state lives on base item, GSI used for discovery only.
LSI caps item collection at 10GB (no split) Critical Avoid LSIs entirely; use GSI even though it costs strong consistency on the alternate key. The Authoritative-strong-read-on-alt-key use case has no clean answer in DDB. Re-architect or accept eventual consistency.
Transactions limited to 100 items / 4MB / single region Medium Saga pattern with compensating actions for multi-region or larger txns; outbox pattern with Streams. Saga complexity; compensating actions are application logic, not DB-enforced; outbox needs idempotent consumers.
DDB Streams retention is 24h only Medium Use Kinesis Data Streams adapter (365d retention) or persist Stream output to S3. Kinesis adds infra; S3-backed replay is one-off ETL effort. Lambda failures past 24h = unrecoverable events without one of these.
1:1 mapping of partitions to Stream shards High Increase Stream Parallelization Factor (up to 10 concurrent batches per shard); or accept hot-shard limits. Per-shard ordering is preserved only at Parallelization Factor 1. Higher values lose ordering for the sake of throughput.
Scan is linear, expensive, and unaware of indexes High Export to S3 via DDB Export-to-S3 (PITR-backed); query with Athena. Or design proper GSI. Export takes minutes; data is point-in-time, not live. Athena queries add seconds-to-minutes latency.
On-demand cost is ~7x provisioned per request Medium Switch to Provisioned + Auto Scaling once traffic shape is understood. Once per 24h switch limit; if you switch wrong, you wait. Auto Scaling reacts in minutes, not seconds (insufficient for flash sales).
DAX only supports eventually-consistent reads Medium Strong reads bypass DAX entirely (pass through to DDB). Use DAX only for eventually-consistent traffic. Mixed-consistency paths split traffic: cache hit rate drops; reasoning about consistency becomes per-call.
MRSC locks to exactly 3 regions, no transactions High For 2-region active-active: use MREC + application idempotency. For multi-region txns: saga across regions. Mode is set at creation, not changeable. Picking wrong = rebuild table. No reasonable mitigation for needing both MRSC and transactions.
No full-text search High Stream to OpenSearch via Lambda; or use DDB contains filter (still scan-based, doesn't reduce RCU). Two-system sync; OpenSearch ops; eventual consistency between search and source.

Fault Tolerance

Single-tech layout. Each dimension is the behavior of the system and the operational reality of running it through that behavior.

Dimension Behavior Operational Reality
Replication model 3x synchronous via Multi-Paxos across 3 AZs within a region. WAL persisted on majority (2 of 3) before client ack. You have no knobs. There is no single-replica DDB. Every table is 3-way replicated whether you need it or not.
Failure detection Multi-Paxos heartbeats from leader; lease-based mechanism (leader periodically renews lease via Paxos). Detection is sub-second internally, but the lease expiry is what triggers re-election (typically ~5-10s).
Failover mechanism Automatic via Auto Admin. New leader elected from healthy replicas. RR routing tables updated via MemDS within seconds. Client sees brief errors on strongly-consistent reads/writes; eventually-consistent reads keep working. SDK retries (default exponential backoff) usually mask it.
RTO (typical) Single-partition leader failure: ~10-15s blip for that partition. AZ failure: seconds for unaffected partitions, ~30s for partitions whose leader was in the failed AZ. Effective RTO from app perspective is the SDK retry budget. With default retries, most reads recover transparently.
RPO (typical) Within-region: 0 (synchronous WAL quorum). Cross-region with MREC: ~1s replication lag. Cross-region with MRSC (Jun 2025): 0. RPO=0 within a region holds because WAL persists to 2 of 3 replicas before ack. The only data-loss scenario is a correlated multi-AZ outage that loses 2+ replicas before WAL shipped to S3.
Split-brain behavior Multi-Paxos prevents split-brain by quorum requirement. Minority side cannot achieve quorum, so cannot accept writes. A network partition that isolates the leader from the majority causes leader to lose its lease; new leader elected on majority side. Minority side serves only eventually-consistent reads.
Blast radius of single-node failure A storage node hosts many partitions. Its failure affects only those partitions, each for ~10-15s. Other partitions (on other nodes) untouched. This is why DDB doesn't have "the DB is down" outages — failures are partition-scoped, not table-scoped or service-scoped.
Cross-region failover story (MREC) App must route traffic to surviving region (DNS / Route 53 / app-level). Data is already there via async replication; in-flight writes may be lost (RPO ~1s). Requires app-side resilience: idempotent writes, retry budgets, Route 53 health checks. App's failover playbook is non-trivial.
Cross-region failover story (MRSC) RPO=0 — all writes durable across all 3 regions before ack. App reroutes traffic; reads continue with strong consistency from surviving region. The trade is write latency: every write pays the cross-region quorum cost (typically 30-100+ms depending on region geography). For some workloads this is fine; for hot loops it's prohibitive.
Data loss scenarios Documented: simultaneous loss of 2+ replicas in different AZs before WAL ships to S3 archive (rare but theoretically possible). S3 archival is the durability floor. Even if all 3 in-region replicas are lost, the WAL exists in S3 with 11 nines of durability. The recovery is slow (hours) but data is intact.
Recovery from logical corruption PITR restores any table to any second in the last 35 days. Restored to a new table. Restore is slow at large scale (TB-sized tables take hours). App must swap to restored table (alias / config / new endpoint). Test the swap before you need it.

Sharding

How DynamoDB physically distributes data. This is the substrate that makes "single-digit ms at any scale" possible — and the source of hot-partition pain when access patterns skew.

Dimension Behavior Operational Reality
Sharding model Hash partitioning on the partition key. MD5(PK) determines which partition. Range partitioning within a partition by sort key. You cannot control which partition a key lands on. You can only control PK design (cardinality, distribution).
Shard key constraints PK is set at item creation and immutable. Max 2KB for the PK string. Required on every item. Changing PK = rewriting the table. Plan PK like you plan a database schema in a relational world: it's a permanent decision.
Rebalancing mechanism Automatic via Auto Admin. Two triggers: Split-for-Size (partition exceeds 10GB) and Split-for-Heat (sustained high traffic on a partition). Both happen transparently. No client-side change needed. No throttling during the split itself, though the partition may have been throttling beforehand.
Rebalancing cost / impact Split takes 5-30 minutes. During the split, the source partition handles all traffic; the new partition is being populated. The split is invisible to clients but the lead-up is not: traffic was hitting the per-partition cap (3000 RCU / 1000 WCU), so the partition was throttling for ~5-30 minutes before Auto Admin reacted.
Hot-shard behavior Adaptive Capacity boosts the hot partition's effective limits within minutes. Split-for-Heat splits the partition for longer-term relief. Adaptive Capacity has a 5-30 minute reaction time. For sustained viral hotness, this is fine. For flash events (Super Bowl ad, breaking news), it's too slow.
Maximum shards (practical) Effectively unlimited. Tables with 100K+ partitions exist in production. Cost scales linearly. The limit is operational, not technical: at very high partition counts, scans become absurd, GSI propagation gets distributed across more shards (good for write fan-out), and Streams shard count balloons.
Resharding without downtime? For capacity scaling: yes, automatic. For PK schema change: no — must rebuild the table. "Rebuild" pattern: dual-write to old + new tables for migration window, backfill via DDB Streams or full-table scan, cut over reads, retire old table. Several weeks of work for a large table.
Cross-shard query support Scan (every partition, expensive) or GSI (same item replicated in a separate partition space by GSI key). Scan is the wrong answer in production. GSI is the right answer, but adds write amplification. A third option: stream to S3 + Athena for offline analytical queries.
Per-partition limits 10GB storage, 3000 RCU/sec, 1000 WCU/sec. These are hard ceilings before adaptive capacity kicks in. Adaptive capacity can boost a partition above these limits temporarily, but Split-for-Heat is the durable fix. For workloads that consistently exceed: write-shard the PK.
Partition isolation Each partition is independent. A hot partition's throttling does not affect other partitions in the same table. This is the basis for the "predictable latency at scale" pitch. Multi-tenant tables work because hot-tenant traffic doesn't cascade to other tenants.

Replication

In-region (Multi-Paxos, synchronous) and cross-region (MREC async or MRSC sync). Each dimension covered for both modes.

Dimension In-Region Cross-Region (MREC) Cross-Region (MRSC, Jun 2025)
Topology Leader-follower per partition, 3 replicas (1 leader + 2 followers) across AZs. Multi-leader (multi-active). Every region has its own leader for the partition; replication is asynchronous between regions. Multi-leader with a multi-region journal coordinating writes. Strong consistency guaranteed across the 3 regions.
Sync vs async Synchronous via Multi-Paxos. Writes ack only after WAL persisted to quorum (2 of 3). Asynchronous. Local write acks immediately; replication to other regions happens in background. Synchronous across regions. Writes ack after multi-region journal commit.
Replication factor 3 (not configurable). One leader + 2 followers across 3 AZs. Configurable: pick any 1-N regions. Each region has its own 3x replication internally. Fixed at 3 regions, or 2 regions + 1 witness region. Cannot be 2-region or 4+region.
Consistency level options Strong (leader read, 1 RCU) or eventually (any replica, 0.5 RCU). Per-read flag. Within each region: same as in-region. Across regions: eventual only. Strong consistency across regions (per-read flag, 1 RCU). Eventual cross-region also available.
Replication lag (typical) WAL propagated to followers within ms; followers apply to B-tree within ms after. p99 lag ≤1s. p50 cross-region lag: <1s. p99: a few seconds. Spikes possible during cross-region network events. Lag is part of write commit — no async lag visible to readers. Writers absorb the latency (typically 30-100+ms depending on regions).
Conflict resolution None needed — single leader per partition. Last-Writer-Wins (LWW) by item-level timestamp. Silent; no notification of the loser. No conflicts possible — multi-region journal serializes writes globally.
Cross-region replication N/A — in-region only. Active-active, eventually consistent. Standard Global Tables since 2017. Active-active, strongly consistent. GA June 2025.
Replication during partition Minority side rejects writes (no quorum). Majority side continues. New leader elected from majority. Each region operates independently during a cross-region partition. Writes accepted locally; conflicts resolved via LWW once partition heals. Regional partition: surviving 2 regions form quorum, writes continue. Isolated region cannot accept writes (RPO=0 means it cannot diverge).
What happens during region failure N/A — region-level failure is a cross-region concern. Surviving regions continue to serve traffic. App must reroute. ~1s of in-flight writes may be lost (RPO). Surviving regions continue with strong consistency intact. Zero RPO. App reroutes seamlessly.
Cost multiplier Baseline. Replicated WCU: 1 rWCU per region per write. 3-region MREC table = 3x WCU cost. Same as MREC (per-region rWCU). Latency, not cost, is the trade.

Better Usage Patterns

Eight patterns that separate teams who run DDB well from teams who run it expensively. Each row is something that compounds at scale — get it right and it pays for itself many times over.

Pattern What Most Teams Do Wrong The Better Way Why It Matters
Single-table design (when access patterns are stable) Create one table per entity type (Users, Orders, OrderItems, Posts) then discover DDB doesn't do joins. Enumerate access patterns first. Build one table with PK/SK overloaded with type prefixes (USER#42, ORDER#2024-06-01#abc). Item collections retrieve heterogeneous data in one query. Cuts reads-per-screen from N round-trips to 1. Compounds at scale: latency, cost, error rate all improve. Hard to retrofit once you're 10 tables deep.
Write-shard hot partition keys at design time Use productId as PK and discover one viral product takes 80% of writes. Adaptive Capacity reacts in 5-30 minutes. For known-hot keys, append a 0-N suffix (PRODUCT#42#0 through PRODUCT#42#9). Fan-out on write to random suffix; gather on read across all N. Trades read complexity for write headroom. The right call when you have known viral content or known celebrity entities. Use sparingly — adds N-way fan-out reads everywhere.
Sparse and partial GSIs Project every attribute into the GSI. Pay full GSI cost on every base write. Sparse GSI: only items with the indexed attribute land in the GSI. Partial projection: only the subset of attributes you'll actually query. Often 50-80% cost reduction. GSI cost can be the dominant line item on a DDB bill. Sparse projection commonly drops GSI WCU by half. Re-evaluate quarterly.
Provisioned + Auto Scaling over On-Demand for steady workloads Run On-Demand "forever" because it's easier. Pay 7x for steady-state traffic. Launch On-Demand to learn traffic shape. Switch to Provisioned once you know p50/p95 RCU/WCU. Set auto-scaling target around 70% utilization. For a workload with $30K/mo On-Demand bill, switching to Provisioned typically lands at $4-5K. Free engineering 20-30 hours; ROI in days.
Read-modify-write via conditional expressions instead of optimistic locking in app Read item, mutate in app, write back, hope nothing changed in between. Race conditions in production. Use UpdateItem with ConditionExpression: "version = :expected_version" and SET version = version + 1. Atomic at DDB. Removes a class of race condition entirely. Failed condition is a clear retry signal; the alternative is silent data corruption.
TTL for transient data + cleanup-via-Streams for side effects Run a nightly Lambda that scans the table and deletes old items. Pays full scan RCU on every run. Set TTL attribute at write time. Free deletes. If side effects needed (e.g., publish "session expired" event), consume the TTL delete via Streams. Saves 100% of the cleanup cost. TTL deletes are zero-cost. Streams give you the event trigger if you need it.
Conditional writes for idempotency in event-driven systems Lambda processes an SQS message twice (at-least-once delivery), writes the same record twice, double-counts. Include event_id in the item. Use ConditionExpression: "attribute_not_exists(event_id)". Duplicate event is a no-op. Idempotency-at-the-database is the cleanest layer to put it. Removes the need for application-layer dedup tracking.
BatchWriteItem and BatchGetItem with parallel partition spread Loop over items one-at-a-time with PutItem. Pay full per-request fee on every item. Batch up to 25 writes / 100 reads. Distribute across partitions in the batch (sequential PKs are an anti-pattern; randomize order to avoid throttling one partition). ~25% per-request fee savings; parallel partition spread avoids batch-internal hot spots. Critical for high-throughput ETL.
Use DDB Export to S3 + Athena for analytics, never Scan "Run an analytics query" turns into Scan with a filter expression. Pays for every byte in the table. Export to S3 (PITR-backed, no RCU consumed). Query with Athena over Parquet. Or stream via Kinesis Firehose for near-real-time. Eliminates the worst-cost path. Export-to-S3 is essentially free (no RCU); Athena queries are pennies. A 1TB DDB scan is $130; the equivalent Athena query is $5.
Pre-warm partitions before known-hot launches Empty table for product launch. First wave of traffic exceeds single-partition limits; Auto Admin reacts too late. Write synthetic data to the table at high WCU for ~20-30 minutes before launch. Forces Auto Admin to pre-split the table. This is the Disney+ launch playbook. Without it, the launch hour is a cascading throttle event. With it, the table absorbs the surge.

Advanced / Next-Gen Alternatives

What to watch, what to migrate to, and what does it better for specific workloads. DDB has remarkable longevity (14 years and still architecturally fresh), but some workloads have outgrown its model.

Successor / Alternative What It Improves Maturity Migration Cost When To Consider
Amazon Aurora DSQL Multi-region active-active with strong consistency AND full SQL. PostgreSQL-compatible. Eliminates the "I need DDB scaling but I also need joins" tension. GA 2025 High — different data model (relational), but PostgreSQL skills transfer. Requires schema design from scratch. When your access patterns are diverse and unpredictable, or when ad-hoc analytics is core to the workload. Also when you have PostgreSQL expertise on the team and adding DDB ops mindshare is a tax.
Spanner / CockroachDB / YugabyteDB Distributed SQL with multi-region strong consistency. Full SQL with joins, transactions, indexes. CockroachDB and Yuga are open-source / self-hostable. Production High — SQL schema design, ORM compatibility usually requires care. Spanner is GCP-only. Multi-region OLTP with strict serializability. Financial workloads where MRSC's lack of transactions is a non-starter.
FoundationDB ACID transactions across the entire keyspace (not capped at 100 items). Layered data model (record layer, doc layer, queue layer) on a strict-serializable ordered KV core. Production (Snowflake uses it for metadata; Apple uses it for iCloud) Very high — self-hosted; significant ops investment; no managed offering. When you need DDB-style key-value semantics PLUS unbounded transactions PLUS open source. Significant ops team required.
TigerBeetle Purpose-built for financial ledgers: double-entry accounting primitives, deterministic execution, strict serializability at >1M txn/sec. Emerging (1.0 released 2024) Very high — replaces the ledger entirely; opinionated data model; no SQL. When your DDB use case is specifically a financial ledger and eventual consistency on GSI / LWW conflict resolution starts to read as a correctness problem.
ScyllaDB Cassandra-compatible with DDB-compatible API. Self-hosted or managed. Often 2-10x lower TCO at high QPS due to C++ shard-per-core architecture. Production (Discord migrated trillions of messages from Cassandra to ScyllaDB) Medium-high — different ops model; managed offering exists; DDB-compatible API lowers app-side change. When DDB cost crosses ~$200K/year and you have the ops headcount to consider self-managing. Or when DDB's hard partition limits no longer fit your scale.
Redis / KeyDB / Valkey (with persistence) Sub-millisecond latency (vs single-digit ms in DDB). Richer data structures (sorted sets, hyperloglog, streams). Lower per-op cost at high throughput. Production Medium — different consistency model (often eventual or weak), but app-side patterns translate well. When you specifically need µs latency and your data fits in memory. DDB fronts a database; Redis fronts a cache (with durability options that work but are not the strongest in the industry).
Event-sourcing + materialized views pattern Architectural rather than technological. Append-only log (Kafka, Kinesis, EventStore) plus query-side materialized views. Solves DDB's "every new query is a schema change" problem. Production (Netflix, LinkedIn, many event-sourced fintechs) High — different conceptual model. Pays back at 5+ access patterns or when audit requirements are strict. When you find yourself adding GSIs constantly, or when audit / replay capability matters more than ms-latency reads. Often complements DDB rather than replaces it.

When To Use Which

Decision matrices for the high-stakes choices: query vs scan, GSI vs LSI, on-demand vs provisioned, MREC vs MRSC, DAX vs ElastiCache, and DDB vs alternatives.

Query vs Scan

SituationUseWhy
You know the partition keyQueryBound to a single partition. Single B-tree traversal. Cheap and fast.
You know a sort-key rangeQuery with KeyConditionExpressionRange-scan within the partition; native B-tree operation.
You need to filter by an attribute (not the SK)Add a GSI on that attribute, then Query the GSIFilterExpression on a Scan or Query reads then discards; pays full RCU.
You need cross-partition analyticsExport to S3 + AthenaScan is linear in table size; Athena over Parquet is cheap and parallelizable.
One-off table dump for migrationParallelScan with 8-16 segmentsScan is fine for one-time work; just budget the RCU explicitly.

GSI vs LSI

PropertyGSILSI
Partition keyDifferent from baseSame as base
Sort keyAny attribute (including same as base SK)Different from base SK
Can be added later?YesNo (creation-time only)
Strongly-consistent reads?NoYes
10GB collection cap?No (partitions split)Yes (cannot split)
Own RCU/WCU?Yes (independent)No (shared with base)
PE RecommendationDefault. Use this.Only when you genuinely need strongly-consistent reads on an alternate SK AND know the collection will stay <10GB.

On-Demand vs Provisioned

Workload ShapeUseReason
Unknown / brand-new launchOn-DemandEliminates the "guess the WCU" problem at launch. Pay for what you use.
Steady-state, predictableProvisioned + Auto Scaling~7x cheaper than On-Demand. Auto-scaling handles drift.
Spiky, unpredictableOn-DemandSpikes that exceed Provisioned + Auto-Scale's reaction time throttle. On-Demand absorbs.
Spiky but with known peakProvisioned with high ceilingPre-provision enough headroom for the known peak; cost is per-WCU regardless of utilization.
Dev / staging / lower environmentsOn-Demand or low ProvisionedLow traffic. Don't pay 24/7 Provisioned for a near-idle table.
Multi-region (Global Tables)Either, applied consistently across all regionsMode must match across replicas. Mixing modes is not supported.

Global Tables: MREC vs MRSC (New June 2025)

ConcernMREC (Eventually-Consistent)MRSC (Strongly-Consistent)
Write latencyLocal-region only (~5-10ms)Cross-region quorum (~30-100+ms)
RPO (data loss on regional failure)~1s0
Region countAny number, add/remove anytimeExactly 3 (or 2 + 1 witness), fixed
Conflict resolutionLast-Writer-Wins, silentNone possible — serialized
TransactionsSupported (single-region)Not supported at all
CostrWCU per regionSame rWCU model
Mode change after creationCannot convert to MRSCCannot convert to MREC
Use whenRead-heavy global apps, social/media, watchlists, sessionsFinancial transactions, inventory, anything where stale-read causes incorrect business decisions

DAX vs ElastiCache (Redis/Memcached)

PropertyDAXElastiCache
Backed by DDB APIYes (drop-in)No (different protocol)
Cache invalidationAutomatic on DAX-routed writes (write-through)App must invalidate explicitly
Eventually-consistent onlyYesN/A (independent of DDB)
Item cache + Query cacheBoth, separatelyJust KV; app builds query semantics
Supports TransactWriteItemsNoN/A
Aware of Global TablesNoNo
Use whenRead-heavy DDB workload with eventually-consistent semantics; same-region traffic only.Need custom caching logic (TTL, eviction, data structures); already using Redis elsewhere; need cross-region replication of cache.

Use DynamoDB vs Don't

WorkloadVerdictWhy
Known access patterns, key-value or simple-query lookups, >1K QPSDDBThe sweet spot. p99 latency, ops savings, predictable cost.
OLTP with complex joins and ad-hoc reportingAurora / Postgres / Aurora DSQLDDB will fight you on every new access pattern.
Full-text search / fuzzy matchOpenSearch (stream from DDB if needed)DDB contains is filter-based, not indexed.
Graph traversal (friend-of-friend, recommendation graphs)Neptune / TigerGraphDDB can store edges but cannot efficiently traverse them.
Time-series (high-cardinality metrics)Timestream / TimescaleDB / InfluxDB / PrometheusDDB works for time-bounded read patterns but lacks compression and downsampling.
Multi-region active-active with strong consistency AND transactionsSpanner / Aurora DSQLMRSC doesn't support transactions; this is a hard wall.
Sub-millisecond reads at >500K QPSRedis / Valkey / ScyllaDBDDB's p99 floor is ~2-5ms; below that, in-memory KV or shard-per-core is faster.
Embedded / single-host workloadSQLite / RocksDBDDB is multi-tenant managed service overhead you don't need.
Cost-sensitive at large scale ($200K+/yr)Consider self-managed ScyllaDB / CassandraDDB markup gets noticeable at scale; ops team headcount becomes the right trade.

Top Production Use Case Patterns

Seven canonical patterns. Each is "what DDB was specifically engineered to do well" — the workloads where it consistently outperforms alternatives on cost, latency, and ops headcount combined.

Shopping Carts & E-Commerce Catalogs

Transient cart items, product detail lookups, inventory state. This is literally the workload DDB was built to solve for Amazon.com in 2004.

PK = userId for carts (per-user partition), PK = productId for catalog. Sparse GSIs for "products in category X." On-demand for catalog browse, Provisioned for cart writes.

User Session Management

Billions of mobile/web sessions, auth tokens, JWT-refresh state. Predictable cost, fast invalidation via TTL.

PK = sessionId, TTL on expiration timestamp. Often paired with DAX for the read-heavy "is this session valid?" path. Global Tables MREC for multi-region apps.

Real-Time Leaderboards & In-Game State

Player inventory, matchmaking, in-game achievements, scoring metrics. Sub-10ms reads even at peak event traffic.

PK = playerId, SK = type (PROFILE, INVENTORY#, ACHIEVEMENT#). Atomic increments via UpdateExpression. Streams to fan-out leaderboard changes to a cached top-N.

Media Bookmarking & Watchlists

"Resume where you left off" — the bookmark timestamp written every few seconds during playback, read on every session start across any device.

PK = userId, SK = titleId. Global Tables for cross-device sync via region replication. Disney+, Hulu, Prime Video all use this pattern.

High-Frequency IoT Event Ingestion

Telemetry from millions of devices, unpredictable arrival, predictable storage cost.

PK = deviceId, SK = timestamp. TTL for retention windows. Streams to fan-out to anomaly detection. Lyft GPS tracking, Careem location data fit here.

Event-Driven Architectures (DDB → Lambda)

Streams trigger Lambda on every change. Becomes the de-facto event bus for many CQRS and saga implementations.

DDB write = command, Stream + Lambda = event projection. Idempotency via event_id in the item with attribute_not_exists condition.

Transactional Financial Ledgers

Low-latency audited entries, strongly-consistent reads for balance checks, Global Tables MRSC for cross-region durability (RPO=0).

TransactWriteItems for atomic debit/credit pairs. MRSC since Jun 2025 closes the cross-region story. Saga pattern for multi-step financial workflows.

Production Case Studies

Seven operators running DDB at billion-item / millions-of-QPS scale. Each story includes the specific access pattern, scale dimensions, and the optimizations they did along the way. These are the playbooks worth stealing.

Disney+ 3B daily content API requests · 10M+ launch-day signups · 100M+ global subscribers

Disney+ launched November 12, 2019 onto AWS, with DynamoDB as the source of truth for watchlists, bookmarks, recommendations, and caching of computed personalization output. On day one the platform absorbed 3 billion requests to the content APIs. Watchlists are backed by Global Tables (MREC at the time), so users on a phone in São Paulo see additions made from a TV in Mumbai within a second.

The bookmark architecture is the interesting one. The Disney+ player writes a bookmark (current playback timestamp) to a Kinesis stream every few seconds. A consumer ingests from Kinesis and writes to a regional DynamoDB Global Table replica. On session start, the player calls the Content API which reads the bookmark from the local Global Table replica, returning the resume position with single-digit-ms latency. Kinesis decouples the high-write-rate ingest from the read-from-anywhere serving — DDB doesn't see the raw firehose, just the consumer's compacted writes.

Optimizations Disney+ did:

  • Pre-partitioned tables before launch. They didn't know the exact traffic shape but knew it would be enormous. They ran high-WCU synthetic writes for ~20-30 minutes against empty tables, forcing Auto Admin to split partitions ahead of real traffic. This is the launch playbook.
  • Switched between On-Demand and Provisioned during launch week. On-Demand for the unknown ramp, Provisioned + Auto Scaling once they understood the steady-state.
  • Replicated popular-title metadata across many partitions by suffix-randomizing the PK for top content (write-sharding), so a single viral title couldn't hot-spot one partition.
  • Used Kinesis between high-frequency writes and DDB instead of writing directly. Kinesis absorbs spikes; the DDB consumer rate is controlled.

Amazon Prime Video ~200M Prime members · catalog scale 100Ks of titles · playback tracking globally

Prime Video uses DynamoDB across the full content lifecycle: catalog management, subscription state, rental and purchase records, playback bookmarks, entitlement checks, and order fulfillment. The workload mix is the interesting part — heterogeneous patterns (browsing, account state, playback events) all sharing the same database family without per-workload tuning.

Entitlement checks happen on every play attempt and need single-digit-ms latency: "does this user own / subscribe to this title?" Catalog browsing is read-heavy and amenable to caching. Playback events are write-heavy with spiky patterns around premiere drops and live sports. Prime Video runs all of these on DDB without separate database families, leaning on DDB's predictable per-partition performance to keep tail latencies bounded across workloads.

Optimizations Prime Video did:

  • Heterogeneous access patterns isolated to different tables rather than forced into one (the opposite of pure single-table design). Entitlements, catalog, and playback events are separate tables, each optimized for its access pattern. The discipline is to avoid the single-table-everything trap when workloads are genuinely different.
  • Adaptive Capacity heavily relied on for premieres when traffic surges by 10-100x for specific titles. DDB Streams fan out to downstream personalization and recommendation pipelines.
  • Global Tables MREC for cross-region read locality — Prime Video users in Europe read entitlements from eu-west-1 replicas, not US East round-trips.

Netflix 260M+ subscribers · ~700 microservices · ML feature lookups at request time

Netflix runs a polyglot persistence layer — Cassandra for the bulk of streaming state, DDB for specific workloads where its predictable latency and zero-ops overhead outweighed Cassandra's lower per-request cost. DDB powers user-data collection and ML feature storage for personalization, with DynamoDB Streams enabling real-time CDC into the broader Netflix event mesh (Kafka, Flink, Iceberg).

The PE-relevant pattern: Netflix doesn't use DDB everywhere. They use it where the ops cost of running Cassandra would exceed the markup of DDB at that workload's scale. For low-throughput-but-latency-sensitive workloads (personalization features, A/B test assignment, member status checks), DDB wins. For high-throughput high-fanout workloads (the actual streaming session state, viewer history), Cassandra wins because they already own the operational expertise.

Optimizations Netflix did:

  • Workload-by-workload sizing decisions, not a blanket "use DDB" or "use Cassandra" mandate. The ops cost / managed-service cost trade is re-evaluated per workload.
  • DDB Streams into Kafka via Kinesis-DDB adapter to feed the larger Netflix event mesh, so DDB writes become events in Flink-processed pipelines.
  • Chaos engineering against DDB-backed services — Chaos Monkey kills the service instances and verifies retry behavior under failover. The "DDB just works" claim is verified, not assumed.

Lyft millions of rides/day · >100K GPS writes/sec at peak · multiple cities concurrent

Lyft uses DynamoDB across multiple data stores, but the canonical example is the ride-tracking system that stores GPS coordinates for every ride. Each driver app pushes location every few seconds during an active ride; each rider app polls location to render the live-on-map experience. The combination is heavy on both read and write, and skewed by city density (NYC writes dwarf small-city writes).

The right access pattern here is: PK = rideId (gives per-ride isolation), SK = timestamp (gives time-ordered range). Queries are bounded — "give me the last 30 seconds of this ride" — and never cross-ride. This is a textbook DDB-favorable pattern: known PK, time-range SK, predictable cost per query.

Optimizations Lyft did:

  • Per-ride partitioning means even at city-scale traffic, the load distributes across thousands of rideId partitions; no hot key.
  • TTL for ride history after a retention window — the live tracking table doesn't accumulate cold data; analytical history goes to S3.
  • Kinesis between driver-side ingest and DDB writes to absorb burst patterns (every driver app sending every N seconds) before DDB sees the smoothed-out stream.

Capital One 99% reduction in application failover time · multi-region active-active financial workload

Capital One moved a critical application from mainframe to a serverless DDB-backed architecture with multi-region active-active. The driving requirement was application failover time: their previous architecture took an order of magnitude longer to recover from regional incidents than acceptable. DynamoDB Global Tables (active-active replication) was the only AWS option offering true active-active multi-region writes at the time (Aurora Multi-Master was single-region; Aurora Global was active-passive with ~1 minute failover).

Capital One's team did a deliberate trade-off analysis comparing DDB Global Tables, Aurora Global Database, and Aurora Multi-Master. Aurora Global was the early favorite because it required less application change, but it wasn't active-active. Aurora Multi-Master was active-active but single-region. DDB was the only one with active-active multi-region — and that single property drove the decision.

More recently (re:Invent 2025), Capital One demonstrated MELT (Metrics, Errors, Logs, Traces) approaches to multi-region recovery and discussed using Aurora DSQL and DynamoDB MRSC as the next generation of active-active patterns. MRSC closes the RPO gap that MREC leaves open.

Optimizations Capital One did:

  • Reduced failover time by 99% by going active-active — there is no failover, just a routing change.
  • Streams + Lambda fan-out pattern documented in their engineering blog, with explicit warnings about hot-shard behavior (1:1 partition-to-shard mapping creates bottlenecks under skew).
  • Now evaluating MRSC for workloads where MREC's eventual consistency and silent LWW conflict resolution is a correctness risk.

Duolingo 52.7M DAU (end of 2025) · 31B+ items · 24K read units/sec, 3.3K write units/sec sustained

Duolingo stores its learner progress, exercise responses, and ML feature data in DynamoDB. The driving constraint was simple: at the scale of tens of billions of learning objects, running their own database would have cost a 3-5 engineer ops team. DDB's predictable cost per access and zero-ops promise was cheaper than the equivalent headcount.

The interesting Duolingo nuance is the data pipeline integration. Their personalization ML (Birdbrain) pulls learning history from DDB, runs predictions, writes back updated state. DDB also stores TTS audio metadata for Amazon Polly integrations. The pattern is: DDB as the system-of-record for fast lookups, S3 as the immutable data lake, EMR + Spark for offline ML training, Amazon EBS for temporary working storage.

Optimizations Duolingo did:

  • Moved Birdbrain v2 from nightly batch to sub-minute async processing — abandoned the rigid 24-hour batch cycle; DDB writes feed the pipeline within minutes. Streams + Lambda underneath.
  • DDB as feature store for inference, S3 as training data store. The split avoids paying DDB rates for cold historical data while keeping inference reads on the fast path.
  • Composite key design — PK = userId, SK = (lesson, exercise, timestamp). Item collections per user, queried with single-partition Query operations during inference.

Airbnb hundreds of millions of sessions · global multi-region traffic · every page-view = lookup

Airbnb uses DynamoDB for user session storage and as a backing store for fast, real-time web lookups across the platform. The read pattern is brutally consistent: every authenticated request reads session state, so the workload is read-dominant by orders of magnitude. The global user base means cross-region latency matters — a user in Tokyo cannot wait for a US-East round-trip to validate their session.

Session storage is the textbook DDB use case: known PK (sessionId), simple GetItem, predictable cost per session, TTL for cleanup. The hard part is doing it at the scale Airbnb operates — many regions, hundreds of millions of active sessions, read latency budget under 10ms. Global Tables answers the multi-region question; DAX (or app-level caching) answers the read-load question.

Optimizations Airbnb (and similar session-store users) typically do:

  • TTL on session expiry for free cleanup. No nightly scan.
  • Global Tables MREC for cross-region session sync. A user who logs in from Tokyo and lands in a request handled by ap-southeast-1 reads from the local replica.
  • DAX or application-layer cache in front for the read path. Session lookups are bursty (every page load); caching dramatically reduces DDB RCU consumption.
  • Compact session items — short attribute names, S3-pointer for large session payloads (preferences, recently viewed listings). Avoid the 400KB ceiling drift.

Compiled 2026-06-08 · DynamoDB 2022 USENIX paper + AWS docs (as of Jun 2026) + customer case studies · Generated with the tech-tradeoffs-analyzer skill

Citations: AWS DynamoDB Developer Guide, AWS Database Blog, Capital One Tech blog, Disney+ engineering blog (Achintya Ashok), Lyft case study, Duolingo case study, AWS re:Invent 2024-2025 (DAT440), DynamoDB 2022 USENIX paper (Sorenson et al.)