Distributed SQL: Five-Way Trade-Off Analysis
CockroachDB, YugabyteDB, TiDB, Amazon Aurora DSQL, and Google Cloud Spanner. The architectural decisions that separate them, and the ones interview committees actually probe.
Distributed SQL Strongly Consistent PE DepthAs of 2026-06-02. Aurora DSQL GA was May 27, 2025; CockroachDB 24.3 retired the open Core edition in November 2024; Spanner Omni reached preview in 2026.
These five collapse into three architectural philosophies, not five. Spanner and Aurora DSQL trust physical time (TrueTime, EC2 Time Sync) to avoid coordination; CockroachDB and YugabyteDB use Raft + HLC and treat clock skew as the enemy to be bounded; TiDB uses a single centralized timestamp oracle (PD) Percolator-style. The right question is never "which one scales better." It is: who owns the operational pain (you vs. the cloud), what is the migration story for your existing wire protocol, and what does a partition do to your write SLO. If you cannot answer those three for your workload, you are picking on vibes.
Best default choices
1. Trade-Offs
One table per technology. The trade-offs themselves diverge, so a matrix would flatten the most important distinctions. Each row is a concrete X-for-Y decision baked into the design, not a feature.
CockroachDB
Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.
Raft per range · HLC · Pebble · CCL| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| SERIALIZABLE isolation by default | Strongest correctness floor of any SQL DB, no anomaly catalog to memorize | Higher abort rate and retry burden than READ COMMITTED, especially on hot rows | Inventory decrement on a top SKU during a sale, every concurrent decrement is a transaction restart | The 24.x line did add READ COMMITTED as opt-in, but it is a behavior trap: most teams enable it cluster-wide and lose the correctness story that justified picking Cockroach in the first place |
| HLC instead of GPS-disciplined time | Runs on any cloud, any laptop, no special hardware dependency | Uncertainty interval (default 500ms max_offset) forces "uncertainty restarts" when reads land inside it | NTP drift on a noisy VM pushes one node 600ms off, the node shuts itself down rather than risk a serializability violation | This is why AWS Time Sync adoption matters for Cockroach in practice. Tightening max_offset reduces restart amplification, but the safe default trades latency for safety |
| Per-range Raft groups (MultiRaft) | Range-level placement, replication factor, and lease control without resharding | Hundreds of thousands of Raft groups per node, each with its own heartbeat traffic | Idle clusters burn CPU on heartbeats, dense node failures cause Raft thundering herds on leader re-election | The MultiRaft optimization batches heartbeats, but the engineering cost shows up in Raft tuning flags most teams never touch and Cockroach support has to walk them through |
| Postgres wire compatibility (partial) | Drop-in for psql, pgx, JDBC, most ORMs, lift-and-shift from PG looks easy | No PG extensions (PostGIS, pgcrypto, pgvector via separate path), different query planner, divergent EXPLAIN output | Existing app uses LISTEN/NOTIFY, advisory locks, or PL/pgSQL with side effects, none of which port | "Postgres-compatible" is a marketing surface. Real test is your slowest 100 queries on production data, not the schema migration |
| Geo-partitioning by row | GDPR/data-residency at row granularity, no app-level routing | Locality constraints become a schema concern, cross-region joins fan out to wide-area latency | A "show me my orders" query joins a US-pinned user to an EU-pinned order row, p99 jumps from 5ms to 180ms | The PE move is to forbid cross-partition queries at the schema level via constraints, not policy. Otherwise developers will write them, and the wide-area cost is invisible until on-call |
| Closed-source license (CCL, post-Nov 2024) | Backed development, dedicated support, all enterprise features in one bundle | No Apache-licensed escape hatch, BSL converts to ASLv2 after four years (slow), $10M revenue cap on free tier with annual renewal | Acquired company is now over the revenue threshold and Cockroach Labs' pricing posture has shifted, your only options are pay, fork an older BSL version, or migrate | This is the single biggest reason to evaluate YugabyteDB alongside. The architecture is similar enough that the differentiator becomes license, ecosystem trust, and operational maturity |
| Range-based key partitioning | Range scans are fast, ORDER BY on PK is free, time-series append patterns work well | Monotonic primary keys (auto-incrementing IDs, timestamps) create a hot last range that does not auto-split until it is already too late | An event-logging table with timestamp PK takes 80% of cluster writes on a single range during a traffic spike | Use hash-sharded indexes or UUID PKs by default. The "PRIMARY KEY (created_at, id)" pattern is the most common production foot-gun on Cockroach |
| Leaseholder reads bypass Raft | Single-replica reads at p99 sub-10ms, no consensus round-trip on the read path | Lease transfer on node loss adds tail latency, follower reads are stale by design (default 4.8s) | Node graceful drain or rebalance causes a brief lease-transfer storm, p99 reads spike to 100ms+ for ~30s | Follower reads are the right tool for analytics-on-OLTP, but using them for user-facing reads while claiming "strong consistency" is the most common postmortem item on Cockroach deployments |
YugabyteDB
Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.
Raft per tablet · HLC · DocDB · Apache 2.0| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Forked PostgreSQL backend (YSQL) on top of DocDB | Highest PG compatibility of the five, real triggers, FKs, stored procs, most extensions | Forked PG 11.2 base (Enhanced Compatibility Mode in v2024.1 closes some gaps), upgrade cadence lags upstream Postgres by years | Application depends on a PG 15+ feature (MERGE, range types with multiranges), and Yugabyte's parity is "soon" | The fork is both the gift and the curse. You inherit PG semantics for free but also inherit PG's per-connection backend process model, which is why connection pooling is non-optional |
| Two-layer architecture (query / storage) | Storage layer (DocDB) is consistent at the doc level, query layer is stateless and scales independently | Two services to operate, two release notes to read, query layer per-connection memory dominates at high connection counts | Spiky workload at 5000 concurrent connections, YSQL processes consume RAM the storage layer doesn't need | YB-TServer + YB-Master split mirrors TiDB's separation but at finer granularity. Operators frequently misallocate resources between them on first deployment |
| HLC + bounded clock skew (similar to CRDB) | No special hardware, runs on any cloud, on-prem, or laptop | Same uncertainty interval problem as Cockroach, depends on NTP discipline | One node's NTP daemon misconfigured, divergence triggers tablet leader bouncing | YugabyteDB's leader lease mechanism is stricter than Cockroach's: new leader must wait out old lease, which costs availability during failover but prevents split-brain reads |
| Apache 2.0 license, genuinely open core | Real fork-and-go option, no revenue gate, no telemetry mandate | Smaller commercial ecosystem than Cockroach Labs, narrower bench of vendor SREs available for hire | You need a deep performance investigation on a Saturday and the public Slack is the fastest path, not a support contract | The license is a real competitive moat against CRDB now, especially for procurement-sensitive enterprises. Don't dismiss it as ideological |
| Hash sharding by default for tables, range for indexes | No monotonic-PK hot-shard problem out of the box, uniform write distribution | Range scans on a hash-sharded PK degenerate to scatter-gather across tablets | A "list my last 50 orders by created_at DESC" query goes from O(1 tablet) to O(N tablets) when N grows | This is the inverse of CRDB's default. You explicitly opt into range sharding via SPLIT INTO, which is the right default for OLTP but trips up engineers migrating from PG with timestamp PKs |
| xCluster async replication for two-region active-active | Sub-second async replication for low-write-latency multi-region without the cost of synchronous Paxos across regions | No unique constraint enforcement across universes, sequences conflict, conflict resolution is last-writer-wins at WAL level | Two active universes both accept INSERTs with the same auto-increment ID, indexes silently diverge from main table | The doc literally tells you to use UUIDs and skip uniqueness constraints in active-active xCluster. If that constraint is unacceptable, you don't actually want async, you want sync sync sync and to pay the latency |
| Read-from-follower with bounded staleness | Geo-local reads with explicit staleness bound (e.g., "no more than 30s old") | Application must opt in per-session, mixed-staleness queries inside one txn are easy to write incorrectly | An analytics dashboard reads followers and shows totals that don't match the OLTP view of the same data, customer reports "the numbers are wrong" | Bounded-staleness is the most underused feature across all five. Most teams either run everything strong or build a separate replica, both more expensive than tuning a staleness bound per query class |
| Geo-partitioning via PostgreSQL partition syntax | Uses native PG declarative partitioning + tablespaces, familiar to DBAs | No FK references on partitioned tables, all the usual PG partition-pruning gotchas, schema changes ripple across partition tables | A regulated multi-region schema needs FKs between user and orders, you discover the limit during the prod cutover | Tablespace pinning is more transparent than CRDB's zone configs but less expressive. Choose based on whether your team thinks in PG idioms or zone constraints |
TiDB
Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.
3-component · PD timestamp oracle · TiKV · MySQL wire| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| MySQL wire protocol, not Postgres | Drop-in migration for the largest installed base of OLTP on the planet, every MySQL driver works | Inherits MySQL's looser type system, lock semantics, and SQL dialect quirks, no SERIALIZABLE | Application relied on MySQL repeatable-read gap locks, TiDB's snapshot isolation behaves differently and you get phantom rows | This is the only one of the five with MySQL heritage. If you are not migrating from MySQL or building net-new and standardizing on MySQL ecosystem, this is rarely the right pick |
| Centralized Placement Driver (PD) issues global timestamps | Percolator-style 2PC works, no clock-skew concerns, simple mental model | PD is a global bottleneck on transaction start, TSO request latency is on the critical path | Cross-region cluster where PD leader is one continent away, every transaction pays a continental round-trip just to begin | PD itself is Raft-replicated for HA, but only the leader serves TSO. Multi-region TiDB is fundamentally region-local with cross-region cluster-to-cluster replication, not true multi-master like Spanner or DSQL |
| Compute / storage separation (TiDB / TiKV / TiFlash) | Scale stateless SQL layer (TiDB) without rebalancing data, TiFlash columnar replica gives HTAP on the same cluster | Three independent services to operate, each with its own scaling story and failure modes | TiKV is healthy but TiDB layer is OOMing on a bad query, the cluster looks "fine" to PD while users see errors | The three-tier architecture is the cleanest separation of concerns of the five, but it is also the highest operational surface area. Plan for a dedicated TiDB SRE if self-hosted |
| Optimistic concurrency by default (pessimistic is opt-in) | Better throughput when contention is low, no lock manager overhead | High-contention rows produce abort storms, retry burden moves to the app | Inventory counter with thousands of concurrent decrements, OCC retry loop spikes CPU on TiDB and TiKV both | The recommendation has moved to "use pessimistic by default in 8.x" for OLTP. Teams that adopted OCC in 4.x and never re-evaluated still see the abort storms in 2026 |
| 96MB regions, automatic split and merge | Fine-grained rebalancing, PD can move hot regions in seconds | Tens of thousands of regions per node, Raft heartbeat overhead similar to CRDB's MultiRaft problem | A 10TB cluster has ~100K regions, region-related metadata on PD becomes hot, scheduling decisions lag | The "hibernate region" optimization makes idle regions free, but the math still pushes you to bigger region sizes (256MB+) for very large clusters |
| TiFlash columnar replica for HTAP | Real-time analytics on fresh OLTP data without ETL, single SQL endpoint, optimizer picks engine per query | 2x storage cost when TiFlash is enabled, optimizer mistakes can route analytics queries to TiKV and OLTP to TiFlash | Analyst writes a heavy GROUP BY without a hint, optimizer picks TiKV, scan starves OLTP for 10 minutes | HTAP is the most-marketed and least-understood feature. Real customers split workloads by isolation_read_engines session var anyway. The "no ETL" promise is true; the "no operational discipline" implication is not |
| Apache 2.0 license, real open source | Free fork path, large Chinese-tech adoption (ByteDance, Pinduoduo at hundreds-of-TB scale) gives real production validation | Documentation and ecosystem still feel Mandarin-first in places, support tier expectations vary by region | You need authoritative answers fast and the best blog post on your symptom is on a Chinese tech blog | The PingCAP commercial roadmap (TiDB Cloud Premium / Essential, late 2025) is moving the lift-and-shift complexity into managed service. Worth evaluating against Aurora DSQL for that exact reason |
| Two-phase commit across regions for cross-region transactions | Real distributed ACID, not "eventually consistent multi-master" | 2PC latency = at least 2 cross-region RTTs, prepared but uncommitted transactions block | A coordinator failure during 2PC leaves transactions in prepared state, you discover them in alerts hours later | The cluster-to-cluster (CDC-based) async replication path is what most production multi-region TiDB looks like, not true multi-region 2PC. Frame the choice as "sync within region, async across" honestly |
Amazon Aurora DSQL
Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.
Disaggregated · OCC · EC2 Time Sync · Serverless| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| OCC, cross-region coordination only at COMMIT | Reads and writes execute locally on Query Processor, no per-statement WAN latency, multi-region writes sub-second | Conflicting transactions abort at commit time, retry burden lives in the app | Two regions both write to the same logical row, both look successful until COMMIT, one gets an OCC abort with no prior warning | This is the single biggest architectural bet in DSQL. It pays off when contention is low and writes are partition-friendly, and falls over when you have hot rows, in which case Aurora DSQL is the wrong tool, not "needs tuning" |
| EC2 Time Sync as the clock substrate | Microsecond-bounded physical time across AWS without owning atomic clocks, like Spanner's TrueTime without the on-prem hardware | AWS-only forever, no path to multi-cloud, time-sync infra is AWS-internal | Compliance or strategic mandate forces multi-cloud, DSQL cannot follow | This is why DSQL exists as a separate service from Aurora Postgres. The Time Sync dependency is intrinsic to the architecture, not an implementation detail |
| 3000-row, 10MiB, 5-minute transaction cap | Prevents head-of-line blocking at Adjudicator and Journal, predictable system-wide latency | Batch jobs, large ETL, schema migrations on big tables cannot run as single transactions | A nightly batch that worked on Aurora Postgres has to be re-architected as N chunked transactions with idempotency, app changes scope balloons | The 3000-row limit is the most honest design statement of any of the five. It is not a quota, it is a contract: DSQL refuses to be your batch processor |
| No foreign keys, no triggers, no stored procs, no JSONB (as of 2026) | Forces all referential integrity and business logic into application code, makes the database genuinely stateless behavior-wise | Most existing Postgres apps require substantial rewrite, ORM features that assume FK constraints break silently | You migrate, FKs no-op, an orphan row is inserted, you discover it in a customer report next quarter | This is the bet that "the database is for data, the app is for logic." Defensible architecturally, brutal in migration. The 2025 add of views and unique indexes signals AWS hears the feedback |
| Serverless with scale-to-zero | No instance sizing, no idle cost, no PG Bouncer (no connection pooler needed), 10K default connections per cluster | DPU billing is hard to forecast, query plan changes can 3x cost without a code change | A subquery rewrite during a refactor turns into 3x DPU consumption, monthly bill spikes, no committed-use discount to fall back on | The CloudWatch DPU breakdown (ComputeDPU, ReadDPU, WriteDPU, MultiRegionWriteDPU) is the most important observability surface. Treat it like a SLO, not just a billing metric |
| Active-active multi-region, no leader | 99.999% multi-region availability SLA, no failover step, both endpoints accept writes concurrently with strong consistency | 2x multi-region write DPU cost (writes replicate to both regions), cross-region latency surfaces as commit latency | A 150ms RTT between regions makes every commit pay the full RTT, write throughput per connection drops accordingly | The cost model means "active-active" is not free in DSQL. Many teams will discover their workload is fundamentally region-local and they were paying for global by accident |
| Postgres 16 subset, not full Postgres | Familiar wire protocol, drivers, basic SQL, EXPLAIN works | No extensions (no pgvector, no PostGIS, no pgcrypto), no sequences, no temp tables | App uses UUIDs from `gen_random_uuid()` (works), then needs PostGIS spatial join (doesn't), and you discover during integration | The "Postgres compatible" claim is the same shape as Cockroach's. Test on your actual query workload, not the wire protocol |
| GA'd May 2025, still maturing the feature gap | Aggressive roadmap, new features monthly (views, unique indexes, PrivateLink, FIS integration in 2025) | Roadmap-driven decision making, features assumed at decision time may not arrive when promised | You promised stakeholders FK support "soon" based on AWS signal, two quarters later it is still not there | The aggressive launch cadence is genuine, but PE-grade decision making requires assuming today's feature set, not the roadmap. Aurora DSQL in mid-2026 is not the Aurora DSQL of late 2027 |
Google Cloud Spanner
Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.
TrueTime · Paxos · 2PC · External Consistency| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| TrueTime (GPS + atomic clocks) for external consistency | The strongest correctness guarantee available in any commercial SQL DB, transactions appear to execute in real-time order globally | Commit-wait pays the TrueTime uncertainty bound (typically a few ms) on every cross-machine write | Tight-loop write benchmarks look slower than Cockroach because commit-wait is unhidden | The commit-wait is what makes the system honest. Cockroach's uncertainty restarts are the same bound paid as a restart cost instead. Spanner amortizes it as wait time, Cockroach amortizes it as retry rate. Same physics, different UX |
| Paxos per split, leader-based writes | Proven, deployed inside Google for 15+ years, the most validated distributed consensus implementation outside Raft | Leader-based writes mean cross-region writes have one fast region (the leader) and one slow (followers) | A multi-region instance has unbalanced write latencies by design, EU writers see 50ms while US writers see 5ms | Spanner's "regional vs multi-regional" config choice is more important than its TrueTime story for actual workload economics. Pick wrong and you pay 5x for the wrong shape of consistency |
| GoogleSQL dialect (PostgreSQL dialect now available) | Two SQL surfaces, PG dialect supports most ORMs and existing PG apps, GQL graph language for Spanner Graph | Both dialects diverge from upstream Postgres, schema features (interleaved tables) are Spanner-specific | You design schemas using interleaved tables for performance, later cannot port back to vanilla Postgres without redesign | The PG dialect is a recent (2022+) compatibility layer, not the native language. For new builds, GoogleSQL is the better-supported and more honest choice unless multi-DB compat is a hard requirement |
| Snapshot reads at any timestamp from any replica | Globally consistent read-only transactions at the local replica without leader round-trip | Replica "safe time" can lag if it has not seen recent writes, reads may block waiting for safe time to advance | A snapshot read at "now" on a replica that is behind blocks for hundreds of ms during a write-heavy moment | Stale reads ("stale" by a few ms intentionally) are the production pattern. "Strong reads" (current timestamp) are correctness theater for most read paths |
| Multi-region SLA of 99.999% | The highest published SLA in the category, Google takes the operational ownership | Premium pricing, multi-region instances cost ~3x regional, processing-unit billing rewards continuous low utilization | A spiky workload pays for steady-state PUs that mostly idle | The Editions split (Standard / Enterprise / Enterprise Plus) means the SLA you actually get depends on a procurement decision, not just the deployment topology. Read the SLA per edition |
| GCP-only (until Spanner Omni in 2026 preview) | Deepest possible integration with Vertex AI, BigQuery (Zero-ETL), Pub/Sub, IAM | Multi-cloud is now possible via Spanner Omni preview but is not Spanner in production parity | You commit to Spanner, your company gets acquired by an AWS-only shop, migration is multi-quarter | Spanner Omni's software-defined TrueTime is the most interesting 2026 development in this space. Whether it reaches feature parity with managed Spanner is the bet that decides multi-cloud distributed SQL for the next 5 years |
| 2PC across Paxos groups for multi-split transactions | Real ACID across the entire schema, not just intra-shard | 2PC + Paxos + commit-wait stacks: cross-split transactions are visibly slower | A transaction that touches 5 splits pays 5x the prepare overhead, OLTP latency p99 climbs from 8ms to 40ms | The interleaved tables feature is the workaround: keep child rows physically co-located with parent so common transactions stay single-split. Most Spanner experts will say "schema design is the only optimization that matters" and they are right |
| Multi-model (relational + graph + vector + full-text) | Single system for OLTP + GraphRAG + semantic search, no ETL between specialized DBs | Each new model is "good enough" rather than best-in-class against Neo4j, Pinecone, Elasticsearch | You picked Spanner Graph for the consolidation story, later your data scientists need Cypher-only features that GQL doesn't have yet | The multi-model story is a real differentiator, but maturity per model varies. Spanner Graph GA'd in 2025, vector indexes for PG dialect GA'd recently. Treat each model as its own production-readiness check |
2. Use Cases
Real workloads with the driving property that ruled out the obvious alternative. Generic "scalability" is not a driving property.
CockroachDB
Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Multi-region SaaS control plane | DoorDash, Netflix, Shipt have publicly used CRDB for control-plane storage | SERIALIZABLE isolation across regions without a custom 2PC layer | Tens of TB, 10K+ QPS sustained, p99 under 25ms in-region | Postgres + sharding requires building the distributed-transaction story you would get for free |
| Geo-residency compliance (GDPR, India DPDP) | Mid-size fintech with EU and India users in one logical app | Row-level pinning to a region, with one schema and one connection string | 1-2TB per region, 5K QPS, hard residency boundary | YugabyteDB is the close competitor; CRDB wins on Postgres-extension expectation and zone-config expressivity |
| Lift-and-shift from Postgres at scale | Existing PG monolith hitting single-primary write ceiling | Postgres wire compatibility plus horizontal write scaling without re-architecting the app | Outgrew a 96-vCPU primary, sustained 30K writes/sec, 5TB and growing | Aurora Postgres scales reads via read replicas but not writes, sharding via Vitess is MySQL-only |
| Inventory and ordering systems with strong correctness | E-commerce with thousands of concurrent decrement transactions on hot SKUs | Default SERIALIZABLE prevents oversell without app-side locking gymnastics | 5K orders/sec at peak, hot SKU contention is the design hazard | MongoDB and Cassandra cannot give the correctness guarantee, would require app-level inventory service |
| Hybrid on-prem and cloud deployments | Regulated industries (banking, healthcare) wanting cloud agility plus on-prem fallback | Same engine on bare metal, EKS, GKE, no vendor lock to a cloud provider | Multi-cluster, 50+ nodes per cluster, mixed deployment environments | Spanner is GCP-only (Spanner Omni preview as of 2026), DSQL is AWS-only |
| Distributed audit logs with point-in-time recovery | Compliance-heavy workloads needing tamper-evident history | Time-travel queries (AS OF SYSTEM TIME) for free, MVCC retention configurable | Years of history retained, queries against 30-day-old snapshots in production | Postgres needs separate WAL archiving + restore choreography, distributed competitors lack the AS OF clause |
YugabyteDB
Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| PostgreSQL replacement for write-scale-out | SaaS hitting PG vertical scaling limit, ORM and stored procs in use | Highest PG feature parity of the five, real triggers and FKs and extensions | 5-20TB, 10K+ writes/sec, 100s of stored procedures to preserve | CockroachDB lacks triggers and most extensions, Aurora DSQL is a far thinner PG subset |
| Multi-cloud distributed SQL | Enterprise mandate to avoid AWS / GCP lock-in on the system of record | Apache 2.0 license, runs identically on EKS, GKE, AKS, bare metal | Hundreds of nodes across two clouds, active-passive xCluster between | Spanner is GCP-only, DSQL is AWS-only, CockroachDB now requires a paid Enterprise license above $10M revenue |
| Financial systems with geo-partitioning | Bank with strict per-jurisdiction data residency | Row-level geo-partitioning via standard PG declarative partitioning syntax | Multiple regulatory regions, single schema, single connection string | App-level routing with PG is operationally tractable but auditable-by-design is harder to demonstrate |
| Telco subscriber data store | Mobile carrier with regional billing systems, 99.99% SLA | Synchronous replication across 3 AZs, follower reads for nearby low-latency lookup | Hundreds of millions of subscribers, sub-10ms p99 reads, sustained MGT writes | Cassandra gives availability but no SQL, MongoDB lacks the strong-consistency guarantees needed for billing |
| Active-active two-region apps | SaaS needing low write latency in both US and EU regions | xCluster bidirectional async replication, accept eventual conflict resolution to gain local write speed | Sub-5ms in-region writes, ~1s cross-region replication lag | Spanner multi-region pays commit-wait for true active-active; DSQL pays full RTT at commit; both more expensive |
| Edge-adjacent IoT or device data | Manufacturing or IoT with regional clusters synced asynchronously | xCluster supports many-to-one fan-in for centralized analytics | 30+ edge clusters streaming to a central reporting cluster | Building this on Postgres requires logical replication + custom conflict handling per table |
TiDB
Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| MySQL replacement at scale | ByteDance, Pinduoduo, Shopee have used TiDB for MySQL replacement at hundreds of TB | MySQL wire compatibility, horizontal scale, no Vitess operational burden | Hundreds of TB, 100K+ QPS sustained, MySQL ecosystem fully reusable | Vitess preserves MySQL but inherits sharding complexity, AWS RDS MySQL caps out at the single-primary ceiling |
| HTAP on operational data | E-commerce running analytics on fresh OLTP via TiFlash columnar replica | Real-time analytics without an ETL pipeline, single SQL endpoint routes per query | Operational DB also serving dashboards with sub-minute freshness | Separating OLTP and OLAP into PG + Snowflake adds an ETL pipeline and 5-minute-plus freshness gap |
| Game backend state | Mobile and console game studios needing transactional state at scale | MySQL compatibility (most game backends are MySQL-first), strong consistency for in-game economy | Millions of concurrent sessions, transactional inventory and currency operations | Cassandra and Redis cannot give the transactional guarantees, MySQL needs sharding |
| Financial settlement and reconciliation | Fintech with MySQL legacy, regulatory pressure for ACID across larger data sets | Strong consistency with MySQL semantics, audit-friendly | 10s of TB, sustained pessimistic-transaction throughput | Aurora DSQL has no triggers or FKs and a 3000-row transaction cap; both are dealbreakers for batch reconciliation |
| SaaS multi-tenant OLTP | Vertical SaaS replacing per-tenant MySQL shards with one logical TiDB cluster | Eliminate per-tenant shard sprawl, simplify connection management, single SQL plane | 1000s of tenants, GB-to-TB per tenant, MySQL drivers everywhere | PG-based options force MySQL→PG migration on top of distribution; TiDB removes that variable |
| Asia-Pacific deployments with PingCAP support | Companies with China and SEA presence valuing native-language support | Strong support presence and ecosystem in APAC, mature TiDB Cloud offering on Alibaba Cloud and AWS | Regional teams, mixed Mandarin/English ops | Spanner is GCP, DSQL is AWS, CRDB has limited APAC commercial bench |
Amazon Aurora DSQL
Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Greenfield serverless apps on AWS | New product launches on Lambda + API Gateway with unpredictable traffic | Scale-to-zero billing, no connection pooler, no instance sizing | Variable load from 0 to thousands of QPS, no capacity planning | Aurora Serverless v2 has a minimum cost floor, Aurora Postgres requires instance sizing |
| Active-active multi-region without operational handoff | Robinhood and ADP publicly cited as launch customers for global SLAs | 99.999% multi-region availability with single logical endpoint per region | Two-region active-active, sustained 4x faster than Spanner per AWS GA claims (verify on your workload) | Spanner is GCP-only, multi-region CRDB requires you to operate the cluster |
| Microservices ledger or transactional state | Microservice owning a small slice of state, no batch jobs, well-bounded transactions | 3000-row transaction limit is fine for service-shaped workloads, OCC scales without locking | 10s of GB per service, low-contention writes, microservice-shaped access | DynamoDB would work but loses SQL, joins, and ad-hoc query flexibility |
| Multi-tenant SaaS with tenant-isolated workloads | SaaS where tenant rows are independent, no cross-tenant transactions | OCC works when conflicts are tenant-scoped, scale-to-zero per tenant inactive period | Many small tenants, low cross-tenant locking concerns | RDS Postgres requires upfront capacity for the worst-case tenant |
| Event-driven applications with bursty traffic | Workflow orchestration, retry queues, scheduled-job dispatchers | Per-DPU billing means quiet hours are nearly free, no provisioned floor to pay for | Workload spikes 100x between idle and peak, no traditional sizing answer | Provisioned Aurora pays for peak 24/7, Lambda + DynamoDB loses SQL ergonomics |
| Net-new applications choosing strict AWS-native | AWS-first orgs with IAM, KMS, VPC, IaC pipelines already standardized | IAM auth (token-based, no DB passwords), CloudWatch DPU metrics, AWS FIS for failover testing | Standard AWS-resident stack, single-cloud strategic mandate | Self-managed CRDB or YugabyteDB on EKS doubles operational scope |
Google Cloud Spanner
Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Global financial systems | Banks and fintechs with multi-region ledgers, e.g., trading and payment platforms | External consistency, the only commercially deployed system with this guarantee | Trillions of rows, multi-region with 5-nines SLA, strict audit requirements | None of the others guarantee external consistency in the formal sense, only strong consistency |
| Ad-tech and large-scale OLTP | Google's own AdWords backend (F1) is the canonical use case | Strong consistency at petabyte scale with predictable latency | Petabytes, millions of QPS, single logical schema across global footprint | No alternative has demonstrated this scale in production |
| Multi-model consolidation | Companies consolidating off Neo4j + Elasticsearch + Pinecone + Postgres into one platform | Single ACID-compliant store across relational, graph, vector, and full-text | 10s of TB, mixed workload patterns, ETL elimination as a strategic goal | Each specialized DB still wins on its niche; Spanner wins on consolidation when "good enough" per model suffices |
| Gaming leaderboards and global state | Mobile game backends with single global leaderboard and player state | Globally consistent ordering of writes, single source of truth across regions | 10M+ concurrent players, sub-second cross-region read freshness | Cassandra and Redis cannot give cross-region ordering, only locality-based eventual consistency |
| Spanner Graph for fraud detection | Financial fraud detection with graph traversal on fresh transaction data | Combined SQL + GQL graph queries on ACID-compliant data, no ETL to a separate graph DB | Billions of edges, real-time scoring on OLTP transactions | Neo4j is the graph specialist but requires ETL from the OLTP store, adding latency and consistency risk |
| Multi-cloud strategic deployments (Spanner Omni) | Companies committed to Spanner but with non-GCP infrastructure obligations | Spanner Omni preview brings the engine to on-prem, AWS, Azure (preview-quality, not production) | Mid-size deployments, evaluation as of 2026, not strategic-bet-the-company | If GCP is on the table, native Spanner is always cheaper and more mature |
3. Limitations
Matrix layout. Rows are the categories that bite at scale; cells describe how each technology is constrained, with severity coded by color.
| Limitation | CockroachDB | YugabyteDB | TiDB | Aurora DSQL | Spanner |
|---|---|---|---|---|---|
| Transaction size | ~64MB practical (raft command size), no hard row count MEDIUM | ~512MB tablet-level, soft transaction limits via configs MEDIUM | 10GB hard limit in default mode, 100GB with "large transactions" enabled MEDIUM | 3000 rows, 10MiB, 5 minutes hard CRITICAL | 20K mutations per commit, 100MB max HIGH |
| SQL feature completeness vs upstream | Postgres subset, no LISTEN/NOTIFY, no advisory locks, no extensions HIGH | PG 11.2 base, most features supported but cadence lags PG releases by years MEDIUM | MySQL 5.7+ subset, no triggers in many versions, no events MEDIUM | No FKs (no enforcement), no triggers, no stored procs, no JSONB, no temp tables, no sequences, no extensions CRITICAL | GoogleSQL is a separate dialect, PG dialect is partial, interleaved tables non-portable HIGH |
| Multi-region write latency | Cross-region Raft quorum at write time, ~RTT to nearest 2 regions HIGH | Same as CRDB for sync, sub-second for xCluster async (with consistency caveats) HIGH | Cross-region 2PC pays multi-RTT, most multi-region TiDB is cluster-to-cluster async HIGH | RTT at commit, sub-statement local execution, OCC abort at commit MEDIUM | TrueTime commit-wait + Paxos, leader-side fast / follower-side slow MEDIUM |
| Hot-row contention behavior | SERIALIZABLE → transaction restart, can cascade HIGH | Similar restart pattern, slightly better via pessimistic locking option MEDIUM | OCC default storms aborts, pessimistic mode mitigates HIGH | OCC abort at commit, no early failure detection CRITICAL | Pessimistic locking + 2PC, lock contention but no retry storm MEDIUM |
| Operational complexity (self-managed) | Single binary, hundreds of knobs, requires CRDB-experienced SRE HIGH | Two services (Master, TServer), three for HA, requires PG + RocksDB expertise HIGH | Three core services (TiDB, TiKV, PD) + TiFlash, highest surface area CRITICAL | N/A — fully managed, no self-hosted option N/A | N/A managed, Spanner Omni is preview-only as of 2026 N/A |
| License / vendor lock-in | CCL closed source, $10M revenue gate for free Enterprise, telemetry mandatory on free tier HIGH | Apache 2.0, genuinely open LOW | Apache 2.0, genuinely open LOW | AWS-only, no multi-cloud path CRITICAL | GCP-only natively (Omni preview opens this), Google-managed proprietary HIGH |
| Cost predictability | Per-vCPU licensing on Enterprise, predictable but not cheap MEDIUM | Compute + storage on managed, free self-hosted LOW | Managed pricing similar to CRDB Cloud, self-hosted is the cost win MEDIUM | DPU billing notoriously hard to forecast, query changes can 3x cost CRITICAL | Processing-unit billing predictable but expensive, multi-region is 3x regional HIGH |
| Analytics / OLAP capability | Not designed for OLAP, doc says so, follower reads help but limited HIGH | Same, no columnar engine, recommend separate OLAP store HIGH | TiFlash columnar replica gives real HTAP, the only one with built-in OLAP LOW | No analytics workload, OLTP only by design CRITICAL | Spanner Data Boost separates analytics traffic from OLTP, BigQuery Zero-ETL closes the gap MEDIUM |
| Cold-start / scaling-up latency | Manual node provisioning, minutes to add capacity MEDIUM | Manual node provisioning, similar to CRDB MEDIUM | Manual node provisioning, PD scheduling needs time to rebalance MEDIUM | Serverless, sub-60s cluster provision, scales automatically LOW | Autoscaling (Enterprise+ tier) automatic, processing-unit adjustments seconds LOW |
| Maximum cluster size in practice | Hundreds of nodes published, Cockroach Labs ref customers in low hundreds MEDIUM | Hundreds of nodes published, comparable to CRDB MEDIUM | Thousands of TiKV nodes at ByteDance scale LOW | Single cluster, 256TiB max storage as of Nov 2025 (raised from 128TiB) MEDIUM | Petabyte-scale validated by Google's internal F1 deployment LOW |
4. Fault Tolerance
Replication model is the headline; failure detection time and split-brain behavior are what determine your incident postmortem template.
| Dimension | CockroachDB | YugabyteDB | TiDB | Aurora DSQL | Spanner |
|---|---|---|---|---|---|
| Replication model | Raft per range, default RF=3, leaseholder serves reads | Raft per tablet, default RF=3, leader-lease-protected reads | Raft per region in TiKV, default RF=3, leader serves reads | Disaggregated, Journal-replicated across 3 AZs, no traditional consensus | Paxos per split, RF configurable (default 3 zones, multi-region adds more) |
| Failure detection | Liveness heartbeats, ~9s for unavailability, configurable | Tablet peer heartbeats, ~3s detection | PD heartbeats, ~10s for region leader replacement | Control plane health checks, transparent to client | TrueTime + Paxos heartbeats, sub-10s detection in practice |
| Failover mechanism | Raft re-election, leaseholder moves, no client config change | Raft re-election + leader-lease wait, no client change | Raft re-election in TiKV + PD rebalance, no client change | Automatic re-routing to healthy QP/Journal/Adjudicator, invisible to client | Paxos re-election, automatic, no client change |
| RTO (typical) | ~9s for node loss, 30s+ for AZ-wide event due to lease churn | ~3-10s for node loss, depending on lease parameters | ~10-30s for TiKV node loss, PD election adds latency for region | Sub-second for AZ failure (multi-AZ active by design) | Zero RTO for zone failure (Spanner Omni doc claims, native Spanner sub-second) |
| RPO (typical) | Zero (synchronous Raft commit) | Zero for sync replication, seconds for xCluster async | Zero (synchronous Raft commit) | Zero (Journal acknowledges before client sees commit) | Zero (Paxos quorum commit) |
| Split-brain behavior | Minority partition rejects writes, leaseholder check on reads | Leader-lease prevents two leaders, minority rejects writes | Minority partition rejects writes, PD election quorum-protected | No leader, conflicts surface as OCC aborts at commit | Paxos quorum, minority side cannot make progress |
| Single-node failure blast radius | Hundreds of ranges may lose leaseholder, ~9s of localized read latency | Tablets where node was leader briefly unavailable for writes | Regions where node was leader briefly unavailable for writes | One QP loss is invisible (request re-routes), one AZ loss is sub-second | Splits where node was leader unavailable briefly, others unaffected |
| Cross-region failover | Manual via zone-config changes, can be fully automated with survival goals | Automatic for sync clusters, manual cutover for xCluster async | Manual cluster-to-cluster failover for the typical async multi-region setup | No failover step — multi-region is active-active, traffic shifts naturally | Automatic for multi-region instances (Enterprise Plus tier) |
| Data loss scenarios | Simultaneous loss of quorum in a range (3-AZ simultaneous failure) | Same as CRDB for sync, xCluster async can lose unreplicated writes | Quorum loss in TiKV region, or PD majority loss for cluster metadata | Multi-AZ Journal quorum loss; multi-region cluster pair both lost | Multi-region quorum loss (rare given replica count and geographic spread) |
| Published availability SLA | No managed-service SLA on self-hosted; CockroachDB Cloud Advanced has tier-specific SLAs | YugabyteDB Aeon (managed) has tier SLAs; self-hosted depends on operator | TiDB Cloud Dedicated 99.99%, self-hosted depends on operator | 99.99% single-region, 99.999% multi-region | 99.99% regional, 99.999% multi-regional (Enterprise Plus) |
6. Replication
Topology, sync vs async, and the cross-region story. The first three are Raft-similar; the last two diverge.
| Dimension | CockroachDB | YugabyteDB | TiDB | Aurora DSQL | Spanner |
|---|---|---|---|---|---|
| Replication topology | Single leader per range (leaseholder), Raft followers | Single leader per tablet, Raft followers, read replicas opt-in | Single leader per region in TiKV, Raft followers, TiFlash columnar replica | No leader; Journal replicates writes, Adjudicator resolves conflicts | Single Paxos leader per split, witness replicas for quorum |
| Sync vs async | Sync within Raft group, async to non-voting replicas (follower-reads) | Sync within Raft group, async via xCluster for cross-region | Sync within TiKV Raft group, async via CDC-based cluster replication | Sync across AZs, sync across regions at commit (multi-region peered) | Sync via Paxos, both regional and multi-regional are synchronous |
| Replication factor (default / max) | 3 default, configurable per zone | 3 default, configurable via tablespace per fault domain | 3 default, configurable per region/label | 3 across AZs fixed (multi-region adds full second cluster) | 3 default per region, multi-region adds zones for additional copies |
| Consistency level options | SERIALIZABLE (default), READ COMMITTED (opt-in 24.x+), follower reads with bounded staleness | SERIALIZABLE, SNAPSHOT, READ COMMITTED, follower reads with bounded staleness | SNAPSHOT (default in OCC), READ COMMITTED via pessimistic txn | Snapshot isolation, linearizable reads | External consistency (default), bounded staleness reads, exact staleness reads |
| Replication lag (typical) | Sub-ms within Raft group, configurable seconds for follower reads | Sub-ms within Raft, ~sub-second for xCluster async | Sub-ms within Raft, ~seconds for cross-cluster CDC | Zero (all writes durable at commit before ack) | Sub-ms commit-wait dominates, sub-second cross-region |
| Conflict resolution | Raft serializes all writes, no conflicts to resolve | Same as CRDB for sync; xCluster async uses last-writer-wins at WAL level | 2PC commit ordering, no conflicts in sync path | OCC, conflicting writer aborts at commit time | 2PC with TrueTime ordering, no conflicts in commit path |
| Cross-region replication | Sync via Raft (high latency) or async via Change Data Capture | Sync (high latency) or async via xCluster (LWW conflicts allowed) | Mostly async cluster-to-cluster, 2PC available but rarely deployed multi-region | Sync at commit, multi-region peered clusters are first-class | Sync via Paxos with TrueTime, multi-region is first-class |
| Replication during partition | Minority Raft side rejects writes, majority side serves | Same, minority Raft side blocked, leader-lease prevents stale reads | Minority TiKV region blocked, PD must have majority | OCC means partitioned writes will conflict at commit, transactions abort | Paxos minority blocked, majority continues |
7. Better Usage Patterns
Five patterns per tech. The anti-pattern half of each row is the one that shows up in code review; the better-way half is the one experienced operators recommend.
CockroachDB
Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Primary key design | Auto-increment INT or timestamp PK migrated from Postgres unchanged | UUID or hash-sharded composite key from day one, never monotonic | Avoids the single-hot-range write bottleneck that becomes a Saturday-night oncall |
| Transaction retries | Catch generic exceptions, retry once, hope it works | Use the official retry wrappers (savepoint cockroach_restart), exponential backoff, surface aborts as a metric | SERIALIZABLE means retries are part of the normal control flow, not an exception path |
| Follower reads | Treat them as "free strong reads" for read-heavy workloads | Use AS OF SYSTEM TIME or follower-read clause explicitly, document the staleness bound per query class | Stale reads with no documented staleness will be the root cause of a data-correctness postmortem within 6 months |
| Geo-partitioning | Set up survival goals and zone configs once, forget the schema-level enforcement | Add CHECK constraints that prevent cross-region rows from being inserted into the wrong partition | Stops "compliance accidents" at the schema layer, not in code review |
| Schema change cadence | Manual psql ALTER TABLE during a maintenance window | Use online schema changes (DDL is non-blocking by design in CRDB), but stage in non-prod with the same row counts | CRDB does online DDL well; the failures are usually about under-tested constraints, not the DDL itself |
| Connection management | Use a single pgbouncer pool sized for the largest node | Use SQL-aware proxy (cockroach-sql-proxy or CockroachDB Cloud's load balancer), sized per-node not per-cluster | Cross-node connection pooling adds tail latency from lease redirects; SQL-aware routing keeps queries node-local |
YugabyteDB
Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Hash vs range sharding | Accept hash default for everything, then complain range scans are slow | Use HASH for OLTP point lookups, RANGE explicitly for time-series and ordered queries | The default is right for 80% of cases; the other 20% need an explicit ASC PRIMARY KEY (col, col) SPLIT INTO ... clause |
| Connection pooling | Direct PG connections from app, hundreds of YSQL backends each consuming RAM | YugabyteDB Smart Driver + connection pool, target 10-50 active connections per node | YSQL inherits Postgres' per-connection process model; without pooling, RAM and context switches kill the cluster |
| xCluster active-active | Use auto-increment PKs and unique constraints, "fix it later" | UUIDs for PKs, no unique constraints on indexed columns, application-level conflict reconciliation logic | The docs literally warn about this; conflicts are silent and you find them in customer reports |
| Geo-partitioning + FKs | Try to enforce FKs across partitioned tables, hit the partition + FK limit | Co-locate referenced rows in the same geo-partition, denormalize across regions if needed | The schema-level limitation is real; pretending it isn't leads to insertion failures or silent missing FK semantics |
| Bounded staleness reads | Run all reads strong, build a "read replica" via xCluster for analytics | SET yb_read_from_followers = true with explicit staleness; mix in same session as needed | Avoids building a parallel infrastructure for what is a per-query parameter; saves substantial cost and complexity |
| YB-Master sizing | Treat YB-Master like a metadata service that needs minimal resources | Size YB-Master for the catalog operations of your cluster; CPU spikes during heavy DDL or split storms | An underpowered Master becomes the bottleneck during incident recovery, exactly when you can't afford it |
TiDB
Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| OCC vs pessimistic | Leave the default and assume MySQL-compatible behavior | Use pessimistic locking (set transaction default in 5.0+) for OLTP, OCC only for low-contention bulk paths | The default has shifted historically; teams who adopted on 4.x default OCC and never re-evaluated are still seeing abort storms in 2026 |
| TiFlash usage | Add TiFlash, hope the optimizer picks it for analytics queries | Use isolation_read_engines session variable per workload class, force OLTP to TiKV and OLAP to TiFlash explicitly | Optimizer mistakes are the #1 HTAP problem in TiDB; manual routing per session class avoids them |
| Auto-incrementing PK | Use AUTO_INCREMENT or AUTO_RANDOM without thinking | AUTO_RANDOM for OLTP tables to avoid the last-region hot-shard problem, AUTO_INCREMENT only for analytics-style tables | TiDB allocates AUTO_INCREMENT in node-local batches but the last-region hot-spot is still real for sequential reads |
| PD placement | Place PD on the same nodes as TiKV for cost reasons | Dedicated PD nodes in their own AZ, sized for the TSO throughput your cluster needs | PD is on the critical path for every transaction; co-locating with TiKV creates cascading failures during a single node loss |
| Region size tuning | Stay on 96MB default for the largest clusters | Increase region size to 256MB or 512MB at >10TB scale to reduce PD metadata overhead | Region count grows with data size; PD scheduling and TiKV heartbeat overhead grows linearly with region count |
| Cross-region deployment | Stretch a single cluster across regions and accept high write latency | Separate cluster per region with CDC-based async replication (TiCDC), accept eventual cross-region | The vast majority of production multi-region TiDB is region-local with async replication; pretending otherwise produces unusable write latency |
Amazon Aurora DSQL
Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Transaction shape | Migrate batch jobs as single transactions, hit the 3000-row limit in production | Chunk into 1000-row batches with idempotency keys, design every job to be resumable | The 3000-row limit is intentional; pretending it's a quota leads to constant retries instead of correct chunking |
| Foreign key replacement | Skip FKs, hope the ORM enforces them | Application-level referential integrity with explicit consistency checks at transaction boundaries | Without FKs, the database cannot prevent orphan rows; the application must own this invariant explicitly |
| Cost monitoring | Watch the monthly bill, react when it spikes | Treat per-query DPU consumption as a SLO; alert on Compute/Read/Write DPU breakdowns separately | A subquery refactor can 3x DPU consumption silently; only per-query observability catches it before the bill |
| Index strategy | Add indexes the same way as Postgres, hope query plans work | Use covering indexes aggressively to enable index-only scans; EXPLAIN ANALYZE VERBOSE is mandatory before deploy | Storage is cheap, DPU is not; covering indexes that eliminate table scans pay for themselves in days |
| OCC conflict handling | Catch OCC abort, retry once, give up | Exponential backoff, jitter, partition-aware retries, observability on conflict-prone keys | OCC aborts are signal, not noise; high abort rate on a key means redesign, not retry harder |
| Multi-region cost | Enable multi-region for "availability" without measuring write cost impact | Multi-region only when active-active is the actual requirement; single-region multi-AZ is 99.99% and cheaper | Multi-region writes 2x DPU; many workloads enable it for marketing reasons and pay continuously |
Google Cloud Spanner
Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Schema design | Treat Spanner like Postgres, design schemas without interleaved tables | Use INTERLEAVE IN PARENT for child rows that are always accessed with the parent (orders within users) | Interleaved tables keep co-accessed rows physically together, eliminating 2PC for the common case |
| Read patterns | Default to "strong reads" for everything, pay TrueTime commit-wait on every read | Use stale reads with bounded staleness (e.g., 10s) for non-critical paths, strong only when ordering matters | Stale reads can be served from any replica without quorum, lowering p99 and cost simultaneously |
| Primary key design | Use auto-incrementing INT64 or timestamp PK from Postgres habit | UUIDv4 or hash-prefixed PK to avoid hotspot on monotonic inserts | Like all range-partitioned systems, monotonic PKs create a hot last-split that takes all writes |
| Cross-region instance choice | Pick multi-region instance for "availability," accept the 3x cost | Regional instance with cross-region read replicas (where available) when writes are region-local | Multi-region is for true global writes; regional+replica is cheaper for read-anywhere/write-here workloads |
| Edition choice | Default to Standard, hit a feature gap later (graph, vector, geo-partitioning) | Pick Enterprise or Enterprise Plus from day one if any multi-model or geo-partition feature is in the roadmap | Edition upgrades are possible but per-feature licensing makes the math hard to undo |
| Connection management | Use the default Spanner client session count without tuning | Enable multiplexed sessions (default in 2025+ clients) for high-concurrency workloads, sized to actual QPS | Session exhaustion is a quiet failure mode; multiplexed sessions eliminate it in 80% of cases |
8. Advanced / Next-Gen Alternatives
Where each technology might be replaced, augmented, or obviated. Three rows per tech, weighted to the alternatives that an L7 review would mention.
CockroachDB
Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| YugabyteDB | Apache 2.0 license, higher PG feature parity, no revenue gate | Production at scale | Medium, Postgres-compatible but DDL and tuning differ | License change forced re-evaluation, or PG ecosystem features (extensions, triggers) are required |
| Aurora DSQL | Eliminates self-managed operations on AWS, scale-to-zero billing | GA May 2025 | High, requires substantial app refactor (no FKs, no triggers, txn limits) | AWS-only strategy, greenfield apps, workload tolerant of OCC and DSQL limits |
| Postgres + Citus (now Microsoft-owned) | True Postgres with horizontal sharding via Citus; lower license risk than CRDB Enterprise | Mature, Azure-blessed | Medium, app-side awareness of distribution columns | Multi-tenant SaaS where tenant is the natural shard key, want real Postgres |
| Vitess (MySQL-shaped) | If you can switch to MySQL, Vitess is more mature and broadly deployed for horizontal scale | Mature (YouTube, Slack, GitHub) | Very high if migrating from PG; near-zero if already on MySQL | Existing MySQL workload outgrowing single-primary, OSS-only mandate |
YugabyteDB
Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| CockroachDB | Slightly more mature commercial support and ecosystem in North America | Production | Medium, similar architecture but DDL and tuning differ | Procurement requires a US-headquartered vendor with tier-1 enterprise support |
| Aurora DSQL | Serverless economics, no cluster ops | GA | High, PG feature gap is wider than YB's | AWS-native, greenfield, smaller feature surface acceptable |
| Spanner with PG dialect | External consistency, true multi-region writes, multi-model in one engine | Production | High, GCP-specific and edition-tier-dependent | GCP commitment exists, correctness guarantees are a procurement requirement |
| Postgres + pglogical / Patroni / Citus stack | Real upstream Postgres, no fork to track | Mature | Medium, requires assembling sharding + HA from parts | Postgres-purist team unwilling to accept any fork, accepting higher operational burden |
TiDB
Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Vitess | Preserves MySQL semantics more strictly, more mature in North American hyperscalers | Mature | Medium, both are MySQL-protocol, but Vitess sharding is app-aware | Existing MySQL with complex stored procs and triggers TiDB doesn't fully support |
| PlanetScale (Vitess + serverless) | Serverless MySQL with branching, no cluster ops | Production | Low for MySQL apps that fit PlanetScale's deferred schema model | Want managed serverless on MySQL ecosystem, OK with PlanetScale's foreign-key constraints |
| MySQL HeatWave (Oracle) | HTAP via in-memory analytics engine, deep MySQL semantic parity | Production | Low for OCI users, medium otherwise | HTAP need with strict MySQL parity, willing to use Oracle Cloud Infrastructure |
| SingleStore (formerly MemSQL) | Row + columnar in one store, MySQL wire protocol, real-time analytics | Production | Medium, MySQL wire but different SQL semantics | Mixed OLTP + sub-second analytics, want a more analytics-leaning HTAP than TiDB |
Amazon Aurora DSQL
Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Aurora Postgres Limitless Database | True Postgres compatibility (FKs, triggers, JSONB, extensions) with horizontal sharding | Production | Low for traditional PG apps, requires shard-key thinking | Need full Postgres features but want AWS-native horizontal scale; multi-region not the priority |
| CockroachDB Cloud / YugabyteDB Aeon on AWS | More mature feature set, multi-cloud option | Production | Low to medium | DSQL feature gaps (FKs, triggers, batch transactions) are dealbreakers |
| DynamoDB with PartiQL | If you can model as key-value, DynamoDB is mature, cheaper, and has no transaction-row limits | Mature | High, data model change required | Workload is actually KV-shaped and SQL was used out of habit, not necessity |
| Spanner Omni on AWS (preview) | External consistency and multi-cloud, the only credible alternative for DSQL's correctness story | Preview 2026 | High, GCP-tied tooling and dialect | Multi-cloud strategic, willing to accept preview-stage software |
Google Cloud Spanner
Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Spanner Omni | Same engine, multi-cloud (on-prem, AWS, Azure) | Preview as of 2026 | Low within Spanner family | Multi-cloud mandate or GCP-exit risk hedge; willing to track preview features |
| CockroachDB / YugabyteDB | Multi-cloud today, lower cost at small-to-mid scale, no GCP dependency | Production | Medium to high, depending on Spanner-specific features used | Cost optimization, multi-cloud, willing to operate the cluster yourself |
| TigerBeetle (for financial workloads) | Single-purpose double-entry accounting at sub-ms p99, deterministic and dedicated to ledger correctness | Emerging, early production | High, dedicated financial schema, separate from general OLTP | Pure financial-ledger workload where Spanner's general-purpose overhead doesn't earn its keep |
| FoundationDB + layers | Lower-level, transactional KV at extreme scale with stronger isolation guarantees | Production (Apple, Snowflake) | Very high, SQL is not native, build layers yourself | Platform-engineering bet, building your own DB-as-a-service on top of a strong KV substrate |