Distributed SQL: Five-Way Trade-Off Analysis

CockroachDB, YugabyteDB, TiDB, Amazon Aurora DSQL, and Google Cloud Spanner. The architectural decisions that separate them, and the ones interview committees actually probe.

Distributed SQL Strongly Consistent PE Depth

As of 2026-06-02. Aurora DSQL GA was May 27, 2025; CockroachDB 24.3 retired the open Core edition in November 2024; Spanner Omni reached preview in 2026.

PE Verdict

These five collapse into three architectural philosophies, not five. Spanner and Aurora DSQL trust physical time (TrueTime, EC2 Time Sync) to avoid coordination; CockroachDB and YugabyteDB use Raft + HLC and treat clock skew as the enemy to be bounded; TiDB uses a single centralized timestamp oracle (PD) Percolator-style. The right question is never "which one scales better." It is: who owns the operational pain (you vs. the cloud), what is the migration story for your existing wire protocol, and what does a partition do to your write SLO. If you cannot answer those three for your workload, you are picking on vibes.

Best default choices

CockroachDBDefault when PostgreSQL compatibility, self-managed portability, and serializable correctness matter YugabyteDBDefault for Apache 2.0 distributed SQL with PostgreSQL APIs and hash-sharded OLTP scale TiDBDefault when MySQL compatibility and HTAP-style scale-out matter more than Postgres semantics Aurora DSQLDefault for AWS-native serverless PostgreSQL-compatible multi-region workloads SpannerDefault for cloud-managed external consistency, global scale, and TrueTime-backed correctness

1. Trade-Offs

One table per technology. The trade-offs themselves diverge, so a matrix would flatten the most important distinctions. Each row is a concrete X-for-Y decision baked into the design, not a feature.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

Raft per range · HLC · Pebble · CCL

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
SERIALIZABLE isolation by default	Strongest correctness floor of any SQL DB, no anomaly catalog to memorize	Higher abort rate and retry burden than READ COMMITTED, especially on hot rows	Inventory decrement on a top SKU during a sale, every concurrent decrement is a transaction restart	The 24.x line did add READ COMMITTED as opt-in, but it is a behavior trap: most teams enable it cluster-wide and lose the correctness story that justified picking Cockroach in the first place
HLC instead of GPS-disciplined time	Runs on any cloud, any laptop, no special hardware dependency	Uncertainty interval (default 500ms max_offset) forces "uncertainty restarts" when reads land inside it	NTP drift on a noisy VM pushes one node 600ms off, the node shuts itself down rather than risk a serializability violation	This is why AWS Time Sync adoption matters for Cockroach in practice. Tightening max_offset reduces restart amplification, but the safe default trades latency for safety
Per-range Raft groups (MultiRaft)	Range-level placement, replication factor, and lease control without resharding	Hundreds of thousands of Raft groups per node, each with its own heartbeat traffic	Idle clusters burn CPU on heartbeats, dense node failures cause Raft thundering herds on leader re-election	The MultiRaft optimization batches heartbeats, but the engineering cost shows up in Raft tuning flags most teams never touch and Cockroach support has to walk them through
Postgres wire compatibility (partial)	Drop-in for psql, pgx, JDBC, most ORMs, lift-and-shift from PG looks easy	No PG extensions (PostGIS, pgcrypto, pgvector via separate path), different query planner, divergent EXPLAIN output	Existing app uses LISTEN/NOTIFY, advisory locks, or PL/pgSQL with side effects, none of which port	"Postgres-compatible" is a marketing surface. Real test is your slowest 100 queries on production data, not the schema migration
Geo-partitioning by row	GDPR/data-residency at row granularity, no app-level routing	Locality constraints become a schema concern, cross-region joins fan out to wide-area latency	A "show me my orders" query joins a US-pinned user to an EU-pinned order row, p99 jumps from 5ms to 180ms	The PE move is to forbid cross-partition queries at the schema level via constraints, not policy. Otherwise developers will write them, and the wide-area cost is invisible until on-call
Closed-source license (CCL, post-Nov 2024)	Backed development, dedicated support, all enterprise features in one bundle	No Apache-licensed escape hatch, BSL converts to ASLv2 after four years (slow), $10M revenue cap on free tier with annual renewal	Acquired company is now over the revenue threshold and Cockroach Labs' pricing posture has shifted, your only options are pay, fork an older BSL version, or migrate	This is the single biggest reason to evaluate YugabyteDB alongside. The architecture is similar enough that the differentiator becomes license, ecosystem trust, and operational maturity
Range-based key partitioning	Range scans are fast, ORDER BY on PK is free, time-series append patterns work well	Monotonic primary keys (auto-incrementing IDs, timestamps) create a hot last range that does not auto-split until it is already too late	An event-logging table with timestamp PK takes 80% of cluster writes on a single range during a traffic spike	Use hash-sharded indexes or UUID PKs by default. The "PRIMARY KEY (created_at, id)" pattern is the most common production foot-gun on Cockroach
Leaseholder reads bypass Raft	Single-replica reads at p99 sub-10ms, no consensus round-trip on the read path	Lease transfer on node loss adds tail latency, follower reads are stale by design (default 4.8s)	Node graceful drain or rebalance causes a brief lease-transfer storm, p99 reads spike to 100ms+ for ~30s	Follower reads are the right tool for analytics-on-OLTP, but using them for user-facing reads while claiming "strong consistency" is the most common postmortem item on Cockroach deployments

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

Raft per tablet · HLC · DocDB · Apache 2.0

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Forked PostgreSQL backend (YSQL) on top of DocDB	Highest PG compatibility of the five, real triggers, FKs, stored procs, most extensions	Forked PG 11.2 base (Enhanced Compatibility Mode in v2024.1 closes some gaps), upgrade cadence lags upstream Postgres by years	Application depends on a PG 15+ feature (MERGE, range types with multiranges), and Yugabyte's parity is "soon"	The fork is both the gift and the curse. You inherit PG semantics for free but also inherit PG's per-connection backend process model, which is why connection pooling is non-optional
Two-layer architecture (query / storage)	Storage layer (DocDB) is consistent at the doc level, query layer is stateless and scales independently	Two services to operate, two release notes to read, query layer per-connection memory dominates at high connection counts	Spiky workload at 5000 concurrent connections, YSQL processes consume RAM the storage layer doesn't need	YB-TServer + YB-Master split mirrors TiDB's separation but at finer granularity. Operators frequently misallocate resources between them on first deployment
HLC + bounded clock skew (similar to CRDB)	No special hardware, runs on any cloud, on-prem, or laptop	Same uncertainty interval problem as Cockroach, depends on NTP discipline	One node's NTP daemon misconfigured, divergence triggers tablet leader bouncing	YugabyteDB's leader lease mechanism is stricter than Cockroach's: new leader must wait out old lease, which costs availability during failover but prevents split-brain reads
Apache 2.0 license, genuinely open core	Real fork-and-go option, no revenue gate, no telemetry mandate	Smaller commercial ecosystem than Cockroach Labs, narrower bench of vendor SREs available for hire	You need a deep performance investigation on a Saturday and the public Slack is the fastest path, not a support contract	The license is a real competitive moat against CRDB now, especially for procurement-sensitive enterprises. Don't dismiss it as ideological
Hash sharding by default for tables, range for indexes	No monotonic-PK hot-shard problem out of the box, uniform write distribution	Range scans on a hash-sharded PK degenerate to scatter-gather across tablets	A "list my last 50 orders by created_at DESC" query goes from O(1 tablet) to O(N tablets) when N grows	This is the inverse of CRDB's default. You explicitly opt into range sharding via SPLIT INTO, which is the right default for OLTP but trips up engineers migrating from PG with timestamp PKs
xCluster async replication for two-region active-active	Sub-second async replication for low-write-latency multi-region without the cost of synchronous Paxos across regions	No unique constraint enforcement across universes, sequences conflict, conflict resolution is last-writer-wins at WAL level	Two active universes both accept INSERTs with the same auto-increment ID, indexes silently diverge from main table	The doc literally tells you to use UUIDs and skip uniqueness constraints in active-active xCluster. If that constraint is unacceptable, you don't actually want async, you want sync sync sync and to pay the latency
Read-from-follower with bounded staleness	Geo-local reads with explicit staleness bound (e.g., "no more than 30s old")	Application must opt in per-session, mixed-staleness queries inside one txn are easy to write incorrectly	An analytics dashboard reads followers and shows totals that don't match the OLTP view of the same data, customer reports "the numbers are wrong"	Bounded-staleness is the most underused feature across all five. Most teams either run everything strong or build a separate replica, both more expensive than tuning a staleness bound per query class
Geo-partitioning via PostgreSQL partition syntax	Uses native PG declarative partitioning + tablespaces, familiar to DBAs	No FK references on partitioned tables, all the usual PG partition-pruning gotchas, schema changes ripple across partition tables	A regulated multi-region schema needs FKs between user and orders, you discover the limit during the prod cutover	Tablespace pinning is more transparent than CRDB's zone configs but less expressive. Choose based on whether your team thinks in PG idioms or zone constraints

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

3-component · PD timestamp oracle · TiKV · MySQL wire

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
MySQL wire protocol, not Postgres	Drop-in migration for the largest installed base of OLTP on the planet, every MySQL driver works	Inherits MySQL's looser type system, lock semantics, and SQL dialect quirks, no SERIALIZABLE	Application relied on MySQL repeatable-read gap locks, TiDB's snapshot isolation behaves differently and you get phantom rows	This is the only one of the five with MySQL heritage. If you are not migrating from MySQL or building net-new and standardizing on MySQL ecosystem, this is rarely the right pick
Centralized Placement Driver (PD) issues global timestamps	Percolator-style 2PC works, no clock-skew concerns, simple mental model	PD is a global bottleneck on transaction start, TSO request latency is on the critical path	Cross-region cluster where PD leader is one continent away, every transaction pays a continental round-trip just to begin	PD itself is Raft-replicated for HA, but only the leader serves TSO. Multi-region TiDB is fundamentally region-local with cross-region cluster-to-cluster replication, not true multi-master like Spanner or DSQL
Compute / storage separation (TiDB / TiKV / TiFlash)	Scale stateless SQL layer (TiDB) without rebalancing data, TiFlash columnar replica gives HTAP on the same cluster	Three independent services to operate, each with its own scaling story and failure modes	TiKV is healthy but TiDB layer is OOMing on a bad query, the cluster looks "fine" to PD while users see errors	The three-tier architecture is the cleanest separation of concerns of the five, but it is also the highest operational surface area. Plan for a dedicated TiDB SRE if self-hosted
Optimistic concurrency by default (pessimistic is opt-in)	Better throughput when contention is low, no lock manager overhead	High-contention rows produce abort storms, retry burden moves to the app	Inventory counter with thousands of concurrent decrements, OCC retry loop spikes CPU on TiDB and TiKV both	The recommendation has moved to "use pessimistic by default in 8.x" for OLTP. Teams that adopted OCC in 4.x and never re-evaluated still see the abort storms in 2026
96MB regions, automatic split and merge	Fine-grained rebalancing, PD can move hot regions in seconds	Tens of thousands of regions per node, Raft heartbeat overhead similar to CRDB's MultiRaft problem	A 10TB cluster has ~100K regions, region-related metadata on PD becomes hot, scheduling decisions lag	The "hibernate region" optimization makes idle regions free, but the math still pushes you to bigger region sizes (256MB+) for very large clusters
TiFlash columnar replica for HTAP	Real-time analytics on fresh OLTP data without ETL, single SQL endpoint, optimizer picks engine per query	2x storage cost when TiFlash is enabled, optimizer mistakes can route analytics queries to TiKV and OLTP to TiFlash	Analyst writes a heavy GROUP BY without a hint, optimizer picks TiKV, scan starves OLTP for 10 minutes	HTAP is the most-marketed and least-understood feature. Real customers split workloads by isolation_read_engines session var anyway. The "no ETL" promise is true; the "no operational discipline" implication is not
Apache 2.0 license, real open source	Free fork path, large Chinese-tech adoption (ByteDance, Pinduoduo at hundreds-of-TB scale) gives real production validation	Documentation and ecosystem still feel Mandarin-first in places, support tier expectations vary by region	You need authoritative answers fast and the best blog post on your symptom is on a Chinese tech blog	The PingCAP commercial roadmap (TiDB Cloud Premium / Essential, late 2025) is moving the lift-and-shift complexity into managed service. Worth evaluating against Aurora DSQL for that exact reason
Two-phase commit across regions for cross-region transactions	Real distributed ACID, not "eventually consistent multi-master"	2PC latency = at least 2 cross-region RTTs, prepared but uncommitted transactions block	A coordinator failure during 2PC leaves transactions in prepared state, you discover them in alerts hours later	The cluster-to-cluster (CDC-based) async replication path is what most production multi-region TiDB looks like, not true multi-region 2PC. Frame the choice as "sync within region, async across" honestly

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

Disaggregated · OCC · EC2 Time Sync · Serverless

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
OCC, cross-region coordination only at COMMIT	Reads and writes execute locally on Query Processor, no per-statement WAN latency, multi-region writes sub-second	Conflicting transactions abort at commit time, retry burden lives in the app	Two regions both write to the same logical row, both look successful until COMMIT, one gets an OCC abort with no prior warning	This is the single biggest architectural bet in DSQL. It pays off when contention is low and writes are partition-friendly, and falls over when you have hot rows, in which case Aurora DSQL is the wrong tool, not "needs tuning"
EC2 Time Sync as the clock substrate	Microsecond-bounded physical time across AWS without owning atomic clocks, like Spanner's TrueTime without the on-prem hardware	AWS-only forever, no path to multi-cloud, time-sync infra is AWS-internal	Compliance or strategic mandate forces multi-cloud, DSQL cannot follow	This is why DSQL exists as a separate service from Aurora Postgres. The Time Sync dependency is intrinsic to the architecture, not an implementation detail
3000-row, 10MiB, 5-minute transaction cap	Prevents head-of-line blocking at Adjudicator and Journal, predictable system-wide latency	Batch jobs, large ETL, schema migrations on big tables cannot run as single transactions	A nightly batch that worked on Aurora Postgres has to be re-architected as N chunked transactions with idempotency, app changes scope balloons	The 3000-row limit is the most honest design statement of any of the five. It is not a quota, it is a contract: DSQL refuses to be your batch processor
No foreign keys, no triggers, no stored procs, no JSONB (as of 2026)	Forces all referential integrity and business logic into application code, makes the database genuinely stateless behavior-wise	Most existing Postgres apps require substantial rewrite, ORM features that assume FK constraints break silently	You migrate, FKs no-op, an orphan row is inserted, you discover it in a customer report next quarter	This is the bet that "the database is for data, the app is for logic." Defensible architecturally, brutal in migration. The 2025 add of views and unique indexes signals AWS hears the feedback
Serverless with scale-to-zero	No instance sizing, no idle cost, no PG Bouncer (no connection pooler needed), 10K default connections per cluster	DPU billing is hard to forecast, query plan changes can 3x cost without a code change	A subquery rewrite during a refactor turns into 3x DPU consumption, monthly bill spikes, no committed-use discount to fall back on	The CloudWatch DPU breakdown (ComputeDPU, ReadDPU, WriteDPU, MultiRegionWriteDPU) is the most important observability surface. Treat it like a SLO, not just a billing metric
Active-active multi-region, no leader	99.999% multi-region availability SLA, no failover step, both endpoints accept writes concurrently with strong consistency	2x multi-region write DPU cost (writes replicate to both regions), cross-region latency surfaces as commit latency	A 150ms RTT between regions makes every commit pay the full RTT, write throughput per connection drops accordingly	The cost model means "active-active" is not free in DSQL. Many teams will discover their workload is fundamentally region-local and they were paying for global by accident
Postgres 16 subset, not full Postgres	Familiar wire protocol, drivers, basic SQL, EXPLAIN works	No extensions (no pgvector, no PostGIS, no pgcrypto), no sequences, no temp tables	App uses UUIDs from `gen_random_uuid()` (works), then needs PostGIS spatial join (doesn't), and you discover during integration	The "Postgres compatible" claim is the same shape as Cockroach's. Test on your actual query workload, not the wire protocol
GA'd May 2025, still maturing the feature gap	Aggressive roadmap, new features monthly (views, unique indexes, PrivateLink, FIS integration in 2025)	Roadmap-driven decision making, features assumed at decision time may not arrive when promised	You promised stakeholders FK support "soon" based on AWS signal, two quarters later it is still not there	The aggressive launch cadence is genuine, but PE-grade decision making requires assuming today's feature set, not the roadmap. Aurora DSQL in mid-2026 is not the Aurora DSQL of late 2027

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

TrueTime · Paxos · 2PC · External Consistency

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
TrueTime (GPS + atomic clocks) for external consistency	The strongest correctness guarantee available in any commercial SQL DB, transactions appear to execute in real-time order globally	Commit-wait pays the TrueTime uncertainty bound (typically a few ms) on every cross-machine write	Tight-loop write benchmarks look slower than Cockroach because commit-wait is unhidden	The commit-wait is what makes the system honest. Cockroach's uncertainty restarts are the same bound paid as a restart cost instead. Spanner amortizes it as wait time, Cockroach amortizes it as retry rate. Same physics, different UX
Paxos per split, leader-based writes	Proven, deployed inside Google for 15+ years, the most validated distributed consensus implementation outside Raft	Leader-based writes mean cross-region writes have one fast region (the leader) and one slow (followers)	A multi-region instance has unbalanced write latencies by design, EU writers see 50ms while US writers see 5ms	Spanner's "regional vs multi-regional" config choice is more important than its TrueTime story for actual workload economics. Pick wrong and you pay 5x for the wrong shape of consistency
GoogleSQL dialect (PostgreSQL dialect now available)	Two SQL surfaces, PG dialect supports most ORMs and existing PG apps, GQL graph language for Spanner Graph	Both dialects diverge from upstream Postgres, schema features (interleaved tables) are Spanner-specific	You design schemas using interleaved tables for performance, later cannot port back to vanilla Postgres without redesign	The PG dialect is a recent (2022+) compatibility layer, not the native language. For new builds, GoogleSQL is the better-supported and more honest choice unless multi-DB compat is a hard requirement
Snapshot reads at any timestamp from any replica	Globally consistent read-only transactions at the local replica without leader round-trip	Replica "safe time" can lag if it has not seen recent writes, reads may block waiting for safe time to advance	A snapshot read at "now" on a replica that is behind blocks for hundreds of ms during a write-heavy moment	Stale reads ("stale" by a few ms intentionally) are the production pattern. "Strong reads" (current timestamp) are correctness theater for most read paths
Multi-region SLA of 99.999%	The highest published SLA in the category, Google takes the operational ownership	Premium pricing, multi-region instances cost ~3x regional, processing-unit billing rewards continuous low utilization	A spiky workload pays for steady-state PUs that mostly idle	The Editions split (Standard / Enterprise / Enterprise Plus) means the SLA you actually get depends on a procurement decision, not just the deployment topology. Read the SLA per edition
GCP-only (until Spanner Omni in 2026 preview)	Deepest possible integration with Vertex AI, BigQuery (Zero-ETL), Pub/Sub, IAM	Multi-cloud is now possible via Spanner Omni preview but is not Spanner in production parity	You commit to Spanner, your company gets acquired by an AWS-only shop, migration is multi-quarter	Spanner Omni's software-defined TrueTime is the most interesting 2026 development in this space. Whether it reaches feature parity with managed Spanner is the bet that decides multi-cloud distributed SQL for the next 5 years
2PC across Paxos groups for multi-split transactions	Real ACID across the entire schema, not just intra-shard	2PC + Paxos + commit-wait stacks: cross-split transactions are visibly slower	A transaction that touches 5 splits pays 5x the prepare overhead, OLTP latency p99 climbs from 8ms to 40ms	The interleaved tables feature is the workaround: keep child rows physically co-located with parent so common transactions stay single-split. Most Spanner experts will say "schema design is the only optimization that matters" and they are right
Multi-model (relational + graph + vector + full-text)	Single system for OLTP + GraphRAG + semantic search, no ETL between specialized DBs	Each new model is "good enough" rather than best-in-class against Neo4j, Pinecone, Elasticsearch	You picked Spanner Graph for the consolidation story, later your data scientists need Cypher-only features that GQL doesn't have yet	The multi-model story is a real differentiator, but maturity per model varies. Spanner Graph GA'd in 2025, vector indexes for PG dialect GA'd recently. Treat each model as its own production-readiness check

2. Use Cases

Real workloads with the driving property that ruled out the obvious alternative. Generic "scalability" is not a driving property.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Multi-region SaaS control plane	DoorDash, Netflix, Shipt have publicly used CRDB for control-plane storage	SERIALIZABLE isolation across regions without a custom 2PC layer	Tens of TB, 10K+ QPS sustained, p99 under 25ms in-region	Postgres + sharding requires building the distributed-transaction story you would get for free
Geo-residency compliance (GDPR, India DPDP)	Mid-size fintech with EU and India users in one logical app	Row-level pinning to a region, with one schema and one connection string	1-2TB per region, 5K QPS, hard residency boundary	YugabyteDB is the close competitor; CRDB wins on Postgres-extension expectation and zone-config expressivity
Lift-and-shift from Postgres at scale	Existing PG monolith hitting single-primary write ceiling	Postgres wire compatibility plus horizontal write scaling without re-architecting the app	Outgrew a 96-vCPU primary, sustained 30K writes/sec, 5TB and growing	Aurora Postgres scales reads via read replicas but not writes, sharding via Vitess is MySQL-only
Inventory and ordering systems with strong correctness	E-commerce with thousands of concurrent decrement transactions on hot SKUs	Default SERIALIZABLE prevents oversell without app-side locking gymnastics	5K orders/sec at peak, hot SKU contention is the design hazard	MongoDB and Cassandra cannot give the correctness guarantee, would require app-level inventory service
Hybrid on-prem and cloud deployments	Regulated industries (banking, healthcare) wanting cloud agility plus on-prem fallback	Same engine on bare metal, EKS, GKE, no vendor lock to a cloud provider	Multi-cluster, 50+ nodes per cluster, mixed deployment environments	Spanner is GCP-only (Spanner Omni preview as of 2026), DSQL is AWS-only
Distributed audit logs with point-in-time recovery	Compliance-heavy workloads needing tamper-evident history	Time-travel queries (AS OF SYSTEM TIME) for free, MVCC retention configurable	Years of history retained, queries against 30-day-old snapshots in production	Postgres needs separate WAL archiving + restore choreography, distributed competitors lack the AS OF clause

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
PostgreSQL replacement for write-scale-out	SaaS hitting PG vertical scaling limit, ORM and stored procs in use	Highest PG feature parity of the five, real triggers and FKs and extensions	5-20TB, 10K+ writes/sec, 100s of stored procedures to preserve	CockroachDB lacks triggers and most extensions, Aurora DSQL is a far thinner PG subset
Multi-cloud distributed SQL	Enterprise mandate to avoid AWS / GCP lock-in on the system of record	Apache 2.0 license, runs identically on EKS, GKE, AKS, bare metal	Hundreds of nodes across two clouds, active-passive xCluster between	Spanner is GCP-only, DSQL is AWS-only, CockroachDB now requires a paid Enterprise license above $10M revenue
Financial systems with geo-partitioning	Bank with strict per-jurisdiction data residency	Row-level geo-partitioning via standard PG declarative partitioning syntax	Multiple regulatory regions, single schema, single connection string	App-level routing with PG is operationally tractable but auditable-by-design is harder to demonstrate
Telco subscriber data store	Mobile carrier with regional billing systems, 99.99% SLA	Synchronous replication across 3 AZs, follower reads for nearby low-latency lookup	Hundreds of millions of subscribers, sub-10ms p99 reads, sustained MGT writes	Cassandra gives availability but no SQL, MongoDB lacks the strong-consistency guarantees needed for billing
Active-active two-region apps	SaaS needing low write latency in both US and EU regions	xCluster bidirectional async replication, accept eventual conflict resolution to gain local write speed	Sub-5ms in-region writes, ~1s cross-region replication lag	Spanner multi-region pays commit-wait for true active-active; DSQL pays full RTT at commit; both more expensive
Edge-adjacent IoT or device data	Manufacturing or IoT with regional clusters synced asynchronously	xCluster supports many-to-one fan-in for centralized analytics	30+ edge clusters streaming to a central reporting cluster	Building this on Postgres requires logical replication + custom conflict handling per table

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
MySQL replacement at scale	ByteDance, Pinduoduo, Shopee have used TiDB for MySQL replacement at hundreds of TB	MySQL wire compatibility, horizontal scale, no Vitess operational burden	Hundreds of TB, 100K+ QPS sustained, MySQL ecosystem fully reusable	Vitess preserves MySQL but inherits sharding complexity, AWS RDS MySQL caps out at the single-primary ceiling
HTAP on operational data	E-commerce running analytics on fresh OLTP via TiFlash columnar replica	Real-time analytics without an ETL pipeline, single SQL endpoint routes per query	Operational DB also serving dashboards with sub-minute freshness	Separating OLTP and OLAP into PG + Snowflake adds an ETL pipeline and 5-minute-plus freshness gap
Game backend state	Mobile and console game studios needing transactional state at scale	MySQL compatibility (most game backends are MySQL-first), strong consistency for in-game economy	Millions of concurrent sessions, transactional inventory and currency operations	Cassandra and Redis cannot give the transactional guarantees, MySQL needs sharding
Financial settlement and reconciliation	Fintech with MySQL legacy, regulatory pressure for ACID across larger data sets	Strong consistency with MySQL semantics, audit-friendly	10s of TB, sustained pessimistic-transaction throughput	Aurora DSQL has no triggers or FKs and a 3000-row transaction cap; both are dealbreakers for batch reconciliation
SaaS multi-tenant OLTP	Vertical SaaS replacing per-tenant MySQL shards with one logical TiDB cluster	Eliminate per-tenant shard sprawl, simplify connection management, single SQL plane	1000s of tenants, GB-to-TB per tenant, MySQL drivers everywhere	PG-based options force MySQL→PG migration on top of distribution; TiDB removes that variable
Asia-Pacific deployments with PingCAP support	Companies with China and SEA presence valuing native-language support	Strong support presence and ecosystem in APAC, mature TiDB Cloud offering on Alibaba Cloud and AWS	Regional teams, mixed Mandarin/English ops	Spanner is GCP, DSQL is AWS, CRDB has limited APAC commercial bench

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Greenfield serverless apps on AWS	New product launches on Lambda + API Gateway with unpredictable traffic	Scale-to-zero billing, no connection pooler, no instance sizing	Variable load from 0 to thousands of QPS, no capacity planning	Aurora Serverless v2 has a minimum cost floor, Aurora Postgres requires instance sizing
Active-active multi-region without operational handoff	Robinhood and ADP publicly cited as launch customers for global SLAs	99.999% multi-region availability with single logical endpoint per region	Two-region active-active, sustained 4x faster than Spanner per AWS GA claims (verify on your workload)	Spanner is GCP-only, multi-region CRDB requires you to operate the cluster
Microservices ledger or transactional state	Microservice owning a small slice of state, no batch jobs, well-bounded transactions	3000-row transaction limit is fine for service-shaped workloads, OCC scales without locking	10s of GB per service, low-contention writes, microservice-shaped access	DynamoDB would work but loses SQL, joins, and ad-hoc query flexibility
Multi-tenant SaaS with tenant-isolated workloads	SaaS where tenant rows are independent, no cross-tenant transactions	OCC works when conflicts are tenant-scoped, scale-to-zero per tenant inactive period	Many small tenants, low cross-tenant locking concerns	RDS Postgres requires upfront capacity for the worst-case tenant
Event-driven applications with bursty traffic	Workflow orchestration, retry queues, scheduled-job dispatchers	Per-DPU billing means quiet hours are nearly free, no provisioned floor to pay for	Workload spikes 100x between idle and peak, no traditional sizing answer	Provisioned Aurora pays for peak 24/7, Lambda + DynamoDB loses SQL ergonomics
Net-new applications choosing strict AWS-native	AWS-first orgs with IAM, KMS, VPC, IaC pipelines already standardized	IAM auth (token-based, no DB passwords), CloudWatch DPU metrics, AWS FIS for failover testing	Standard AWS-resident stack, single-cloud strategic mandate	Self-managed CRDB or YugabyteDB on EKS doubles operational scope

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Global financial systems	Banks and fintechs with multi-region ledgers, e.g., trading and payment platforms	External consistency, the only commercially deployed system with this guarantee	Trillions of rows, multi-region with 5-nines SLA, strict audit requirements	None of the others guarantee external consistency in the formal sense, only strong consistency
Ad-tech and large-scale OLTP	Google's own AdWords backend (F1) is the canonical use case	Strong consistency at petabyte scale with predictable latency	Petabytes, millions of QPS, single logical schema across global footprint	No alternative has demonstrated this scale in production
Multi-model consolidation	Companies consolidating off Neo4j + Elasticsearch + Pinecone + Postgres into one platform	Single ACID-compliant store across relational, graph, vector, and full-text	10s of TB, mixed workload patterns, ETL elimination as a strategic goal	Each specialized DB still wins on its niche; Spanner wins on consolidation when "good enough" per model suffices
Gaming leaderboards and global state	Mobile game backends with single global leaderboard and player state	Globally consistent ordering of writes, single source of truth across regions	10M+ concurrent players, sub-second cross-region read freshness	Cassandra and Redis cannot give cross-region ordering, only locality-based eventual consistency
Spanner Graph for fraud detection	Financial fraud detection with graph traversal on fresh transaction data	Combined SQL + GQL graph queries on ACID-compliant data, no ETL to a separate graph DB	Billions of edges, real-time scoring on OLTP transactions	Neo4j is the graph specialist but requires ETL from the OLTP store, adding latency and consistency risk
Multi-cloud strategic deployments (Spanner Omni)	Companies committed to Spanner but with non-GCP infrastructure obligations	Spanner Omni preview brings the engine to on-prem, AWS, Azure (preview-quality, not production)	Mid-size deployments, evaluation as of 2026, not strategic-bet-the-company	If GCP is on the table, native Spanner is always cheaper and more mature

3. Limitations

Matrix layout. Rows are the categories that bite at scale; cells describe how each technology is constrained, with severity coded by color.

Limitation	CockroachDB	YugabyteDB	TiDB	Aurora DSQL	Spanner
Transaction size	~64MB practical (raft command size), no hard row count MEDIUM	~512MB tablet-level, soft transaction limits via configs MEDIUM	10GB hard limit in default mode, 100GB with "large transactions" enabled MEDIUM	3000 rows, 10MiB, 5 minutes hard CRITICAL	20K mutations per commit, 100MB max HIGH
SQL feature completeness vs upstream	Postgres subset, no LISTEN/NOTIFY, no advisory locks, no extensions HIGH	PG 11.2 base, most features supported but cadence lags PG releases by years MEDIUM	MySQL 5.7+ subset, no triggers in many versions, no events MEDIUM	No FKs (no enforcement), no triggers, no stored procs, no JSONB, no temp tables, no sequences, no extensions CRITICAL	GoogleSQL is a separate dialect, PG dialect is partial, interleaved tables non-portable HIGH
Multi-region write latency	Cross-region Raft quorum at write time, ~RTT to nearest 2 regions HIGH	Same as CRDB for sync, sub-second for xCluster async (with consistency caveats) HIGH	Cross-region 2PC pays multi-RTT, most multi-region TiDB is cluster-to-cluster async HIGH	RTT at commit, sub-statement local execution, OCC abort at commit MEDIUM	TrueTime commit-wait + Paxos, leader-side fast / follower-side slow MEDIUM
Hot-row contention behavior	SERIALIZABLE → transaction restart, can cascade HIGH	Similar restart pattern, slightly better via pessimistic locking option MEDIUM	OCC default storms aborts, pessimistic mode mitigates HIGH	OCC abort at commit, no early failure detection CRITICAL	Pessimistic locking + 2PC, lock contention but no retry storm MEDIUM
Operational complexity (self-managed)	Single binary, hundreds of knobs, requires CRDB-experienced SRE HIGH	Two services (Master, TServer), three for HA, requires PG + RocksDB expertise HIGH	Three core services (TiDB, TiKV, PD) + TiFlash, highest surface area CRITICAL	N/A — fully managed, no self-hosted option N/A	N/A managed, Spanner Omni is preview-only as of 2026 N/A
License / vendor lock-in	CCL closed source, $10M revenue gate for free Enterprise, telemetry mandatory on free tier HIGH	Apache 2.0, genuinely open LOW	Apache 2.0, genuinely open LOW	AWS-only, no multi-cloud path CRITICAL	GCP-only natively (Omni preview opens this), Google-managed proprietary HIGH
Cost predictability	Per-vCPU licensing on Enterprise, predictable but not cheap MEDIUM	Compute + storage on managed, free self-hosted LOW	Managed pricing similar to CRDB Cloud, self-hosted is the cost win MEDIUM	DPU billing notoriously hard to forecast, query changes can 3x cost CRITICAL	Processing-unit billing predictable but expensive, multi-region is 3x regional HIGH
Analytics / OLAP capability	Not designed for OLAP, doc says so, follower reads help but limited HIGH	Same, no columnar engine, recommend separate OLAP store HIGH	TiFlash columnar replica gives real HTAP, the only one with built-in OLAP LOW	No analytics workload, OLTP only by design CRITICAL	Spanner Data Boost separates analytics traffic from OLTP, BigQuery Zero-ETL closes the gap MEDIUM
Cold-start / scaling-up latency	Manual node provisioning, minutes to add capacity MEDIUM	Manual node provisioning, similar to CRDB MEDIUM	Manual node provisioning, PD scheduling needs time to rebalance MEDIUM	Serverless, sub-60s cluster provision, scales automatically LOW	Autoscaling (Enterprise+ tier) automatic, processing-unit adjustments seconds LOW
Maximum cluster size in practice	Hundreds of nodes published, Cockroach Labs ref customers in low hundreds MEDIUM	Hundreds of nodes published, comparable to CRDB MEDIUM	Thousands of TiKV nodes at ByteDance scale LOW	Single cluster, 256TiB max storage as of Nov 2025 (raised from 128TiB) MEDIUM	Petabyte-scale validated by Google's internal F1 deployment LOW

4. Fault Tolerance

Replication model is the headline; failure detection time and split-brain behavior are what determine your incident postmortem template.

Dimension	CockroachDB	YugabyteDB	TiDB	Aurora DSQL	Spanner
Replication model	Raft per range, default RF=3, leaseholder serves reads	Raft per tablet, default RF=3, leader-lease-protected reads	Raft per region in TiKV, default RF=3, leader serves reads	Disaggregated, Journal-replicated across 3 AZs, no traditional consensus	Paxos per split, RF configurable (default 3 zones, multi-region adds more)
Failure detection	Liveness heartbeats, ~9s for unavailability, configurable	Tablet peer heartbeats, ~3s detection	PD heartbeats, ~10s for region leader replacement	Control plane health checks, transparent to client	TrueTime + Paxos heartbeats, sub-10s detection in practice
Failover mechanism	Raft re-election, leaseholder moves, no client config change	Raft re-election + leader-lease wait, no client change	Raft re-election in TiKV + PD rebalance, no client change	Automatic re-routing to healthy QP/Journal/Adjudicator, invisible to client	Paxos re-election, automatic, no client change
RTO (typical)	~9s for node loss, 30s+ for AZ-wide event due to lease churn	~3-10s for node loss, depending on lease parameters	~10-30s for TiKV node loss, PD election adds latency for region	Sub-second for AZ failure (multi-AZ active by design)	Zero RTO for zone failure (Spanner Omni doc claims, native Spanner sub-second)
RPO (typical)	Zero (synchronous Raft commit)	Zero for sync replication, seconds for xCluster async	Zero (synchronous Raft commit)	Zero (Journal acknowledges before client sees commit)	Zero (Paxos quorum commit)
Split-brain behavior	Minority partition rejects writes, leaseholder check on reads	Leader-lease prevents two leaders, minority rejects writes	Minority partition rejects writes, PD election quorum-protected	No leader, conflicts surface as OCC aborts at commit	Paxos quorum, minority side cannot make progress
Single-node failure blast radius	Hundreds of ranges may lose leaseholder, ~9s of localized read latency	Tablets where node was leader briefly unavailable for writes	Regions where node was leader briefly unavailable for writes	One QP loss is invisible (request re-routes), one AZ loss is sub-second	Splits where node was leader unavailable briefly, others unaffected
Cross-region failover	Manual via zone-config changes, can be fully automated with survival goals	Automatic for sync clusters, manual cutover for xCluster async	Manual cluster-to-cluster failover for the typical async multi-region setup	No failover step — multi-region is active-active, traffic shifts naturally	Automatic for multi-region instances (Enterprise Plus tier)
Data loss scenarios	Simultaneous loss of quorum in a range (3-AZ simultaneous failure)	Same as CRDB for sync, xCluster async can lose unreplicated writes	Quorum loss in TiKV region, or PD majority loss for cluster metadata	Multi-AZ Journal quorum loss; multi-region cluster pair both lost	Multi-region quorum loss (rare given replica count and geographic spread)
Published availability SLA	No managed-service SLA on self-hosted; CockroachDB Cloud Advanced has tier-specific SLAs	YugabyteDB Aeon (managed) has tier SLAs; self-hosted depends on operator	TiDB Cloud Dedicated 99.99%, self-hosted depends on operator	99.99% single-region, 99.999% multi-region	99.99% regional, 99.999% multi-regional (Enterprise Plus)

Divergence to flag: DSQL and Spanner publish 99.999% multi-region. That number is reachable architecturally on CRDB and YugabyteDB too, but only if you operate the cluster to that bar yourself. The SLA you contractually receive vs. the SLA your engineering can achieve are different questions, and procurement frequently confuses them.

5. Sharding

Hash, range, or directory; how rebalancing happens and what the hot-shard escape hatch looks like.

Dimension	CockroachDB	YugabyteDB	TiDB	Aurora DSQL	Spanner
Sharding model	Range-based on PK by default, hash-sharded indexes available	Hash by default for tables, range for indexes (opt-in for tables via SPLIT)	Range-based (Regions, ~96MB), automatic split and merge	Storage cloning + automatic sharding under the hood, not user-visible	Range-based (splits), automatic dynamic partitioning
Shard key constraints	Primary key, any data type; monotonic keys are an anti-pattern	Primary key, hash-distributed by default avoids monotonic hot-shard	Implicit (auto-generated row IDs) or user-defined cluster index	No user-visible shard key; system manages distribution	Primary key, can use UUID or co-located (interleaved) keys
Rebalancing mechanism	Background range splits at ~512MB or load threshold, leases move independently	Tablet split + load-based leader rebalancing managed by YB-Master	PD continuously rebalances regions across TiKV nodes, latency-aware	Hidden, automatic, no user controls	Splits move automatically based on load and size; admin can pre-split
Rebalancing cost / impact	Transparent to client, can cause brief read-latency spikes during lease transfer	Transparent, leader-lease wait briefly extends write unavailability	Transparent, PD throttles to limit impact, can be paused	Invisible to client by design	Transparent, briefly blocks writes to the moving split
Hot-shard behavior	Auto-split if load exceeds threshold; monotonic PK is the common foot-gun	Hash sharding avoids the monotonic hot-shard problem by default	Auto-split, PD migrates leader off hot region; high-write hot rows still abort	OCC means hot rows abort on commit; no "hot shard" in the traditional sense	Auto-split, interleaved tables reduce hot-spot risk by co-locating logical units
Maximum shards (practical)	Hundreds of thousands of ranges per cluster, validated in published deployments	Hundreds of thousands of tablets per cluster	~100K regions per node feasible with hibernated regions	Not user-visible; AWS manages scale to 256TiB cluster	Petabyte scale in production at Google (F1), no published upper bound
Resharding without downtime	Yes, automatic via range splits, manual via ALTER PRIMARY KEY (online schema change)	Yes, automatic via tablet splits, online tablespace moves for geo-partitioning	Yes, fully automatic, no SQL needed	Yes, completely transparent	Yes, automatic, no user action required
Cross-shard query support	Distributed SQL planner fans out and aggregates, full SQL support across shards	Distributed SQL planner from forked PG, full SQL across tablets	Distributed SQL with TiFlash + TiKV coordination, full SQL	Full SQL, distribution invisible	Full SQL with 2PC for cross-split, GoogleSQL or PG dialect

6. Replication

Topology, sync vs async, and the cross-region story. The first three are Raft-similar; the last two diverge.

Dimension	CockroachDB	YugabyteDB	TiDB	Aurora DSQL	Spanner
Replication topology	Single leader per range (leaseholder), Raft followers	Single leader per tablet, Raft followers, read replicas opt-in	Single leader per region in TiKV, Raft followers, TiFlash columnar replica	No leader; Journal replicates writes, Adjudicator resolves conflicts	Single Paxos leader per split, witness replicas for quorum
Sync vs async	Sync within Raft group, async to non-voting replicas (follower-reads)	Sync within Raft group, async via xCluster for cross-region	Sync within TiKV Raft group, async via CDC-based cluster replication	Sync across AZs, sync across regions at commit (multi-region peered)	Sync via Paxos, both regional and multi-regional are synchronous
Replication factor (default / max)	3 default, configurable per zone	3 default, configurable via tablespace per fault domain	3 default, configurable per region/label	3 across AZs fixed (multi-region adds full second cluster)	3 default per region, multi-region adds zones for additional copies
Consistency level options	SERIALIZABLE (default), READ COMMITTED (opt-in 24.x+), follower reads with bounded staleness	SERIALIZABLE, SNAPSHOT, READ COMMITTED, follower reads with bounded staleness	SNAPSHOT (default in OCC), READ COMMITTED via pessimistic txn	Snapshot isolation, linearizable reads	External consistency (default), bounded staleness reads, exact staleness reads
Replication lag (typical)	Sub-ms within Raft group, configurable seconds for follower reads	Sub-ms within Raft, ~sub-second for xCluster async	Sub-ms within Raft, ~seconds for cross-cluster CDC	Zero (all writes durable at commit before ack)	Sub-ms commit-wait dominates, sub-second cross-region
Conflict resolution	Raft serializes all writes, no conflicts to resolve	Same as CRDB for sync; xCluster async uses last-writer-wins at WAL level	2PC commit ordering, no conflicts in sync path	OCC, conflicting writer aborts at commit time	2PC with TrueTime ordering, no conflicts in commit path
Cross-region replication	Sync via Raft (high latency) or async via Change Data Capture	Sync (high latency) or async via xCluster (LWW conflicts allowed)	Mostly async cluster-to-cluster, 2PC available but rarely deployed multi-region	Sync at commit, multi-region peered clusters are first-class	Sync via Paxos with TrueTime, multi-region is first-class
Replication during partition	Minority Raft side rejects writes, majority side serves	Same, minority Raft side blocked, leader-lease prevents stale reads	Minority TiKV region blocked, PD must have majority	OCC means partitioned writes will conflict at commit, transactions abort	Paxos minority blocked, majority continues

7. Better Usage Patterns

Five patterns per tech. The anti-pattern half of each row is the one that shows up in code review; the better-way half is the one experienced operators recommend.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Primary key design	Auto-increment INT or timestamp PK migrated from Postgres unchanged	UUID or hash-sharded composite key from day one, never monotonic	Avoids the single-hot-range write bottleneck that becomes a Saturday-night oncall
Transaction retries	Catch generic exceptions, retry once, hope it works	Use the official retry wrappers (savepoint cockroach_restart), exponential backoff, surface aborts as a metric	SERIALIZABLE means retries are part of the normal control flow, not an exception path
Follower reads	Treat them as "free strong reads" for read-heavy workloads	Use AS OF SYSTEM TIME or follower-read clause explicitly, document the staleness bound per query class	Stale reads with no documented staleness will be the root cause of a data-correctness postmortem within 6 months
Geo-partitioning	Set up survival goals and zone configs once, forget the schema-level enforcement	Add CHECK constraints that prevent cross-region rows from being inserted into the wrong partition	Stops "compliance accidents" at the schema layer, not in code review
Schema change cadence	Manual psql ALTER TABLE during a maintenance window	Use online schema changes (DDL is non-blocking by design in CRDB), but stage in non-prod with the same row counts	CRDB does online DDL well; the failures are usually about under-tested constraints, not the DDL itself
Connection management	Use a single pgbouncer pool sized for the largest node	Use SQL-aware proxy (cockroach-sql-proxy or CockroachDB Cloud's load balancer), sized per-node not per-cluster	Cross-node connection pooling adds tail latency from lease redirects; SQL-aware routing keeps queries node-local

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Hash vs range sharding	Accept hash default for everything, then complain range scans are slow	Use HASH for OLTP point lookups, RANGE explicitly for time-series and ordered queries	The default is right for 80% of cases; the other 20% need an explicit ASC PRIMARY KEY (col, col) SPLIT INTO ... clause
Connection pooling	Direct PG connections from app, hundreds of YSQL backends each consuming RAM	YugabyteDB Smart Driver + connection pool, target 10-50 active connections per node	YSQL inherits Postgres' per-connection process model; without pooling, RAM and context switches kill the cluster
xCluster active-active	Use auto-increment PKs and unique constraints, "fix it later"	UUIDs for PKs, no unique constraints on indexed columns, application-level conflict reconciliation logic	The docs literally warn about this; conflicts are silent and you find them in customer reports
Geo-partitioning + FKs	Try to enforce FKs across partitioned tables, hit the partition + FK limit	Co-locate referenced rows in the same geo-partition, denormalize across regions if needed	The schema-level limitation is real; pretending it isn't leads to insertion failures or silent missing FK semantics
Bounded staleness reads	Run all reads strong, build a "read replica" via xCluster for analytics	SET yb_read_from_followers = true with explicit staleness; mix in same session as needed	Avoids building a parallel infrastructure for what is a per-query parameter; saves substantial cost and complexity
YB-Master sizing	Treat YB-Master like a metadata service that needs minimal resources	Size YB-Master for the catalog operations of your cluster; CPU spikes during heavy DDL or split storms	An underpowered Master becomes the bottleneck during incident recovery, exactly when you can't afford it

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
OCC vs pessimistic	Leave the default and assume MySQL-compatible behavior	Use pessimistic locking (set transaction default in 5.0+) for OLTP, OCC only for low-contention bulk paths	The default has shifted historically; teams who adopted on 4.x default OCC and never re-evaluated are still seeing abort storms in 2026
TiFlash usage	Add TiFlash, hope the optimizer picks it for analytics queries	Use isolation_read_engines session variable per workload class, force OLTP to TiKV and OLAP to TiFlash explicitly	Optimizer mistakes are the #1 HTAP problem in TiDB; manual routing per session class avoids them
Auto-incrementing PK	Use AUTO_INCREMENT or AUTO_RANDOM without thinking	AUTO_RANDOM for OLTP tables to avoid the last-region hot-shard problem, AUTO_INCREMENT only for analytics-style tables	TiDB allocates AUTO_INCREMENT in node-local batches but the last-region hot-spot is still real for sequential reads
PD placement	Place PD on the same nodes as TiKV for cost reasons	Dedicated PD nodes in their own AZ, sized for the TSO throughput your cluster needs	PD is on the critical path for every transaction; co-locating with TiKV creates cascading failures during a single node loss
Region size tuning	Stay on 96MB default for the largest clusters	Increase region size to 256MB or 512MB at >10TB scale to reduce PD metadata overhead	Region count grows with data size; PD scheduling and TiKV heartbeat overhead grows linearly with region count
Cross-region deployment	Stretch a single cluster across regions and accept high write latency	Separate cluster per region with CDC-based async replication (TiCDC), accept eventual cross-region	The vast majority of production multi-region TiDB is region-local with async replication; pretending otherwise produces unusable write latency

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Transaction shape	Migrate batch jobs as single transactions, hit the 3000-row limit in production	Chunk into 1000-row batches with idempotency keys, design every job to be resumable	The 3000-row limit is intentional; pretending it's a quota leads to constant retries instead of correct chunking
Foreign key replacement	Skip FKs, hope the ORM enforces them	Application-level referential integrity with explicit consistency checks at transaction boundaries	Without FKs, the database cannot prevent orphan rows; the application must own this invariant explicitly
Cost monitoring	Watch the monthly bill, react when it spikes	Treat per-query DPU consumption as a SLO; alert on Compute/Read/Write DPU breakdowns separately	A subquery refactor can 3x DPU consumption silently; only per-query observability catches it before the bill
Index strategy	Add indexes the same way as Postgres, hope query plans work	Use covering indexes aggressively to enable index-only scans; EXPLAIN ANALYZE VERBOSE is mandatory before deploy	Storage is cheap, DPU is not; covering indexes that eliminate table scans pay for themselves in days
OCC conflict handling	Catch OCC abort, retry once, give up	Exponential backoff, jitter, partition-aware retries, observability on conflict-prone keys	OCC aborts are signal, not noise; high abort rate on a key means redesign, not retry harder
Multi-region cost	Enable multi-region for "availability" without measuring write cost impact	Multi-region only when active-active is the actual requirement; single-region multi-AZ is 99.99% and cheaper	Multi-region writes 2x DPU; many workloads enable it for marketing reasons and pay continuously

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Schema design	Treat Spanner like Postgres, design schemas without interleaved tables	Use INTERLEAVE IN PARENT for child rows that are always accessed with the parent (orders within users)	Interleaved tables keep co-accessed rows physically together, eliminating 2PC for the common case
Read patterns	Default to "strong reads" for everything, pay TrueTime commit-wait on every read	Use stale reads with bounded staleness (e.g., 10s) for non-critical paths, strong only when ordering matters	Stale reads can be served from any replica without quorum, lowering p99 and cost simultaneously
Primary key design	Use auto-incrementing INT64 or timestamp PK from Postgres habit	UUIDv4 or hash-prefixed PK to avoid hotspot on monotonic inserts	Like all range-partitioned systems, monotonic PKs create a hot last-split that takes all writes
Cross-region instance choice	Pick multi-region instance for "availability," accept the 3x cost	Regional instance with cross-region read replicas (where available) when writes are region-local	Multi-region is for true global writes; regional+replica is cheaper for read-anywhere/write-here workloads
Edition choice	Default to Standard, hit a feature gap later (graph, vector, geo-partitioning)	Pick Enterprise or Enterprise Plus from day one if any multi-model or geo-partition feature is in the roadmap	Edition upgrades are possible but per-feature licensing makes the math hard to undo
Connection management	Use the default Spanner client session count without tuning	Enable multiplexed sessions (default in 2025+ clients) for high-concurrency workloads, sized to actual QPS	Session exhaustion is a quiet failure mode; multiplexed sessions eliminate it in 80% of cases

8. Advanced / Next-Gen Alternatives

Where each technology might be replaced, augmented, or obviated. Three rows per tech, weighted to the alternatives that an L7 review would mention.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
YugabyteDB	Apache 2.0 license, higher PG feature parity, no revenue gate	Production at scale	Medium, Postgres-compatible but DDL and tuning differ	License change forced re-evaluation, or PG ecosystem features (extensions, triggers) are required
Aurora DSQL	Eliminates self-managed operations on AWS, scale-to-zero billing	GA May 2025	High, requires substantial app refactor (no FKs, no triggers, txn limits)	AWS-only strategy, greenfield apps, workload tolerant of OCC and DSQL limits
Postgres + Citus (now Microsoft-owned)	True Postgres with horizontal sharding via Citus; lower license risk than CRDB Enterprise	Mature, Azure-blessed	Medium, app-side awareness of distribution columns	Multi-tenant SaaS where tenant is the natural shard key, want real Postgres
Vitess (MySQL-shaped)	If you can switch to MySQL, Vitess is more mature and broadly deployed for horizontal scale	Mature (YouTube, Slack, GitHub)	Very high if migrating from PG; near-zero if already on MySQL	Existing MySQL workload outgrowing single-primary, OSS-only mandate

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
CockroachDB	Slightly more mature commercial support and ecosystem in North America	Production	Medium, similar architecture but DDL and tuning differ	Procurement requires a US-headquartered vendor with tier-1 enterprise support
Aurora DSQL	Serverless economics, no cluster ops	GA	High, PG feature gap is wider than YB's	AWS-native, greenfield, smaller feature surface acceptable
Spanner with PG dialect	External consistency, true multi-region writes, multi-model in one engine	Production	High, GCP-specific and edition-tier-dependent	GCP commitment exists, correctness guarantees are a procurement requirement
Postgres + pglogical / Patroni / Citus stack	Real upstream Postgres, no fork to track	Mature	Medium, requires assembling sharding + HA from parts	Postgres-purist team unwilling to accept any fork, accepting higher operational burden

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Vitess	Preserves MySQL semantics more strictly, more mature in North American hyperscalers	Mature	Medium, both are MySQL-protocol, but Vitess sharding is app-aware	Existing MySQL with complex stored procs and triggers TiDB doesn't fully support
PlanetScale (Vitess + serverless)	Serverless MySQL with branching, no cluster ops	Production	Low for MySQL apps that fit PlanetScale's deferred schema model	Want managed serverless on MySQL ecosystem, OK with PlanetScale's foreign-key constraints
MySQL HeatWave (Oracle)	HTAP via in-memory analytics engine, deep MySQL semantic parity	Production	Low for OCI users, medium otherwise	HTAP need with strict MySQL parity, willing to use Oracle Cloud Infrastructure
SingleStore (formerly MemSQL)	Row + columnar in one store, MySQL wire protocol, real-time analytics	Production	Medium, MySQL wire but different SQL semantics	Mixed OLTP + sub-second analytics, want a more analytics-leaning HTAP than TiDB

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Aurora Postgres Limitless Database	True Postgres compatibility (FKs, triggers, JSONB, extensions) with horizontal sharding	Production	Low for traditional PG apps, requires shard-key thinking	Need full Postgres features but want AWS-native horizontal scale; multi-region not the priority
CockroachDB Cloud / YugabyteDB Aeon on AWS	More mature feature set, multi-cloud option	Production	Low to medium	DSQL feature gaps (FKs, triggers, batch transactions) are dealbreakers
DynamoDB with PartiQL	If you can model as key-value, DynamoDB is mature, cheaper, and has no transaction-row limits	Mature	High, data model change required	Workload is actually KV-shaped and SQL was used out of habit, not necessity
Spanner Omni on AWS (preview)	External consistency and multi-cloud, the only credible alternative for DSQL's correctness story	Preview 2026	High, GCP-tied tooling and dialect	Multi-cloud strategic, willing to accept preview-stage software

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Spanner Omni	Same engine, multi-cloud (on-prem, AWS, Azure)	Preview as of 2026	Low within Spanner family	Multi-cloud mandate or GCP-exit risk hedge; willing to track preview features
CockroachDB / YugabyteDB	Multi-cloud today, lower cost at small-to-mid scale, no GCP dependency	Production	Medium to high, depending on Spanner-specific features used	Cost optimization, multi-cloud, willing to operate the cluster yourself
TigerBeetle (for financial workloads)	Single-purpose double-entry accounting at sub-ms p99, deterministic and dedicated to ledger correctness	Emerging, early production	High, dedicated financial schema, separate from general OLTP	Pure financial-ledger workload where Spanner's general-purpose overhead doesn't earn its keep
FoundationDB + layers	Lower-level, transactional KV at extreme scale with stronger isolation guarantees	Production (Apple, Snowflake)	Very high, SQL is not native, build layers yourself	Platform-engineering bet, building your own DB-as-a-service on top of a strong KV substrate

Best default choices

Search and compare

1. Trade-Offs

CockroachDB

YugabyteDB

TiDB

Amazon Aurora DSQL

Google Cloud Spanner

2. Use Cases

CockroachDB

YugabyteDB

TiDB

Amazon Aurora DSQL

Google Cloud Spanner

3. Limitations

4. Fault Tolerance

5. Sharding

6. Replication

7. Better Usage Patterns

CockroachDB

YugabyteDB

TiDB

Amazon Aurora DSQL

Google Cloud Spanner

8. Advanced / Next-Gen Alternatives

CockroachDB

YugabyteDB

TiDB

Amazon Aurora DSQL

Google Cloud Spanner