Distributed SQL: Five-Way Trade-Off Analysis

CockroachDB, YugabyteDB, TiDB, Amazon Aurora DSQL, and Google Cloud Spanner. The architectural decisions that separate them, and the ones interview committees actually probe.

Distributed SQL Strongly Consistent PE Depth

As of 2026-06-02. Aurora DSQL GA was May 27, 2025; CockroachDB 24.3 retired the open Core edition in November 2024; Spanner Omni reached preview in 2026.

PE Verdict

These five collapse into three architectural philosophies, not five. Spanner and Aurora DSQL trust physical time (TrueTime, EC2 Time Sync) to avoid coordination; CockroachDB and YugabyteDB use Raft + HLC and treat clock skew as the enemy to be bounded; TiDB uses a single centralized timestamp oracle (PD) Percolator-style. The right question is never "which one scales better." It is: who owns the operational pain (you vs. the cloud), what is the migration story for your existing wire protocol, and what does a partition do to your write SLO. If you cannot answer those three for your workload, you are picking on vibes.

Best default choices

1. Trade-Offs

One table per technology. The trade-offs themselves diverge, so a matrix would flatten the most important distinctions. Each row is a concrete X-for-Y decision baked into the design, not a feature.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

Raft per range · HLC · Pebble · CCL
Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
SERIALIZABLE isolation by defaultStrongest correctness floor of any SQL DB, no anomaly catalog to memorizeHigher abort rate and retry burden than READ COMMITTED, especially on hot rowsInventory decrement on a top SKU during a sale, every concurrent decrement is a transaction restartThe 24.x line did add READ COMMITTED as opt-in, but it is a behavior trap: most teams enable it cluster-wide and lose the correctness story that justified picking Cockroach in the first place
HLC instead of GPS-disciplined timeRuns on any cloud, any laptop, no special hardware dependencyUncertainty interval (default 500ms max_offset) forces "uncertainty restarts" when reads land inside itNTP drift on a noisy VM pushes one node 600ms off, the node shuts itself down rather than risk a serializability violationThis is why AWS Time Sync adoption matters for Cockroach in practice. Tightening max_offset reduces restart amplification, but the safe default trades latency for safety
Per-range Raft groups (MultiRaft)Range-level placement, replication factor, and lease control without reshardingHundreds of thousands of Raft groups per node, each with its own heartbeat trafficIdle clusters burn CPU on heartbeats, dense node failures cause Raft thundering herds on leader re-electionThe MultiRaft optimization batches heartbeats, but the engineering cost shows up in Raft tuning flags most teams never touch and Cockroach support has to walk them through
Postgres wire compatibility (partial)Drop-in for psql, pgx, JDBC, most ORMs, lift-and-shift from PG looks easyNo PG extensions (PostGIS, pgcrypto, pgvector via separate path), different query planner, divergent EXPLAIN outputExisting app uses LISTEN/NOTIFY, advisory locks, or PL/pgSQL with side effects, none of which port"Postgres-compatible" is a marketing surface. Real test is your slowest 100 queries on production data, not the schema migration
Geo-partitioning by rowGDPR/data-residency at row granularity, no app-level routingLocality constraints become a schema concern, cross-region joins fan out to wide-area latencyA "show me my orders" query joins a US-pinned user to an EU-pinned order row, p99 jumps from 5ms to 180msThe PE move is to forbid cross-partition queries at the schema level via constraints, not policy. Otherwise developers will write them, and the wide-area cost is invisible until on-call
Closed-source license (CCL, post-Nov 2024)Backed development, dedicated support, all enterprise features in one bundleNo Apache-licensed escape hatch, BSL converts to ASLv2 after four years (slow), $10M revenue cap on free tier with annual renewalAcquired company is now over the revenue threshold and Cockroach Labs' pricing posture has shifted, your only options are pay, fork an older BSL version, or migrateThis is the single biggest reason to evaluate YugabyteDB alongside. The architecture is similar enough that the differentiator becomes license, ecosystem trust, and operational maturity
Range-based key partitioningRange scans are fast, ORDER BY on PK is free, time-series append patterns work wellMonotonic primary keys (auto-incrementing IDs, timestamps) create a hot last range that does not auto-split until it is already too lateAn event-logging table with timestamp PK takes 80% of cluster writes on a single range during a traffic spikeUse hash-sharded indexes or UUID PKs by default. The "PRIMARY KEY (created_at, id)" pattern is the most common production foot-gun on Cockroach
Leaseholder reads bypass RaftSingle-replica reads at p99 sub-10ms, no consensus round-trip on the read pathLease transfer on node loss adds tail latency, follower reads are stale by design (default 4.8s)Node graceful drain or rebalance causes a brief lease-transfer storm, p99 reads spike to 100ms+ for ~30sFollower reads are the right tool for analytics-on-OLTP, but using them for user-facing reads while claiming "strong consistency" is the most common postmortem item on Cockroach deployments

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

Raft per tablet · HLC · DocDB · Apache 2.0
Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Forked PostgreSQL backend (YSQL) on top of DocDBHighest PG compatibility of the five, real triggers, FKs, stored procs, most extensionsForked PG 11.2 base (Enhanced Compatibility Mode in v2024.1 closes some gaps), upgrade cadence lags upstream Postgres by yearsApplication depends on a PG 15+ feature (MERGE, range types with multiranges), and Yugabyte's parity is "soon"The fork is both the gift and the curse. You inherit PG semantics for free but also inherit PG's per-connection backend process model, which is why connection pooling is non-optional
Two-layer architecture (query / storage)Storage layer (DocDB) is consistent at the doc level, query layer is stateless and scales independentlyTwo services to operate, two release notes to read, query layer per-connection memory dominates at high connection countsSpiky workload at 5000 concurrent connections, YSQL processes consume RAM the storage layer doesn't needYB-TServer + YB-Master split mirrors TiDB's separation but at finer granularity. Operators frequently misallocate resources between them on first deployment
HLC + bounded clock skew (similar to CRDB)No special hardware, runs on any cloud, on-prem, or laptopSame uncertainty interval problem as Cockroach, depends on NTP disciplineOne node's NTP daemon misconfigured, divergence triggers tablet leader bouncingYugabyteDB's leader lease mechanism is stricter than Cockroach's: new leader must wait out old lease, which costs availability during failover but prevents split-brain reads
Apache 2.0 license, genuinely open coreReal fork-and-go option, no revenue gate, no telemetry mandateSmaller commercial ecosystem than Cockroach Labs, narrower bench of vendor SREs available for hireYou need a deep performance investigation on a Saturday and the public Slack is the fastest path, not a support contractThe license is a real competitive moat against CRDB now, especially for procurement-sensitive enterprises. Don't dismiss it as ideological
Hash sharding by default for tables, range for indexesNo monotonic-PK hot-shard problem out of the box, uniform write distributionRange scans on a hash-sharded PK degenerate to scatter-gather across tabletsA "list my last 50 orders by created_at DESC" query goes from O(1 tablet) to O(N tablets) when N growsThis is the inverse of CRDB's default. You explicitly opt into range sharding via SPLIT INTO, which is the right default for OLTP but trips up engineers migrating from PG with timestamp PKs
xCluster async replication for two-region active-activeSub-second async replication for low-write-latency multi-region without the cost of synchronous Paxos across regionsNo unique constraint enforcement across universes, sequences conflict, conflict resolution is last-writer-wins at WAL levelTwo active universes both accept INSERTs with the same auto-increment ID, indexes silently diverge from main tableThe doc literally tells you to use UUIDs and skip uniqueness constraints in active-active xCluster. If that constraint is unacceptable, you don't actually want async, you want sync sync sync and to pay the latency
Read-from-follower with bounded stalenessGeo-local reads with explicit staleness bound (e.g., "no more than 30s old")Application must opt in per-session, mixed-staleness queries inside one txn are easy to write incorrectlyAn analytics dashboard reads followers and shows totals that don't match the OLTP view of the same data, customer reports "the numbers are wrong"Bounded-staleness is the most underused feature across all five. Most teams either run everything strong or build a separate replica, both more expensive than tuning a staleness bound per query class
Geo-partitioning via PostgreSQL partition syntaxUses native PG declarative partitioning + tablespaces, familiar to DBAsNo FK references on partitioned tables, all the usual PG partition-pruning gotchas, schema changes ripple across partition tablesA regulated multi-region schema needs FKs between user and orders, you discover the limit during the prod cutoverTablespace pinning is more transparent than CRDB's zone configs but less expressive. Choose based on whether your team thinks in PG idioms or zone constraints

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

3-component · PD timestamp oracle · TiKV · MySQL wire
Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
MySQL wire protocol, not PostgresDrop-in migration for the largest installed base of OLTP on the planet, every MySQL driver worksInherits MySQL's looser type system, lock semantics, and SQL dialect quirks, no SERIALIZABLEApplication relied on MySQL repeatable-read gap locks, TiDB's snapshot isolation behaves differently and you get phantom rowsThis is the only one of the five with MySQL heritage. If you are not migrating from MySQL or building net-new and standardizing on MySQL ecosystem, this is rarely the right pick
Centralized Placement Driver (PD) issues global timestampsPercolator-style 2PC works, no clock-skew concerns, simple mental modelPD is a global bottleneck on transaction start, TSO request latency is on the critical pathCross-region cluster where PD leader is one continent away, every transaction pays a continental round-trip just to beginPD itself is Raft-replicated for HA, but only the leader serves TSO. Multi-region TiDB is fundamentally region-local with cross-region cluster-to-cluster replication, not true multi-master like Spanner or DSQL
Compute / storage separation (TiDB / TiKV / TiFlash)Scale stateless SQL layer (TiDB) without rebalancing data, TiFlash columnar replica gives HTAP on the same clusterThree independent services to operate, each with its own scaling story and failure modesTiKV is healthy but TiDB layer is OOMing on a bad query, the cluster looks "fine" to PD while users see errorsThe three-tier architecture is the cleanest separation of concerns of the five, but it is also the highest operational surface area. Plan for a dedicated TiDB SRE if self-hosted
Optimistic concurrency by default (pessimistic is opt-in)Better throughput when contention is low, no lock manager overheadHigh-contention rows produce abort storms, retry burden moves to the appInventory counter with thousands of concurrent decrements, OCC retry loop spikes CPU on TiDB and TiKV bothThe recommendation has moved to "use pessimistic by default in 8.x" for OLTP. Teams that adopted OCC in 4.x and never re-evaluated still see the abort storms in 2026
96MB regions, automatic split and mergeFine-grained rebalancing, PD can move hot regions in secondsTens of thousands of regions per node, Raft heartbeat overhead similar to CRDB's MultiRaft problemA 10TB cluster has ~100K regions, region-related metadata on PD becomes hot, scheduling decisions lagThe "hibernate region" optimization makes idle regions free, but the math still pushes you to bigger region sizes (256MB+) for very large clusters
TiFlash columnar replica for HTAPReal-time analytics on fresh OLTP data without ETL, single SQL endpoint, optimizer picks engine per query2x storage cost when TiFlash is enabled, optimizer mistakes can route analytics queries to TiKV and OLTP to TiFlashAnalyst writes a heavy GROUP BY without a hint, optimizer picks TiKV, scan starves OLTP for 10 minutesHTAP is the most-marketed and least-understood feature. Real customers split workloads by isolation_read_engines session var anyway. The "no ETL" promise is true; the "no operational discipline" implication is not
Apache 2.0 license, real open sourceFree fork path, large Chinese-tech adoption (ByteDance, Pinduoduo at hundreds-of-TB scale) gives real production validationDocumentation and ecosystem still feel Mandarin-first in places, support tier expectations vary by regionYou need authoritative answers fast and the best blog post on your symptom is on a Chinese tech blogThe PingCAP commercial roadmap (TiDB Cloud Premium / Essential, late 2025) is moving the lift-and-shift complexity into managed service. Worth evaluating against Aurora DSQL for that exact reason
Two-phase commit across regions for cross-region transactionsReal distributed ACID, not "eventually consistent multi-master"2PC latency = at least 2 cross-region RTTs, prepared but uncommitted transactions blockA coordinator failure during 2PC leaves transactions in prepared state, you discover them in alerts hours laterThe cluster-to-cluster (CDC-based) async replication path is what most production multi-region TiDB looks like, not true multi-region 2PC. Frame the choice as "sync within region, async across" honestly

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

Disaggregated · OCC · EC2 Time Sync · Serverless
Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
OCC, cross-region coordination only at COMMITReads and writes execute locally on Query Processor, no per-statement WAN latency, multi-region writes sub-secondConflicting transactions abort at commit time, retry burden lives in the appTwo regions both write to the same logical row, both look successful until COMMIT, one gets an OCC abort with no prior warningThis is the single biggest architectural bet in DSQL. It pays off when contention is low and writes are partition-friendly, and falls over when you have hot rows, in which case Aurora DSQL is the wrong tool, not "needs tuning"
EC2 Time Sync as the clock substrateMicrosecond-bounded physical time across AWS without owning atomic clocks, like Spanner's TrueTime without the on-prem hardwareAWS-only forever, no path to multi-cloud, time-sync infra is AWS-internalCompliance or strategic mandate forces multi-cloud, DSQL cannot followThis is why DSQL exists as a separate service from Aurora Postgres. The Time Sync dependency is intrinsic to the architecture, not an implementation detail
3000-row, 10MiB, 5-minute transaction capPrevents head-of-line blocking at Adjudicator and Journal, predictable system-wide latencyBatch jobs, large ETL, schema migrations on big tables cannot run as single transactionsA nightly batch that worked on Aurora Postgres has to be re-architected as N chunked transactions with idempotency, app changes scope balloonsThe 3000-row limit is the most honest design statement of any of the five. It is not a quota, it is a contract: DSQL refuses to be your batch processor
No foreign keys, no triggers, no stored procs, no JSONB (as of 2026)Forces all referential integrity and business logic into application code, makes the database genuinely stateless behavior-wiseMost existing Postgres apps require substantial rewrite, ORM features that assume FK constraints break silentlyYou migrate, FKs no-op, an orphan row is inserted, you discover it in a customer report next quarterThis is the bet that "the database is for data, the app is for logic." Defensible architecturally, brutal in migration. The 2025 add of views and unique indexes signals AWS hears the feedback
Serverless with scale-to-zeroNo instance sizing, no idle cost, no PG Bouncer (no connection pooler needed), 10K default connections per clusterDPU billing is hard to forecast, query plan changes can 3x cost without a code changeA subquery rewrite during a refactor turns into 3x DPU consumption, monthly bill spikes, no committed-use discount to fall back onThe CloudWatch DPU breakdown (ComputeDPU, ReadDPU, WriteDPU, MultiRegionWriteDPU) is the most important observability surface. Treat it like a SLO, not just a billing metric
Active-active multi-region, no leader99.999% multi-region availability SLA, no failover step, both endpoints accept writes concurrently with strong consistency2x multi-region write DPU cost (writes replicate to both regions), cross-region latency surfaces as commit latencyA 150ms RTT between regions makes every commit pay the full RTT, write throughput per connection drops accordinglyThe cost model means "active-active" is not free in DSQL. Many teams will discover their workload is fundamentally region-local and they were paying for global by accident
Postgres 16 subset, not full PostgresFamiliar wire protocol, drivers, basic SQL, EXPLAIN worksNo extensions (no pgvector, no PostGIS, no pgcrypto), no sequences, no temp tablesApp uses UUIDs from `gen_random_uuid()` (works), then needs PostGIS spatial join (doesn't), and you discover during integrationThe "Postgres compatible" claim is the same shape as Cockroach's. Test on your actual query workload, not the wire protocol
GA'd May 2025, still maturing the feature gapAggressive roadmap, new features monthly (views, unique indexes, PrivateLink, FIS integration in 2025)Roadmap-driven decision making, features assumed at decision time may not arrive when promisedYou promised stakeholders FK support "soon" based on AWS signal, two quarters later it is still not thereThe aggressive launch cadence is genuine, but PE-grade decision making requires assuming today's feature set, not the roadmap. Aurora DSQL in mid-2026 is not the Aurora DSQL of late 2027

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

TrueTime · Paxos · 2PC · External Consistency
Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
TrueTime (GPS + atomic clocks) for external consistencyThe strongest correctness guarantee available in any commercial SQL DB, transactions appear to execute in real-time order globallyCommit-wait pays the TrueTime uncertainty bound (typically a few ms) on every cross-machine writeTight-loop write benchmarks look slower than Cockroach because commit-wait is unhiddenThe commit-wait is what makes the system honest. Cockroach's uncertainty restarts are the same bound paid as a restart cost instead. Spanner amortizes it as wait time, Cockroach amortizes it as retry rate. Same physics, different UX
Paxos per split, leader-based writesProven, deployed inside Google for 15+ years, the most validated distributed consensus implementation outside RaftLeader-based writes mean cross-region writes have one fast region (the leader) and one slow (followers)A multi-region instance has unbalanced write latencies by design, EU writers see 50ms while US writers see 5msSpanner's "regional vs multi-regional" config choice is more important than its TrueTime story for actual workload economics. Pick wrong and you pay 5x for the wrong shape of consistency
GoogleSQL dialect (PostgreSQL dialect now available)Two SQL surfaces, PG dialect supports most ORMs and existing PG apps, GQL graph language for Spanner GraphBoth dialects diverge from upstream Postgres, schema features (interleaved tables) are Spanner-specificYou design schemas using interleaved tables for performance, later cannot port back to vanilla Postgres without redesignThe PG dialect is a recent (2022+) compatibility layer, not the native language. For new builds, GoogleSQL is the better-supported and more honest choice unless multi-DB compat is a hard requirement
Snapshot reads at any timestamp from any replicaGlobally consistent read-only transactions at the local replica without leader round-tripReplica "safe time" can lag if it has not seen recent writes, reads may block waiting for safe time to advanceA snapshot read at "now" on a replica that is behind blocks for hundreds of ms during a write-heavy momentStale reads ("stale" by a few ms intentionally) are the production pattern. "Strong reads" (current timestamp) are correctness theater for most read paths
Multi-region SLA of 99.999%The highest published SLA in the category, Google takes the operational ownershipPremium pricing, multi-region instances cost ~3x regional, processing-unit billing rewards continuous low utilizationA spiky workload pays for steady-state PUs that mostly idleThe Editions split (Standard / Enterprise / Enterprise Plus) means the SLA you actually get depends on a procurement decision, not just the deployment topology. Read the SLA per edition
GCP-only (until Spanner Omni in 2026 preview)Deepest possible integration with Vertex AI, BigQuery (Zero-ETL), Pub/Sub, IAMMulti-cloud is now possible via Spanner Omni preview but is not Spanner in production parityYou commit to Spanner, your company gets acquired by an AWS-only shop, migration is multi-quarterSpanner Omni's software-defined TrueTime is the most interesting 2026 development in this space. Whether it reaches feature parity with managed Spanner is the bet that decides multi-cloud distributed SQL for the next 5 years
2PC across Paxos groups for multi-split transactionsReal ACID across the entire schema, not just intra-shard2PC + Paxos + commit-wait stacks: cross-split transactions are visibly slowerA transaction that touches 5 splits pays 5x the prepare overhead, OLTP latency p99 climbs from 8ms to 40msThe interleaved tables feature is the workaround: keep child rows physically co-located with parent so common transactions stay single-split. Most Spanner experts will say "schema design is the only optimization that matters" and they are right
Multi-model (relational + graph + vector + full-text)Single system for OLTP + GraphRAG + semantic search, no ETL between specialized DBsEach new model is "good enough" rather than best-in-class against Neo4j, Pinecone, ElasticsearchYou picked Spanner Graph for the consolidation story, later your data scientists need Cypher-only features that GQL doesn't have yetThe multi-model story is a real differentiator, but maturity per model varies. Spanner Graph GA'd in 2025, vector indexes for PG dialect GA'd recently. Treat each model as its own production-readiness check

2. Use Cases

Real workloads with the driving property that ruled out the obvious alternative. Generic "scalability" is not a driving property.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Multi-region SaaS control planeDoorDash, Netflix, Shipt have publicly used CRDB for control-plane storageSERIALIZABLE isolation across regions without a custom 2PC layerTens of TB, 10K+ QPS sustained, p99 under 25ms in-regionPostgres + sharding requires building the distributed-transaction story you would get for free
Geo-residency compliance (GDPR, India DPDP)Mid-size fintech with EU and India users in one logical appRow-level pinning to a region, with one schema and one connection string1-2TB per region, 5K QPS, hard residency boundaryYugabyteDB is the close competitor; CRDB wins on Postgres-extension expectation and zone-config expressivity
Lift-and-shift from Postgres at scaleExisting PG monolith hitting single-primary write ceilingPostgres wire compatibility plus horizontal write scaling without re-architecting the appOutgrew a 96-vCPU primary, sustained 30K writes/sec, 5TB and growingAurora Postgres scales reads via read replicas but not writes, sharding via Vitess is MySQL-only
Inventory and ordering systems with strong correctnessE-commerce with thousands of concurrent decrement transactions on hot SKUsDefault SERIALIZABLE prevents oversell without app-side locking gymnastics5K orders/sec at peak, hot SKU contention is the design hazardMongoDB and Cassandra cannot give the correctness guarantee, would require app-level inventory service
Hybrid on-prem and cloud deploymentsRegulated industries (banking, healthcare) wanting cloud agility plus on-prem fallbackSame engine on bare metal, EKS, GKE, no vendor lock to a cloud providerMulti-cluster, 50+ nodes per cluster, mixed deployment environmentsSpanner is GCP-only (Spanner Omni preview as of 2026), DSQL is AWS-only
Distributed audit logs with point-in-time recoveryCompliance-heavy workloads needing tamper-evident historyTime-travel queries (AS OF SYSTEM TIME) for free, MVCC retention configurableYears of history retained, queries against 30-day-old snapshots in productionPostgres needs separate WAL archiving + restore choreography, distributed competitors lack the AS OF clause

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
PostgreSQL replacement for write-scale-outSaaS hitting PG vertical scaling limit, ORM and stored procs in useHighest PG feature parity of the five, real triggers and FKs and extensions5-20TB, 10K+ writes/sec, 100s of stored procedures to preserveCockroachDB lacks triggers and most extensions, Aurora DSQL is a far thinner PG subset
Multi-cloud distributed SQLEnterprise mandate to avoid AWS / GCP lock-in on the system of recordApache 2.0 license, runs identically on EKS, GKE, AKS, bare metalHundreds of nodes across two clouds, active-passive xCluster betweenSpanner is GCP-only, DSQL is AWS-only, CockroachDB now requires a paid Enterprise license above $10M revenue
Financial systems with geo-partitioningBank with strict per-jurisdiction data residencyRow-level geo-partitioning via standard PG declarative partitioning syntaxMultiple regulatory regions, single schema, single connection stringApp-level routing with PG is operationally tractable but auditable-by-design is harder to demonstrate
Telco subscriber data storeMobile carrier with regional billing systems, 99.99% SLASynchronous replication across 3 AZs, follower reads for nearby low-latency lookupHundreds of millions of subscribers, sub-10ms p99 reads, sustained MGT writesCassandra gives availability but no SQL, MongoDB lacks the strong-consistency guarantees needed for billing
Active-active two-region appsSaaS needing low write latency in both US and EU regionsxCluster bidirectional async replication, accept eventual conflict resolution to gain local write speedSub-5ms in-region writes, ~1s cross-region replication lagSpanner multi-region pays commit-wait for true active-active; DSQL pays full RTT at commit; both more expensive
Edge-adjacent IoT or device dataManufacturing or IoT with regional clusters synced asynchronouslyxCluster supports many-to-one fan-in for centralized analytics30+ edge clusters streaming to a central reporting clusterBuilding this on Postgres requires logical replication + custom conflict handling per table

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
MySQL replacement at scaleByteDance, Pinduoduo, Shopee have used TiDB for MySQL replacement at hundreds of TBMySQL wire compatibility, horizontal scale, no Vitess operational burdenHundreds of TB, 100K+ QPS sustained, MySQL ecosystem fully reusableVitess preserves MySQL but inherits sharding complexity, AWS RDS MySQL caps out at the single-primary ceiling
HTAP on operational dataE-commerce running analytics on fresh OLTP via TiFlash columnar replicaReal-time analytics without an ETL pipeline, single SQL endpoint routes per queryOperational DB also serving dashboards with sub-minute freshnessSeparating OLTP and OLAP into PG + Snowflake adds an ETL pipeline and 5-minute-plus freshness gap
Game backend stateMobile and console game studios needing transactional state at scaleMySQL compatibility (most game backends are MySQL-first), strong consistency for in-game economyMillions of concurrent sessions, transactional inventory and currency operationsCassandra and Redis cannot give the transactional guarantees, MySQL needs sharding
Financial settlement and reconciliationFintech with MySQL legacy, regulatory pressure for ACID across larger data setsStrong consistency with MySQL semantics, audit-friendly10s of TB, sustained pessimistic-transaction throughputAurora DSQL has no triggers or FKs and a 3000-row transaction cap; both are dealbreakers for batch reconciliation
SaaS multi-tenant OLTPVertical SaaS replacing per-tenant MySQL shards with one logical TiDB clusterEliminate per-tenant shard sprawl, simplify connection management, single SQL plane1000s of tenants, GB-to-TB per tenant, MySQL drivers everywherePG-based options force MySQL→PG migration on top of distribution; TiDB removes that variable
Asia-Pacific deployments with PingCAP supportCompanies with China and SEA presence valuing native-language supportStrong support presence and ecosystem in APAC, mature TiDB Cloud offering on Alibaba Cloud and AWSRegional teams, mixed Mandarin/English opsSpanner is GCP, DSQL is AWS, CRDB has limited APAC commercial bench

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Greenfield serverless apps on AWSNew product launches on Lambda + API Gateway with unpredictable trafficScale-to-zero billing, no connection pooler, no instance sizingVariable load from 0 to thousands of QPS, no capacity planningAurora Serverless v2 has a minimum cost floor, Aurora Postgres requires instance sizing
Active-active multi-region without operational handoffRobinhood and ADP publicly cited as launch customers for global SLAs99.999% multi-region availability with single logical endpoint per regionTwo-region active-active, sustained 4x faster than Spanner per AWS GA claims (verify on your workload)Spanner is GCP-only, multi-region CRDB requires you to operate the cluster
Microservices ledger or transactional stateMicroservice owning a small slice of state, no batch jobs, well-bounded transactions3000-row transaction limit is fine for service-shaped workloads, OCC scales without locking10s of GB per service, low-contention writes, microservice-shaped accessDynamoDB would work but loses SQL, joins, and ad-hoc query flexibility
Multi-tenant SaaS with tenant-isolated workloadsSaaS where tenant rows are independent, no cross-tenant transactionsOCC works when conflicts are tenant-scoped, scale-to-zero per tenant inactive periodMany small tenants, low cross-tenant locking concernsRDS Postgres requires upfront capacity for the worst-case tenant
Event-driven applications with bursty trafficWorkflow orchestration, retry queues, scheduled-job dispatchersPer-DPU billing means quiet hours are nearly free, no provisioned floor to pay forWorkload spikes 100x between idle and peak, no traditional sizing answerProvisioned Aurora pays for peak 24/7, Lambda + DynamoDB loses SQL ergonomics
Net-new applications choosing strict AWS-nativeAWS-first orgs with IAM, KMS, VPC, IaC pipelines already standardizedIAM auth (token-based, no DB passwords), CloudWatch DPU metrics, AWS FIS for failover testingStandard AWS-resident stack, single-cloud strategic mandateSelf-managed CRDB or YugabyteDB on EKS doubles operational scope

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Global financial systemsBanks and fintechs with multi-region ledgers, e.g., trading and payment platformsExternal consistency, the only commercially deployed system with this guaranteeTrillions of rows, multi-region with 5-nines SLA, strict audit requirementsNone of the others guarantee external consistency in the formal sense, only strong consistency
Ad-tech and large-scale OLTPGoogle's own AdWords backend (F1) is the canonical use caseStrong consistency at petabyte scale with predictable latencyPetabytes, millions of QPS, single logical schema across global footprintNo alternative has demonstrated this scale in production
Multi-model consolidationCompanies consolidating off Neo4j + Elasticsearch + Pinecone + Postgres into one platformSingle ACID-compliant store across relational, graph, vector, and full-text10s of TB, mixed workload patterns, ETL elimination as a strategic goalEach specialized DB still wins on its niche; Spanner wins on consolidation when "good enough" per model suffices
Gaming leaderboards and global stateMobile game backends with single global leaderboard and player stateGlobally consistent ordering of writes, single source of truth across regions10M+ concurrent players, sub-second cross-region read freshnessCassandra and Redis cannot give cross-region ordering, only locality-based eventual consistency
Spanner Graph for fraud detectionFinancial fraud detection with graph traversal on fresh transaction dataCombined SQL + GQL graph queries on ACID-compliant data, no ETL to a separate graph DBBillions of edges, real-time scoring on OLTP transactionsNeo4j is the graph specialist but requires ETL from the OLTP store, adding latency and consistency risk
Multi-cloud strategic deployments (Spanner Omni)Companies committed to Spanner but with non-GCP infrastructure obligationsSpanner Omni preview brings the engine to on-prem, AWS, Azure (preview-quality, not production)Mid-size deployments, evaluation as of 2026, not strategic-bet-the-companyIf GCP is on the table, native Spanner is always cheaper and more mature

3. Limitations

Matrix layout. Rows are the categories that bite at scale; cells describe how each technology is constrained, with severity coded by color.

Limitation CockroachDB YugabyteDB TiDB Aurora DSQL Spanner
Transaction size ~64MB practical (raft command size), no hard row count MEDIUM ~512MB tablet-level, soft transaction limits via configs MEDIUM 10GB hard limit in default mode, 100GB with "large transactions" enabled MEDIUM 3000 rows, 10MiB, 5 minutes hard CRITICAL 20K mutations per commit, 100MB max HIGH
SQL feature completeness vs upstream Postgres subset, no LISTEN/NOTIFY, no advisory locks, no extensions HIGH PG 11.2 base, most features supported but cadence lags PG releases by years MEDIUM MySQL 5.7+ subset, no triggers in many versions, no events MEDIUM No FKs (no enforcement), no triggers, no stored procs, no JSONB, no temp tables, no sequences, no extensions CRITICAL GoogleSQL is a separate dialect, PG dialect is partial, interleaved tables non-portable HIGH
Multi-region write latency Cross-region Raft quorum at write time, ~RTT to nearest 2 regions HIGH Same as CRDB for sync, sub-second for xCluster async (with consistency caveats) HIGH Cross-region 2PC pays multi-RTT, most multi-region TiDB is cluster-to-cluster async HIGH RTT at commit, sub-statement local execution, OCC abort at commit MEDIUM TrueTime commit-wait + Paxos, leader-side fast / follower-side slow MEDIUM
Hot-row contention behavior SERIALIZABLE → transaction restart, can cascade HIGH Similar restart pattern, slightly better via pessimistic locking option MEDIUM OCC default storms aborts, pessimistic mode mitigates HIGH OCC abort at commit, no early failure detection CRITICAL Pessimistic locking + 2PC, lock contention but no retry storm MEDIUM
Operational complexity (self-managed) Single binary, hundreds of knobs, requires CRDB-experienced SRE HIGH Two services (Master, TServer), three for HA, requires PG + RocksDB expertise HIGH Three core services (TiDB, TiKV, PD) + TiFlash, highest surface area CRITICAL N/A — fully managed, no self-hosted option N/A N/A managed, Spanner Omni is preview-only as of 2026 N/A
License / vendor lock-in CCL closed source, $10M revenue gate for free Enterprise, telemetry mandatory on free tier HIGH Apache 2.0, genuinely open LOW Apache 2.0, genuinely open LOW AWS-only, no multi-cloud path CRITICAL GCP-only natively (Omni preview opens this), Google-managed proprietary HIGH
Cost predictability Per-vCPU licensing on Enterprise, predictable but not cheap MEDIUM Compute + storage on managed, free self-hosted LOW Managed pricing similar to CRDB Cloud, self-hosted is the cost win MEDIUM DPU billing notoriously hard to forecast, query changes can 3x cost CRITICAL Processing-unit billing predictable but expensive, multi-region is 3x regional HIGH
Analytics / OLAP capability Not designed for OLAP, doc says so, follower reads help but limited HIGH Same, no columnar engine, recommend separate OLAP store HIGH TiFlash columnar replica gives real HTAP, the only one with built-in OLAP LOW No analytics workload, OLTP only by design CRITICAL Spanner Data Boost separates analytics traffic from OLTP, BigQuery Zero-ETL closes the gap MEDIUM
Cold-start / scaling-up latency Manual node provisioning, minutes to add capacity MEDIUM Manual node provisioning, similar to CRDB MEDIUM Manual node provisioning, PD scheduling needs time to rebalance MEDIUM Serverless, sub-60s cluster provision, scales automatically LOW Autoscaling (Enterprise+ tier) automatic, processing-unit adjustments seconds LOW
Maximum cluster size in practice Hundreds of nodes published, Cockroach Labs ref customers in low hundreds MEDIUM Hundreds of nodes published, comparable to CRDB MEDIUM Thousands of TiKV nodes at ByteDance scale LOW Single cluster, 256TiB max storage as of Nov 2025 (raised from 128TiB) MEDIUM Petabyte-scale validated by Google's internal F1 deployment LOW

4. Fault Tolerance

Replication model is the headline; failure detection time and split-brain behavior are what determine your incident postmortem template.

Dimension CockroachDB YugabyteDB TiDB Aurora DSQL Spanner
Replication model Raft per range, default RF=3, leaseholder serves reads Raft per tablet, default RF=3, leader-lease-protected reads Raft per region in TiKV, default RF=3, leader serves reads Disaggregated, Journal-replicated across 3 AZs, no traditional consensus Paxos per split, RF configurable (default 3 zones, multi-region adds more)
Failure detection Liveness heartbeats, ~9s for unavailability, configurable Tablet peer heartbeats, ~3s detection PD heartbeats, ~10s for region leader replacement Control plane health checks, transparent to client TrueTime + Paxos heartbeats, sub-10s detection in practice
Failover mechanism Raft re-election, leaseholder moves, no client config change Raft re-election + leader-lease wait, no client change Raft re-election in TiKV + PD rebalance, no client change Automatic re-routing to healthy QP/Journal/Adjudicator, invisible to client Paxos re-election, automatic, no client change
RTO (typical) ~9s for node loss, 30s+ for AZ-wide event due to lease churn ~3-10s for node loss, depending on lease parameters ~10-30s for TiKV node loss, PD election adds latency for region Sub-second for AZ failure (multi-AZ active by design) Zero RTO for zone failure (Spanner Omni doc claims, native Spanner sub-second)
RPO (typical) Zero (synchronous Raft commit) Zero for sync replication, seconds for xCluster async Zero (synchronous Raft commit) Zero (Journal acknowledges before client sees commit) Zero (Paxos quorum commit)
Split-brain behavior Minority partition rejects writes, leaseholder check on reads Leader-lease prevents two leaders, minority rejects writes Minority partition rejects writes, PD election quorum-protected No leader, conflicts surface as OCC aborts at commit Paxos quorum, minority side cannot make progress
Single-node failure blast radius Hundreds of ranges may lose leaseholder, ~9s of localized read latency Tablets where node was leader briefly unavailable for writes Regions where node was leader briefly unavailable for writes One QP loss is invisible (request re-routes), one AZ loss is sub-second Splits where node was leader unavailable briefly, others unaffected
Cross-region failover Manual via zone-config changes, can be fully automated with survival goals Automatic for sync clusters, manual cutover for xCluster async Manual cluster-to-cluster failover for the typical async multi-region setup No failover step — multi-region is active-active, traffic shifts naturally Automatic for multi-region instances (Enterprise Plus tier)
Data loss scenarios Simultaneous loss of quorum in a range (3-AZ simultaneous failure) Same as CRDB for sync, xCluster async can lose unreplicated writes Quorum loss in TiKV region, or PD majority loss for cluster metadata Multi-AZ Journal quorum loss; multi-region cluster pair both lost Multi-region quorum loss (rare given replica count and geographic spread)
Published availability SLA No managed-service SLA on self-hosted; CockroachDB Cloud Advanced has tier-specific SLAs YugabyteDB Aeon (managed) has tier SLAs; self-hosted depends on operator TiDB Cloud Dedicated 99.99%, self-hosted depends on operator 99.99% single-region, 99.999% multi-region 99.99% regional, 99.999% multi-regional (Enterprise Plus)
Divergence to flag: DSQL and Spanner publish 99.999% multi-region. That number is reachable architecturally on CRDB and YugabyteDB too, but only if you operate the cluster to that bar yourself. The SLA you contractually receive vs. the SLA your engineering can achieve are different questions, and procurement frequently confuses them.

5. Sharding

Hash, range, or directory; how rebalancing happens and what the hot-shard escape hatch looks like.

Dimension CockroachDB YugabyteDB TiDB Aurora DSQL Spanner
Sharding model Range-based on PK by default, hash-sharded indexes available Hash by default for tables, range for indexes (opt-in for tables via SPLIT) Range-based (Regions, ~96MB), automatic split and merge Storage cloning + automatic sharding under the hood, not user-visible Range-based (splits), automatic dynamic partitioning
Shard key constraints Primary key, any data type; monotonic keys are an anti-pattern Primary key, hash-distributed by default avoids monotonic hot-shard Implicit (auto-generated row IDs) or user-defined cluster index No user-visible shard key; system manages distribution Primary key, can use UUID or co-located (interleaved) keys
Rebalancing mechanism Background range splits at ~512MB or load threshold, leases move independently Tablet split + load-based leader rebalancing managed by YB-Master PD continuously rebalances regions across TiKV nodes, latency-aware Hidden, automatic, no user controls Splits move automatically based on load and size; admin can pre-split
Rebalancing cost / impact Transparent to client, can cause brief read-latency spikes during lease transfer Transparent, leader-lease wait briefly extends write unavailability Transparent, PD throttles to limit impact, can be paused Invisible to client by design Transparent, briefly blocks writes to the moving split
Hot-shard behavior Auto-split if load exceeds threshold; monotonic PK is the common foot-gun Hash sharding avoids the monotonic hot-shard problem by default Auto-split, PD migrates leader off hot region; high-write hot rows still abort OCC means hot rows abort on commit; no "hot shard" in the traditional sense Auto-split, interleaved tables reduce hot-spot risk by co-locating logical units
Maximum shards (practical) Hundreds of thousands of ranges per cluster, validated in published deployments Hundreds of thousands of tablets per cluster ~100K regions per node feasible with hibernated regions Not user-visible; AWS manages scale to 256TiB cluster Petabyte scale in production at Google (F1), no published upper bound
Resharding without downtime Yes, automatic via range splits, manual via ALTER PRIMARY KEY (online schema change) Yes, automatic via tablet splits, online tablespace moves for geo-partitioning Yes, fully automatic, no SQL needed Yes, completely transparent Yes, automatic, no user action required
Cross-shard query support Distributed SQL planner fans out and aggregates, full SQL support across shards Distributed SQL planner from forked PG, full SQL across tablets Distributed SQL with TiFlash + TiKV coordination, full SQL Full SQL, distribution invisible Full SQL with 2PC for cross-split, GoogleSQL or PG dialect

6. Replication

Topology, sync vs async, and the cross-region story. The first three are Raft-similar; the last two diverge.

Dimension CockroachDB YugabyteDB TiDB Aurora DSQL Spanner
Replication topology Single leader per range (leaseholder), Raft followers Single leader per tablet, Raft followers, read replicas opt-in Single leader per region in TiKV, Raft followers, TiFlash columnar replica No leader; Journal replicates writes, Adjudicator resolves conflicts Single Paxos leader per split, witness replicas for quorum
Sync vs async Sync within Raft group, async to non-voting replicas (follower-reads) Sync within Raft group, async via xCluster for cross-region Sync within TiKV Raft group, async via CDC-based cluster replication Sync across AZs, sync across regions at commit (multi-region peered) Sync via Paxos, both regional and multi-regional are synchronous
Replication factor (default / max) 3 default, configurable per zone 3 default, configurable via tablespace per fault domain 3 default, configurable per region/label 3 across AZs fixed (multi-region adds full second cluster) 3 default per region, multi-region adds zones for additional copies
Consistency level options SERIALIZABLE (default), READ COMMITTED (opt-in 24.x+), follower reads with bounded staleness SERIALIZABLE, SNAPSHOT, READ COMMITTED, follower reads with bounded staleness SNAPSHOT (default in OCC), READ COMMITTED via pessimistic txn Snapshot isolation, linearizable reads External consistency (default), bounded staleness reads, exact staleness reads
Replication lag (typical) Sub-ms within Raft group, configurable seconds for follower reads Sub-ms within Raft, ~sub-second for xCluster async Sub-ms within Raft, ~seconds for cross-cluster CDC Zero (all writes durable at commit before ack) Sub-ms commit-wait dominates, sub-second cross-region
Conflict resolution Raft serializes all writes, no conflicts to resolve Same as CRDB for sync; xCluster async uses last-writer-wins at WAL level 2PC commit ordering, no conflicts in sync path OCC, conflicting writer aborts at commit time 2PC with TrueTime ordering, no conflicts in commit path
Cross-region replication Sync via Raft (high latency) or async via Change Data Capture Sync (high latency) or async via xCluster (LWW conflicts allowed) Mostly async cluster-to-cluster, 2PC available but rarely deployed multi-region Sync at commit, multi-region peered clusters are first-class Sync via Paxos with TrueTime, multi-region is first-class
Replication during partition Minority Raft side rejects writes, majority side serves Same, minority Raft side blocked, leader-lease prevents stale reads Minority TiKV region blocked, PD must have majority OCC means partitioned writes will conflict at commit, transactions abort Paxos minority blocked, majority continues

7. Better Usage Patterns

Five patterns per tech. The anti-pattern half of each row is the one that shows up in code review; the better-way half is the one experienced operators recommend.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Primary key designAuto-increment INT or timestamp PK migrated from Postgres unchangedUUID or hash-sharded composite key from day one, never monotonicAvoids the single-hot-range write bottleneck that becomes a Saturday-night oncall
Transaction retriesCatch generic exceptions, retry once, hope it worksUse the official retry wrappers (savepoint cockroach_restart), exponential backoff, surface aborts as a metricSERIALIZABLE means retries are part of the normal control flow, not an exception path
Follower readsTreat them as "free strong reads" for read-heavy workloadsUse AS OF SYSTEM TIME or follower-read clause explicitly, document the staleness bound per query classStale reads with no documented staleness will be the root cause of a data-correctness postmortem within 6 months
Geo-partitioningSet up survival goals and zone configs once, forget the schema-level enforcementAdd CHECK constraints that prevent cross-region rows from being inserted into the wrong partitionStops "compliance accidents" at the schema layer, not in code review
Schema change cadenceManual psql ALTER TABLE during a maintenance windowUse online schema changes (DDL is non-blocking by design in CRDB), but stage in non-prod with the same row countsCRDB does online DDL well; the failures are usually about under-tested constraints, not the DDL itself
Connection managementUse a single pgbouncer pool sized for the largest nodeUse SQL-aware proxy (cockroach-sql-proxy or CockroachDB Cloud's load balancer), sized per-node not per-clusterCross-node connection pooling adds tail latency from lease redirects; SQL-aware routing keeps queries node-local

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Hash vs range shardingAccept hash default for everything, then complain range scans are slowUse HASH for OLTP point lookups, RANGE explicitly for time-series and ordered queriesThe default is right for 80% of cases; the other 20% need an explicit ASC PRIMARY KEY (col, col) SPLIT INTO ... clause
Connection poolingDirect PG connections from app, hundreds of YSQL backends each consuming RAMYugabyteDB Smart Driver + connection pool, target 10-50 active connections per nodeYSQL inherits Postgres' per-connection process model; without pooling, RAM and context switches kill the cluster
xCluster active-activeUse auto-increment PKs and unique constraints, "fix it later"UUIDs for PKs, no unique constraints on indexed columns, application-level conflict reconciliation logicThe docs literally warn about this; conflicts are silent and you find them in customer reports
Geo-partitioning + FKsTry to enforce FKs across partitioned tables, hit the partition + FK limitCo-locate referenced rows in the same geo-partition, denormalize across regions if neededThe schema-level limitation is real; pretending it isn't leads to insertion failures or silent missing FK semantics
Bounded staleness readsRun all reads strong, build a "read replica" via xCluster for analyticsSET yb_read_from_followers = true with explicit staleness; mix in same session as neededAvoids building a parallel infrastructure for what is a per-query parameter; saves substantial cost and complexity
YB-Master sizingTreat YB-Master like a metadata service that needs minimal resourcesSize YB-Master for the catalog operations of your cluster; CPU spikes during heavy DDL or split stormsAn underpowered Master becomes the bottleneck during incident recovery, exactly when you can't afford it

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
OCC vs pessimisticLeave the default and assume MySQL-compatible behaviorUse pessimistic locking (set transaction default in 5.0+) for OLTP, OCC only for low-contention bulk pathsThe default has shifted historically; teams who adopted on 4.x default OCC and never re-evaluated are still seeing abort storms in 2026
TiFlash usageAdd TiFlash, hope the optimizer picks it for analytics queriesUse isolation_read_engines session variable per workload class, force OLTP to TiKV and OLAP to TiFlash explicitlyOptimizer mistakes are the #1 HTAP problem in TiDB; manual routing per session class avoids them
Auto-incrementing PKUse AUTO_INCREMENT or AUTO_RANDOM without thinkingAUTO_RANDOM for OLTP tables to avoid the last-region hot-shard problem, AUTO_INCREMENT only for analytics-style tablesTiDB allocates AUTO_INCREMENT in node-local batches but the last-region hot-spot is still real for sequential reads
PD placementPlace PD on the same nodes as TiKV for cost reasonsDedicated PD nodes in their own AZ, sized for the TSO throughput your cluster needsPD is on the critical path for every transaction; co-locating with TiKV creates cascading failures during a single node loss
Region size tuningStay on 96MB default for the largest clustersIncrease region size to 256MB or 512MB at >10TB scale to reduce PD metadata overheadRegion count grows with data size; PD scheduling and TiKV heartbeat overhead grows linearly with region count
Cross-region deploymentStretch a single cluster across regions and accept high write latencySeparate cluster per region with CDC-based async replication (TiCDC), accept eventual cross-regionThe vast majority of production multi-region TiDB is region-local with async replication; pretending otherwise produces unusable write latency

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Transaction shapeMigrate batch jobs as single transactions, hit the 3000-row limit in productionChunk into 1000-row batches with idempotency keys, design every job to be resumableThe 3000-row limit is intentional; pretending it's a quota leads to constant retries instead of correct chunking
Foreign key replacementSkip FKs, hope the ORM enforces themApplication-level referential integrity with explicit consistency checks at transaction boundariesWithout FKs, the database cannot prevent orphan rows; the application must own this invariant explicitly
Cost monitoringWatch the monthly bill, react when it spikesTreat per-query DPU consumption as a SLO; alert on Compute/Read/Write DPU breakdowns separatelyA subquery refactor can 3x DPU consumption silently; only per-query observability catches it before the bill
Index strategyAdd indexes the same way as Postgres, hope query plans workUse covering indexes aggressively to enable index-only scans; EXPLAIN ANALYZE VERBOSE is mandatory before deployStorage is cheap, DPU is not; covering indexes that eliminate table scans pay for themselves in days
OCC conflict handlingCatch OCC abort, retry once, give upExponential backoff, jitter, partition-aware retries, observability on conflict-prone keysOCC aborts are signal, not noise; high abort rate on a key means redesign, not retry harder
Multi-region costEnable multi-region for "availability" without measuring write cost impactMulti-region only when active-active is the actual requirement; single-region multi-AZ is 99.99% and cheaperMulti-region writes 2x DPU; many workloads enable it for marketing reasons and pay continuously

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Schema designTreat Spanner like Postgres, design schemas without interleaved tablesUse INTERLEAVE IN PARENT for child rows that are always accessed with the parent (orders within users)Interleaved tables keep co-accessed rows physically together, eliminating 2PC for the common case
Read patternsDefault to "strong reads" for everything, pay TrueTime commit-wait on every readUse stale reads with bounded staleness (e.g., 10s) for non-critical paths, strong only when ordering mattersStale reads can be served from any replica without quorum, lowering p99 and cost simultaneously
Primary key designUse auto-incrementing INT64 or timestamp PK from Postgres habitUUIDv4 or hash-prefixed PK to avoid hotspot on monotonic insertsLike all range-partitioned systems, monotonic PKs create a hot last-split that takes all writes
Cross-region instance choicePick multi-region instance for "availability," accept the 3x costRegional instance with cross-region read replicas (where available) when writes are region-localMulti-region is for true global writes; regional+replica is cheaper for read-anywhere/write-here workloads
Edition choiceDefault to Standard, hit a feature gap later (graph, vector, geo-partitioning)Pick Enterprise or Enterprise Plus from day one if any multi-model or geo-partition feature is in the roadmapEdition upgrades are possible but per-feature licensing makes the math hard to undo
Connection managementUse the default Spanner client session count without tuningEnable multiplexed sessions (default in 2025+ clients) for high-concurrency workloads, sized to actual QPSSession exhaustion is a quiet failure mode; multiplexed sessions eliminate it in 80% of cases

8. Advanced / Next-Gen Alternatives

Where each technology might be replaced, augmented, or obviated. Three rows per tech, weighted to the alternatives that an L7 review would mention.

CockroachDB

Use when PostgreSQL compatibility, serializable-by-default correctness, and cloud portability matter more than minimizing retry and operational tuning burden.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
YugabyteDBApache 2.0 license, higher PG feature parity, no revenue gateProduction at scaleMedium, Postgres-compatible but DDL and tuning differLicense change forced re-evaluation, or PG ecosystem features (extensions, triggers) are required
Aurora DSQLEliminates self-managed operations on AWS, scale-to-zero billingGA May 2025High, requires substantial app refactor (no FKs, no triggers, txn limits)AWS-only strategy, greenfield apps, workload tolerant of OCC and DSQL limits
Postgres + Citus (now Microsoft-owned)True Postgres with horizontal sharding via Citus; lower license risk than CRDB EnterpriseMature, Azure-blessedMedium, app-side awareness of distribution columnsMulti-tenant SaaS where tenant is the natural shard key, want real Postgres
Vitess (MySQL-shaped)If you can switch to MySQL, Vitess is more mature and broadly deployed for horizontal scaleMature (YouTube, Slack, GitHub)Very high if migrating from PG; near-zero if already on MySQLExisting MySQL workload outgrowing single-primary, OSS-only mandate

YugabyteDB

Best when Apache 2.0 licensing, PostgreSQL APIs, Raft tablets, and default hash sharding fit a high-throughput OLTP migration.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
CockroachDBSlightly more mature commercial support and ecosystem in North AmericaProductionMedium, similar architecture but DDL and tuning differProcurement requires a US-headquartered vendor with tier-1 enterprise support
Aurora DSQLServerless economics, no cluster opsGAHigh, PG feature gap is wider than YB'sAWS-native, greenfield, smaller feature surface acceptable
Spanner with PG dialectExternal consistency, true multi-region writes, multi-model in one engineProductionHigh, GCP-specific and edition-tier-dependentGCP commitment exists, correctness guarantees are a procurement requirement
Postgres + pglogical / Patroni / Citus stackReal upstream Postgres, no fork to trackMatureMedium, requires assembling sharding + HA from partsPostgres-purist team unwilling to accept any fork, accepting higher operational burden

TiDB

Choose when MySQL compatibility, TiKV scale-out, and HTAP adjacency outweigh the complexity of PD, region scheduling, and transaction-mode choices.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
VitessPreserves MySQL semantics more strictly, more mature in North American hyperscalersMatureMedium, both are MySQL-protocol, but Vitess sharding is app-awareExisting MySQL with complex stored procs and triggers TiDB doesn't fully support
PlanetScale (Vitess + serverless)Serverless MySQL with branching, no cluster opsProductionLow for MySQL apps that fit PlanetScale's deferred schema modelWant managed serverless on MySQL ecosystem, OK with PlanetScale's foreign-key constraints
MySQL HeatWave (Oracle)HTAP via in-memory analytics engine, deep MySQL semantic parityProductionLow for OCI users, medium otherwiseHTAP need with strict MySQL parity, willing to use Oracle Cloud Infrastructure
SingleStore (formerly MemSQL)Row + columnar in one store, MySQL wire protocol, real-time analyticsProductionMedium, MySQL wire but different SQL semanticsMixed OLTP + sub-second analytics, want a more analytics-leaning HTAP than TiDB

Amazon Aurora DSQL

Best for AWS-native teams that want serverless distributed SQL with PostgreSQL compatibility and are comfortable trading control for managed time-sync assumptions.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Aurora Postgres Limitless DatabaseTrue Postgres compatibility (FKs, triggers, JSONB, extensions) with horizontal shardingProductionLow for traditional PG apps, requires shard-key thinkingNeed full Postgres features but want AWS-native horizontal scale; multi-region not the priority
CockroachDB Cloud / YugabyteDB Aeon on AWSMore mature feature set, multi-cloud optionProductionLow to mediumDSQL feature gaps (FKs, triggers, batch transactions) are dealbreakers
DynamoDB with PartiQLIf you can model as key-value, DynamoDB is mature, cheaper, and has no transaction-row limitsMatureHigh, data model change requiredWorkload is actually KV-shaped and SQL was used out of habit, not necessity
Spanner Omni on AWS (preview)External consistency and multi-cloud, the only credible alternative for DSQL's correctness storyPreview 2026High, GCP-tied tooling and dialectMulti-cloud strategic, willing to accept preview-stage software

Google Cloud Spanner

Use when external consistency, managed global scale, and TrueTime-backed semantics justify Google Cloud lock-in and Spanner-specific schema design.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Spanner OmniSame engine, multi-cloud (on-prem, AWS, Azure)Preview as of 2026Low within Spanner familyMulti-cloud mandate or GCP-exit risk hedge; willing to track preview features
CockroachDB / YugabyteDBMulti-cloud today, lower cost at small-to-mid scale, no GCP dependencyProductionMedium to high, depending on Spanner-specific features usedCost optimization, multi-cloud, willing to operate the cluster yourself
TigerBeetle (for financial workloads)Single-purpose double-entry accounting at sub-ms p99, deterministic and dedicated to ledger correctnessEmerging, early productionHigh, dedicated financial schema, separate from general OLTPPure financial-ledger workload where Spanner's general-purpose overhead doesn't earn its keep
FoundationDB + layersLower-level, transactional KV at extreme scale with stronger isolation guaranteesProduction (Apple, Snowflake)Very high, SQL is not native, build layers yourselfPlatform-engineering bet, building your own DB-as-a-service on top of a strong KV substrate