PostgreSQL

The world's most extensible relational database. Single-leader by design, multi-everything by extension, operationally honest about its limits.

RDBMS ACID MVCC Single-leader writes PG 18 (Sept 2025), 18.1 (Nov 2025)

As of 2026-06-07 · Grounded in PostgreSQL 17/18 reality

PE Verdict

PostgreSQL is the default boring choice for OLTP, and that boring is its superpower. Pick it unless you have a concrete reason not to: extreme write scale (then Citus, DSQL, or Cockroach), embedded use (SQLite), or pure document mutation patterns (DynamoDB/Mongo for ledger-style append). The two hidden cliffs every L7 candidate must know cold: XID wraparound (32-bit txid counter, vacuum is not optional, idle transactions stop vacuum) and process-per-connection (5-10MB per backend, max_connections beyond ~500 is malpractice without PgBouncer in transaction mode).

Best default choices

PostgreSQL firstDefault to Postgres for OLTP until write scale, topology, or latency requirements force a different system Index by workloadUse B-tree by default; reach for GIN, GiST, BRIN, partial, and covering indexes deliberately Partition with intentUse declarative partitioning for pruning, retention, and hot/cold data, not as generic scale magic Pool, vacuum, observeUse PgBouncer, tune autovacuum, and design around XID wraparound and connection limits

00. Architecture & Core Concepts

PostgreSQL is a process-per-connection, single-writer database that earned its reputation by being correct first and fast second. Every feature below is grounded in three primitives: the WAL, MVCC, and the shared buffer cache.

Architecture Model

Process-per-connection

Postmaster forks a backend process per client. 5-10 MB RAM each. Postmaster also manages background workers (autovacuum launcher, WAL writer, checkpointer, bgwriter, archiver, walsender, walreceiver).

Storage Engine

Heap + WAL

Tables stored as 8KB pages in heap files. All writes go to WAL first (synchronous fsync at commit by default), then dirty pages flushed to heap by checkpointer. No pluggable storage engine in core; pluggable table AM exists since PG12 but heap is still the only practical choice.

Concurrency Model

MVCC, no read locks

Readers never block writers, writers never block readers. Each tuple has xmin/xmax. Old versions remain until vacuum reclaims them. Snapshot isolation by default. Serializable available via SSI (predicate locks).

Durability

WAL + fsync

Write-Ahead Log is the source of truth. On crash, replay from last checkpoint. Replication is WAL shipping (physical) or WAL decoding (logical). synchronous_commit controls the fsync/ack trade-off per transaction.

ACID Semantics in Postgres

Property	How Postgres Implements It	PE Nuance
Atomicity	Per-transaction WAL records. On commit, write COMMIT record and fsync. On abort or crash, roll forward to last commit, discard the rest.	Subtransactions (SAVEPOINT) implemented via virtual subxid. >64 nested subxids per transaction overflows to `pg_subtrans` and degrades performance sharply. Avoid in hot paths.
Consistency	Constraints checked at row level (CHECK, NOT NULL, UNIQUE, FK). Deferrable constraints checked at commit.	FK checks acquire row-level SHARE lock on referenced table. Mass updates on a parent table can lock-contend with child inserts. Cascading FKs at scale are a known anti-pattern.
Isolation	Default: READ COMMITTED. Snapshot taken per statement. Available: REPEATABLE READ (snapshot per transaction), SERIALIZABLE (true serializability via SSI with predicate locks).	SERIALIZABLE has measurable overhead and can produce serialization failures the app must retry. Most production workloads run RC; use RR or SER deliberately for ledger correctness.
Durability	`synchronous_commit = on` fsyncs WAL before ACK. `fsync = on` at checkpoint. `full_page_writes` protects against torn pages.	PG 18 enabled data checksums by default for new clusters. Pre-18 clusters without checksums silently corrupt on torn writes; turn on or accept the risk.

MVCC Mechanics (and the wraparound cliff)

Production landmine. Every write transaction consumes one of 2 billion usable XIDs. If autovacuum cannot freeze tuples fast enough, the cluster shuts down to prevent data loss. Multiple unicorns have taken full outages here. Monitor age(datfrozenxid) per database; alert at 200M, escalate at 1B. Set idle_in_transaction_session_timeout in production; one forgotten transaction can stop all vacuum work.

Concept	Mechanism	Operational Implication
Tuple visibility	Every tuple carries `xmin` (inserter txid) and `xmax` (deleter/updater txid). A snapshot defines which xids are "in the past" and therefore visible.	UPDATE never overwrites in place. It writes a new tuple version, sets xmax on the old one. Dead tuples must be cleaned by VACUUM. Heavy update workloads create bloat fast.
HOT updates	Heap-Only Tuple: if no indexed column changes and the new tuple fits on the same page, no index entries are added. Old tuple chained to new via `ctid`.	Set `fillfactor = 70-80` on heavy-update tables to reserve space for HOT. Wrong indexes (covering a frequently-updated column) kill HOT and inflate index bloat.
Transaction ID wraparound	XIDs are 32-bit. PostgreSQL must freeze tuples (mark as visible to all future transactions) before they age past 2B. `autovacuum_freeze_max_age` = 200M by default forces an anti-wraparound vacuum.	Wraparound is one of two ways a Postgres cluster can become read-only in production (the other is disk full). Long-running transactions block vacuum because their snapshot might still need the old tuples.
Visibility horizon	The oldest live snapshot in the cluster sets the floor. VACUUM cannot remove tuples newer than that horizon, even if they are otherwise dead.	One idle-in-transaction connection from a misbehaving worker can prevent VACUUM from reclaiming anything cluster-wide. Use `pg_stat_activity` to monitor; kill idle txns older than minutes.
Table bloat	VACUUM marks dead tuple space as reusable but does not shrink the heap file. VACUUM FULL rewrites the table but takes ACCESS EXCLUSIVE lock.	For online bloat reduction, use `pg_repack` or `pg_squeeze`. PG 19 is targeting native REPACK CONCURRENTLY. Heavy UPDATE tables can balloon 2-4x without intervention.

WAL & Synchronous Commit Modes

synchronous_commit Value	What Gets Confirmed Before ACK	Durability Guarantee	When To Use
`off`	Nothing waits. WAL written async by walwriter (~200ms later).	Up to `wal_writer_delay` + commit_delay of committed data can be lost on crash.	Bulk import jobs where loss is recoverable from source.
`local`	WAL fsynced to local disk.	Local crash safe. Replicas may not have it.	Single-node OLTP, or when sync to replica is too slow to tolerate.
`remote_write`	WAL received by sync standby's OS (not yet fsynced).	Survives primary OS crash if standby is alive. Loses if both crash before standby fsync.	Cross-region with strong replica, willing to trade tiny window for latency.
`on` (default)	WAL fsynced on primary AND on sync standby.	Survives single-node loss with zero data loss.	Default for HA primary with sync standby.
`remote_apply`	WAL applied (visible to readers) on sync standby.	Strongest. Read-after-write consistency across replicas.	When app reads from replicas and cannot tolerate replica lag. Costs RTT per commit.

01. Index Landscape (B-Tree, GIN, GiST, BRIN, SP-GiST, HNSW)

PostgreSQL ships with the richest core index portfolio of any open-source database. Picking the wrong type is the single most common cause of "why is this query slow." Below is the operational decision matrix.

Index Type	Backing Structure	Best For	Worst For	Size Profile	PE Nuance
B-Tree (default)	Lehman-Yao B+ tree, sorted, leaf-to-leaf links	Equality, range, ORDER BY, unique constraints. PG 17+ supports parallel index build; PG 18 adds skip-scan for multicolumn indexes.	JSONB internal keys, array contents, full-text, geometric overlap.	Roughly 25-30% of indexed column data + overhead	Index-only scans require visibility map up to date (vacuum dependency). Bloat appears even on read-mostly tables if old txids hold versions.
GIN (Generalized Inverted)	Inverted lists: value → list of TIDs containing it	JSONB containment (`@>`), full-text search (`tsvector`), arrays, trigrams (`pg_trgm` for LIKE/ILIKE)	High-cardinality unique values, range queries, sorting.	Large. 2-5x size of B-tree for the same column. Write amplification because each indexed term inserts.	Use `jsonb_path_ops` (smaller, containment-only) vs `jsonb_ops` (larger, supports key-exists ops). GIN updates are not in-place; use `fastupdate` + pending list flush tuning.
GiST (Generalized Search Tree)	Extensible balanced tree with user-defined consistency functions	Geometric data (PostGIS uses it), range types (`tsrange`, `int4range`), exclusion constraints, nearest-neighbor.	Standard equality on scalars (B-tree wins).	Moderate, depends on opclass.	Powers PostGIS spatial indexes and range-based exclusion constraints (`EXCLUDE USING GIST`). One of Postgres's most underused features.
BRIN (Block Range INdex)	Min/max summary per range of heap blocks	Append-only data with natural correlation between physical order and column value (time-series, log tables).	Any non-correlated column; UPDATE-heavy tables.	Tiny. 1000-10000x smaller than B-tree for the same column.	The right answer for a 10TB events table with `created_at` column when rows are inserted in time order. Useless on a randomly-ordered table.
SP-GiST (Space-Partitioned GiST)	Non-balanced partitioned trees (quad-trees, k-d trees, suffix trees)	Phone numbers, IP ranges, text prefix search, non-uniform spatial data.	Range queries on overlap.	Moderate.	Rarely used in practice. Niche but powerful for IP routing tables (`inet` type).
Hash	Linear hash	Equality-only lookups on large keys where B-tree size matters.	Range, sort, multicolumn.	Comparable to B-tree.	WAL-logged and replication-safe since PG10. Still rarely beats B-tree in practice; B-tree wins by ecosystem familiarity.
HNSW (via pgvector)	Hierarchical Navigable Small World graph	Vector similarity search for RAG, semantic search, recommendation.	Exact matches, high-dimensional vectors above 2000 dims (PG limit).	Large. ~450-500 MB for 500K vectors at 1536 dims.	pgvector 0.8+ fixed filtered-search via iterative scans. HNSW supports CREATE INDEX CONCURRENTLY. For larger scale, pgvectorscale adds DiskANN. Live in core extensions but version-pin carefully.

02. Declarative Partitioning & Partition Pruning

PG 10 introduced declarative partitioning to replace the older trigger-and-inheritance pattern. Three strategies are supported; pruning happens at planning time (always) and execution time (when bind parameters are involved).

Strategy	Mechanism	Sweet Spot	Operational Pain	PE Nuance
RANGE	`PARTITION BY RANGE (created_at)`. Each partition holds a contiguous interval.	Time-series data, where retention policy = drop oldest partition with `DETACH` + `DROP`.	Must pre-create future partitions or use `pg_partman`. Inserts to a non-existent partition error out.	The most common pattern. `DETACH CONCURRENTLY` (PG14+) removes a partition without blocking the parent.
LIST	`PARTITION BY LIST (region)`. Each partition holds a set of values.	Multi-tenant tables with bounded tenant count, regulatory data residency.	New tenants require new partitions. Doesn't help write throughput across the same partition.	For multi-tenant SaaS with thousands of tenants, schema-based isolation via Citus or per-tenant tables often beats LIST partitioning.
HASH	`PARTITION BY HASH (user_id) PARTITIONS 16`. Distributes rows by hash modulo.	Spread write hotspots across files, parallel maintenance vacuum.	No pruning for range queries. Pruning only works on equality on the partition key.	Often misused as a poor man's sharding. It is not sharding: all partitions live on the same node. Use for I/O parallelism, not horizontal scale.

Partition Pruning Behavior

Stage	When It Runs	What It Eliminates	Gotcha
Constraint exclusion (legacy)	Planning time, for inheritance-based partitioning only. Default off in modern PG.	Partitions whose CHECK constraint contradicts WHERE clause.	Slow with hundreds of partitions; declarative partitioning replaces it.
Plan-time partition pruning	Always, for declarative partitioning when WHERE has constants.	Partitions whose bounds cannot match the predicate.	Type mismatch kills it. `WHERE created_at = '2025-01-01'` may not prune if column is `timestamptz` and literal is interpreted as `date`. Always cast explicitly.
Run-time partition pruning	Execution time, for prepared statements with `$1` parameters or nested-loop joins.	Partitions excluded only when the bind value is known.	EXPLAIN without ANALYZE will show all partitions; EXPLAIN ANALYZE shows the pruned set.
Partition-wise joins (PG11+)	Planning time, when joining two tables partitioned the same way.	Cross-partition join work; each pair of partitions joined independently and in parallel.	Requires `enable_partitionwise_join = on` (off by default). Cost-bounded; planner may decline at scale.
Partition-wise aggregates (PG11+)	Planning time, when GROUP BY aligns with partition key.	Cross-partition aggregation; each partition aggregated independently.	Requires `enable_partitionwise_aggregate = on`. Big win for partitioned dashboards.

Hard limit you must know. Postgres has no global indexes on partitioned tables. Every UNIQUE constraint on a partitioned table must include the partition key. PRIMARY KEY on (user_id) when partitioning by created_at is impossible in core PG. This forces UUIDs or composite keys in time-partitioned data.

03. Trade-Offs

Each row below is something where you give up X to get Y. Sortable by clicking column headers.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
MVCC for concurrency	Readers never block writers. Snapshot isolation by default. Time-travel debugging via `xmin`/`xmax`.	Vacuum is mandatory. UPDATE creates new tuple versions. Heavy UPDATE workloads bloat 2-4x without tuning.	High-churn tables under update storms. Long transactions or replication slots stalling vacuum. XID wraparound at 2B txns.	The autovacuum_vacuum_scale_factor of 0.2 default is too lax for a 1B-row table (vacuums when 200M dead tuples accumulate). Lower it to 0.02-0.05 per table for hot ones.
ACID + strong constraints	Foreign keys, CHECK, unique, deferrable constraints. Catch bad data at the database boundary.	FK checks acquire SHARE locks on parent rows. Cascading deletes can be O(table) under contention.	Hot-spot writes on a referenced parent table (e.g. `users`) with millions of child FKs.	Disable FK validation on bulk import (`NOT VALID`, then `VALIDATE CONSTRAINT` later). At extreme scale, some teams move FK enforcement to the app layer to eliminate the lock.
Process-per-connection isolation	Fault-isolated backends; one crashed backend doesn't take down the cluster. Strong observability per connection.	5-10 MB RAM per connection. Connection establishment is expensive (fork + IPC setup).	Microservice fleets opening direct connections. 50 services × 20 pooled conns each = 1000 connections before traffic.	PgBouncer in transaction-pool mode is non-negotiable above 200 active connections. Watch out for prepared statements breaking in transaction mode; use `PREPARE` at the pooler-aware level or disable client-side prepared statements.
WAL-based durability	Crash recovery is straightforward (replay from checkpoint). Native PITR. Free replication.	Every write goes to disk twice (WAL + heap). Full-page writes on first dirty after checkpoint amplify WAL volume by 8KB.	Write-heavy workloads on slow disks. Frequent checkpoints (low `checkpoint_timeout`) cause WAL volume blowup.	Tune `max_wal_size` and `checkpoint_completion_target = 0.9` to spread fsyncs. Long checkpoint cycles (15-30 min) reduce full-page write overhead.
Synchronous commit with replica	Zero RPO. Survives single-node loss without data loss.	Commit latency = max(local fsync, network round-trip + remote fsync). Adds 1-50ms depending on replica distance.	Cross-region sync replica. One replica failure stalls all commits until `synchronous_standby_names` is reconfigured.	Use `FIRST 1 (replica1, replica2, replica3)` quorum to allow any one of N replicas to ack. Avoid sync replica across continents; use `remote_write` if you must.
Rich index ecosystem (7+ types)	Right tool for each query shape. JSONB containment, geospatial, full-text, vector all in one DB.	Each index multiplies write cost. Heavy GIN indexes have 2-5x write amplification.	Tables with 5+ GIN/GiST indexes under sustained insert load.	Use partial indexes (`WHERE active = true`) and expression indexes to keep size small. Routinely audit with `pg_stat_user_indexes`; drop unused ones.
JSONB schema flexibility	Schema-on-read without leaving SQL. Containment queries via GIN. Hybrid relational + document.	Any field update rewrites the entire JSONB column. GIN range/sort is limited vs MongoDB multi-key indexes.	Hot-update fields buried inside large JSONB documents (e.g. counters, timestamps).	Promote frequently-updated or frequently-filtered keys to typed columns. JSONB is for the long tail of attributes, not the hot path.
Cost-based query planner	Reasonable plans for joins, subqueries, CTEs. Window functions and recursive CTEs are first-class.	Planner relies on statistics. Bad estimates produce catastrophic plans (Cartesian joins, mismatched index choice).	Wide multi-table joins with correlated predicates. Skewed data distributions. Stale stats after bulk load.	Use `CREATE STATISTICS` for multi-column correlations. Run `ANALYZE` after bulk loads. Force plan with `pg_hint_plan` only as last resort; usually a stats fix beats a hint.
Declarative partitioning	Native range/list/hash. Pruning at plan and run time. Detach/drop for retention.	No global indexes. UNIQUE must include partition key. Planning cost grows with partition count (still O(N) lookup despite optimizations).	Tables with >5000 partitions, or unique constraints that don't fit partition key.	Practical sweet spot is 50-1000 partitions. PG 12+ has `enable_partition_pruning` on by default, but the planning cost still grows. Keep partition count bounded.
Single-leader writes	No write conflicts. Linearizable on primary. Failure modes are well understood.	Writes don't scale beyond the largest instance. No native multi-master. No native sharding.	Write throughput above ~50K TPS sustained, or workloads that exceed single-instance NVMe bandwidth.	Most teams hit the read limit (offload to replicas) before the write limit. When the write wall arrives: Citus (extension), Aurora DSQL (managed distributed SQL), or rewrite the access pattern.
Extension ecosystem	PostGIS, TimescaleDB, pgvector, Citus, pg_stat_statements, pg_partman extend core capabilities massively.	Version pinning across PG major, extension major, and managed-provider support. Major upgrades can lag.	Aurora and RDS lag on new extension versions. Some extensions (pgvector, TimescaleDB) ship months after core PG releases.	Audit your extension surface area before adopting; each extension is a hard dependency on its maintainer staying alive and tracking PG major versions.
Logical replication via pgoutput	Selective replication (per-table, per-row, per-column). Major-version upgrades with pg_createsubscriber. Cross-database CDC.	No DDL replication. No sequence values replicated. Initial sync is single-threaded per table.	Schema migrations during a logical-replication-driven upgrade. Sequences end up reset on subscriber.	Use Debezium or similar with logical decoding for CDC to Kafka. PG18 adds `publish_generated_columns = stored`. Sequences and large objects remain manual sync.

04. Use Cases

Real-world deployments at scale. The Driving Property column is the property that made Postgres the right answer, not a list of features.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
OLTP system of record	Stripe, Instagram (pre-shard era), Apple (migrated from Oracle), GitLab	Strong ACID + relational integrity + mature replication + extension ecosystem	Tens of TB per shard, 10-50K TPS per primary	MySQL is comparable but loses on JSONB, CTEs, partial indexes. Oracle adds cost without correctness gains.
Multi-tenant SaaS	Notion, Heroku Postgres, Supabase, schema-per-tenant on Citus	Row-level security, schema-based isolation, FK guarantees per tenant	10K-100K tenants per cluster, hundreds of millions of rows	DynamoDB lacks joins/FKs needed for tenant data integrity. MongoDB lacks SQL ecosystem for analytics on tenant data.
Geospatial workloads	OpenStreetMap (PostgreSQL + PostGIS), Mapbox tile generation, ride-hail H3 lookups	PostGIS is the open-source spatial gold standard. GiST indexes on geometries.	Billions of geometries, sub-100ms point-in-polygon at scale	Dedicated spatial databases (Oracle Spatial) cost more without functional advantage. SQLite + SpatiaLite is fine for embedded but not for serving.
Time-series & metrics	Timescale-backed apps, log aggregation, IoT pipelines	Native partitioning + BRIN indexes + Timescale's automatic compression and continuous aggregates	Millions of inserts/sec via TimescaleDB hypertables, TBs of compressed data	InfluxDB and VictoriaMetrics are leaner but lose SQL flexibility. Prometheus is for metrics, not arbitrary time-series queries.
Hybrid document + relational	Reddit (PG with JSON column), internal Amazon teams, content platforms	JSONB for the schemaless 90% of attributes + columns for the 10% that need indexing and joins	Single tables with millions of varied-shape documents, joined to typed reference data	MongoDB lacks joins and ACID across collections. DynamoDB has document model but query patterns must be pre-baked.
Vector search / RAG store	Anthropic, Supabase Vector, Neon RAG demos, internal corp search	pgvector + HNSW colocate embeddings with metadata; one DB for filter + similarity	1M-100M vectors, p99 <10ms for HNSW similarity + WHERE filter (iterative scan in pgvector 0.8+)	Pinecone/Weaviate add operational and consistency burden for embeddings that are already related to relational data.
Real-time analytics on operational data	Microsoft (Citus runs Azure Cosmos DB for PG), customer-facing dashboards, SaaS metric backends	Citus distributed planner runs SQL across shards. Read-after-write available. CDC from production DB into the same warehouse.	10s of nodes, hundreds of TB, sub-second p99 on aggregations over billions of rows	Snowflake / BigQuery for warehousing, but they aren't OLTP. Citus serves both, when shaped well.
Financial ledger / event log	Banking platforms, accounting tools, billing systems	Serializable isolation (SSI) + deferrable constraints + exclusion constraints for double-entry invariants	Append-heavy, millions of events/day, audit trail required	NoSQL lacks transactional integrity for ledger invariants. CockroachDB is competitive but newer; Postgres has 25 years of operator experience.

05. Limitations

The honest constraint list. Severity reflects what an L7 reviewer would flag in a design review.

Limitation	Severity	Workaround	Workaround Cost
No native horizontal sharding	Critical	Citus extension, application-side sharding via FDW, or migrate to Aurora DSQL / CockroachDB / YugabyteDB.	Citus changes the SQL surface (distribution key in every query). Application sharding adds routing layer. Distributed SQL replacements are full rewrites of the data layer.
XID wraparound at 2B transactions	Critical	Aggressive autovacuum config + monitoring of `age(datfrozenxid)`. Pre-PG14 needed full freeze rewrites; PG14+ improved with dynamic_shared_memory but still requires vigilance.	Monitoring overhead. Manual VACUUM FREEZE windows. No clean upgrade path to 64-bit xids in upstream PG (some forks like PG Pro have it).
Process-per-connection model	High	PgBouncer or pgcat in transaction-pool mode in front of every cluster.	Transaction mode breaks session-level features (advisory locks, prepared statements without server-side caching, SET LOCAL across statements). Two-network-hop latency overhead.
No global indexes on partitioned tables	High	Include partition key in every unique constraint. Use UUIDs as PK if natural keys cannot include partition column.	Schema redesign. Lose unique business-key enforcement at the database level.
Logical replication restrictions	High	Manage DDL changes manually across publisher/subscriber. Use Debezium or PG18 `publish_generated_columns`. Sync sequence values manually before failover.	Operational complexity. Failover playbooks must include sequence resets and DDL sync.
Replication slot retention can fill disk	High	Set `max_slot_wal_keep_size` (PG13+) to cap retention. Monitor `pg_replication_slots` for stalled subscribers.	If cap hits, the subscriber loses sync and must be re-bootstrapped. Choose between disk-full outage and re-sync.
Table bloat from UPDATE/DELETE	Medium	Tune autovacuum per table. Use pg_repack or pg_squeeze for online rebuild. Set fillfactor for HOT.	pg_repack needs 2x disk space transiently. Long transactions still block bloat recovery.
Major-version upgrades	Medium	pg_upgrade (in-place, requires downtime). pg_createsubscriber + logical replication (low downtime, PG17+). Blue-green via cloud provider.	pg_upgrade was a multi-hour outage on multi-TB clusters until PG18 added statistics preservation. Still requires extension version alignment.
Auto-failover not in core	Medium	Patroni, repmgr, pg_auto_failover, or managed services (RDS, Aurora, Cloud SQL).	External Raft/Consul/etcd dependency. Misconfigured Patroni can produce split-brain on partitions.
Query planner blind spots	Medium	`CREATE STATISTICS` for multi-column correlations. `pg_hint_plan` extension for forced plans. Manual rewrites for join order.	Statistics maintenance is per-table; planner can still mis-estimate. Hints are anti-pattern at scale; they freeze your team to old plans.
OLAP performance vs columnar engines	Medium	Citus columnar storage tables. Or offload aggregations to ClickHouse / DuckDB / Snowflake via FDW or CDC.	Adds a second system. Sync lag and pipeline complexity. Tradeoff is real once scans exceed billions of rows.

06. Fault Tolerance

PostgreSQL's HA story is pluggable, not built-in. Core PG ships replication primitives; orchestration is your problem unless you use a managed service.

Dimension	Behavior	Operational Reality
Replication model	Single primary, N hot standbys via physical streaming. Logical replication is per-table pub/sub on top of WAL decoding.	Cascading replicas supported (standby of a standby). One primary writes; readers can fan out to N replicas with read-your-writes only via `synchronous_commit = remote_apply`.
Failure detection	Not in core. Heartbeats and quorum live in external orchestrator (Patroni + etcd/Consul/ZK, repmgr, AWS RDS automation).	Misconfigured orchestrator is the most common cause of Postgres production outages. Patroni's split-brain prevention relies on TTL-based locks; a paused etcd can promote a stale replica.
Failover mechanism	Promote a standby to primary (`pg_promote()` or trigger file). Old primary must be fenced and re-cloned.	Switchover is online (~5-30s). Failover from a hard crash takes longer because the orchestrator has to confirm the primary is dead before promoting.
RTO (typical)	Patroni cluster: 30-60s. RDS Multi-AZ: 60-120s. Aurora Postgres: ~30s. Cloud SQL HA: 60-90s.	RTO includes DNS update or proxy failover (PgBouncer/HAProxy). DNS TTL is often the bottleneck; use connection-pool-level reconnect, not DNS.
RPO (typical)	Zero with sync replication. Up to seconds (network RTT) with async. Logical: depends on subscriber lag.	Sync replication doubles commit latency. Most teams run async within an AZ for read replicas; one sync replica in another AZ for HA.
Split-brain behavior	Core PG cannot prevent split-brain on its own. Two primaries accepting writes is a corruption event.	Orchestrator must fence (STONITH) old primary. Patroni does this via DCS locks; managed services use control-plane fencing. Production-bug scenarios: orchestrator restart racing with primary recovery.
Blast radius of single-node failure	Primary failure: writes pause for RTO. Replica failure: reduced read capacity. Sync replica failure: writes block unless `synchronous_standby_names` is reconfigured.	A failed sync replica can hang the cluster. Always configure `FIRST 1 (s1, s2)` or `ANY 1 (s1, s2)` quorum.
Cross-region failover	Async streaming to cross-region replica. Manual promotion (in core) or orchestrator-driven.	Cross-region RPO is bounded by replication lag and network RTT. Aurora Global Database makes this turnkey but costs more. DIY across regions requires careful WAL retention.
Data loss scenarios	Torn pages without data checksums. WAL slot drop with subscriber behind. Wraparound shutdown. fsync lying (some cloud storage in past).	PG18 enables data checksums by default for new clusters. Pre-18 clusters running without checksums on EBS/managed disks should turn them on (requires `pg_checksums` tool offline).
Backup / PITR	pg_basebackup or pg_dump for base. WAL archiving via `archive_command` or pgBackRest, WAL-G, Barman.	WAL archiving is the most under-tested system in Postgres operations. Restore drills must be quarterly. Many post-mortems start with "we found out the archive was broken when we needed it."

07. Sharding

PostgreSQL ships with no native sharding. Partitioning is not sharding (everything lives on one node). The sharding story is the extension/ecosystem story.

Dimension	Native Postgres	Citus (extension)	Aurora DSQL / Distributed SQL replacements
Sharding model	None. Partitioning is intra-node only.	Hash on chosen distribution column. Schema-based sharding for multi-tenant.	Transparent. Postgres wire compatible. Storage layer is distributed (DSQL uses Aurora's shared storage; CockroachDB/Yugabyte use Raft-replicated ranges).
Shard key constraints	N/A.	Single column. Distributed tables must be queried with the distribution column for single-shard routing; otherwise, queries fan out to all shards.	None visible to the app (transparent), but the storage layer still partitions by key range or hash. Cross-partition queries cost more.
Rebalancing mechanism	N/A.	Online shard rebalancer. Uses logical replication under the hood to move shards. Trigger with `rebalance_table_shards()`.	Automatic in the storage layer. DSQL handles re-partitioning without operator action. Cockroach uses range splits; Yugabyte uses tablet splits.
Rebalancing cost / impact	N/A.	Online, low impact, but consumes CPU and WAL. Move one shard at a time to avoid network saturation.	Background, mostly invisible to apps. Throughput dip during split, no downtime.
Hot-shard behavior	N/A; one node handles everything.	One worker takes the heat. Common fix: use schema-based sharding for outsized tenants, or pre-split via composite distribution column.	Less of a problem; storage and compute decoupled. Aurora DSQL routes hot ranges to multiple writers.
Maximum shards (practical)	N/A.	32-64 worker nodes is the sweet spot. Beyond that, coordinator becomes the bottleneck; control plane and metadata grow nonlinearly.	DSQL is multi-region multi-active by design; thousands of nodes possible. CockroachDB clusters routinely run 50-500 nodes.
Resharding without downtime?	N/A.	Yes via online rebalancer + logical replication. Application changes only if you redistribute by a new column.	Yes, transparent to the app. No application changes ever.
Cross-shard query support	N/A. FDW can join across remote Postgres instances but pushes minimal predicates.	Full distributed planner: hash joins, repartition joins, broadcast joins. Pushdown is sophisticated but join shapes matter.	Full SQL across the cluster. Cross-partition joins are slower; latency is a function of how many partitions you touch.
Cross-shard transactions	N/A; or 2PC via `PREPARE TRANSACTION` and FDW with extreme care.	Distributed transactions via 2PC. Read committed by default; serializable possible at coordinator level.	Full distributed transactions. DSQL uses bounded-clock-based MVCC. CockroachDB/Yugabyte use Raft + serializable.

The Citus tradeoff in one line. Citus is Postgres with sharding bolted on. You get distributed planning, online rebalancing, and the Postgres operator skill set. You give up some SQL surface (correlated subqueries, certain CTE shapes have historically been restricted, though PG17 + Citus 13 improved this) and you must shape every query around the distribution column.

08. Replication

Postgres has two flavors: physical streaming (byte-for-byte WAL replay, whole cluster) and logical (table-level pub/sub via WAL decoding). They solve different problems.

Dimension	Physical / Streaming	Logical (pgoutput)
Replication topology	Leader → 1..N followers. Cascading supported. All-or-nothing cluster replica.	Publisher → Subscriber. Many-to-one and one-to-many. Per-table, per-row (since PG15), per-column.
Sync vs async	Per-transaction via `synchronous_commit` setting. Can be async, remote_write, on (sync), remote_apply.	Async only. Subscriber applies in WAL order; no sync guarantee back to publisher.
Replication factor (default / max)	Default 1 primary + 1 standby. Practical max ~5 replicas per primary before WAL sender pressure.	Many subscribers can connect to one publisher. Each holds its own replication slot.
Consistency level options	Default eventual on followers. `remote_apply` gives read-after-write across the cluster.	Eventually consistent only. Subscriber lag is bounded only by network and apply throughput.
Replication lag (typical)	<100ms within a region. Cross-region: 50-500ms depending on RTT. Apply lag negligible if standby keeps up.	Higher than physical because logical replication is single-process apply per subscription. PG14+ supports parallel apply for streaming large transactions.
Conflict resolution	N/A (single leader, no conflicts possible).	Built-in but limited. Conflicting writes on subscriber (e.g. unique violation) stop the replication stream. Resolution is manual: edit on subscriber, advance slot. pgEdge/BDR adds full conflict resolution.
Cross-region replication	Supported via streaming over WAN. Use async for cross-region; sync hurts commit latency.	Supported. Often used as the cross-region mechanism for selective data movement (e.g. ship orders_to_warehouse).
Replication during network partition	Replicas fall behind silently. Primary keeps writing unless `synchronous_standby_names` requires this replica to ack; then writes hang.	Subscriber falls behind. Publisher's replication slot retains WAL; risk: disk fills up. Cap with `max_slot_wal_keep_size`.
What it cannot replicate	Nothing (it's full WAL).	DDL is not replicated. Sequence values are not replicated to subscriber. Large objects via lo_* APIs are not replicated. TRUNCATE is replicated since PG11.
Use case fit	HA, read scaling, full-cluster disaster recovery, PITR base.	Major-version upgrade (pg_createsubscriber in PG17), CDC to other databases, selective replication across schemas, multi-master via pgEdge.

09. Better Usage Patterns

Patterns most engineers miss, anti-patterns that show up in code review, optimizations that compound at scale.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Set `idle_in_transaction_session_timeout`	Leave it at 0 (no timeout). A forgotten BEGIN holds the vacuum horizon open indefinitely.	Set to 60s-300s in production. Apps that need long transactions can override per session.	Single most common cause of "autovacuum isn't working." One idle-in-tx connection stops bloat reclamation cluster-wide.
Use partial indexes	Index the full column even when most queries filter on a subset.	`CREATE INDEX ON orders(created_at) WHERE status = 'pending';` Order of magnitude smaller and faster for the hot query.	Smaller index fits in buffer cache; index-only scans more likely. Common 5-10x improvement.
BRIN for append-only time-series	Default to B-tree on `created_at`, eat the 30%+ table-size index.	BRIN: `CREATE INDEX ON events USING brin(created_at);` Tiny (KB instead of GB), perfect for range scans on correlated data.	10TB table with B-tree on timestamp = 2-3TB index. Same table with BRIN = a few hundred MB. Queries comparable for range predicates.
PgBouncer in transaction mode	App opens 50-200 connections per service replica directly. `max_connections` raised to 1000 to compensate.	PgBouncer in transaction-pool mode in front of every cluster. App pool size 10-20 per replica; pooler manages the backend.	Postgres backend processes cost 5-10MB each. 1000 backends = 5-10GB just for connection overhead, before any work. Pooler removes this almost entirely.
Don't put hot updates in JSONB	Store all counters and timestamps inside a JSONB document because schema flexibility.	Promote frequently-updated keys (counters, last_modified, active flag) to typed columns. Keep the long tail in JSONB.	JSONB UPDATE rewrites the entire column. A 10KB JSONB document mutated 10K times/day = 100MB/day of WAL.
Tune fillfactor for HOT updates	Leave fillfactor at default 100. Every UPDATE on indexed columns causes new index entries.	Set fillfactor 70-80 on heavy-UPDATE tables. Reserves space for new tuple versions on the same page, enabling HOT (heap-only-tuple) updates.	HOT updates skip index updates entirely. 30-50% reduction in WAL and bloat for update-heavy tables.
Use COPY for bulk loads	Loop over INSERT in the app. 50K rows takes 30 minutes.	`COPY tablename FROM STDIN` via the protocol. Or batched multi-row INSERTs (200-1000 per statement). 50K rows takes seconds.	10-100x faster. Bypasses per-statement parser/planner overhead. Used by every Postgres bulk-load library.
CREATE INDEX CONCURRENTLY in production	Plain `CREATE INDEX` on a live table takes ACCESS EXCLUSIVE lock; blocks reads and writes for minutes to hours.	Always `CREATE INDEX CONCURRENTLY` in production. Two-pass build; takes longer but doesn't block writers.	One missed CONCURRENTLY on a large table = production incident. Failures leave invalid indexes that must be cleaned up.
Monitor replication slot lag	Create a logical replication slot, lose track of it. Subscriber dies. WAL grows until disk is full.	Always alert on `pg_replication_slots.confirmed_flush_lsn` lag > threshold. Set `max_slot_wal_keep_size` as a backstop.	Orphaned slot is one of the top three causes of "Postgres disk full at 3 AM." Cap with the GUC; subscriber can be re-bootstrapped.
Use UUIDv7 over UUIDv4 for PKs	UUIDv4 random PKs. B-tree leaf pages fragment. 30-50% more page splits, slower bulk insert.	PG18 adds `uuidv7()` natively. Timestamp-ordered, still globally unique, friendly to B-tree clustering.	Significant index size and insert throughput improvement. Existing UUIDv4 PKs can stay; new tables should default to UUIDv7.
Use logical replication row filters	Replicate entire tables when only a subset is needed downstream.	`CREATE PUBLICATION pub_us FOR TABLE orders WHERE (region = 'us');` Reduces subscriber WAL volume and storage.	5-10x reduction in subscriber data volume in geo-sharded apps. Avoids cross-region bandwidth charges.
Declarative partitioning over inheritance	Use old PG9.x style table inheritance + triggers because "we already had it set up that way."	PG10+ declarative partitioning: `PARTITION BY RANGE / LIST / HASH`. Pruning is faster and built into the planner.	Inheritance-based partitioning had per-statement trigger overhead and limited pruning. Declarative is faster and the only path to partition-wise joins.
Avoid `SELECT *` in materialized views	`CREATE MATERIALIZED VIEW v AS SELECT * FROM huge_table`. Refresh takes minutes.	Project only needed columns. Use REFRESH MATERIALIZED VIEW CONCURRENTLY (requires unique index).	Refresh time scales with row size. Concurrent refresh holds different locks and doesn't block readers.

10. Advanced / Next-Gen Alternatives

Successors, adjacent technologies that do certain things better, and architectural patterns that obviate the original Postgres limitation.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Citus (PG extension)	Horizontal sharding via hash distribution. Distributed query planner. Schema-based sharding for multi-tenant. Stays SQL-compatible.	Production	Medium. Every query must consider the distribution column; some SQL shapes restricted.	You've hit single-node write limits but want to keep Postgres tooling, drivers, and SQL.
Aurora DSQL (AWS)	Transparent distributed Postgres with active-active multi-region writes. Bounded-clock-based MVCC. Fully serverless. Postgres wire compatible.	GA, growing feature set	Medium-high. Feature gaps vs core PG (views, triggers, foreign keys still in flight as of 2025-2026).	Greenfield apps that need multi-region writes without conflict resolution. Existing apps with mature dependence on PG features should wait.
CockroachDB	Distributed SQL with Postgres wire protocol. Serializable by default. Raft-replicated ranges, automatic resharding.	Production	Medium-high. PG dialect is mostly compatible but not identical; extensions and procedural code don't transfer.	Multi-region OLTP at scale where strong consistency is non-negotiable.
YugabyteDB	Postgres-compatible distributed SQL with reused PG query layer on top of distributed storage. Higher Postgres fidelity than Cockroach.	Production	Medium. Same shape as Cockroach migration but more PG features available.	Need horizontal scale + maximum PG compatibility (extensions, syntax).
Neon / Aurora Serverless v2	Storage / compute separation. Branching (Neon). Pay-per-use compute. Faster cold-start cluster scale.	Production (Aurora), GA (Neon)	Low. Existing PG applications run unchanged.	Variable workloads, dev/test environments needing branching, multi-tenant SaaS where individual tenants are bursty.
TimescaleDB	Time-series optimizations: automatic compression, continuous aggregates, hypertables. Native columnar storage option.	Production	Low. Tables become hypertables; queries unchanged.	Time-series, IoT, observability workloads where retention and compression dominate cost.
pgvector + pgvectorscale	Vector similarity search inside Postgres. HNSW for ANN. pgvectorscale adds DiskANN, statistical binary quantization.	Production (pgvector 0.8+)	Low. Add column + index. Use existing PG ops.	RAG, semantic search, recommendation. Always evaluate before adding a dedicated vector DB.
ClickHouse / DuckDB FDW or CDC	Columnar analytics. 10-100x faster on large-scan aggregations than row-store Postgres.	Production	Medium. Adds a second system. CDC pipeline (Debezium or PG logical replication) keeps it warm.	Analytical queries scan billions of rows. Postgres for OLTP, ClickHouse for OLAP. Materialized view is no longer enough.
pgEdge / BDR (Bi-Directional Replication)	Multi-master Postgres. Conflict resolution. Each node accepts writes.	Production (pgEdge); BDR is commercial	High. Schema design must avoid conflicting unique constraints; sequences need careful planning.	Multi-region active-active writes without leaving Postgres. Niche; most workloads don't need this.
Native REPACK CONCURRENTLY (PG19 target)	Native online table repack without pg_repack/pg_squeeze external dependency.	Patch preview as of PGConf.dev 2026	None. It will land in PG19.	Plan for it. Reduces operational tooling, simplifies bloat management.
OrioleDB (alternative table AM)	Pluggable storage engine with undo log instead of MVCC tuple chains. Smaller bloat footprint, no vacuum dependency. Postgres-compatible.	Early	High. Replaces table storage engine; not yet a drop-in.	Watch closely. Long-term answer to MVCC bloat if/when it lands in core or matures as an extension.

11. In Production: Company Deep Dives

Five real-world deployments grounded in published engineering blogs, conference talks, and primary source material. Each one earned its place at scale by leaning hard on Postgres's strengths and engineering around the limits.

Company	What's on Postgres	Scale Signal	Architecture Shape	Defining Move
OpenAI	ChatGPT user metadata, conversations, API platform data	800M users, >10x load growth in 12 months, millions of QPS, ~50 read replicas	Single-primary, unsharded, multi-region replicas on Azure PG Flexible Server	Refused to shard. Hardened the single primary instead.
Instagram	User graph, photos, comments, likes, friendships	27M users at sharding cutover (2012), now powering Meta-scale Instagram	Schema-based logical sharding inside Postgres; thousands of logical shards on fewer physical hosts	Decoupled logical shard from physical host via a lookup map; sharding became a config change.
Reddit	ThingDB (Links, Comments, Accounts, Subreddits as KV); media metadata store	850M monthly users (2023); media metadata at 100K reads/sec, p50 2.6ms, p99 17ms	Old: ThingDB schemaless KV on PG. New (2024): Aurora PG, time-partitioned via pg_partman, JSONB payload	Used Postgres as a key-value store on purpose. Chose AWS Aurora PG over Cassandra for debuggability.
Robinhood	Crypto trading ledger, transaction auditing, user data, core OLTP	Hypergrowth 2021 crypto volume; primary CPU bottlenecked, replica lag at risk	Django + Aurora Postgres with cohort-based functional sharding; data lakehouse for analytics	Sharded by customer cohort (new vs. existing users); 1000x query speedups via EXPLAIN-driven index work.
Apple	Internal tooling, build and release metadata, transactional catalogs (App Store ecosystem)	Not publicly disclosed in engineering blogs	Less publicly documented than the others; FoundationDB + Cassandra power iCloud/CloudKit, Postgres serves many internal services	Contributor presence in core PG community (committers and core team members). Pattern: long-running migration from Oracle to PG across internal services.

OpenAI — Refusing to Shard at 800M Users

Source: Bohan Zhang's Scaling PostgreSQL to power 800 million ChatGPT users (OpenAI Engineering, Jan 22 2026), and his talks at PGConf.Dev 2025 / POSETTE 2025.

Topology

1 primary + ~50 replicas

Azure PG Flexible Server. Replicas span multiple regions. No sharding within any production cluster. Multiple separate clusters for workload isolation.

Workload Shape

Read-heavy, ms p99

"Low double-digit millisecond p99 client-side latency, five-nines availability." One SEV-0 in 12 months (ImageGen launch, write traffic 10x in a week).

Strategic Bet

Don't shard PG

New write-heavy workloads go to Azure Cosmos DB by default. PG kept as the read-heavy system of record. New tables are not allowed in PG; they must land in the sharded system.

The architectural philosophy. OpenAI's argument is that the decision to shard should come from workload patterns, not user counts. Their workload is overwhelmingly read-heavy; sharding existing apps would touch hundreds of endpoints and take months to years. The single-primary architecture has runway, and they're protecting it with operational discipline.

Optimization	The Problem	What They Did	Impact
Connection pooling with PgBouncer	Azure cap of 5,000 connections per instance. Connection storms had exhausted all connections in past incidents.	PgBouncer pods in Kubernetes deployments, transaction/statement pooling mode. Co-located proxy, clients, and replicas in the same region.	Avg connection time dropped from 50ms to 5ms (10x). Postgres backend count stays bounded regardless of client connection count.
Cache locking (leasing)	Cache-miss storms after Redis or app-cache failure pushed enormous read load directly to PG, saturating CPU.	Cache leasing: only one reader on a missed key fetches from PG and repopulates the cache. All other concurrent misses wait.	Eliminated thundering-herd reads on cache misses. Prevented cascading degradation cycles that had caused past SEVs.
Workload isolation across clusters	A new feature with an inefficient query would consume CPU on the shared primary and degrade unrelated critical traffic.	High-priority vs low-priority workloads routed to separate clusters. Different products on separate clusters. Replica-level isolation within clusters too.	Noisy-neighbor problem contained at cluster boundary. One bad query no longer impacts unrelated products.
Multi-layer rate limiting	Retry storms and expensive query spikes (e.g. ORM-generated 12-way joins) had triggered cascading failures.	Rate limits at app, connection pooler, proxy, and query layers. ORM extended to support per-query-digest blocking for targeted load shedding.	Specific bad query digests can be blocked in minutes during incidents. Recovery from query-driven SEVs is targeted, not blunt.
Schema change discipline	Schema changes that trigger a full table rewrite (e.g. column type changes) take minutes to hours and block writes.	Hard 5-second timeout on all schema changes. Only lightweight ops permitted (add/drop columns that don't rewrite). CONCURRENTLY required for index ops. New tables forbidden on PG (they go to Cosmos DB).	Schema changes are predictable and safe. Backfills are rate-limited and can take weeks but never cause production impact.
Cascading replication	WAL streaming to ~50 replicas saturates primary network bandwidth and CPU. Replica lag becomes unstable as count grows.	Intermediate replicas (replicas of replicas) relay WAL downstream. Collaboration with Azure PG team to make failover safe under this topology.	Path to scale beyond ~100 replicas without overloading the primary. Still in testing as of the published blog.
Killing the 12-table join	A specific ORM-generated query joining 12 tables would spike CPU during traffic bursts and cause SEVs.	Audited ORM-generated SQL. Broke the join apart; moved the complex join logic to the application layer.	Eliminated a recurring SEV source. Reinforced the rule: review every query the ORM emits in hot paths.
idle_in_transaction_session_timeout	Long-running idle transactions held the MVCC horizon open, blocking autovacuum from reclaiming dead tuples cluster-wide.	Set the GUC explicitly in production. Connection pooler tuned with matching idle timeouts.	Autovacuum can keep up with bloat. The "one forgotten worker" failure mode is closed.
Migrating writes off PG	MVCC and write amplification make PG inefficient for write-heavy hot paths.	Shardable write-heavy workloads moved to Azure Cosmos DB. Application bugs producing redundant writes are fixed. Lazy writes smooth spikes.	Primary write capacity protected. Cosmos DB absorbs the workloads that don't fit PG's MVCC model.

Instagram — Schema-Based Logical Sharding

Source: Instagram Engineering blog Sharding & IDs at Instagram; Mike Krieger's SFPUG 2012 and AirBnB Tech Talk presentations.

Trigger Point

photosdb > 60 GB

Django + PG started on a single LA machine "less powerful than a MacBook Pro." When the photos table exceeded 60GB, sharding was no longer optional.

Sharding Model

Thousands of logical shards

Logical shards = Postgres schemas. Mapped to fewer physical machines via a config table. Moving capacity = moving schemas; no app changes.

ID Generation

Snowflake-in-Postgres

64-bit IDs generated by a PL/pgSQL function inside each shard: 41-bit timestamp + 13-bit logical shard id + 10-bit per-shard sequence (1024 IDs/ms/shard).

The design insight. Instagram evaluated NoSQL alternatives and rejected them; their pain point was capacity expansion, not data-model flexibility. By making the logical shard a Postgres schema (a namespace inside a database), they got pg_dump/restore as their shard movement tool, kept all SQL ergonomics, and made rebalancing a config-file edit.

Optimization	The Problem	What They Did	Impact
Logical-shard / physical-host decoupling	Range-based sharding (A-H on shard 0, I-Z on shard 1) is brittle. When a shard fills, you have to split mid-key, which means redistributing rows.	Pre-create thousands of logical shards (PG schemas) up front. Map them many-to-one to physical hosts. Capacity expansion = move some logical shards to new hosts and update the map.	Shard rebalancing has no application code path. Done via pg_dump + pg_restore + config update. Zero disk-IO spike from re-bucketing.
Modular shard key on user_id	Need to find any user's data quickly without scatter-gather across shards.	`user_id % NUM_LOGICAL_SHARDS` → logical shard id. Look up logical → physical via static map. All related data (photos, comments, likes) shard by the same user_id.	Single-shard lookups for all per-user reads. No cross-shard joins for the primary access pattern.
In-database ID generator	SERIAL primary keys collide across shards. UUIDs are unsortable and hurt B-tree clustering. External ID services (Twitter Snowflake) add operational burden.	PL/pgSQL function in each shard generates 64-bit IDs: timestamp (41 bits, sortable) + logical shard id (13 bits) + auto-increment sequence (10 bits). 1024 IDs per ms per logical shard.	Globally unique, sortable, no external ID service. The shard id is embedded in the ID itself, so you can route a lookup by ID alone.
Pre-split many logical shards	Once you bake the modular hash into application code, changing the divisor is painful (you'd have to re-bucket every row).	Choose a large number of logical shards (thousands) from day one, even if all of them fit on one host initially. Never change the divisor; only change the physical map.	Decades of expansion runway without code changes. The constraint is now physical hosts, not the sharding scheme.
Same-host colocation early on	Adding many physical machines too early wastes hardware.	Start with all logical shards on one physical host. Move shards to new hosts only when load demands it.	Incremental scale-out. Pay only for what's needed. Two engineers managed the original infra.

Reddit — ThingDB and the Aurora Media Metadata Rebuild

Sources: Reddit Engineering blog (Jianyi Yi, 2024) and the historical reddit-archive wiki on the original ThingDB architecture.

ThingDB (legacy)

PG as a KV store

For Links, Comments, Accounts, Subreddits. Two tables per "thing type": a thing table (IDs, indexed flags) and a data table (free-form key/value rows). Schemaless on purpose.

Media Metadata (2024)

100K reads/sec

AWS Aurora Postgres. p50 2.6ms, p90 4.7ms, p99 17ms. Range-partitioned by post_id via pg_partman. Metadata payload stored as JSONB.

Chose PG over Cassandra

Debugging ease

"Ad-hoc queries and debugging are simpler in Postgres than in Cassandra." Avoided Cassandra denormalization pain when new query patterns appeared.

The choice that surprised people. Reddit picked Aurora Postgres over Cassandra in 2024 for a 100K-reads/sec metadata store, after years of Cassandra in the stack. The cited reasons: easier debugging on incident calls, no rigid denormalization, and SQL flexibility for ad-hoc analytics. The 50TB-by-2030 projection was deemed manageable with partitioning.

Optimization	The Problem	What They Did	Impact
ThingDB: PG as KV store	Reddit's domain objects (subreddits, comments, accounts) have fluid attribute sets. Adding a column to a 100M-row table is expensive.	Two tables per thing type: a thing table with indexed flags and an opaque data table with one row per (thing_id, key, value) tuple. Schemaless via convention.	Schema flexibility for the core data model without leaving PG. Read patterns served by careful indexing on the thing table.
Range partitioning by post_id	Media metadata grows monotonically. Most reads target recent posts. A flat table would have an ever-growing hot index.	pg_partman creates monthly partitions on post_id (post_id increases roughly with time). Most queries hit one or two recent partitions.	Hot working set fits in buffer cache. Index on the recent partition is small enough to stay resident. Older partitions can be archived cheaply.
pg_cron for partition management	Forgetting to create future partitions means inserts to a non-existent partition fail.	pg_cron schedules a job that calls pg_partman to create future partitions ahead of time, when the spare-partition count drops below a threshold.	Zero-touch partition lifecycle. No 2AM pages from "future partition missing."
JSONB for metadata payload	Media metadata fields vary by media type (image vs. video vs. embed). Modeling each as columns explodes the schema and adds NULL bloat.	Metadata fields stored in a single JSONB column. Effectively converts the table into a key-value store with PG semantics.	No joins. Schema-on-read for new media types without DDL. Reads still benefit from PG indexing on the partition key and any promoted JSONB paths.
PgBouncer in front of Aurora	Service replicas opening direct connections to Aurora would exhaust connection slots and spike CPU per backend process.	PgBouncer between the API service and Aurora. Transaction-pool mode.	Application can run with hundreds of effective client connections while keeping backend process count bounded.
Dual-write / dual-read migration	Migrating 100K-RPS read traffic from old fragmented stores to a unified new store cannot involve downtime.	Phase 1: dual write to old + new. Phase 2: backfill old → new. Phase 3: dual read with comparison and gap repair. Phase 4: ramp read traffic. Phase 5: deprecate old.	Live migration of a high-traffic critical path with no user-visible impact. Race condition (backfill overwriting newer writes) caught and fixed during phase 3.

Robinhood — Crypto Shard and the Aurora Stack

Sources: Scaling Robinhood Crypto Systems (Robinhood Newsroom, Mahipal/Chai/Zhang); Data Lake at Robinhood; load testing engineering posts.

Trigger

2021 crypto volume surge

Crypto trading volume exploded after the 2018 Robinhood Crypto launch. Primary database CPU and replica lag at risk of degrading the trading experience.

Stack

Django + Aurora Postgres

OLTP on PG via AWS RDS / Aurora. Apache Hudi + S3 for the data lakehouse. Kafka for event streams. Elasticsearch and InfluxDB for adjacent workloads.

Sharding Approach

Cohort-based functional shard

Spun up an entirely new shard, isolated from the existing one, that received newly-registered customers. Existing customers stayed on the original shard.

The pragmatic call. Robinhood's crypto team chose cohort-based functional sharding (new users on a new shard) rather than automated tooling like Citus. The trade-off was speed of execution under hypergrowth pressure: a new shard is decisively isolated, with a known blast radius, and required no application-wide refactoring. This is the right call when you're sprinting through a traffic crisis.

Optimization	The Problem	What They Did	Impact
Cohort-based functional sharding	Crypto trading volume risked saturating primary DB CPU and replica lag SLOs.	Spun up a fully isolated new shard. New customer registrations were routed to it. Existing customers stayed where they were.	Immediate write-load relief on the original shard. Limited blast radius (a problem on shard A doesn't affect shard B).
EXPLAIN ANALYZE-driven index work	Several expensive queries dominated CPU. Some had been running since legacy code paths and were never re-validated.	Per-query EXPLAIN ANALYZE audits. Identified missing indexes. Added covering indexes for the hot query shapes.	Some queries saw 1000x speedups. CPU pressure measurably reduced; system headroom grew.
Eliminate expensive JOINs	Some critical query plans involved multi-table joins driven by normalized schemas.	Identified the most expensive joins via plan analysis. Redesigned data access patterns to avoid them or denormalized where the cost was justified.	Hot-path query latency dropped. CPU per query reduced; throughput improved.
Mobile-platform parity	Android, iOS, and Web each had subtly different query paths to the same data, so DB optimizations had to be implemented three times.	Unified the front-end platforms to share the same backend access patterns. One optimization, three platforms.	Faster iteration on DB-driven improvements. No more "iOS got the optimization, Android didn't."
Load and Fault team	Engineering teams launched features without confident understanding of how they'd behave at target QPS, leading to surprise outages.	Dedicated team grew from 3 to 8 engineers. Runs recurring load tests with explicit QPS targets per service. Validates latency and success-rate SLOs before traffic ramp.	"Scaling confidently": team launches new products with quantified scaling envelopes, not vibes.
Multi-system data layer separation	One database trying to serve OLTP, time-series, search, and analytics is a recipe for noisy neighbors.	Postgres / Aurora for OLTP. Elasticsearch for search. InfluxDB for time-series. Hudi + S3 for the lakehouse. CDC pipelines keep them in sync.	Each system runs in its competence zone. Postgres serves the trading-ledger workload it's best at without being asked to do analytics.

Apple — Internal Services and the Quiet PG Footprint

Apple is famously quiet about internal infrastructure. The publicly documented Apple data infrastructure story emphasizes FoundationDB (acquired 2015, open-sourced 2018) and Cassandra for iCloud and CloudKit. Apple's PostgreSQL footprint is real and large, but it is much less publicly documented than the other four companies in this section. What follows separates what is on the public record from what is reasonable inference.

Public Record

Strong PG community presence

Apple employs PostgreSQL committers and core team members. Apple is one of the larger PG-using enterprises, with widespread internal use of PG for transactional metadata catalogs, internal tooling, and operational data.

iCloud / CloudKit

FoundationDB + Cassandra

Public papers and the open-sourced FoundationDB Record Layer confirm that iCloud's user-data sync, Photos, iCloud Drive, and CloudKit run on FoundationDB and Cassandra, not PG. Treat that as known.

Inferred (with caveats)

App Store, internal tooling

App Store transactional catalogs and internal developer/build tooling are widely understood in industry to run on Postgres clusters, but Apple has not published architecture blogs detailing it the way Instagram, Reddit, Robinhood, and OpenAI have.

Honest limitation. The depth of detail available for Apple's PostgreSQL deployments is materially less than the other four companies. Anything claiming specific tuning parameters, replica counts, or partitioning schemes for Apple's internal PG is speculation. Below are patterns publicly attributable to Apple's PG community work, not internal architecture details.

Pattern	Evidence Type	What's Known / Reasonable Inference	Caveat
HA Postgres at scale	Community presence; common enterprise PG patterns	Apple operates large HA PG clusters with managed orchestration (Patroni-class or in-house equivalent), streaming replicas, sync commit for HA, async for read scale.	Specific topology, replica counts, and orchestration software not publicly disclosed.
Oracle-to-PostgreSQL migration over time	Industry pattern; Apple's broader history of Oracle reduction	Apple has migrated significant workloads off Oracle to PostgreSQL across multiple internal product lines over the last decade. This is consistent with the broader enterprise PG adoption pattern.	No specific Apple engineering blog confirms the scope or timeline.
PostgreSQL community contributions	Public; verifiable via the PostgreSQL git history and PGCon talks	Apple-employed engineers contribute to PG core. Areas have included parallel query, partitioning improvements, query planner work, and ongoing performance contributions.	Individual contributor attribution is public; Apple's strategic emphasis is not formally published.
Internal tooling and metadata catalogs	Industry knowledge; not from Apple engineering blogs	Apple is widely understood to use PG for many internal services where ACID, joins, FKs, and PG's extension ecosystem matter (operational metadata, build/release tooling, transactional catalogs around the App Store ecosystem).	This is consensus inference, not first-party documentation. Treat with appropriate skepticism in any design review.
FoundationDB for user-data scale	Public engineering papers and the open-sourced Record Layer	For workloads that need horizontal scale, multi-tenancy at the billions-of-databases level, and strict serializable transactions, Apple uses FoundationDB. CloudKit is the prime example.	This is the documented path for iCloud-class workloads, not the PG path.

Cross-Cutting Patterns from the Five

Pattern	Who Did It	Why It Matters
PgBouncer in production is non-negotiable	OpenAI, Reddit (implicit in Aurora setup), Robinhood, Instagram (later)	Process-per-connection puts a hard ceiling on backend count. Every team that runs PG at scale puts a transaction-mode pooler in front of it.
Functional or schema-based sharding before generic sharding tools	Instagram (schema-based), Robinhood (cohort-based), OpenAI (cluster-level isolation)	Citus is powerful but adds operational and SQL complexity. The simpler answer (separate clusters, separate schemas, separate cohorts) often wins for years.
Range partitioning on time-correlated keys	Reddit (post_id), and effectively every time-series PG user	When the access pattern is biased to recent data, time-correlated range partitioning keeps the working set small enough to live in buffer cache.
JSONB as the long-tail attribute container	Reddit (media metadata payload), implicitly common across all five	Promote hot fields to typed columns; keep the long tail in JSONB. The hybrid approach beats both pure relational and pure document modeling for evolving schemas.
Aurora Postgres for managed scale	Reddit (media metadata), Robinhood	Storage-compute separation removes the most operationally painful aspects of self-managed PG. Trade-off: extension version lag and Aurora-specific feature gaps.
Refuse to shard until forced	OpenAI (explicitly), Instagram (delayed it as long as feasible)	Sharding has a long tail of complexity costs. The first answer should always be: optimize the single-primary harder, push reads to replicas, and migrate writable workloads to systems that fit better.
Isolate workloads at the cluster boundary	OpenAI (most explicit), Robinhood (cohort shards)	One bad query should not be able to degrade unrelated traffic. The cheapest way to enforce that is at the cluster boundary, not via in-cluster fairness controls.

Best default choices

Search this guide

00. Architecture & Core Concepts

Architecture Model

Storage Engine

Concurrency Model

Durability

ACID Semantics in Postgres

MVCC Mechanics (and the wraparound cliff)

WAL & Synchronous Commit Modes

01. Index Landscape (B-Tree, GIN, GiST, BRIN, SP-GiST, HNSW)

02. Declarative Partitioning & Partition Pruning

Partition Pruning Behavior

03. Trade-Offs

04. Use Cases

05. Limitations

06. Fault Tolerance

07. Sharding

08. Replication

09. Better Usage Patterns

10. Advanced / Next-Gen Alternatives

11. In Production: Company Deep Dives

OpenAI — Refusing to Shard at 800M Users

Topology

Workload Shape

Strategic Bet

Instagram — Schema-Based Logical Sharding

Trigger Point

Sharding Model

ID Generation

Reddit — ThingDB and the Aurora Media Metadata Rebuild

ThingDB (legacy)

Media Metadata (2024)

Chose PG over Cassandra

Robinhood — Crypto Shard and the Aurora Stack

Trigger

Stack

Sharding Approach

Apple — Internal Services and the Quiet PG Footprint

Public Record

iCloud / CloudKit

Inferred (with caveats)

Cross-Cutting Patterns from the Five