Kafka vs Kinesis vs Pulsar

Three distributed logs, three storage philosophies, three operational regimes. The cells below explain when each one stops being the right answer.

Streaming · Head-to-Head

As of 2026-06-01 · Kafka 4.3.0 · Kinesis OD-Advantage · Pulsar 4.0 + Oxia

PE Verdict

Storage coupling is the master variable. Kafka couples (highest throughput, painful rebalance), Pulsar separates (elastic, three-tier ops cost), Kinesis hides (predictable bill, no tuning surface).

The decision rarely turns on throughput. It turns on topic count, headcount, and how many AZs you're willing to pay AWS to cross.

Best default choices

KafkaDefault for portable event logs, CDC, replay, Connect/Flink ecosystems, and sustained high throughput KinesisBest AWS-native default when low operations, IAM, Lambda, Firehose, and managed scaling matter most PulsarUse for multi-tenancy, many topics, native geo-replication, tiered storage, and queue-plus-stream semantics Managed firstPrefer MSK, Kinesis, Confluent Cloud, or StreamNative before staffing a self-managed streaming platform

01 · Trade-Offs

One row per distinct trade-off. A trade-off is something where you give up X to get Y. "Fast" is not a trade-off. Tables sort by clicking the column header.

Kafka 4.3 · Trade-Offs

Trade-Off	What You Gain	What You Give Up	When It Bites	PE Nuance
Coupled compute + storage on broker	Highest throughput-per-dollar at sustained load; zero-copy sendfile path; locality wins for hot reads.	Adding capacity requires moving partition data; rebalancing is a planned ops event.	A failed broker on a 50TB cluster takes hours to fully re-replicate. Cruise Control automates it, the bytes still move.	PETiered storage (KIP-405) softens but does not remove this. Hot tier still on broker SSD; rebalance only copies hot tier.
Pull-based consumer model	Free backpressure; consumer controls rate; replay from arbitrary offset is trivial.	Baseline poll latency adds 1-5ms on idle/low-volume topics.	Low-QPS topic with sub-5ms SLO. Tune `fetch.min.bytes` and `fetch.max.wait.ms`.	PELong-polling masks the latency for most workloads. Pull is the correct default; push systems pay this cost in flow-control complexity instead.
Partition as the parallelism unit	Total order per partition; clear ownership; one consumer per partition gives easy correctness.	Cannot scale parallelism above the partition count without re-keying everything.	Under-partitioned topic at year 2 needs full producer re-key migration to add consumers.	PEOver-partition early. Cost is per-partition overhead (controller load, replication threads); cost of under-partitioning is offline migration.
KRaft replaces ZooKeeper (4.0+)	Single-system operation; partition ceiling jumped from ~200K to 1M+; faster failover.	Newer code path than ZK; controller bugs at extreme partition counts still surfacing through 4.x series.	Edge cases at >500K partitions where controller fetch saturates. KIP-1219 (4.3) added tuning knobs.	PE4.3 removed the ZK migration path entirely. Upgrade story is cleaner now but 4.x is still the maturing release line.
Transactional EOS	Atomic multi-partition writes; read-process-write pipelines without app-level dedup.	~20% throughput overhead; coordinator state to manage; longer recovery on coordinator failover.	Hung transactions blocking partition progress. KIP-890 (partition verification) closed the worst patterns.	PEEOS is "exactly-once within Kafka", not "across systems". The application is still responsible for idempotency at every external sink.
Tiered storage (KIP-405)	Unbounded retention at S3 prices; broker disk stays small; rebalance becomes cheap.	Cold-read latency increases 10x or more; remote log manager adds operational surface.	Incident replay of week-old data saturates broker fetch threads and blocks live traffic.	PEPer-topic config. Only enable on high-retention topics. KIP-1235 (4.3) fixed default ISR for metadata topic, was a foot-gun before.
ISR durability model	Tunable durability via `min.insync.replicas`; strong by default with RF=3.	Write availability tied to ISR size; an AZ blip can shrink ISR below min and reject writes.	Regional network event drops a follower below min.insync; producers see write rejections on healthy partitions.	PECanonical config is RF=3, min.insync.replicas=2, unclean.leader.election.enable=false. The "lose availability vs lose data" knob is here.
JVM-based broker	Massive ecosystem; mature tools; well-understood tuning; every APM vendor speaks JMX.	GC tuning matters; PageCache discipline required; off-heap allocations need deliberate config.	G1 pauses at high throughput cause producer timeouts. PageCache thrashing under multi-topic load.	PEProvision 2x active topic working set as page cache headroom. Consider ZGC for low-pause requirements. Redpanda exists because of this.

Kinesis Data Streams · Trade-Offs

Trade-Off	What You Gain	What You Give Up	When It Bites	PE Nuance
Fully managed serverless	Zero broker operation; auto-scaling without ops review; AWS-grade availability SLA.	Per-GB markup; no protocol-level optimization; no visibility into broker behavior.	Sustained 10TB/hour where cost crosses self-managed Kafka by 3-5x even including headcount.	PETCO-with-headcount wins for Kinesis below ~5-10 TB/day. Above that, the per-GB premium dwarfs the on-call savings.
Shard as capacity + billing unit	Predictable per-shard limits: 1 MB/s in, 2 MB/s out, 1000 records/s. Easy math.	Hard ceiling. No oversubscription. Cannot tune around hot keys.	A skewed partition key throws `ProvisionedThroughputExceededException` with no application-level recourse.	PEPick partition key with care. High-cardinality random suffix is the workaround, then group downstream.
On-Demand auto-scale to 2x prior 30d peak	Survives normal traffic spikes without manual resharding; "set and forget" sizing.	Does not handle cold-start spikes >2x or 5x sudden ramps; scale-up window is ~15 min.	Black Friday cold-start; new product launch traffic; viral event. Writes throttle silently.	PEPre-warm by writing synthetic traffic before known events. Or stay on provisioned mode for known-spiky workloads.
AWS-only deep integration	Firehose to S3/Redshift in one click; Lambda event source mapping; native KMS, IAM, CloudWatch.	Vendor lock-in; cross-cloud replication requires custom plumbing.	Multi-cloud strategy mandates or post-M&A integration with non-AWS shop.	PEKCL is open source. Some portability via Kafka-on-Kinesis tools but operationally expensive. Build assuming you stay on AWS.
At-least-once delivery semantics	Simpler producer/consumer code; no transactional state to manage; smaller surface.	No exactly-once. Dedup is your job, in your consumer, every time.	Payments, financial event sourcing, idempotency-critical pipelines. Retry storms generate duplicates.	PECanonical pattern: DynamoDB conditional write keyed on the Kinesis SequenceNumber. Costs a DDB call per record.
365-day max retention	Long enough for most audit, replay, and ML feature-store windows. Tiered retrieval pricing.	Hard ceiling. Cannot retain forever. Long-term retrieval is charged separately per GB.	Regulatory 7-year retention; ML retraining on multi-year data; full event-sourcing replay.	PEPair Kinesis with Firehose-to-S3 archive for true long-term. Treat Kinesis as the working window, S3 as the cold tier.
Enhanced Fan-Out (EFO)	Push-based; per-consumer dedicated 2 MB/s; sub-second latency; bypasses shared shard egress.	Per-consumer-hour charge; 20 consumer cap on Standard, 50 on OD-Advantage.	Many-consumer fan-out (ML feature store with N readers, multi-team analytics).	PEPool consumers behind a single EFO subscription; fan out further downstream via SNS or in-app routing.
KCL/KPL client libraries	Best-in-class checkpoint management, lease balancing, automatic resharding handling.	JVM-heavy; complex to operate; KCL 2.x checkpoint semantics non-trivial.	Non-JVM language teams. Debugging lease races. Checkpoint races on consumer restart.	PEKCL 3.x improvements ongoing. For simple workloads consider direct SDK + manual checkpointing in DynamoDB.

Pulsar 4.0 · Trade-Offs

Trade-Off	What You Gain	What You Give Up	When It Bites	PE Nuance
Stateless brokers + BookKeeper storage	Brokers scale in seconds; storage scales independently; failure isolation between layers.	Three stateful systems to operate (brokers, bookies, Oxia).	Upgrade choreography. Bookie under-replication alerts. Cross-component performance debugging.	PEElasticity win is real. Ops headcount is ~1.5x Kafka's. Decision is whether you want to pay for that elasticity.
Segment-based replication via BookKeeper ensemble	Topic data spreads across many bookies; no broker-pinned partition; faster recovery from bookie loss.	Write path adds bookie quorum hop; 2-3ms latency penalty vs Kafka's direct local write.	Sub-5ms p99 SLOs. Kafka's tuned local SSD path is hard to beat for raw latency.	PEWorth it for elasticity and multi-tenancy. Not worth it for single-tenant low-latency workloads.
Native multi-tenancy (tenant / namespace / topic)	1M+ topics per cluster; per-namespace quotas, retention, replication, authentication.	Mental model complexity; up-front design of tenant/namespace boundaries.	Teams skipping namespace design end up with messy isolation 18 months in.	PESaaS platforms get isolation free. Single-tenant workloads pay the complexity tax without the benefit.
Both streaming and queueing in one system	Four subscription modes (Exclusive, Shared, Failover, Key_Shared) cover queue and stream patterns.	API complexity; mode is per-subscription; misconfiguration is silent.	Subscription mode mismatch produces either out-of-order delivery or single-consumer bottleneck.	PEKey_Shared for ordered parallelism. Shared for queue-style work. Exclusive for "consumer group" semantics. Document the choice per topic.
Push-based consumer with flow control	Lower idle latency than pull; broker manages dispatch rate; less network chatter.	Backpressure is the consumer's job via `receiveQueueSize`; getting it wrong is silent.	Slow consumer with high receiveQueueSize causes broker memory pressure and topic backlog.	PETune receiveQueueSize to working set, not "max". Monitor broker dispatcher metrics, not just consumer lag.
Native geo-replication	Configured at namespace level; async or sync; survives full-region loss with no MirrorMaker.	Instance-level configuration store (extra coordination layer) at the global scope.	WAN brownout causes replication lag spikes; rare-path failover code seldom drilled.	PEAsync is the default and correct choice. Sync only for low-write financial workloads where DC loss must not lose committed data.
Oxia replaces ZooKeeper (4.0+)	Million-topic ceiling; faster metadata operations; cleaner ops than ZK.	Newest piece in the stack. Less production-tested than ZK or KRaft.	Split-brain edge cases not yet fully mapped at the longest tail of failure scenarios.	PERun 5-node Oxia ensemble minimum. Pin versions carefully through the 4.x line until major-version stability is proven.
Tiered offload (S3 / GCS / Azure)	Unbounded retention; offload entire ledgers transparently; reads pass through broker via offloader.	Cold-read latency penalty; offloader plugin tuning surface; per-namespace policy.	Year-old data replay during incident pulls hard on bookies plus S3 simultaneously.	PESchedule offload during off-peak windows. Size broker managed-ledger cache for hot working set.

02 · Use Cases

Real deployments with named drivers. The Driving Property is the one thing that pinned the choice; everything else is consequence.

Kafka · Use Cases

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Event sourcing + CDC backbone	LinkedIn, Stripe, Shopify	Unbounded retention + sub-10ms p99 at sustained TB/hour	~7 trillion events/day at LinkedIn; 100s of MB/s sustained	Kinesis: 365-day retention ceiling. Pulsar: ecosystem still catching up on CDC connectors (Debezium is Kafka-native).
Real-time analytics pipelines	Netflix, Pinterest, Uber	Sub-10ms producer-to-consumer p99 at 100+ MB/s sustained	1.6M msg/s at tier-1 banks; 100s of brokers per cluster	Kinesis: 70-200ms floor disqualifies. Pulsar: viable but Flink/Spark integration leans Kafka.
Microservice async backbone	Uber, Airbnb, Amazon Prime	Connect ecosystem + 17 language clients + community tooling	1000+ topics, 100s of services, cross-team contracts	Kinesis: cross-region support requires custom plumbing. Pulsar: client coverage gaps for non-JVM polyglot orgs.
Log aggregation	Datadog ingest, internal logging pipelines	Cost-per-GB at TB/hour ingest with multi-day retention	PB-scale ingest; 7-30 day retention	Kinesis: per-GB cost untenable at this volume. Pulsar: team familiarity tax, no clear architectural win.
Financial event sourcing with EOS	Robinhood-style trade pipelines, ledger systems	Multi-partition atomic writes via transactional producer	10s of K msg/s with strict idempotency	Kinesis: at-least-once requires app-layer dedup table per consumer. Pulsar: dedup is simpler but no multi-partition atomicity.
ML feature pipelines	Uber Michelangelo, Spotify ML, Pinterest	Replay-from-offset + Connect ecosystem + Streams DSL	1000s of feature topics; weeks of replay window	Kinesis: replay window capped; KCL checkpoint races on consumer restart. Pulsar: viable but tooling gap on ML side.
Stream processing input layer	Flink jobs, Spark Streaming, Kafka Streams	Native partition model is the de facto stream processing primitive	100s of topics feeding 10s of processors	Kinesis: Flink Kinesis connector works but partition mapping is awkward. Pulsar: Flink connector mature but ecosystem leans Kafka.

Kinesis · Use Cases

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Mobile / IoT ingest	AWS Mobile SDK shops, consumer IoT vendors	Cognito auth + zero broker ops + AWS-native PKI	100K-1M devices; bursty per-device traffic	Kafka: PKI/auth integration for device fleets is a project. Pulsar: same plus newer to mobile SDK landscape.
CloudWatch Logs streaming	Enterprise AWS shops, observability pipelines	Native CloudWatch subscription filter integration	Account-wide log volume; TB/day	Kafka: integration is build-it-yourself. Pulsar: same.
Clickstream into Redshift / S3	E-commerce on AWS, content platforms	Firehose 1-click sink to S3/Redshift/OpenSearch	10K-100K events/sec sustained	Kafka: at <1TB/day TCO including headcount loses by 3-4x. Pulsar: same TCO problem at smaller scale.
Lambda fan-out trigger	Serverless-first AWS shops	Native event source mapping with parallelization factor	1000s of invocations/sec per stream	Kafka: MSK Lambda trigger exists but brittle on rebalance. Pulsar: no native Lambda trigger.
Startup MVP with streaming need	Early-stage product with no streaming engineer on staff	0.05 FTE ops cost; ship before raising Series A	<100 GB/day; 5-10 microservices	Kafka: needs a dedicated platform engineer. Pulsar: same plus less hiring pool.
Bounded retention compliance	Healthcare, payments, audit trails on AWS	365-day retention out of the box; HIPAA / PCI-eligible	Regulatory window; not throughput-driven	Kafka: same retention possible via tiered storage but compliance docs are your problem. Pulsar: similar.
Real-time ML inference pipeline	SageMaker-based inference pipelines	Native SageMaker + Lambda + DynamoDB integration	10K-100K inference req/sec	Kafka: SageMaker doesn't integrate natively; you build the bridge. Pulsar: same.

Pulsar · Use Cases

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Multi-tenant SaaS event platform	Yahoo Japan, Tencent, Splunk Observability	Native tenant/namespace isolation at 100K+ topics	1M+ topics per cluster; 1000s of tenants	Kafka: topic ceiling + isolation needs ACL gymnastics or cluster sprawl. Kinesis: no topic abstraction at this scale.
Geo-replicated event bus 3+ regions	Iterable, Verizon Media (formerly Yahoo)	First-class async geo-replication configured per namespace	3-5 active regions; cross-region failover drilled	Kafka: MirrorMaker 2 works but adds an ops layer with its own failure modes. Kinesis: no primitive; build custom Lambda fan-out.
IoT platforms with millions of device topics	Comcast video platforms, large telco IoT	Million-topic ceiling; one topic per device or session	10M+ devices; topic-per-device	Kafka: partition ceiling and per-topic overhead make this uneconomic. Kinesis: stream-per-device is API-call-limited.
Unified queue + stream workloads	Iterable email/SMS infrastructure, marketing platforms	Shared subscription for queue semantics plus offset replay for stream	1000s of queues with stream replay needs	Kafka: no queue semantics natively (Share Groups in 4.x still maturing). Kinesis: no queue semantics, no replay primitive.
Financial transaction processing	Tencent Billing (migrated from Kafka)	Multi-region sync replication with strong durability per ledger	Per-tenant ledger isolation at 100s of tenants	Kafka: multi-tenancy + sync geo-replication forces cluster-per-tenant sprawl. Kinesis: no cross-region primitive.
AI agent platforms with per-session memory	Emerging agentic AI platforms; per-agent or per-session topic isolation	Topic-per-session at >100K sessions; isolation + lifecycle per namespace	100K-1M concurrent sessions; short-lived topics	Kafka: per-topic broker overhead at million scale. Kinesis: stream-per-session is API-cost-prohibitive.
Mixed retention per tenant	Analytics SaaS with tier-based retention SLAs	Per-namespace retention policy; trial users get 7d, enterprise gets 365d	10s of tiers; 1000s of customers	Kafka: retention is topic-level; would require topic-per-tier. Kinesis: retention is stream-level with same problem.

03 · Limitations

Cross-tech limitation matrix. Where each system is constrained, how badly, and what the workaround costs you.

Limitation Axis	Kafka 4.3	Kinesis	Pulsar 4.0
Topic / partition ceiling	Medium ~1M partitions per KRaft cluster (4.3 tested). Controller load grows with partition count. Workaround: cluster sharding (multi-cluster). Cost: ops surface multiplies.	High Stream is the unit, not topic. Per-account shard limits (20K in US-East/West/Ireland, 6K elsewhere). Workaround: limit increase ticket. Cost: AWS support latency.	Low Designed for 1M+ unique topics per cluster (Oxia-backed). Workaround: N/A at most scales.
p99 latency floor	Low ~5ms achievable end-to-end at sustained 100MB/s+. Workaround: N/A. This is the floor.	High ~70-200ms end-to-end typical. EFO improves read, not write. Workaround: EFO + KPL aggregation. Cost: per-EFO-consumer fee.	Medium ~5-15ms typical. Bookie quorum hop adds 2-3ms vs Kafka. Workaround: tune ensemble + Qack. Cost: reduced durability margin.
Multi-region cost	High MirrorMaker 2 ops burden + cross-region egress. Workaround: Confluent Cluster Linking. Cost: vendor license.	Critical No native cross-region replication. Workaround: Lambda cross-region forwarder. Cost: custom code + per-record Lambda invocation.	Low Native per-namespace geo-replication. Workaround: N/A.
Hot key / partition skew	Medium Hot partition saturates one broker leader. Workaround: custom partitioner + key salting. Cost: downstream re-grouping.	High 1 MB/s hard cap per shard, no exception. Workaround: random suffix + reshard. Cost: reshard latency, downstream complexity.	Medium Hot topic concentrates broker load. Workaround: Key_Shared + manual bundle split. Cost: tuning effort.
Vendor lock-in	Low Open source protocol; many distributions. Workaround: N/A.	Critical AWS-only; proprietary API and SDKs. Workaround: Kafka-on-Kinesis proxy. Cost: performance hit, build effort.	Low Open source; KoP for Kafka client compatibility. Workaround: N/A.
Ecosystem breadth	Low 17+ official clients; every APM, ETL, and stream processor speaks Kafka. Workaround: N/A.	Medium Strong AWS-native, thin elsewhere. Workaround: KCL + custom adapters. Cost: integration work.	Medium 6 official clients; growing but smaller than Kafka. Workaround: KoP proxy for Kafka clients. Cost: translation hop.
Operational FTE cost	High ~0.5-1 FTE at 10 TB/day self-managed. Workaround: Confluent Cloud or MSK. Cost: 2-3x infra premium.	Low ~0.05 FTE; AWS owns ops. Workaround: N/A.	High ~0.75-1.5 FTE for the three components. Workaround: StreamNative Cloud. Cost: managed-service premium.
Storage retention ceiling	Low Unbounded via tiered storage. Workaround: N/A.	High 365-day hard ceiling. Workaround: Firehose to S3 archive. Cost: dual ingest pipeline.	Low Unbounded via tiered offload. Workaround: N/A.
Throughput ceiling	Low Linearly scales with brokers. ~605 MB/s on commodity cloud nodes. Workaround: N/A.	Medium OD: 2 GB/s in, 4 GB/s out per stream (with limit raise). Workaround: stream sharding at app level. Cost: routing logic.	Low Comparable to Kafka in tuned benchmarks; bookie scaling is independent. Workaround: N/A.
Exactly-once semantics	Low Native EOS within Kafka via transactional producer + read_committed. Workaround: N/A.	Critical At-least-once only. Workaround: DDB conditional write on SequenceNumber. Cost: per-record DDB call.	Low Producer-side dedup per namespace. No multi-partition atomicity but simpler model. Workaround: N/A for single-partition.

04 · Fault Tolerance

How each system survives the failure modes that show up in real on-call rotations.

Dimension	Kafka 4.3	Kinesis	Pulsar 4.0
Replication model	Leader-follower per partition, ISR-based. Default RF=3 with min.insync.replicas=2.	Hidden. AWS-internal multi-AZ; documented as "synchronously replicated across 3 AZs". No tuning surface.	BookKeeper ensemble per ledger. Write quorum and ack quorum configurable per namespace (e.g., E=3 W=3 A=2).
Failure detection	KRaft controller heartbeats; replica.lag.time.max.ms drives ISR shrink.	AWS-internal. You see PutRecords failures and ProvisionedThroughputExceeded, not node-level failure.	ZK/Oxia session timeout for broker; bookie health checks; broker dispatcher heartbeat to client.
Failover mechanism	Controller elects new partition leader from ISR. Producer metadata refresh triggers reconnect.	Transparent to clients. SDK retries are automatic; shard reassignment is invisible.	Topic ownership reassigned to surviving broker (stateless, instant). Bookie failure triggers BookKeeper auto-recovery.
RTO (typical)	10-30s for partition leader failover. Longer if controller is the failing node.	Seconds for shard reassignment. RTO from AWS-internal failures is not user-visible.	Sub-second for broker topic ownership transfer (stateless). Seconds-to-minutes for bookie quorum reformation.
RPO (typical)	Zero with acks=all + min.insync.replicas=2. Up to last unacked batch otherwise.	Zero for acknowledged PutRecord calls. AWS commits synchronously across 3 AZs.	Zero with E=3 W=3 A=2. Configurable per namespace.
Split-brain behavior	KRaft Raft prevents split-brain by design. ZK era could see brief stale-leader windows.	N/A at user level. AWS-internal concern.	Oxia leader lease prevents broker-side split-brain. BookKeeper ledger fencing prevents storage-side split-brain.
Blast radius of single-node failure	All partitions led by that broker re-elect. Followers continue serving. Cross-broker work for new leaders.	Single-shard or single-AZ failure invisible to users. AWS handles it.	Broker failure: topics shift to others (cheap). Bookie failure: only ledgers in its ensemble affected (narrow blast radius).
Cross-region failover story	Not built-in. MirrorMaker 2 or Cluster Linking required. Failover is an app-layer concern.	Not built-in. Cross-region replication requires Lambda forwarder pattern.	Native via geo-replication. Configure async or sync at the namespace level. Failover is consumer-side reconnect.
Data loss scenarios	unclean.leader.election=true + ISR shrink to one node + that node fails = data loss. Mitigate with min.insync.replicas=2 and unclean=false.	Producer not retrying on ProvisionedThroughputExceeded; SDK default has bounded retry. Data loss possible if app drops the retry.	Quorum loss in a ledger (e.g., 2 of 3 bookies down) blocks writes until recovery; data loss only with E=2 W=2 A=1 and double failure.

05 · Sharding

The unit of parallelism, how it's keyed, and what changing it costs.

Dimension	Kafka 4.3	Kinesis	Pulsar 4.0
Sharding model	Hash partitioning by key (default murmur2). Custom partitioners allowed.	Hash range over MD5(partition key) mapped to shard hash space.	Hash by routing key (Java/native client). For non-partitioned topics, single-broker ownership.
Shard key constraints	Any byte sequence. Null key = round-robin. Same key = same partition (ordering guarantee).	Up to 256 char string. Optional ExplicitHashKey to override. Same key = same shard.	Any byte sequence. Routing modes: Round-robin, Single-partition, Custom.
Rebalancing mechanism	Manual partition reassignment via admin tool or Cruise Control. Cooperative rebalancing for consumer groups (KIP-429).	Automatic shard splits and merges (provisioned), automatic on OD. Triggered by traffic or API call.	Broker bundle redistribution is automatic. BookKeeper auto-recovery rebalances ledger segments on bookie failure.
Rebalancing cost / impact	Multi-hour to multi-day for TB-scale partition moves. Network and disk I/O intensive. Cruise Control rate-limits.	Shard split takes ~30s; merge ~30s. Some records may be processed twice during the transition.	Broker bundle moves are seconds (stateless brokers). Bookie ledger rebalance is background, no traffic impact.
Hot-shard behavior	Hot partition saturates leader broker. Producer latency on that partition spikes; others unaffected.	Hard 1 MB/s ingest cap. Excess returns ProvisionedThroughputExceededException. No oversubscription.	Hot bundle migrates to less-loaded broker via load shedding (configurable thresholds).
Maximum shards (practical)	~1M partitions per cluster (KRaft, 4.3). Per-partition cost grows with replication threads + file handles.	20K shards per account in tier-1 regions; 6K elsewhere. Per-stream practical limit ~1000s.	Topic partitions effectively unbounded at cluster level; bundles are the load-balancing unit, not partitions.
Resharding without downtime?	Adding partitions: yes, online (but breaks key-to-partition mapping for new keys). Reducing: no, requires migration.	Yes. UpdateShardCount API on provisioned; automatic on OD. Brief duplicate processing during transition.	Yes. Increase partition count via admin API. Decrease requires topic migration.
Cross-shard query support	No, by design. Streams DSL provides app-layer joins. Cross-partition transactions via EOS.	No. Consumer pulls per-shard; aggregation is app-side.	No native cross-partition queries. Pulsar Functions provides per-message routing/aggregation.

06 · Replication

How each system makes durability promises and what those promises cost.

Dimension	Kafka 4.3	Kinesis	Pulsar 4.0
Replication topology	Leader-follower per partition. Each partition has one leader; followers pull from leader.	Hidden multi-AZ replication. From outside: single logical stream.	Leaderless quorum at the BookKeeper layer. Broker writes to multiple bookies in parallel.
Sync vs async	Sync to ISR (acks=all). Async beyond ISR (lagging followers catch up).	Sync across 3 AZs (documented). Async user-side via consumer apps.	Sync to write quorum (W bookies). Ack returned after A bookies confirm.
Replication factor (default / max)	Default 3; max bounded by broker count.	3 AZs, fixed, not user-configurable.	Default ensemble=3, write=3, ack=2; max bounded by bookie count per namespace.
Consistency level options	acks=0 (fire-and-forget), acks=1 (leader-only), acks=all (ISR). min.insync.replicas controls all.	Single option: synchronous PutRecord ack after multi-AZ commit.	Per-namespace: configure E (ensemble), W (write quorum), A (ack quorum). Highly tunable.
Replication lag (typical)	Sub-ms within AZ; 1-10ms cross-AZ for ISR followers.	Not exposed. Inferred from PutRecord latency (~10-20ms typical).	Single-digit ms for write quorum ack.
Conflict resolution	Single-writer (leader) eliminates conflicts. Producer idempotence (KIP-98) handles retry dedup.	N/A. Single ordered stream per shard.	Single-writer per topic via broker ownership. Producer idempotence via dedup ID.
Cross-region replication	MirrorMaker 2 (async, eventual). Cluster Linking (Confluent commercial) for log-level mirroring.	Not native. Build custom via Lambda or third-party tools.	Native, per-namespace. Async by default; sync available at extra latency cost.
Replication during partition	ISR shrinks to surviving brokers. Writes succeed until ISR drops below min.insync.replicas, then reject.	AWS-internal; user-visible result is potential throttling, not split-brain.	If write quorum unreachable, writes block until BookKeeper opens new ensemble. Auto-recovery is automatic.

07 · Better Usage Patterns

What most teams do wrong, the right way to do it, and why the difference matters at scale.

Kafka · Patterns

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Partition count sizing	Pick a small number (say 6 or 12) and hope to scale up later.	Start at 2-3x your forecasted peak consumer count; over-provision early.	Adding partitions changes key-to-partition mapping. Re-keying old data is offline work; doing it right once saves a migration project.
Cross-AZ consumer fetch	Consumers default to fetching from the leader; ~50% of fetches cross AZ.	Enable rack-aware fetch from followers (KIP-392). Set `client.rack` on consumer.	Cross-AZ traffic is often the largest line item in a Kafka AWS bill. Rack-aware fetch can cut consumer egress by 60-70%.
Tiered storage segment sizing	Leave `segment.bytes` at the default (1 GB).	Tune segment.bytes per topic to match offload cadence; smaller for fast-moving topics, larger for archival.	Offload happens at segment roll. Wrong size = either too-frequent S3 churn (cost) or slow promotion to cold tier (disk pressure).
KRaft controller quorum size	Run 3 controllers (the documentation minimum).	Run 5 controllers for production. Tolerates 2 simultaneous failures.	Controller availability is the single most critical dependency in KRaft. 3-node quorum has zero margin during a planned restart plus an unplanned failure.
Topic creation policy	Enable auto-create-topics; let producers create topics ad hoc.	Disable auto-create. Provision topics via CI with explicit config (RF, partitions, retention).	Auto-create topics get default config you don't want (RF=1 often). Explicit creation is auditable; topic configs become reviewable infra.
Page cache sizing	Provision broker memory based on heap + headroom; treat page cache as automatic.	Provision 2x active topic working set as page cache, separate from JVM heap.	Kafka reads from page cache; cache misses go to disk and trash latency. Right-sizing the OS cache is more important than heap tuning.
Consumer rebalance protocol	Use eager rebalancing (the legacy default).	Use cooperative rebalancing (KIP-429). Set `partition.assignment.strategy` accordingly.	Eager rebalance is stop-the-world for the consumer group. Cooperative is incremental. For groups with many consumers, the difference is minutes of consumer downtime per rebalance.

Kinesis · Patterns

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Capacity mode selection	Pick On-Demand Standard by default.	Use OD-Advantage if you have 3+ EFO consumers or sustained fan-out workloads. 60% lower data rates.	OD-Advantage removes per-stream fixed charge and per-EFO-consumer-shard-hour cost. Crosses over to cheaper at modest consumer counts.
Cold-start handling	Trust OD auto-scale to handle a known launch event.	Pre-warm 15-30 min before with synthetic traffic at expected load, or pre-provision shards for the launch window.	OD scales to 2x prior 30-day peak with a ~15 min ramp. Cold-start spikes exceed this. Synthetic warming is the documented AWS workaround.
Partition key strategy	Use user_id or tenant_id as partition key for "ordering".	Salt the key with a random suffix; re-group downstream by the unsalted key for ordering.	Hot tenants saturate one shard. Salting distributes load. Ordering can be reconstructed downstream by event timestamp + unsalted key.
Consumer library	Poll directly with GetRecords from a custom consumer.	Use KCL 2.x or 3.x. Let it handle lease management, checkpointing, and resharding.	GetRecords has retry, throttling, and shard-iterator semantics that are non-trivial. KCL handles them correctly. Custom consumers fail on resharding.
Producer batching	PutRecord one at a time.	Use PutRecords up to 500 records or 5 MB per call. KPL for aggregation if needed.	Single-record PutRecord burns 1 of 1000 records/s per shard for each call. Batching 100x lifts effective throughput by 100x.
Checkpoint cadence	Checkpoint after every record (correctness-first instinct).	Checkpoint after successful batch processing, with a max time bound. Tune for replay tolerance.	Per-record checkpoint hammers DynamoDB (KCL state). Batch checkpoint trades a few seconds of replay window for 100x cost reduction.
Retention sizing	Set max retention (365 days) "just in case".	Set retention to the minimum your replay use case requires. Archive to S3 via Firehose for the rest.	Long-term retention is billed per GB-month. At 1 TB/day with 365-day retention, the storage bill exceeds the ingest bill within months.

Pulsar · Patterns

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Tenant / namespace hierarchy	Treat tenant and namespace as bureaucratic; put everything in `public/default`.	Plan the hierarchy upfront. One tenant per business domain, namespaces by lifecycle (prod / staging) or by retention class.	Quotas, retention, replication, auth are all per-namespace. Skipping the design means flat permissions, no per-team quotas, and a reorganization project later.
Subscription mode selection	Default to Shared subscription and lose ordering.	Use Key_Shared when you need ordered parallel processing. Exclusive for "consumer group" semantics. Failover for active-passive consumers.	Mode mismatch is silent. Shared subscription on an ordered workload produces interleaved messages downstream and breaks invariants days later.
Retention policy scope	Set retention per topic, repeated across hundreds of topics.	Set retention at the namespace level. Group topics by retention class into namespaces.	Per-topic retention is a maintenance burden and a config-drift surface. Namespace-level retention scales to thousands of topics.
Oxia ensemble sizing	Run 3-node Oxia in production, mirroring the ZK recipe.	Run 5-node Oxia for prod. Separate ensemble from dev/staging.	Oxia is new; production tail of failures still being mapped. 5-node tolerates rolling restart + unplanned failure. Shared dev/prod ensembles cross-pollinate incidents.
Tiered offload threshold	Leave default offload policy off, then hit bookie capacity limits.	Enable tiered offload for any namespace with >30-day retention. Tune `offloadAfterElapsedMs` per namespace.	Bookies are expensive storage. Tiered offload to S3 is 10x cheaper per GB. Tuning offload threshold is the difference between bookie sprawl and a small bookie fleet.
Broker managed-ledger cache sizing	Leave cache size at default; treat broker memory as JVM heap problem.	Size `managedLedgerCacheSizeMB` to your hot working set, not topic count.	Cache misses go to bookies. Bookie load = network + disk hit. Right-sized cache absorbs the hot read path entirely.
BookKeeper ensemble configuration	Set E=W=A=3 "for safety" on every namespace.	Tune E (ensemble) larger than W (write) for elasticity. E=5 W=3 A=2 spreads ledgers wider without slowing writes.	Larger E means smaller per-bookie blast radius. W=A controls latency vs durability. Bigger isn't always safer; it's a per-namespace trade-off.

08 · Advanced / Next-Gen Alternatives

Where each system might be displaced, what the successor improves, and when migration is worth the cost.

Kafka · Successors / Alternatives

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
AutoMQ (diskless Kafka on S3)	10x infra cost reduction; eliminates cross-AZ replication tax; second-scale elasticity; Kafka API compatible.	Emerging	Low: protocol-compatible; client code unchanged. Operational re-learning required.	Cloud Kafka where cross-AZ traffic dominates the bill; teams comfortable trading 10-30ms tail latency for cost.
WarpStream (S3-direct)	BYOC architecture (data plane in customer VPC, control plane managed); zero inter-AZ; Kafka API compatible.	Emerging	Low: client-compatible. Higher latency floor (~400ms p99) limits use cases.	Log aggregation, analytics ingest, cost-sensitive workloads where latency budget is generous.
Redpanda (C++ Kafka-compatible)	No JVM/GC pauses; sub-ms tail latency; thread-per-core architecture; single binary deployment.	Production	Low: Kafka API compatible. Operational differences (no ZK/KRaft equivalent, different tuning model).	Sub-ms latency requirements; teams that want fewer moving parts; smaller-scale deployments where C++ ops is feasible.
Bufstream (Buf's S3-direct)	Protocol-compatible; S3-native storage; integrated schema management via Buf Schema Registry.	Early	Low: protocol-compatible. Newest entrant; ecosystem still small.	Greenfield deployments where schema-first matters; teams already using Buf for Protobuf.

Kinesis · Successors / Alternatives

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
MSK Serverless	Kafka protocol + AWS-managed scaling. Same "no ops" story with broader ecosystem.	Production	Medium: protocol migration but client SDK changes; integration patterns shift from KPL/KCL to Kafka clients.	Hitting Kinesis limits (latency, retention) but committed to AWS; want managed Kafka without operating MSK.
Confluent Cloud on AWS	Full managed Kafka with Schema Registry, ksqlDB, Connect, Stream Governance. More features than MSK.	Production	Medium: protocol migration; commercial license; private link configuration.	Teams needing full Confluent feature set (Cluster Linking, governance) on AWS; willing to pay 2-3x infra premium.
AWS Firehose alone	Removes the stream abstraction entirely for ingest-to-S3 pipelines. Simpler, cheaper.	Production	Low: drop the Kinesis layer for pure ingest pipelines.	You had Kinesis only to land in S3 via Firehose anyway. Skip the intermediate.
EventBridge	Event-routing semantics with content-based filtering. Better fit for fan-out to multiple services.	Production	Medium: API and conceptual change from log to event bus.	Workloads that look more like async RPC than streaming. Microservices integration.

Pulsar · Successors / Alternatives

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
StreamNative Cloud	Fully managed Pulsar; removes three-component ops burden.	Production	Low: same Pulsar, hosted.	You want Pulsar's architecture without operating it. Same trade as Confluent Cloud for Kafka.
Pulsar Functions (embedded stream processing)	Lightweight Functions for stateless processing without a separate Flink/Spark cluster.	Production	None (additive, not replacement).	Simple stream transforms (filter, route, enrich) where dragging in Flink is overkill.
Kafka-on-Pulsar (KoP)	Run Kafka protocol on Pulsar storage. Migrate clients incrementally.	Emerging	Low for protocol; medium for operational learning of Pulsar internals.	Want Pulsar's multi-tenancy and elasticity, but cannot rewrite Kafka clients yet.
Astra Streaming (DataStax)	Managed Pulsar with native Cassandra integration. Streaming + wide-column store in one console.	Production	Low.	Workloads that span streaming and wide-column persistence; existing DataStax customers.