Apache Kafka — PE-Grade Deep Dive

Distributed append-only log that pretends to be a queue but isn't one. What it actually is, how partitions and storage really work, how every feature gets used in production, and where the 2024-2026 diskless wave is taking the project.

Current: Apache Kafka 4.3.0 (May 22, 2026) KRaft only since 4.0 Tiered Storage GA Share Groups GA (4.2) As of 2026-06-04
PE Verdict

Kafka is the durable, partitioned, append-only log that the rest of the streaming world is built on, and the partition is the only abstraction that matters. Choose partition count and partition key carefully at topic creation, because both are permanent. Beyond that, the most expensive thing about Kafka in 2026 is not the brokers, it is cross-AZ traffic, and the diskless wave (Tiered Storage, KIP-1150, AutoMQ, WarpStream) exists specifically to attack that bill. Reach for Kafka when you need replayable ordered streams with high fan-out; reach for SQS, RabbitMQ, or Share Groups when you actually need a queue.

Best default choices

01What Kafka Actually Is

Kafka is a distributed, partitioned, replicated, append-only commit log. That phrase is exact and every word does work. Most teams describe it as "a message queue" or "a streaming platform"; both descriptions are wrong in ways that lead to wrong design decisions.

Mental Model

Picture a write-ahead log from a database, but exposed as a network service, sharded across machines, and never truncated until you tell it to. Producers append byte ranges to the end. Consumers read from any offset, at any speed, and the broker remembers nothing about who has read what. That is the entire data plane.

Why teams reach for it

  • Decoupling. Producers do not know consumers exist. Add a third consumer six months later and nothing upstream changes.
  • Replay. A consumer that crashed at offset 4 trillion comes back and resumes at offset 4 trillion. A new consumer can read the entire log from offset 0 if retention permits.
  • Fan-out at constant producer cost. One write, N consumer groups read. The producer pays the same cost for one downstream or fifty.
  • Ordered partitioned streams. All records with the same key land in the same partition and are delivered in production order to a single consumer.
  • Backpressure absorption. Producers and consumers are decoupled in time. A 10x consumer slowdown does not propagate back to producers; the log just grows.
  • Durability with high write throughput. Sequential append to disk plus page cache means a single broker sustains hundreds of MB/sec of writes with replication-factor-3 durability.

When Kafka is the wrong tool

Anti-Use-Cases

RPC. Kafka is not a request/response system. If you need a reply within 5ms, do not use Kafka.

Per-message ack / DLQ semantics. Until KIP-932 Share Groups went GA in Kafka 4.2, Kafka could not natively do "ack one message, retry that one, send the next one to DLQ." Most teams still want SQS or RabbitMQ for this. Share Groups is a real option now, but it is a 2026 capability, not a default mental model.

Workflows with priorities. Kafka has no notion of message priority. Every message in a partition waits behind the one in front of it.

Small, low-throughput pub/sub. Below ~5k msg/sec sustained, a managed SQS+SNS or NATS deployment is operationally cheaper.

Long-lived stateful work assignment. If you need to assign a "job" to one worker and reassign on failure, that is a job queue, not a log.

What ships under the "Kafka" name

Apache Kafka in 2026 is not a single binary. It is a family of components, each with distinct operational profiles:

ComponentWhat It DoesRuns WhereProduction Reality
Broker Stores partitions, serves produce/fetch RPCs JVM, on dedicated hosts (or pods) The thing you actually operate. Usually 3 to 100+ nodes.
Controller (KRaft) Cluster metadata, partition leader election JVM, often co-located with brokers in small clusters Since Kafka 4.0, ZooKeeper is gone. Controllers run a Raft quorum (typically 3 or 5 nodes).
Kafka Streams Library for stateful stream processing Embedded in your application JVM Java/Scala only. State stored in local RocksDB plus changelog topics.
Kafka Connect Pluggable framework for source and sink connectors Separate JVM cluster (workers) Hundreds of connectors exist (JDBC, S3, Debezium, Elastic). Operationally similar to running another Kafka.
Schema Registry Stores Avro/Protobuf/JSON schemas, enforces compatibility Confluent or Apicurio service (not Apache Kafka core) Optional but standard. Without it, schema evolution becomes tribal knowledge.
MirrorMaker 2 Cross-cluster replication Runs as a Connect cluster For DR and active-active. Offsets do not translate cleanly across clusters.
ksqlDB / Flink SQL on Kafka topics Separate cluster Flink is the 2026 default for serious streaming SQL; ksqlDB is Confluent's offering.
PE Take

The Kafka mental error most teams make is treating it as one product. Adopting "Kafka" in practice means adopting at least Broker + Schema Registry + Connect, and probably Streams or Flink. Each is a separate operational surface with its own runbooks, version skew rules, and failure modes. Budget accordingly.

02Core Concepts: How Kafka Actually Works

2.1 Partitions: the only abstraction that matters

A topic is just a name. The real unit is the partition. A topic with 12 partitions is 12 independent ordered logs. Everything Kafka does, ordering, parallelism, replication, sharding, traces back to partitions.

Topic: orders (partitions = 4) partition 0 [ off0 | off1 | off2 | off3 | off4 | ... ] ← leader on broker-1 partition 1 [ off0 | off1 | off2 | off3 | ... ] ← leader on broker-2 partition 2 [ off0 | off1 | off2 | off3 | off4 | off5 ] ← leader on broker-3 partition 3 [ off0 | off1 | off2 | ... ] ← leader on broker-1 Producer with key="cust-42" → hash(key) % 4 = 2 → always partition 2 Producer with key="cust-99" → hash(key) % 4 = 0 → always partition 0 Producer with key=null → sticky round-robin across partitions

Key → partition mapping

The default partitioner does murmur2(key) % numPartitions. Two consequences that bite:

  • Partition count is permanent. Adding partitions changes the modulo, which means a key that landed on partition 2 yesterday lands on partition 5 tomorrow. Ordering guarantee per key is broken across the resize event. Most teams pick partition count once and never change it.
  • Key cardinality drives skew. If 90% of traffic carries the same key (a single hot customer, a single popular video), that key hammers one partition and one broker. The other brokers idle.

What ordering actually guarantees

Ordering Contract

Kafka guarantees ordering within a partition, and only within a partition. If you write records A, B, C to partition 0, every consumer reading partition 0 sees them in that order. If A goes to partition 0 and B goes to partition 1, you get no ordering guarantee between A and B.

This is why partition key is the most important decision in a Kafka design. The key defines what your ordering domain is. Customer ID as the key means "ordered per customer." Order ID as the key means "ordered per order, but customer events can interleave." Get this wrong and you cannot fix it without reprocessing.

Parallelism math

Maximum consumer parallelism within a consumer group equals partition count. 12 partitions = at most 12 consumers doing useful work. A 13th consumer in the group sits idle. This is why "partition count" debates exist:

DecisionToo Few PartitionsToo Many Partitions
Effect on consumer throughput Capped. Cannot scale consumers past partition count. Unlimited (within reason). Add consumers freely.
Effect on broker memory Low. Fewer open file handles, fewer index files in memory. High. Each partition costs ~10-20MB of broker resources (open files, index pages, controller metadata).
Effect on rebalance time Fast. Few partitions to reassign. Slow with classic protocol. KIP-848 in Kafka 4.0+ mostly fixes this.
Effect on EOS overhead Low. Each transaction touches few partitions. High. Transactions across many partitions multiply control-plane traffic.
Effect on producer batching Better. More records hit the same partition, larger batches. Worse. Same producer rate spread thinner across partitions, smaller batches, lower compression ratio.
PE Take: partition count heuristic

Target throughput per partition: ~10 MB/s sustained sweet spot. Per-broker partition ceiling: ~4000 partitions per broker before metadata overhead degrades performance. KRaft pushed the cluster-wide ceiling to ~1.9M partitions, but individual broker limits did not move proportionally. Sanity check: pick partition count = max(target throughput / 10 MB/s, target consumer count) and round up to a power of two so future resizes are cleaner.

2.2 Storage: log segments, page cache, and the magic of sequential writes

The reason Kafka is fast is not exotic. It is that disks are very fast at sequential append, and Kafka does almost nothing but sequential append.

What a partition looks like on disk

/var/lib/kafka/data/orders-0/ ├── 00000000000000000000.log ← segment 1, holds offsets 0..1006 ├── 00000000000000000000.index ← sparse offset → byte position ├── 00000000000000000000.timeindex ← sparse timestamp → offset ├── 00000000000000001007.log ← segment 2, holds offsets 1007..N (ACTIVE) ├── 00000000000000001007.index ├── 00000000000000001007.timeindex ├── 00000000000000001007.snapshot ← producer epoch state (for idempotence/EOS) └── leader-epoch-checkpoint ← who was leader at which offset

The partition directory is a sequence of segment files. Each .log file is named by the base offset of the first record it contains. The currently-being-written segment is the active segment. Once a segment rolls (by size, default 1 GB, or time, default 7 days), it becomes immutable, and a new active segment opens.

Why segments exist

  • Deletion is cheap. Retention deletes a whole segment file via unlink, which is constant-time. There is no scanning, no compaction of individual records (for retention deletion).
  • Index files stay small. Each segment has its own index, so the index never grows past ~10 MB by default.
  • Tiered storage offload is per-segment. Sealed segments are the unit that gets uploaded to S3.

How reads find the right byte

The .index file is sparse, one entry every log.index.interval.bytes (default 4 KB). For offset lookup:

  1. Binary search the segment list by filename (the filename is the base offset).
  2. Binary search the .index file to find the nearest indexed offset ≤ target.
  3. Scan forward in the .log file from that byte position until the exact offset is found.

A 1 GB segment with 4 KB index interval has ~262,144 index entries at 8 bytes each, around 2 MB. The default 10 MB index ceiling handles segments up to 5 GB. This is also why arbitrary-offset reads stay cheap, you binary-search a small index, then scan a few KB.

Page cache is the secret weapon

Why Kafka feels free

Kafka does not maintain an application-level cache. It writes to the OS page cache and lets the kernel decide when to flush. Consumers reading recent data hit the page cache, not disk. This is why a Kafka broker can sustain 1 GB/s of read throughput while disks are doing only 100 MB/s, the readers are hitting RAM that the kernel happens to have populated with recently-written log data.

Sized correctly (page cache > working set of recently-written data), Kafka is bound by network, not disk. Sized incorrectly (working set falls out of page cache, consumers do random offset jumps), you hit disk and throughput collapses.

Zero-copy reads (sendfile)

When a consumer asks for records, Kafka does not copy the bytes from kernel page cache into a JVM heap buffer and back to a socket. It uses the sendfile() syscall (or its NIO equivalent), which pipes bytes from page cache directly to the network socket. The JVM never touches the data. This is the second reason Kafka feels free, and the reason adding TLS or message-level encryption hurts (sendfile no longer applies, you pay one full copy per record).

Compaction vs deletion

Cleanup PolicyWhat It DoesWhen To UseGotcha
delete (default) Removes whole segments older than retention time/size Event streams, telemetry, audit logs Retention is per-topic; running out of disk and reducing retention does not delete data until segments age out.
compact Retains the latest value per key; old values for the same key are garbage-collected Materialized views, state changelogs, last-known-state topics Tombstones (null values) trigger deletion, but only after delete.retention.ms. Compaction runs on a background thread and is not real-time.
compact,delete Compaction with an additional time/size ceiling Long-lived state with an absolute "purge after N years" policy The two policies interact in ways that surprise people; deletion can remove the latest value for a key if it is too old.

Tiered storage (GA since Kafka 3.9, mature in 4.x)

Hot segments stay on broker local disk. Sealed segments older than local.retention.ms get uploaded to remote object storage (S3, GCS, Azure Blob) by the Remote Log Manager. Consumers reading old data fetch transparently from remote.

[ producer ] ↓ ┌──────────────────────┐ │ broker local disk │ │ ─ active segment │ ←── tail reads (page cache) │ ─ recent sealed segs │ └──────┬───────────────┘ │ async upload (Remote Log Manager) ↓ ┌──────────────────────┐ │ S3 / GCS / Azure │ ←── historical reads │ ─ old sealed segs │ (paid in object-storage GET cost) └──────────────────────┘
Tiered Storage gotchas

Does not work on compacted topics. Hard limitation, enforced at config time.

Replay-from-zero now pays for S3 GETs. A consumer that rewinds 30 days will trigger that many days of S3 reads, which can spike your bill.

Brokers still read the segment, decompress it, and ship to the consumer. Tiered storage cuts your disk cost, not your network cost. Cross-AZ replication and consumer fetch traffic are unchanged.

2.3 Push vs Pull: why Kafka is pull-based

Kafka brokers do not push records to consumers. Consumers ask. This is the opposite of RabbitMQ, ActiveMQ, and most queue-style brokers, and the choice has structural consequences.

Push model (RabbitMQ-style) Pull model (Kafka) broker → "here, take this" consumer → "got anything past offset 4823?" broker → "and this" broker → "here are 500 records, offsets 4823-5322" broker → "still more" consumer → "got anything past offset 5322?" consumer: "STOP, I am drowning" broker → "no, wait 500ms... here are 50 more" broker → tracks per-consumer state broker → tracks NOTHING about the consumer

What pull buys Kafka

  • Consumer-controlled backpressure. A slow consumer reads slowly. The broker neither buffers nor cares. There is no "consumer crashed and now I have unacked messages to track" problem.
  • Broker is stateless about consumers. The broker does not track who has read what. Consumers commit offsets to a special compacted topic (__consumer_offsets). Brokers serve fetches; that is the entire consumer protocol.
  • Reads are batched aggressively. A consumer asks for "up to 50 MB or wait 500 ms" in one round-trip, gets a large batch, decompresses it locally, processes the whole thing, then asks for the next batch. One TCP round-trip moves thousands of records.
  • Replay is trivial. "Read from offset 0" is the same operation as "read from the tail." The broker has no idea you are reading old data vs new data.
  • Multiple independent consumer groups for free. Each group tracks its own offset. Adding a new consumer group adds zero load on the producer side and zero state on the broker.

What pull costs

  • Idle latency floor. If a consumer polls every 500 ms and a record arrives at t=10 ms, the consumer sees it ~490 ms later in the worst case. Long-polling (fetch.max.wait.ms) cuts this, but the floor exists.
  • Wasted polls when topic is idle. Consumers poll on a schedule even when nothing is happening. Kafka mitigates this with long-poll (broker holds the fetch request until data arrives or timeout), but you still pay the connection cost.
  • No fan-out push to many endpoints. You cannot point Kafka at 10,000 mobile devices and have it push notifications. Each consumer needs to maintain a persistent connection and poll. This is what services like SNS, Pusher, and WebSockets are for.

The long-poll behavior

A consumer fetch carries two parameters: fetch.min.bytes (default 1 byte) and fetch.max.wait.ms (default 500 ms). The broker holds the fetch until enough bytes are available OR the timeout expires. If you set min bytes to 1 MB and wait to 500 ms, the broker either returns 1 MB quickly or returns whatever it has after 500 ms. This is the knob for tuning between latency (small min bytes, short wait) and throughput (large min bytes, longer wait).

PE Take: when push would help

For request-response patterns or "deliver to this client now" semantics, pull is wrong. Kafka is not trying to solve that problem. If your design needs broker-initiated delivery, you are either using the wrong tool or building a layer (WebSocket gateway, push notification service) in front of Kafka. The 2026 emergence of Share Groups (KIP-932) doesn't change this; share groups are still pull, they just relax the partition-to-consumer mapping.

2.4 Offsets and consumer position

An offset is just a 64-bit integer per partition. Consumers commit offsets back to Kafka via the __consumer_offsets internal topic (which is compacted; latest committed offset per (group, topic, partition) tuple wins). Three commit modes:

Commit ModeWhat It DoesFailure BehaviorWhen To Use
enable.auto.commit=true Background commit every auto.commit.interval.ms (default 5s) Consumer crash → up to 5 seconds of records reprocessed. At-least-once with a wide window. Logging, metrics, anything where duplicates are cheap.
commitSync() Blocking commit after each batch At-least-once, tight window. Consumer crash before commit → reprocess the current batch. Production default. Pair with idempotent or transactional downstream writes.
commitAsync() Non-blocking commit Faster but commits can fail silently. Need callback to handle errors. High-throughput consumers where the occasional retry on shutdown is acceptable.
Transactional sendOffsetsToTransaction Offset commit atomically tied to a produce transaction Exactly-once across consume-process-produce within Kafka. Kafka Streams and any consume-transform-produce pipeline needing EOS.

03Architecture

Kafka is a two-tier distributed system: a data plane of brokers that store partitions and serve produce/fetch RPCs, and a control plane of KRaft controllers running a Raft quorum to manage cluster metadata. Since Kafka 4.0 (March 2025), ZooKeeper is gone; KRaft is the only mode. Clients (producers, consumers, Connect workers, Streams apps) talk to brokers directly over a binary protocol on TCP.

3.1 Cluster Topology

┌──────────────────────────────────────┐ │ CONTROL PLANE (KRaft) │ │ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ ctrl-1 │◄─┤ ctrl-2 │─►│ ctrl-3 │ │ ← Raft quorum │ └────┬───┘ └────────┘ └────────┘ │ (tolerates 1 failure) │ │ active │ └───────┼──────────────────────────────┘ │ metadata log replicated to all brokers │ via __cluster_metadata topic ┌──────────────────────┼──────────────────────────────────┐ │ ▼ │ │ DATA PLANE (Brokers) │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ broker-1 │ │ broker-2 │ │ broker-3 │ │ │ │ ─ part 0 L │ │ ─ part 0 F │ │ ─ part 0 F │ │ L = Leader │ │ ─ part 1 F │ │ ─ part 1 L │ │ ─ part 1 F │ │ F = Follower │ │ ─ part 2 F │ │ ─ part 2 F │ │ ─ part 2 L │ │ │ │ ─ __cm rep │ │ ─ __cm rep │ │ ─ __cm rep │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └───────▲─────────────────▲─────────────────▲────────────┘ │ │ │ ┌───────┴────────┐ ┌──────┴─────────┐ ┌────┴────────────┐ │ Producers │ │ Consumers │ │ Connect / │ │ (clients) │ │ (clients) │ │ Streams / Admin │ └────────────────┘ └────────────────┘ └──────────────────┘

Three node roles, two of which can be co-located:

RoleWhat It DoesQuorum / CountDeployment Reality
Controller Raft quorum managing cluster metadata: topic configs, partition assignments, leader elections, ACLs. Replicates the metadata log to all brokers. 3 or 5 nodes (odd for quorum). Tolerates ⌊(N-1)/2⌋ failures. In small clusters (<20 brokers), often runs combined with brokers (process.roles=broker,controller). In large clusters, runs on dedicated nodes for isolation.
Broker Stores partitions on local disk, serves produce/fetch RPCs, replicates partitions, executes leader/follower roles per partition. 3 to 100+ nodes. Scales horizontally with partition count. The thing you spend money on. CPU is rarely the bottleneck; disk IO, network bandwidth, and page cache size dominate sizing.
Client Producer, consumer, Streams app, Connect worker, Admin client. Discovers cluster via bootstrap servers, then refreshes metadata directly from brokers. Unbounded. Each client maintains a connection pool to brokers it actively reads/writes from. Metadata refresh is on-demand (cache miss) plus periodic (default 5 min).

3.2 Broker Internals

A single broker process is structured around two thread pools, a request channel, and a set of long-lived service threads.

┌─────────────────────────────────────────┐ │ KAFKA BROKER PROCESS │ │ │ client ──TCP──►│ ┌──────────────┐ │ │ │ Network │ num.network.threads │ │ │ Thread Pool │ (default 3) │ │ │ (Acceptors + │ ─ TLS handshake │ │ │ Processors) │ ─ Parse request hdr │ │ └──────┬───────┘ │ │ │ enqueue │ │ ┌──────▼───────────┐ │ │ │ REQUEST CHANNEL │ bounded queue │ │ │ (in-memory queue)│ per priority │ │ └──────┬───────────┘ │ │ │ dequeue │ │ ┌──────▼───────────┐ │ │ │ I/O Thread Pool │ num.io.threads │ │ │ (Request Handler)│ (default 8) │ │ └──┬──┬──┬──┬──────┘ │ │ │ │ │ └──► Other (ACL, txn,...) │ │ │ │ │ │ │ │ │ └─────► Fetch handler │ │ │ │ ↓ │ │ │ │ ┌─────────┐ │ │ │ │ │ Log Mgr │ sendfile() ─►│ to client │ │ │ └─────────┘ │ │ │ │ │ │ │ └────────► Produce handler │ │ │ ↓ │ │ │ ┌─────────────┐ │ │ │ │ Log Mgr │ │ │ │ │ (append to │ │ │ │ │ active seg)│ │ │ │ └──────┬──────┘ │ │ │ │ written to page │ │ │ │ cache, then ack │ │ │ ▼ │ │ │ ┌─────────────┐ │ │ │ │ Replica Mgr │ ── follower │ │ │ │ (notify fwd)│ fetch │ │ │ └─────────────┘ │ │ │ │ │ └────────► Metadata handler │ │ (cached metadata image) │ └─────────────────────────────────────────┘

The two thread pools

  • Network threads (num.network.threads, default 3). Read bytes off the socket, parse the request envelope, hand the request to the request channel. Write responses back to the socket. They never touch the log. Sized roughly to network connection count; under-sizing causes TCP backpressure on clients.
  • I/O threads (num.io.threads, default 8). Dequeue requests, execute the actual work: log append, log read, metadata lookup, replica fetch. These do the disk and page-cache work. Under-sizing causes request queue time to balloon while CPU sits idle on network threads.

Key service threads (not in the request path)

  • ReplicaFetcher threads (num.replica.fetchers, default 1 per source broker). Followers pull data from leaders via long-poll fetches. Increase when many partitions move during a rebalance.
  • LogCleaner threads. Background compaction for compacted topics. Reads sealed segments, merges them, writes a new segment with latest-value-per-key. Throttle to avoid disk contention with active produces.
  • RemoteLogManager threads (Tiered Storage). Upload sealed segments to S3, fetch on miss. Sized via remote.log.manager.task.thread.pool.size.
  • GroupCoordinator. Hosts consumer-group and share-group state. Each broker is the coordinator for a subset of groups (mapped via hash of group.id).
  • TransactionCoordinator. Manages transactional producer state (PID, epoch, commit/abort markers). Each broker hosts a subset.
PE Take: sizing thread pools

The default 3 network / 8 IO is for a dev broker with light load. For production: scale num.network.threads to roughly min(num_cores / 2, 8) and num.io.threads to roughly num_cores. Watch the RequestQueueTimeMs JMX metric: sustained high values mean IO threads can't keep up; sustained low values with high NetworkProcessorAvgIdlePercent means you have too many network threads. The two metrics tell you which pool is the bottleneck.

3.3 KRaft Control Plane in Detail

The control plane runs a Raft quorum over the __cluster_metadata topic. All cluster metadata (topic creates, partition reassignments, ACL changes, broker registrations) is an append to this log. The active controller is the Raft leader; passive controllers replicate the log. Brokers replicate the log too (read-only) so they always have an up-to-date metadata image without going through ZooKeeper-style watches.

WRITE PATH READ PATH (e.g., CreateTopic) (e.g., broker checks partition state) AdminClient Broker │ │ │ CreateTopicsRequest │ (no RPC; cached locally) ▼ ▼ Active Controller MetadataImage (in-memory snapshot) │ │ │ append to │ updated by replay of │ __cluster_metadata │ __cluster_metadata ▼ │ Raft log replicated │ to follower controllers │ │ │ │ majority commit │ ▼ │ All brokers replay ▲ the new metadata records ─────────┘

What lives in the metadata log

  • Topic definitions (name, partition count, configs)
  • Partition assignments (which brokers host which replicas)
  • ISR membership changes
  • Leader epoch changes
  • Broker registrations, broker fencing/unfencing
  • ACLs, client quotas, dynamic configs
  • Feature flags (e.g. share.version)

Metadata snapshots

The metadata log is compacted via snapshots. When the log grows past a threshold, controllers serialize a full state snapshot to a file (00000000000000XXXXX.checkpoint) and prune the log behind it. New controllers bootstrap from the latest snapshot plus the tail of the log. Without this, the metadata log would grow unbounded.

3.4 Partition Placement and Rack Awareness

When a topic is created, the controller assigns its partitions and replicas to brokers. Two constraints drive placement:

  • Even leader spread. Leaders are distributed across brokers so that produce/fetch load is balanced. Round-robin within the broker list.
  • Rack-aware replica placement. If broker.rack is set on each broker (typically to the AZ identifier in cloud deployments), the controller places replicas on different racks. This is what makes RF=3 actually survive an AZ failure.
Rack awareness gotcha

broker.rack is not set by default. A multi-AZ cluster without it can end up with all 3 replicas of a partition in the same AZ, defeating the AZ-failure tolerance you thought you had. Always set this at broker startup, validate via kafka-topics.sh --describe, and run a controlled AZ-fault test before declaring the cluster production-ready.

3.5 Client-Side Architecture

Clients are not thin wrappers around a TCP socket. The Java client is a substantial state machine with its own background threads, buffers, and metadata cache.

Producer internals

application broker │ │ │ send(record, callback) │ ▼ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ KafkaProducer (main thread) │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ │ Serializer │→ │ Partitioner │→ │ Record Accumulator │ │ │ │ │ (key, value) │ │ (key→ part) │ │ (per-partition batch)│ │ │ │ └──────────────┘ └──────────────┘ └──────────┬───────────┘ │ │ │ │ │ │ │ ┌─────────────────────┘ │ │ │ │ batched (linger.ms / batch.size) │ │ │ ┌─────────────────────────▼───────────────────────────────────┐ │ │ │ │ Sender (I/O thread) │ │ │ │ │ ─ Group batches by destination broker │ │ │ │ │ ─ ProduceRequest (one per broker, multi-partition) │──┼──►│ │ │ ─ Wait for ProduceResponse, invoke callbacks │◄─┼───│ │ └──────────────────────────────────────────────────────────────┘ │ │ └────────────────────────────────────────────────────────────────────┘ │

The producer's Sender thread is separate from your application thread. send() returns immediately after enqueueing into the RecordAccumulator. The Sender batches per partition, groups batches per destination broker, sends one ProduceRequest per broker per round-trip, and invokes the callback when the broker acks. This is why throughput depends on linger.ms and batch.size: they govern how long the Sender waits to accumulate a worthwhile batch.

Consumer internals

┌───────────────────────────────────────┐ │ KafkaConsumer │ │ │ poll() ───────►│ ┌──────────────────────────────┐ │ │ │ Fetcher │ │ │ │ ─ async FetchRequest to │──TCP─►│ broker │ │ each assigned partition │ │ (long-poll) │ │ ─ buffer responses │◄──TCP─│ │ └────────────┬─────────────────┘ │ │ │ │ │ ┌────────────▼─────────────────┐ │ │ │ Deserializer (key, value) │ │ │ └────────────┬─────────────────┘ │ │ │ │ │ ▼ │ │ ConsumerRecords │ │ │ │ Background heartbeat thread │ │ ─ ConsumerGroupHeartbeat (KIP-848) │ │ ─ Liveness + assignment reconciliation │ └──────────────────────────────────────────┘

The consumer is not multi-threaded for record processing; poll() is single-threaded and your code processes the returned batch before the next poll(). But the heartbeat (KIP-848 protocol) and prefetch (Fetcher) run on background threads to overlap I/O with processing. If your processing takes longer than max.poll.interval.ms, you're kicked from the group, even if heartbeats are healthy.

3.6 The Wire Protocol

Kafka has its own binary protocol over TCP. Not HTTP. Not gRPC. The protocol is request/response, length-prefixed, with versioned message schemas. Every API (Produce, Fetch, Metadata, OffsetCommit, JoinGroup, etc.) is a distinct request type with its own request and response schema, identified by an API key.

AspectBehaviorOperational Reality
Framing 4-byte length prefix + request header + request body. Same shape for response. Wireshark dissector exists; kafka-protocol tools can decode for debugging.
Versioning Each API has independently versioned request/response schemas. Clients and brokers negotiate the highest mutually-supported version per API at connection time (ApiVersions request). This is why clients and brokers can be 1-2 major versions apart and still work. Client < 2.1 cannot talk to broker 4.0+.
Connection multiplexing One TCP connection per (client, broker) pair. Multiple in-flight requests per connection (pipelined). Default cap: 5 in-flight on the producer (max.in.flight.requests.per.connection). Pipelining is what makes batched produce throughput fast. Setting in-flight to 1 (sometimes seen for "ordering safety") cripples throughput.
Authentication SASL (PLAIN, SCRAM, GSSAPI/Kerberos, OAUTHBEARER) inside the protocol; TLS at the transport layer. mTLS also supported. SASL + TLS is the production standard. OAUTHBEARER + OIDC is the 2024-2026 trend for cloud-native deployments.
Compression negotiation Producer-side, per-batch. Broker stores the compressed batch as-is (since 0.11) and forwards it compressed to consumers. Decompression happens at the consumer. Broker CPU spent on compression is near-zero. Consumer CPU does the work. Compression at producer is one of the highest-leverage tuning knobs.

04Execution Model

What happens, step by step, when bytes move through Kafka. Five flows matter: producer write, broker request handling, replication, consumer fetch, and consumer-group coordination. Understanding these is what separates "I use Kafka" from "I can debug Kafka."

4.1 Producer Write Path

From producer.send(record) in your application code to a durable, replicated commit:

Step 1: Application thread calls send() │ ▼ Step 2: Serialize key and value (using configured serializers) │ ▼ Step 3: Partitioner picks a partition ─ If key present: murmur2(key) % numPartitions ─ If key absent: sticky round-robin ─ Custom partitioner if configured │ ▼ Step 4: Record is placed in the RecordAccumulator's per-partition deque ─ Returns to the application immediately (async) │ ▼ Step 5: Sender thread (background) drains batches when: ─ batch.size reached, OR ─ linger.ms elapsed since first record in batch, OR ─ buffer is full and must flush │ ▼ Step 6: Sender groups batches by destination broker ─ One ProduceRequest per broker, may contain multiple partitions ─ Pipelined up to max.in.flight.requests.per.connection │ ▼ Step 7: Broker network thread reads, places on request channel │ ▼ Step 8: Broker I/O thread handles the request ─ Validates idempotence (PID + sequence number) ─ Validates transactional state (if transactional) ─ Appends to partition's active segment (in page cache) ─ Updates LEO (Log End Offset) │ ▼ Step 9: Broker waits for acks (if acks=all) ─ Followers are continuously fetching; they pull this batch next ─ Each follower acks its fetch with its new LEO ─ Leader updates HW (High Watermark) = min(ISR LEOs) │ ▼ Step 10: Once HW >= produce batch's last offset, broker sends ProduceResponse ─ acks=0: broker doesn't reply (fire-and-forget) ─ acks=1: reply after leader local write ─ acks=all: reply after ISR replication (HW advance) │ ▼ Step 11: Producer's Sender receives response, invokes user callback ─ If retriable error: re-queue the batch (idempotence prevents dups) ─ If fatal: invoke callback with exception

The idempotence trick

With enable.idempotence=true (default since 3.0), each producer gets a Producer ID (PID) and assigns sequence numbers per partition. The broker tracks (PID, partition) → last-seen sequence number. A retried batch with the same sequence is recognized as a duplicate and silently deduped. This is what makes retries=Integer.MAX_VALUE safe.

The transactional trick

With a transactional.id set, the producer registers with a TransactionCoordinator. Each transaction gets a monotonic epoch. beginTransaction / send / commitTransaction writes a special marker into the partitions involved. Consumers with isolation.level=read_committed skip aborted-transaction records and wait for the commit marker before exposing committed ones (this is the "Last Stable Offset" / LSO).

4.2 Broker Request Handling, in Detail

Every request (Produce, Fetch, Metadata, OffsetCommit, etc.) follows the same network-thread → request-channel → IO-thread path. The processing of Produce and Fetch is what we usually care about.

Produce request handling

network thread: read bytes → parse ProduceRequest envelope → enqueue on request channel │ └─→ io thread: ┌─ verify partition leader is this broker (NOT_LEADER_FOR_PARTITION otherwise) ├─ verify min.insync.replicas can be met ├─ for each (topic, partition, batch): │ ─ idempotence check (PID+seq) │ ─ transaction state check │ ─ append batch to active segment (write to page cache) │ ─ update LEO │ ─ update partition's offset index if log.index.interval.bytes crossed ├─ if acks=0: respond immediately (no reply, actually) ├─ if acks=1: respond now (after local write) ├─ if acks=all: register a "purgatory" entry │ ─ waits for HW to advance to cover this batch │ ─ replica fetcher acks from followers will eventually trigger this └─ send response on network thread

Fetch request handling (the zero-copy magic)

network thread: read bytes → parse FetchRequest → enqueue on request channel │ └─→ io thread: ┌─ for each (topic, partition): │ ─ verify partition leader (or follower if KIP-392 fetch-from-follower) │ ─ compute byte range to return: from requested_offset to min(HW, max_bytes) │ ─ obtain file handle and byte range in the segment .log file ├─ if total_bytes < fetch.min.bytes: │ ─ register in fetch purgatory; wait up to fetch.max.wait.ms │ ─ wake up when more data arrives or timeout └─ assemble FetchResponse: for each partition: include the (file, offset, length) tuple network thread: write FetchResponse header to socket for each partition's data: call sendfile(socket_fd, file_fd, offset, length) ─ kernel copies from page cache directly to socket ─ JVM heap is never touched ─ no userspace memory copy at all
Why this matters in practice

The fetch path is the reason Kafka is a 1+ GB/s/broker system on commodity hardware. The data path is: page cache → kernel TCP buffer → NIC. The JVM is in the control path (deciding what to fetch) but never in the data path. Adding TLS forces the kernel to encrypt the bytes, which requires a copy out of page cache into a kernel-encrypted buffer; sendfile no longer applies, and throughput drops 30-50%. Same for any message-level encryption you bolt on.

4.3 Replication Protocol

Replication is followers pulling from leaders via the same Fetch API consumers use. There's no separate replication protocol; followers are just specialized consumers reading their own broker's partitions.

LEADER (broker-1) FOLLOWER (broker-2) ───────────────── ───────────────── LEO = 10000 LEO = 9950 HW = 9950 HW = 9950 ISR = {1, 2, 3} (replicating from leader) ── Fetch(partition=0, offset=9950) ──► ◄── batches[9950..10000] ──────────── (writes to local log, page cache) LEO = 10000 ── Fetch(partition=0, offset=10000) ──► (the new fetch offset 10000 is the follower's ack of receipt) LEO = 10500 (new produces) HW = 10000 (advances to min(ISR LEOs)) ◄── batches[10000..10500] ──── (returned in this fetch)

Key invariants

  • LEO (Log End Offset): the next offset that will be written. Each replica has its own LEO.
  • HW (High Watermark): the highest offset known to be replicated to all ISR members. Consumers can only read up to HW (this is what gives Kafka its "no read uncommitted" guarantee).
  • ISR (In-Sync Replicas): replicas whose LEO is within replica.lag.time.max.ms of the leader. Falling behind drops you from ISR; catching up adds you back.
  • Leader Epoch: a monotonically increasing number that increments on every leader change. Used to detect and recover from divergent logs after a leader change (KIP-101). Without leader epochs, a leader change could silently lose committed data on a follower with stale data; with them, the follower can detect and truncate its log.

What happens when a follower falls behind

  1. Follower's LEO stops advancing (slow disk, GC pause, network).
  2. Leader's ISR shrink condition triggers: follower has been > replica.lag.time.max.ms (30s default) behind.
  3. Controller is notified; ISR is updated in __cluster_metadata.
  4. If ISR size drops below min.insync.replicas: produce with acks=all starts failing with NotEnoughReplicasException.
  5. When the follower catches back up, ISR is expanded; produces resume.

Fetch from follower (KIP-392)

Same fetch protocol, but consumers can request reads from a follower in the same AZ. The follower serves data only up to its locally-known HW (which may lag the leader's by milliseconds). The consumer specifies client.rack; the leader returns a redirect on the first fetch ("read from broker X, same rack as you"). Subsequent fetches go directly to the follower.

4.4 Consumer Group Coordination (KIP-848)

Consumer group rebalancing is the trickiest part of Kafka's execution model. KIP-848 (Kafka 4.0 GA) replaced the classic stop-the-world protocol with a server-driven, incremental one.

The classic protocol (legacy, deprecated)

Member joins → all members JoinGroup request │ ▼ Leader elected (one of the members) │ ▼ ALL MEMBERS STOP CONSUMING (revoke all partitions) │ ▼ Leader computes assignment, sends back via SyncGroup │ ▼ Members resume with new assignment Problem: any slow member blocks the entire group. 100-partition group with 10 members can spend 100+ seconds in rebalance.

KIP-848 (Kafka 4.0+, GA)

Member joins → ConsumerGroupHeartbeat request to broker (group coordinator) │ ▼ GROUP COORDINATOR (broker side) computes the new target assignment │ ▼ Coordinator returns DELTA in heartbeat response: ─ partitions to revoke ─ partitions to add (after others revoke) │ ▼ Each member acks the delta in its next heartbeat │ ▼ Members whose assignment did NOT change keep consuming uninterrupted Members with revoked partitions release them; consumer that's gaining waits for revocation ack, then starts on the new partition │ ▼ 100-partition group with 10 members rebalances in ~5 seconds Slow members no longer block fast ones

Heartbeat protocol

Members send ConsumerGroupHeartbeat RPCs every group.consumer.heartbeat.interval.ms (default 5s). The session timeout (group.consumer.session.timeout.ms, default 45s) is server-side, no longer client-configurable. Missing 9 consecutive heartbeats removes you from the group. The heartbeat is also the carrier for assignment reconciliation, both directions of state flow on a single RPC.

4.5 Transactions: End-to-End Flow

Transactional producers coordinate with a TransactionCoordinator (a per-broker service, sharded by transactional.id). Each transaction passes through these states:

app producer client TxnCoord partition leaders │ │ │ │ │ │── InitProducerId ───►│ │ │ │ (gets PID + epoch) │ │ │ │◄── PID + epoch ──────│ │ │ │ │── beginTransaction ──►│ │ │ │── AddPartitionsToTxn ►│ │ │ │ (each partition you'll write to) │ │ │ │── send(record) ──────►│── ProduceRequest ─────────────────────────►│ │ │ (with PID, epoch, sequence) │ │ │ (writes to log; │ │ records are │ │ in log but │ │ consumers │ │ with │ │ read_committed │ │ DON'T see them │ │ yet) │ │ │── commitTransaction ─►│── EndTxn(COMMIT) ────►│ │ │ │── WriteTxnMarkers ─►│ │ │ (commit markers in │ │ │ each partition) │ │ │ │ read_committed consumers │ can now see the records │ (LSO advances past commit marker)

Why this works

  • The TransactionCoordinator writes its own log (__transaction_state), so coordinator state survives broker restart.
  • Markers are written into the actual data partitions, so any consumer reading those partitions sees the boundary.
  • read_committed consumers maintain an LSO (Last Stable Offset) per partition: the highest offset whose containing transaction has been committed or aborted. Records past LSO are not exposed.
  • A crashed transactional producer is fenced on restart: the new instance bumps the epoch, the TransactionCoordinator aborts any open transaction from the old epoch. This is how zombie producers are prevented from committing stale data.
PE Take: latency cost of read_committed

Read-committed consumers wait for the LSO to advance, which happens only when a transaction commits or aborts. A long-running transaction (say, 30 seconds) holds back the LSO for all consumers of every partition it touches. If your transactional pipelines run alongside latency-sensitive consumers on the same topic, set transaction.timeout.ms tight (5-15 seconds), and design transactions to be small. Otherwise, your consumers experience their tail latency being controlled by your slowest transaction.

05Feature Usage: Working Code, Production Defaults

Every feature below is shown with the configuration most production deployments end up at, not the documentation default. Defaults are tuned for "small dev cluster, learn the API"; production tuning is different.

5.1Pub/Sub: Producer and Consumer

Use when You need durable, ordered, replayable messaging with high fan-out and per-key ordering.
Avoid when You need per-message ack with retry semantics, message priorities, or RPC-style request/response.

Producer (idempotent, production defaults)

import org.apache.kafka.clients.producer.*;
import java.util.Properties;

Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092,broker3:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

// Durability and ordering
props.put("acks", "all");                              // wait for ISR ack, default since 3.0
props.put("enable.idempotence", "true");                  // no duplicates on retry, default since 3.0
props.put("max.in.flight.requests.per.connection", 5); // safe with idempotence
props.put("retries", Integer.MAX_VALUE);                  // rely on delivery.timeout.ms instead
props.put("delivery.timeout.ms", 120000);                // 2 min total deadline per record

// Throughput
props.put("compression.type", "lz4");                    // zstd is often better for archival
props.put("linger.ms", 20);                              // wait up to 20ms to batch
props.put("batch.size", 65536);                          // 64 KB per partition batch
props.put("buffer.memory", 67108864);                    // 64 MB total buffer

try (Producer<String, String> producer = new KafkaProducer<>(props)) {
    // Key drives partition assignment. Same key → same partition → ordered.
    ProducerRecord<String, String> record =
        new ProducerRecord<>("orders", orderId, payload);

    producer.send(record, (metadata, exception) -> {
        if (exception != null) {
            log.error("Send failed for {}", orderId, exception);
        } else {
            log.debug("Sent {} → partition {} offset {}",
                orderId, metadata.partition(), metadata.offset());
        }
    });
}
Producer Gotchas

acks=1 looks faster but is a footgun. The leader writes locally and acks before replication. A leader crash between local write and replication loses the record silently. The throughput delta vs acks=all is often <10% with idempotence on; the durability delta is enormous.

linger.ms=0 starves your batches. The producer sends one record per request. Throughput plummets, broker load multiplies. 5-20ms is the sweet spot for most workloads.

Setting retries=0 to "fail fast" defeats acks=all. Transient leader elections will drop records that would have succeeded on retry. Use delivery.timeout.ms to bound total time instead.

Consumer (KIP-848 new protocol, manual commit)

import org.apache.kafka.clients.consumer.*;
import java.time.Duration;
import java.util.*;

Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092");
props.put("group.id", "order-processor");
props.put("group.protocol", "consumer");  // KIP-848 protocol (Kafka 4.0+)
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

// Manual commit, read only committed records (skip aborted transactions)
props.put("enable.auto.commit", "false");
props.put("isolation.level", "read_committed");

// Tune throughput vs latency
props.put("max.poll.records", 500);            // records per poll()
props.put("max.poll.interval.ms", 300000);    // processing deadline, exceed → kicked from group
props.put("fetch.min.bytes", 1048576);          // wait for 1 MB of data
props.put("fetch.max.wait.ms", 500);           // or 500 ms, whichever first

// Fetch from same-AZ follower to cut cross-AZ traffic (KIP-392)
props.put("client.rack", "us-east-1a");

try (Consumer<String, String> consumer = new KafkaConsumer<>(props)) {
    consumer.subscribe(List.of("orders"));

    while (!Thread.currentThread().isInterrupted()) {
        ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(1));

        for (ConsumerRecord<String, String> record : records) {
            // Process. If processOrder throws, the loop bails and we DO NOT commit.
            // Next poll will re-deliver from the last committed offset. At-least-once.
            processOrder(record.value());
        }

        if (!records.isEmpty()) {
            consumer.commitSync();  // commit after successful processing
        }
    }
}
PE Take: KIP-848 is the only-rebalance-protocol-you-should-be-using in 2026

The classic rebalance protocol is "stop the world": every consumer drops its partitions, the coordinator computes a new assignment, every consumer picks up. A 100-partition group with 10 consumers spent 100+ seconds rebalancing on a single member leave. KIP-848 moves the assignment to the broker-side group coordinator and makes it incremental, the same 100-partition group rebalances in single-digit seconds, and consumers whose assignment did not change keep processing. Set group.protocol=consumer on all new consumers. The classic protocol is in Phase 1 of deprecation as of Kafka 4.3.

5.2Kafka Streams: Stateful Processing

Use when You need stateful per-key processing (joins, aggregations, windows) on Kafka topics, in JVM, with state co-located with your application.
Avoid when Non-JVM stack, cross-organization shared infra (use Flink), or your "stream processing" is actually a batch job with a cron schedule.

Mental model

KStream is a stream of records (insert log). KTable is a changelog interpreted as a materialized view (latest value per key). GlobalKTable is a KTable fully replicated to every Streams instance (no co-partitioning needed). Joins between these have different semantics and different cost.

Example: enrich orders with customer data, aggregate by region in 5-min windows

import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "order-enrichment");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
props.put(StreamsConfig.STATE_DIR_CONFIG, "/var/lib/streams");

// KIP-1071: new server-side Streams rebalance protocol (GA in Kafka 4.2)
props.put(StreamsConfig.GROUP_PROTOCOL_CONFIG, "streams");

StreamsBuilder builder = new StreamsBuilder();

// Source: stream of orders, keyed by customerId
KStream<String, Order> orders = builder.stream("orders",
    Consumed.with(Serdes.String(), orderSerde));

// Materialized table of customers, keyed by customerId
KTable<String, Customer> customers = builder.table("customers",
    Consumed.with(Serdes.String(), customerSerde));

// Stream-table join: enrich each order with customer attributes.
// Requires orders and customers co-partitioned on the same key.
KStream<String, EnrichedOrder> enriched = orders
    .leftJoin(customers, (order, customer) -> new EnrichedOrder(order, customer));

// Aggregate revenue per region in 5-minute tumbling windows
KTable<Windowed<String>, RevenueAgg> revenue = enriched
    .selectKey((k, v) -> v.region())
    .groupByKey(Grouped.with(Serdes.String(), enrichedOrderSerde))
    .windowedBy(TimeWindows.ofSizeAndGrace(
        Duration.ofMinutes(5), Duration.ofSeconds(30)))
    .aggregate(
        RevenueAgg::new,
        (region, order, agg) -> agg.add(order.amount()),
        Materialized.<String, RevenueAgg, WindowStore<Bytes, byte[]>>
            as("revenue-store")
            .withValueSerde(revenueAggSerde));

revenue.toStream()
       .map((windowedKey, agg) -> KeyValue.pair(windowedKey.key(),
            new RevenueResult(windowedKey.window().startTime(), agg.total())))
       .to("revenue-by-region-5min", Produced.with(Serdes.String(), revenueResultSerde));

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

What's actually happening under the hood

  • Local state in RocksDB. The revenue-store is an embedded RocksDB instance per partition assigned to this Streams instance. Reads and writes are local; the disk is yours.
  • Changelog topic for durability. Every update to local state is also written to a changelog topic (Streams creates this for you with the name {application-id}-{store-name}-changelog). On restart or rebalance, state is restored by replaying the changelog. This is why "Streams state" survives a host failure.
  • Co-partitioning required for joins. If orders has 12 partitions and customers has 6, the join is rejected at topology build time. Both must have the same partition count and same partitioner.
  • EOS v2 uses transactions. Read input → update state + write changelog + write output, all in one Kafka transaction. Crashes leave no half-state.
Streams Gotchas

Repartitioning is automatic and silent. selectKey followed by an aggregation triggers an internal repartition topic. This doubles your network and storage cost for that pipeline. Check the topology before deploying.

Changelog topics are real topics. They count against your partition budget, broker storage, and replication traffic. A complex Streams app can quietly add hundreds of partitions to your cluster.

State restoration is slow on first start. Restoring a multi-GB RocksDB store from a changelog topic takes minutes. Use standby.replicas=1 to keep a warm replica so failover does not wait for restoration.

5.3Kafka Connect: Source and Sink Pipelines

Use when You need to move data between Kafka and external systems (databases, S3, search, warehouses) without writing custom integration code.
Avoid when You need significant per-record transformation logic. Use Streams or Flink for that, and let Connect do raw I/O only.

Architecture in one paragraph

Connect runs as a separate JVM cluster of "worker" nodes. Connectors (plugins) define source and sink logic. Tasks are the parallel work units within a connector. State (offsets, configs, status) is persisted to Kafka topics (connect-offsets, connect-configs, connect-status). The REST API on each worker exposes connector lifecycle. Workers coordinate via the same KIP-848 protocol used by consumers.

Source: Postgres CDC via Debezium with SMTs

POST http://connect-worker:8083/connectors
Content-Type: application/json

{
  "name": "orders-pg-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",  // CDC is single-task per database

    "database.hostname": "pg-primary.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/secrets/db-pw:debezium}",
    "database.dbname": "orders_db",
    "topic.prefix": "pg.orders_db",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_orders",

    "table.include.list": "public.orders,public.order_items",
    "snapshot.mode": "initial",

    // Single Message Transforms (SMTs)
    "transforms": "unwrap,route,filter",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "pg.orders_db.public.(.*)",
    "transforms.route.replacement": "cdc.orders.$1",
    "transforms.filter.type": "org.apache.kafka.connect.transforms.Filter",
    "transforms.filter.predicate": "is-not-system-table",

    // Exactly-once support (Kafka Connect 3.3+)
    "exactly.once.support": "required",

    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter.schema.registry.url": "http://schema-registry:8081"
  }
}

Sink: write enriched orders to S3 in Parquet, partitioned by date

POST http://connect-worker:8083/connectors

{
  "name": "orders-s3-sink",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "tasks.max": "8",    // parallel tasks, capped at partition count
    "topics": "enriched-orders",

    "s3.bucket.name": "data-lake-prod",
    "s3.region": "us-east-1",
    "topics.dir": "raw/kafka",

    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "parquet.codec": "snappy",

    "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
    "partition.duration.ms": "3600000",         // 1 hour partitions
    "path.format": "'dt'=YYYY-MM-dd/'hr'=HH",
    "locale": "en-US",
    "timezone": "UTC",
    "timestamp.extractor": "Record",

    "flush.size": "100000",                     // 100k records per S3 file
    "rotate.interval.ms": "600000",              // or 10 min, whichever first

    "behavior.on.null.values": "ignore",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "orders-s3-sink-dlq",
    "errors.deadletterqueue.context.headers.enable": "true"
  }
}
PE Take: SMTs are a smell when they grow

SMTs (Single Message Transforms) are great for "rename a field, drop a column, route by regex." They become a liability when you find yourself chaining 6 of them with custom predicates. At that point, write the transformation in Streams or Flink between source-connector and sink-connector. SMTs are not testable in isolation, not version-controlled with your application code, and not observable beyond connector-level metrics. Reach for them, then walk back to a real processing layer when complexity creeps.

5.4Schema Registry: Schemas as a Service

Use when You have more than two services exchanging data through Kafka. Schema discipline at the platform level prevents the "everyone sends slightly different JSON" failure mode.
Avoid when Single-team monorepo where producer and consumer ship together. The overhead is real and the benefit is marginal at that scale.

What it actually does

Schema Registry stores Avro/Protobuf/JSON Schemas, assigns each one an integer schema ID, and serves them via REST. The serializer in the producer registers the schema (or looks it up) and prepends the 5-byte ID to every record (1 magic byte + 4-byte schema ID + payload). The consumer reads the ID, fetches the schema, deserializes. Without Schema Registry, every consumer needs the schema baked in. With it, schema becomes a runtime concern.

Compatibility modes

ModeWhat It EnforcesWhat You Can ChangeWhen To Use
BACKWARD (default) New schema can read data written with old schema Add optional fields, remove fields Upgrade consumers first, then producers. The standard rolling-upgrade direction.
FORWARD Old schema can read data written with new schema Add required fields, remove optional fields Upgrade producers first, then consumers. Rare in practice.
FULL Both forward and backward compatible Only add or remove optional fields with defaults When you genuinely do not know which side upgrades first.
NONE No checks Anything Don't.
*_TRANSITIVE Compatibility against all prior versions, not just the latest Same as above, applied across history Mature systems where consumers may be pinned to old versions.

Producer with Avro and Schema Registry

import io.confluent.kafka.serializers.KafkaAvroSerializer;
import org.apache.avro.generic.GenericRecord;

Properties props = new Properties();
props.put("bootstrap.servers", "broker:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", KafkaAvroSerializer.class.getName());

props.put("schema.registry.url", "http://schema-registry:8081");

// PE production defaults:
props.put("auto.register.schemas", "false");   // register via CI, not at runtime
props.put("use.latest.version", "true");       // always serialize with the registered latest
props.put("latest.compatibility.strict", "true"); // fail at startup if schemas drift

try (Producer<String, GenericRecord> producer = new KafkaProducer<>(props)) {
    GenericRecord order = new GenericData.Record(orderSchema);
    order.put("orderId", "ord-7821");
    order.put("customerId", "cust-42");
    order.put("amount", 29.99);

    producer.send(new ProducerRecord<>("orders", "ord-7821", order));
}

Schema evolution: add an optional field

// V1 schema (already registered)
{
  "type": "record", "name": "Order",
  "fields": [
    {"name": "orderId", "type": "string"},
    {"name": "customerId", "type": "string"},
    {"name": "amount", "type": "double"}
  ]
}

// V2 schema: add couponCode (optional, default null) → BACKWARD compatible
{
  "type": "record", "name": "Order",
  "fields": [
    {"name": "orderId", "type": "string"},
    {"name": "customerId", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "couponCode", "type": ["null", "string"], "default": null}
  ]
}

# Register via CLI (use in CI, not at runtime)
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data @order-v2.json \
  http://schema-registry:8081/subjects/orders-value/versions
Schema Registry Gotchas

auto.register.schemas=true at runtime is a security hole. A buggy producer can register a half-broken schema and break every consumer. Always register schemas in CI/CD against a staging registry first.

Subject naming strategy matters. Default is TopicNameStrategy ({topic}-value), which means one schema per topic. TopicRecordNameStrategy allows multiple record types per topic. Pick once, do not change, the strategy is baked into producer/consumer code.

Removing a field is FORWARD compatible, not BACKWARD. Many teams default to BACKWARD and then get surprised when "I just removed an unused field" breaks them.

5.5Transactions: Exactly-Once Semantics

Use when You are doing consume-process-produce within Kafka and a duplicate output would be a correctness problem (financial ledger, inventory, dedup-sensitive analytics).
Avoid when Your downstream is non-Kafka. EOS stops at the Kafka boundary. Writing to an external DB, S3, or HTTP endpoint inside a Kafka transaction does not make that write transactional.

What Kafka transactions actually guarantee

Transactional producers write all-or-nothing across multiple partitions and topics, and (in consume-process-produce) atomically commit consumer offsets and produced records together. Consumers with isolation.level=read_committed skip records from aborted transactions. The guarantee scope is Kafka-to-Kafka only.

Consume-process-produce with EOS

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.producer.*;

// Producer config
Properties prodProps = new Properties();
prodProps.put("bootstrap.servers", "broker:9092");
prodProps.put("transactional.id", "order-enricher-instance-1");  // unique per process
prodProps.put("enable.idempotence", "true");                       // required for txn
prodProps.put("transaction.timeout.ms", 60000);

// Consumer config
Properties consProps = new Properties();
consProps.put("bootstrap.servers", "broker:9092");
consProps.put("group.id", "order-enricher");
consProps.put("isolation.level", "read_committed");
consProps.put("enable.auto.commit", "false");  // MUST be false for EOS

Producer<String, EnrichedOrder> producer = new KafkaProducer<>(prodProps);
Consumer<String, Order> consumer = new KafkaConsumer<>(consProps);

producer.initTransactions();
consumer.subscribe(List.of("orders"));

while (true) {
    ConsumerRecords<String, Order> records = consumer.poll(Duration.ofSeconds(1));
    if (records.isEmpty()) continue;

    try {
        producer.beginTransaction();

        for (ConsumerRecord<String, Order> record : records) {
            EnrichedOrder enriched = enrich(record.value());
            producer.send(new ProducerRecord<>("enriched-orders",
                record.key(), enriched));
        }

        // Critical: commit offsets WITHIN the transaction, not via consumer.commitSync()
        Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
        for (TopicPartition tp : records.partitions()) {
            List<ConsumerRecord<String, Order>> partRecs = records.records(tp);
            long lastOffset = partRecs.get(partRecs.size() - 1).offset();
            offsets.put(tp, new OffsetAndMetadata(lastOffset + 1));
        }
        producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());

        producer.commitTransaction();

    } catch (ProducerFencedException | OutOfOrderSequenceException e) {
        // Producer is fenced (another instance took over) or fatal error. Bail.
        producer.close();
        throw e;
    } catch (KafkaException e) {
        // Transient error. Abort and reprocess.
        producer.abortTransaction();
    }
}
EOS Boundary Gotchas

Side effects break EOS. If your enrich() function calls an external HTTP API, that call happens regardless of whether the transaction commits or aborts. The API may be called twice (re-poll after abort). EOS does not extend across the Kafka boundary.

transactional.id must be unique and stable per logical producer. Reusing IDs across instances causes ProducerFencedException. Generating new ones per restart defeats fencing entirely.

Read-committed adds latency. Consumers wait for the LSO (last stable offset), not the high watermark. Long-running transactions stall all consumers reading that partition.

EOS in Kafka Streams is the cleanest path. If your whole pipeline is consume-from-Kafka-and-produce-to-Kafka, use Streams with processing.guarantee=exactly_once_v2. The transactional wiring is automatic.

5.6Tiered Storage: Hot Local, Cold S3

Use when Long retention is a requirement (compliance, replay, slow consumers) and you can tolerate higher latency for old reads.
Avoid when The topic is compacted. Tiered storage is not supported for log-compacted topics.

Broker-side config

# server.properties
remote.log.storage.system.enable=true
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.RemoteLogManager
remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.storage.TopicBasedRemoteLogMetadataManager

# S3-backed plugin (Aiven's TS plugin is the de facto open-source choice)
remote.log.storage.manager.class.path=/opt/kafka/plugins/tiered-storage-s3/*
rsm.config.storage.backend.class=io.aiven.kafka.tieredstorage.storage.s3.S3Storage
rsm.config.storage.s3.bucket.name=kafka-tiered-storage-prod
rsm.config.storage.s3.region=us-east-1
rsm.config.chunk.size=4194304    # 4 MB chunks
rsm.config.compression.enabled=true

# Quota (KIP-956): cap remote fetch/upload throughput
remote.log.manager.copy.max.bytes.per.second=104857600   # 100 MB/s upload
remote.log.manager.fetch.max.bytes.per.second=209715200  # 200 MB/s fetch

Per-topic enablement

# Enable tiered storage on a topic, keep 1 day local, 90 days total
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type topics --entity-name events \
  --alter \
  --add-config remote.storage.enable=true,\
local.retention.ms=86400000,\
retention.ms=7776000000

# Dynamically disable and re-enable (KIP-950, KRaft mode only)
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type topics --entity-name events \
  --alter --add-config remote.storage.enable=false
PE Take: Tiered Storage solves disk cost, not network cost

Tiered Storage cuts the broker EBS bill. It does nothing about cross-AZ replication (still 3x replication factor on hot data) or cross-AZ consumer fetch traffic. In a multi-AZ AWS deployment, networking is typically 50-80% of total Kafka spend. Tiered Storage attacks the smaller line item. If cross-AZ is your real problem, look at KIP-392 (Fetch From Follower) for consumer-side savings, then consider truly diskless options (AutoMQ, Aiven Inkless, the WarpStream-IBM stack) for replication-side savings. See the Advanced section for the full diskless landscape.

5.7KRaft: Operating Kafka Without ZooKeeper

Use when You are running Kafka 4.0 or later. KRaft is the only mode that exists.
Avoid when You are on Kafka 2.x and your migration path is constrained. Go to 3.9 (bridge release) first, migrate metadata, then upgrade to 4.x.

What changed

ZooKeeper was Kafka's external metadata service for 14 years. KRaft (Kafka Raft) replaces it with an internal Raft quorum running on dedicated "controller" nodes (or co-located with brokers in small clusters). Metadata lives in a single internal topic, __cluster_metadata, replicated via Raft. Controller failover is now sub-second instead of tens of seconds. Partition scale ceiling moved from ~200k to ~1.9M per cluster.

Bootstrap a fresh KRaft cluster

# Step 1: generate a cluster UUID once, share across all nodes
KAFKA_CLUSTER_ID="$(./bin/kafka-storage.sh random-uuid)"

# Step 2: format storage on each node with that UUID
./bin/kafka-storage.sh format -t "$KAFKA_CLUSTER_ID" -c config/kraft/server.properties

# Step 3: start brokers
./bin/kafka-server-start.sh config/kraft/server.properties

Combined mode (broker + controller in one process)

# config/kraft/server.properties
process.roles=broker,controller
node.id=1

# All controller nodes listed here, on every node
controller.quorum.voters=1@host1:9093,2@host2:9093,3@host3:9093
controller.listener.names=CONTROLLER

listeners=PLAINTEXT://host1:9092,CONTROLLER://host1:9093
advertised.listeners=PLAINTEXT://host1:9092

log.dirs=/var/lib/kafka/data
num.partitions=3
default.replication.factor=3
min.insync.replicas=2

Operational verbs that changed

TaskZooKeeper Mode (legacy)KRaft Mode
Inspect cluster metadata zkCli.sh ls /brokers kafka-metadata-quorum.sh --describe status
List topics kafka-topics.sh --zookeeper ... kafka-topics.sh --bootstrap-server ... (only)
Add a controller Restart with new zookeeper.connect kafka-metadata-quorum.sh add-controller (dynamic)
Force a controller election Delete ZK node, hope for the best Built-in via metadata quorum; election is automatic

5.8Share Groups (Queues for Kafka): KIP-932

Use when You want queue semantics on Kafka: per-message ack, redelivery, more consumers than partitions, poison-message handling. GA since Kafka 4.2.
Avoid when You need strict ordering per key (share groups do not preserve partition-to-consumer mapping), or EOS (not yet supported with share groups).

Why this exists

For 14 years, the rule was "partition = unit of parallelism, max consumers per group = partition count." Want 1000 concurrent workers? Make 1000 partitions. That model fights real workloads where you want to scale consumers freely without committing to a partition count upfront.

Share Groups break the rule. A share group lets many consumers cooperatively consume from the same partition with per-record acknowledgment. The whole topic becomes a queue. You lose strict per-key ordering. You gain elastic scaling and SQS-like semantics, inside Kafka.

Share group consumer

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.consumer.AcknowledgeType;

Properties props = new Properties();
props.put("bootstrap.servers", "broker:9092");
props.put("group.id", "image-processors");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");

// Per-record ack mode
props.put("share.acknowledgement.mode", "explicit");

try (ShareConsumer<String, byte[]> consumer = new KafkaShareConsumer<>(props)) {
    consumer.subscribe(List.of("image-jobs"));

    while (true) {
        ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofSeconds(2));

        for (ConsumerRecord<String, byte[]> record : records) {
            try {
                processImage(record.value());
                // Acknowledge as ACCEPT: remove from delivery set
                consumer.acknowledge(record, AcknowledgeType.ACCEPT);
            } catch (TransientException e) {
                // Release: redeliver to another consumer, increment delivery count
                consumer.acknowledge(record, AcknowledgeType.RELEASE);
            } catch (PoisonException e) {
                // Reject: send to dead-letter handling; will not be redelivered
                consumer.acknowledge(record, AcknowledgeType.REJECT);
            }
        }

        consumer.commitSync();  // flush ack state to share coordinator
    }
}

Topic-side: enable share group on a topic

# Cluster-wide feature flag (one-time, KRaft)
kafka-features.sh --bootstrap-server broker:9092 \
  upgrade --feature share.version=1

# Configure share-group settings (delivery count threshold, lock timeout)
kafka-configs.sh --bootstrap-server broker:9092 \
  --entity-type groups --entity-name image-processors \
  --alter --add-config \
group.share.delivery.count.limit=5,\
group.share.record.lock.duration.ms=30000

# Inspect
kafka-share-groups.sh --bootstrap-server broker:9092 --describe \
  --group image-processors
PE Take: Share Groups vs SQS vs Consumer Groups

Share Groups are the right answer for job-queue patterns where you already have Kafka and want to avoid running SQS or RabbitMQ alongside it. They are not the right answer when (a) you need strict per-key ordering, (b) you need EOS, or (c) the workload is naturally low-throughput and SQS would cost less. Treat Share Groups as a feature you adopt deliberately, not as the new default for consumers.

06Trade-Offs

Each row is a deliberate engineering choice baked into Kafka's design. Several of these have been challenged by next-gen alternatives (see Section 11), but the trade-offs are still load-bearing in any cluster you run today.

Apache Kafka
Trade-Off What You Gain What You Give Up When It Bites You PE Nuance
Partition count is permanent Constant-time key-to-partition routing, simple consumer assignment, ordered streams per key Resizing breaks the modulo, key K moves to a different partition, per-key ordering broken across resize Year 2 when traffic 4x and your 12-partition topic caps consumer parallelism at 12 workers; you cannot add more without breaking ordering Most teams overshoot intentionally: pick partition count = 2-4x current need, accept slightly worse batching now to avoid the rewrite later. Powers of 2 simplify any future migration via a partition-doubling strategy with parallel consumers.
Pull consumer model Consumer-driven backpressure, broker carries zero per-consumer state, batched fetches, trivial replay Idle latency floor (consumers poll on a clock), wasted polls when topic is quiet, no broker-initiated push Mobile push notifications, low-volume real-time UI updates, anything requiring sub-100ms delivery to thousands of endpoints The Kafka philosophy is "fan out at the edge, not at the broker." If you need push to many endpoints, put a WebSocket gateway between Kafka and clients. The gateway pulls; clients get push. Do not try to make Kafka itself push.
Ordering only within a partition Linear scalability of writes and reads across many partitions; no global coordination Cross-partition ordering is impossible. Two events with different keys have no defined order. "User signup" on partition A, "user first login" on partition B, login arrives before signup record is visible; downstream state machine breaks The fix is partition key design, not Kafka feature flags. Choose the key as the entity whose ordering you actually care about. A single global ordering is impossible at Kafka's scale, so accept partition-level ordering as the contract and design around it.
acks=all + min.insync.replicas=2 for durability No data loss on single-broker failure; ISR contract enforces durability before ack Write availability requires ≥2 healthy replicas. Lose 2 of 3 brokers in a 3-AZ setup, writes block. AZ outage in a 3-AZ deployment with RF=3 and min.isr=2; one AZ down leaves 2 replicas, surviving an additional broker loss in either remaining AZ blocks writes There is no free lunch between "never lose data" and "always accept writes." min.isr=2 with RF=3 chooses durability. min.isr=1 chooses availability but allows a single ack-then-leader-dies scenario to lose committed records. Pick consciously; do not let the default surprise you in an incident.
Cross-AZ replication 3x for durability Survives full-AZ failure with no data loss Every 1 GB produced generates 2 GB of cross-AZ replication traffic, plus consumer fetch traffic spread across AZs Cloud bill audit: ~70% of total Kafka spend is network egress, not compute or storage KIP-392 (fetch-from-follower) cuts consumer-side cross-AZ. KIP-881 (rack-aware assignors) makes it stick. The replication traffic itself is not negotiable in stock Kafka. This is the central reason the diskless wave (AutoMQ, WarpStream, KIP-1150) exists.
EOS stops at the Kafka boundary Strong exactly-once across consume-process-produce within Kafka; ideal for Streams Side effects to external systems (HTTP, DB, S3) are not transactional with Kafka commits "Process order, charge customer via Stripe, produce confirmation event" - transaction aborts after Stripe charge, charge stays, event does not. Customer charged twice on retry. The classic answer is the outbox pattern: write the side-effect intent into a DB transaction with your business write, then a CDC connector publishes to Kafka. Kafka does not solve this for you. Pretending EOS extends across systems causes correctness incidents.
Stateless brokers about consumers Broker scales to millions of consumer connections without per-consumer state No native per-message ack, no per-consumer retry, no priority delivery; consumer-state tracking is in __consumer_offsets only You want SQS semantics. Build them yourself with manual offset management, DLQ topics, and consumer-side retry. Or, since Kafka 4.2, Share Groups (KIP-932). Share Groups GA in 4.2 is the first real answer Kafka has to this trade-off. Until you have Share Groups, the long-running pattern is: separate retry topics with exponential backoff, DLQ topic for terminal failures, idempotent consumer logic.
Page cache + zero-copy reads Effectively free reads of recent data (RAM speeds); 1 GB/s+ per broker is realistic TLS or client-side encryption breaks zero-copy; throughput drops 30-50% Compliance mandate adds TLS, throughput regresses, capacity plan invalidated Plan for TLS in capacity sizing from day 1, not as an afterthought. The throughput delta is real and reproducible. Modern CPUs with AES-NI mitigate some of it, but the sendfile fast path is genuinely gone.
Stateful Streams via local RocksDB + changelog topics State co-located with compute, sub-ms state access, survives failure via changelog replay State restore on rebalance is slow (minutes for multi-GB stores); changelog topics double storage and replication cost Streams app restarts after deploy, takes 8 minutes to restore RocksDB stores, downstream consumers see processing gap Use num.standby.replicas=1 to keep a warm replica per partition. The replica restores in parallel; failover is instant. Costs ~2x state storage but eliminates the restore-time gap. Mandatory for any Streams app with strict SLAs.

07Use Cases

Apache Kafka
Use Case Company / Scenario Driving Property Scale Dimension Why Not Alternative
Event sourcing as the system of record LinkedIn (original use case), large fintechs replacing batch ETL with event-driven architectures Replayable durable log with ordered per-key streams and 7+ day retention; new consumers can rebuild state from offset 0 LinkedIn runs 7+ trillion messages/day across its Kafka fleet; multi-PB retained data; 100+ MB/s per topic sustained Traditional message queues (RabbitMQ, SQS) delete messages on ack. Replay requires a separate store. Kafka removes that store.
CDC pipeline from operational DB to data lake Mid-size SaaS replacing nightly ETL with Debezium + Kafka + S3 sink Sub-minute end-to-end latency for OLTP → lake replication with at-least-once semantics 2000+ tables across 30 Postgres instances, 50K events/sec sustained, 24-hour replay window Direct DB replication (logical slots → consumer) has no fan-out; adding a second consumer doubles replication slot load. Kafka decouples.
Real-time analytics aggregation Ad-tech firm computing per-minute click attribution and revenue rollups via Kafka Streams + Flink Stateful windowed aggregation with exactly-once semantics; p99 end-to-end latency under 5 seconds at 200K events/sec 200K events/sec peak, 5-minute tumbling windows, 30-day retention for late-arriving data Spark Streaming has micro-batch latency floor (seconds). Flink is the alternative, but Kafka Streams keeps state co-located which simplifies ops at this scale.
Microservice decoupling backbone E-commerce platform with 60+ services exchanging order, inventory, fulfillment, and payment events Service can be deployed independently of consumers; new services consume historical events to bootstrap state 15K events/sec sustained across ~400 topics, 50+ producer services, 200+ consumer services Synchronous REST creates a brittle service mesh: one slow service stalls the entire request chain. Kafka absorbs the impedance mismatch.
Application log and metric aggregation Datadog-style observability pipeline before storage in TSDB/object store Massive write throughput, lossy-acceptable durability, slow consumers do not block fast producers 1M+ log events/sec, 30-minute retention, ~100 consumer services tailing Kinesis is similar in concept but has 1000 records/sec per shard cap and 24-hour retention default. Kafka is more elastic at this volume.
Materialized views as Kafka topics Banking ledger projecting account balances from a transactions topic via Kafka Streams Compacted topic gives "latest balance per account" with key-based access; rebuildable from source-of-truth transactions 200M+ accounts, 50K transactions/sec, compacted topic ~80 GB, materialized on every Streams instance Traditional DB-backed projection requires explicit cache invalidation. Compacted topic + Streams = self-healing projection that survives any service restart.
Fraud detection feature pipeline Payment processor enriching transaction stream with rolling per-user features for ML inference Stateful joins against per-user history with millisecond-scale processing latency 30K transactions/sec, 5-minute and 1-hour windows per user, ~10M active users in working set Batch feature stores (Feast in batch mode) miss real-time signals. Online feature stores layered on Kafka Streams give both.

08Limitations

Apache Kafka
Limitation Severity Workaround Workaround Cost
Partition count is irreversible without breaking key ordering Critical Over-provision at topic creation (4x expected need). For existing topics, create a new topic with more partitions and run a dual-write/dual-consume migration. Over-provisioning wastes broker resources (open files, controller metadata). Migration takes weeks of careful dual-write, dual-consume, cutover.
Cross-AZ replication traffic dominates cloud cost Critical KIP-392 fetch-from-follower (consumer-side), KIP-881 rack-aware assignors, ultimately move to diskless (AutoMQ, WarpStream, Aiven Inkless). KIP-392/881 fix consumer side only; replication traffic remains. Diskless migrations require platform-level architecture changes and may sacrifice low-latency.
No native per-message ack / DLQ until Kafka 4.2 High Manual offset management, retry topics with backoff, DLQ topic for terminal failures. Or upgrade to 4.2+ and use Share Groups. Pre-4.2: 100+ lines of boilerplate per consumer to do what SQS gives for free. Share Groups GA: clean, but requires upgrade and learning a new API.
Compacted topics do not work with Tiered Storage High Use delete retention with very long ms, accept the disk cost; or store latest-value state in an external KV store and emit change events to a non-compacted Kafka topic. External KV duplicates the "latest value per key" semantics Kafka offers natively. Long retention with delete policy uses brokers as cheap object storage.
EOS does not extend beyond Kafka High Outbox pattern: write side-effect intent inside a DB transaction with the business write, CDC publishes to Kafka. Requires CDC infrastructure (Debezium), an outbox table per service, careful schema design. Real engineering effort.
Hot partition cannot be split without re-keying High Re-key to a more granular partition key (customerId → customerId+shardId), then repartition. Or salt the hot key across multiple partitions and merge downstream. Re-keying breaks consumer offset continuity. Salting requires downstream consumers to handle multi-partition gather, breaking the simple "one partition = one consumer = one ordering domain" model.
TLS + zero-copy are mutually exclusive Medium Accept the 30-50% throughput drop, size capacity accordingly. Or terminate TLS at a sidecar (Envoy) and run Kafka cleartext on a private network. Capacity: ~30-50% more brokers. Sidecar approach adds operational complexity and a privileged-network assumption.
No priority queues Medium Separate topics per priority, consumer reads from high-priority topic first with backpressure-aware polling. Manual priority logic in consumer; "starvation" risk for low-priority topics if high-priority is saturated.
Connector framework is a separate cluster to operate Medium Run Connect as a managed service (Confluent, Aiven), or use lightweight alternatives (Debezium Server, MirrorMaker 2 standalone). Managed Connect adds vendor cost; standalone alternatives lack the distributed task rebalancing of full Connect.
Schema Registry is not Apache Kafka Medium Confluent Schema Registry (Confluent Community License) is most common; Apicurio (Apache 2.0) is the open-source alternative. License sensitivity in some orgs; Apicurio is fully OSS but has smaller ecosystem support and fewer integrations.

09Fault Tolerance

Single-tech deep dive: dimension, behavior, and the operational reality you discover on the on-call rotation.

Dimension Behavior Operational Reality
Replication model Leader-follower per partition. Replication factor configurable per topic (default 3 in production). One broker is the leader for a partition; followers fetch from leader. Replication factor 3 with min.insync.replicas=2 is the production baseline. RF=2 is a footgun: lose one broker and writes block immediately.
Failure detection Brokers send heartbeats to the active controller. Follower lag tracked by leader; follower drops from ISR if it lags beyond replica.lag.time.max.ms (default 30s). KRaft (Kafka 4.0+) detects broker failure in single-digit seconds vs ZooKeeper's tens-of-seconds. ISR fluctuations under load are normal and not always failures, look for sustained ISR shrinkage, not transient.
Failover mechanism Controller elects a new leader from the ISR (in-sync replica set). Clients receive metadata update via fetch error and refresh routing. No client-side failover code needed. Sub-second leader election in KRaft. Producer/consumer clients seamlessly retry. The pause is short enough that p99 latency users often do not notice.
RTO (typical) Broker failure: 1-3 seconds for in-cluster failover (KRaft). AZ failure: same, as long as ≥1 ISR replica is in a surviving AZ. Region failure: minutes to hours (depends on MM2/Cluster Linking setup). RTO for in-region failures is excellent. Cross-region DR is a separate engineering effort, not a feature you flip on. Most teams accept 5-10 minutes of writes lost in true region failure.
RPO (typical) RPO=0 for in-region failures with acks=all and min.isr=2. Cross-region: bounded by MirrorMaker 2 lag (typically seconds, can spike to minutes under load). "Zero data loss" requires acks=all AND min.isr ≥ 2. acks=1 with min.isr=2 still loses data if the leader crashes before replicating.
Split-brain behavior Prevented by Raft quorum (controllers) and epoch fencing (partition leaders). Old leaders cannot ack writes after losing leadership; clients retry against the new leader. KRaft eliminated the historical ZooKeeper split-brain class. Eligible Leader Replicas (KIP-966) further reduce edge cases around minority-partition leader election.
Blast radius of single-node failure For each partition the failed broker led, sub-second leader re-election to a follower. Partitions where the failed broker was a follower lose one ISR replica. If the failed broker led many partitions, the surviving brokers see a thundering herd of leader-promotion. Plan for transient latency spike. Spread leaders evenly via auto.leader.rebalance.enable=true.
Cross-region failover story Active-passive: MirrorMaker 2 or Cluster Linking replicates to a DR region. Consumer offsets do not translate cleanly across clusters; tools like RemoteClusterUtils help. Active-active: rare in stock Kafka, common in Confluent's stretched clusters. Cross-region DR remains a manual cutover for most. The hard part is not the data, it's the consumer offset translation and the producer client redirection. Practice the failover in staging quarterly or you will get it wrong in prod.
Data loss scenarios (1) acks=1 + leader crash before replication. (2) Unclean leader election enabled + all ISR lost simultaneously + out-of-sync replica promoted. (3) Disk failure on a broker with replication factor 1 (unusual but possible on test topics). Default of unclean.leader.election.enable=false trades availability for durability: if all ISR is lost, the partition becomes unavailable until ISR recovers. Most teams accept this. Flipping it to true is a deliberate decision to lose data on rare double-failures.
Controller quorum failure (KRaft) Controller quorum (typically 3 or 5 nodes) tolerates ⌊(N-1)/2⌋ failures. Loss of majority blocks metadata changes (cannot create topics, elect leaders), but existing leaders continue serving reads/writes. Data plane survives controller-quorum loss. Control plane does not. This is a more graceful degradation than ZooKeeper, where the same loss class would block everything.
PE Take: the durability config matrix

For any non-trivial Kafka deployment, the durability triad is: acks=all, min.insync.replicas=2, replication.factor=3, and unclean.leader.election.enable=false. This combination gives RPO=0 in single-AZ failure scenarios. It also means a 2-AZ failure can block writes, which is the trade you make. Any deviation from this should be a deliberate, documented choice with a written justification.

10Sharding

Dimension Behavior Operational Reality
Sharding model Hash-based (default DefaultPartitioner uses murmur2(key) % N), with sticky round-robin for null keys. Custom partitioners implement org.apache.kafka.clients.producer.Partitioner. Custom partitioners are a smell. Use only for known-hot keys (salt high-volume key K across multiple partitions) and document the consumer-side gather. The default partitioner is what you should be using for ~95% of topics.
Shard key constraints Any byte array. Typically a string. Determined per-record by the producer. Immutable once written; record key cannot change without writing a new record. Pick the key as the entity whose ordering matters: customer, user, order ID. Composite keys (region:userId) work but increase cardinality, which is good for spread, bad for hot-key concentration.
Rebalancing mechanism No automatic rebalancing of partitions across brokers. kafka-reassign-partitions.sh generates and executes a plan to move partitions. Cruise Control or Confluent's Self-Balancing Clusters automate this. This is the single biggest operational gap in stock Kafka. Cruise Control is open-source, mature, and effectively mandatory for clusters with more than ~10 brokers. Without it, broker additions sit empty until you write a reassignment plan by hand.
Rebalancing cost / impact Partition reassignment copies entire partition data from source broker to destination. Replication factor × partition size of network traffic. Throttleable via --throttle flag (KB/s). Moving a 100 GB partition with RF=3 produces 300 GB of cross-broker traffic. Cluster-wide rebalances take hours-to-days. Throttle aggressively (~50-100 MB/s per broker) to avoid impacting production traffic.
Hot-shard behavior No adaptive splitting. A single key K with 90% of traffic concentrates on one partition. The leader broker for that partition becomes the bottleneck. The fix is application-level: salt the hot key across multiple partitions (key = "K-0", "K-1", ..., "K-9"), and have consumers fan-in. Document loudly because it breaks the "one key = one partition" mental model.
Maximum shards (practical) KRaft cluster limit: ~1.9M partitions across all brokers. Per-broker practical limit: ~4000 partitions (controller metadata overhead, open file handles). "Make a topic with 10,000 partitions" is rarely the right answer. A single broker hosting that many partitions burns memory on indices and degrades all topics. Prefer fewer, larger topics or hierarchical topic naming for routing.
Resharding without downtime? Adding partitions is online (kafka-topics.sh --alter --partitions N), but breaks key-to-partition mapping for new records. Shrinking partitions is not supported at all. "Adding partitions" should be treated as "creating a new topic." If key ordering matters (it usually does), the right move is: create new topic, dual-write, dual-consume, cut over, retire old topic. Weeks of work.
Cross-shard query support None native. Consumer reads multiple partitions; ordering is per-partition only. Kafka Streams provides cross-partition aggregation via repartition topics (automatic). If you find yourself wanting "join data from partition 0 with data from partition 3," you have a downstream system problem, not a Kafka problem. Process the join in Streams (with repartition) or pump into a real database for cross-key queries.
Partition assignment to brokers At topic creation, Kafka assigns partition leaders round-robin across brokers, with replicas placed on different brokers (and ideally different racks if broker.rack is set). Rack awareness is mandatory for multi-AZ. Without broker.rack set, Kafka cannot guarantee replicas land in different AZs and you lose the AZ-failure-tolerance assumption you may be operating under.

11Replication

Dimension Behavior Operational Reality
Replication topology Leader-follower per partition. Single leader per partition at any time; followers fetch from leader and apply changes in order. This is the only topology Kafka supports. Multi-leader exists only via Confluent's Cluster Linking or in cross-cluster setups (MM2 active-active). Stock open-source Kafka is leader-follower.
Sync vs async Synchronous within ISR (In-Sync Replicas). With acks=all, producer waits for all ISR members to ack. Out-of-ISR replicas catch up asynchronously. "Sync replication" is sync only to the ISR, not to all replicas. If a follower falls behind, it's removed from ISR and the leader stops waiting for it. The set is dynamic and changes under load.
Replication factor (default / max) Configurable per topic. Production default 3. Maximum bounded by broker count. RF=1 means no replication; RF=N requires N brokers. RF=3 is the answer for almost every production workload. RF=5 for ultra-critical (regulatory archives, financial event logs). RF=2 is wrong; lose one broker, writes block.
Consistency level options Producer-side: acks=0 (fire-and-forget), acks=1 (leader ack only), acks=all (all ISR ack). Consumer-side: isolation.level=read_uncommitted or read_committed. Production producers should use acks=all + enable.idempotence=true (latter is default since 3.0). Consumers should use read_committed when downstream of a transactional producer.
Replication lag (typical) Within-region: sub-second under normal load, single-digit ms for low-throughput topics. Cross-region (MM2): seconds to tens of seconds, sensitive to inter-region network. ISR lag spikes during broker restarts, partition rebalances, and disk pressure. Alert on sustained lag > 5s; transient blips are normal.
Conflict resolution No conflicts within a cluster: single leader per partition, all writes serialized. Cross-cluster (active-active with MM2): no automatic resolution. Both sides accept writes; downstream consumers see both. Active-active cross-cluster setups require application-level conflict design (last-writer-wins is a lie at scale; you need CRDTs or domain-specific reconciliation). Most teams avoid active-active and run active-passive.
Cross-region replication MirrorMaker 2 (Apache, open-source) or Confluent Cluster Linking (Confluent-only, more efficient). Both are async, both have consumer-offset translation challenges. MM2 runs as a Connect cluster; operate it as carefully as your main Kafka. Cluster Linking is more efficient (broker-to-broker direct) but locks you into Confluent. For DR, MM2 is almost always the right answer.
Replication during partition (network) Followers in the minority side fall out of ISR. Leader continues serving writes if min.insync.replicas can still be met. If not, leader rejects writes with NotEnoughReplicasException. This is the partition-tolerance behavior of Kafka: CP-leaning. Availability is sacrificed when min.isr cannot be met. This is the right default; the alternative (accept writes that may not be durable) is silently dangerous.
Fetch-from-follower (KIP-392) Consumers can be configured to fetch from same-AZ followers via client.rack + replica.selector.class=RackAwareReplicaSelector. Reduces cross-AZ consumer fetch traffic dramatically. This single config change cuts ~33% of total cross-AZ traffic in a 3-AZ deployment with one consumer group, more with multiple groups. Inexplicably underused; should be default on every multi-AZ cluster.
Replication cost (network) Every 1 GB produced generates (RF-1) × 1 GB of replication traffic. With RF=3, every produced GB costs 2 GB of replication. If replicas span AZs (they should), that 2 GB is cross-AZ. In AWS at $0.02/GB cross-AZ, 100 MB/s of produce traffic costs ~$5K/month just in replication networking. Multiply by your fleet. This is the single largest line item in Kafka cloud bills and the entire motivation for the diskless wave.

12Better Usage Patterns

Where PE judgment pays off. Patterns most teams miss, anti-patterns visible on code review, optimizations that compound at scale.

Apache Kafka
Pattern What Most Teams Do Wrong The Better Way Why It Matters
Partition count selection Pick the default of 1, or copy the partition count from another topic, with no thought to throughput or consumer parallelism. Compute partition count = max(target throughput MB/s ÷ 10, target consumer count) and round up to a power of two. Document the decision in the topic config. Partition count is permanent for ordering purposes. Wrong now = rewrite later. Right now = 5 minutes of math.
Partition key selection Use whatever ID is convenient (orderId, eventId), get unordered semantics across the entity that actually matters (customer, account, user). The key is the ordering domain. Pick the entity whose events must arrive in order to downstream consumers. customerId > orderId for an orders topic. Wrong key = correctness bugs in downstream stateful processing. Detecting these in production is expensive; preventing them is free.
Producer durability config Leave defaults. Or set acks=1 because "it's faster" without measuring. Explicit acks=all + enable.idempotence=true + min.insync.replicas=2 on the broker. Document the choice in your service config. Defaults change. Explicit config survives version upgrades and is greppable when the durability question comes up.
Consumer commit strategy Use enable.auto.commit=true and discover later that batch reprocessing on restart is undefined. enable.auto.commit=false, commit after successful batch processing. For EOS pipelines, use transactional commit via sendOffsetsToTransaction. Auto-commit is a footgun: it commits on a clock, not after processing. Restarts cause reprocessing of records you thought were done.
Use KIP-848 rebalance protocol Leave consumers on the classic protocol "because it's been working" - and pay 30-100s rebalances per consumer leave/join. Set group.protocol=consumer on all consumers from Kafka 4.0 forward. Test in staging, then roll out by consumer group. The classic protocol is in Phase 1 deprecation in Kafka 4.3. KIP-848 cuts rebalance time by 5-20x and survives slow members gracefully.
Enable fetch-from-follower Run multi-AZ Kafka without setting client.rack on consumers; pay full cross-AZ traffic cost for fetches. Set client.rack on consumer (matching the broker rack of the same AZ) and replica.selector.class=RackAwareReplicaSelector on broker. Cuts ~33% of cross-AZ traffic with a 5-minute config change. The most underused dollar-saving config in Kafka.
Use compacted topics for state Treat Kafka as append-only and run a separate KV store for "current state per entity." Use compacted topic with key=entityId for state. Consumers materialize current state by replaying. Survives any consumer restart. Eliminates an external state store and its operational burden. The compacted topic IS the source of truth.
Schema-first contracts Send JSON blobs with no schema, evolve fields by adding optional ones and hoping consumers handle it. Avro or Protobuf with Schema Registry from day 1. Compatibility=BACKWARD by default. Register schemas in CI, never at runtime. "Hope for the best" schema evolution is the leading cause of cross-team Kafka incidents. Schema Registry catches breakage at deploy time, not at runtime.
Standby replicas for Kafka Streams Set num.standby.replicas=0 (default), then suffer 5-10 minute state restore on every restart. Set num.standby.replicas=1. Warm replica restores changelog in parallel. Failover takes seconds. State restore is the longest tail on Streams availability. Standbys eliminate it at ~2x storage cost.
Topic naming conventions Free-for-all naming, topic name carries no metadata, no clue about lifecycle or ownership. Convention: {domain}.{entity}.{event-type}.{version} (e.g. orders.orders.placed.v1). Encode versioning into the name; new versions = new topics. Hundreds of topics in mature deployments. Naming convention is the difference between findability and chaos.
Throttle partition reassignments Run a reassignment to add a new broker, take down the cluster with replication traffic, get paged at 3 AM. Always set --throttle on kafka-reassign-partitions.sh. 50-100 MB/s per broker is a safe ceiling for production. Unthrottled reassignment can saturate the network and degrade producer/consumer latencies for hours.

13Advanced / Next-Gen Alternatives

The 2024-2026 streaming landscape changed more than the previous decade combined. Cross-AZ replication economics are the forcing function. WarpStream was acquired by Confluent (Sep 2024). Confluent was acquired by IBM for $11B (closed March 17, 2026). AutoMQ remains independent. Aiven's Inkless is upstreaming as KIP-1150. The "diskless Kafka on object storage" category is now real, in production at Grab, JD.com, Tencent.

Apache Kafka
Successor / Alternative What It Improves Maturity Migration Cost When To Consider
AutoMQ Diskless Kafka with pluggable WAL (S3, EBS, NFS). 100% Apache Kafka API compatible. Eliminates cross-AZ replication via S3 as the durability layer. Reported 10x cost reduction for high-throughput workloads. Production at scale (Grab, JD.com, Tencent; Apache 2.0 license) Low. Drop-in protocol compatible. AutoMQ Linking preserves consumer offsets during migration. Cross-AZ replication is the largest line item in your Kafka cloud bill, and you want to stay open-source. Independent vendor, no IBM/Confluent lock-in.
WarpStream (IBM/Confluent) Rewrite of Kafka in Go. Stateless agents write directly to S3. Zero local disk on the data plane. Excellent cost for log/observability workloads. Production (proprietary, now IBM-owned) Low for ingest (Kafka-protocol agents); higher for full migration due to feature gaps (no transactions, no compacted topics). Logs and observability pipelines where 400+ ms latency is acceptable. Already an IBM/Confluent customer. Concerned about IBM lock-in.
Apache Kafka with KIP-1150 (Aiven Inkless) Diskless topics as an upstream Apache Kafka feature. Some topics are diskless (object-storage-backed), others remain classic. Same cluster. Emerging / in-development (Aiven driving the KIP; upstream target Apache 2.0) Eventually low (upstream feature). Currently requires Aiven Inkless or commitment to the KIP timeline. Long-term bet on stock Apache Kafka. Want diskless economics for some topics without abandoning the broader Kafka ecosystem.
Redpanda C++ rewrite, thread-per-core architecture (Seastar). Best-in-class single-node performance, mature tiered storage in 25.x. Strong Kafka protocol compatibility. Production (BSL 1.1 license, source-available, converts to Apache 2.0 after 4 years) Low for protocol; license requires due diligence (BSL prohibits competitive managed-service offerings). Performance-per-node is the dominant concern. Latency-sensitive workloads. License is acceptable.
Apache Pulsar Compute/storage separation via BookKeeper. Multi-tenancy first-class. Native geo-replication, native queue mode, namespaces and policies. Production (Apache 2.0) High. Different APIs, different operational model, ZooKeeper still in the loop. Greenfield deployment with strong multi-tenancy requirements. Want native queue and stream in one system. Not migrating existing Kafka.
Apache Flink True stream processing with stateful operators, exactly-once at scale, watermarks and event-time semantics, broader language support (Java, Scala, Python, SQL). Production at every scale (Apache 2.0) Medium-high. Different deployment model (Flink cluster + checkpointing storage). API is more powerful but steeper. Complex stateful pipelines outgrow Kafka Streams. Cross-organization streaming platform. Need windowing semantics beyond Streams' offering.
NATS JetStream Lightweight messaging, simpler ops, native subjects (no partitions), stream replay, consumer ack semantics. Production (Apache 2.0) High - different conceptual model, different client libraries. Edge or IoT workloads, smaller-scale messaging where Kafka's operational footprint is overkill. Mesh/multi-cluster federation.
PE Take: the 2026 decision matrix

Self-managed Kafka in the cloud is structurally expensive because of cross-AZ replication. The diskless category fixes this and is real now. If you are starting fresh in 2026, the decision is no longer "Kafka vs not-Kafka," it is "classic Kafka vs diskless Kafka." For an existing Kafka deployment with a 6-figure cloud bill, AutoMQ migration is the highest-ROI infrastructure project most teams will see this year, and the one most teams will postpone forever because the existing pipeline is "working." The forcing function is usually an EBS or cross-AZ bill audit.

For latency-sensitive workloads, classic Kafka or Redpanda still win, diskless trades 50-400 ms additional latency for 10x cost reduction. Choose deliberately.

14References

Apache Kafka — PE-Grade Deep Dive · As of 2026-06-04 · Current Kafka 4.3.0