Apache Kafka — PE-Grade Deep Dive
Distributed append-only log that pretends to be a queue but isn't one. What it actually is, how partitions and storage really work, how every feature gets used in production, and where the 2024-2026 diskless wave is taking the project.
Kafka is the durable, partitioned, append-only log that the rest of the streaming world is built on, and the partition is the only abstraction that matters. Choose partition count and partition key carefully at topic creation, because both are permanent. Beyond that, the most expensive thing about Kafka in 2026 is not the brokers, it is cross-AZ traffic, and the diskless wave (Tiered Storage, KIP-1150, AutoMQ, WarpStream) exists specifically to attack that bill. Reach for Kafka when you need replayable ordered streams with high fan-out; reach for SQS, RabbitMQ, or Share Groups when you actually need a queue.
Best default choices
01What Kafka Actually Is
Kafka is a distributed, partitioned, replicated, append-only commit log. That phrase is exact and every word does work. Most teams describe it as "a message queue" or "a streaming platform"; both descriptions are wrong in ways that lead to wrong design decisions.
Picture a write-ahead log from a database, but exposed as a network service, sharded across machines, and never truncated until you tell it to. Producers append byte ranges to the end. Consumers read from any offset, at any speed, and the broker remembers nothing about who has read what. That is the entire data plane.
Why teams reach for it
- Decoupling. Producers do not know consumers exist. Add a third consumer six months later and nothing upstream changes.
- Replay. A consumer that crashed at offset 4 trillion comes back and resumes at offset 4 trillion. A new consumer can read the entire log from offset 0 if retention permits.
- Fan-out at constant producer cost. One write, N consumer groups read. The producer pays the same cost for one downstream or fifty.
- Ordered partitioned streams. All records with the same key land in the same partition and are delivered in production order to a single consumer.
- Backpressure absorption. Producers and consumers are decoupled in time. A 10x consumer slowdown does not propagate back to producers; the log just grows.
- Durability with high write throughput. Sequential append to disk plus page cache means a single broker sustains hundreds of MB/sec of writes with replication-factor-3 durability.
When Kafka is the wrong tool
RPC. Kafka is not a request/response system. If you need a reply within 5ms, do not use Kafka.
Per-message ack / DLQ semantics. Until KIP-932 Share Groups went GA in Kafka 4.2, Kafka could not natively do "ack one message, retry that one, send the next one to DLQ." Most teams still want SQS or RabbitMQ for this. Share Groups is a real option now, but it is a 2026 capability, not a default mental model.
Workflows with priorities. Kafka has no notion of message priority. Every message in a partition waits behind the one in front of it.
Small, low-throughput pub/sub. Below ~5k msg/sec sustained, a managed SQS+SNS or NATS deployment is operationally cheaper.
Long-lived stateful work assignment. If you need to assign a "job" to one worker and reassign on failure, that is a job queue, not a log.
What ships under the "Kafka" name
Apache Kafka in 2026 is not a single binary. It is a family of components, each with distinct operational profiles:
| Component | What It Does | Runs Where | Production Reality |
|---|---|---|---|
| Broker | Stores partitions, serves produce/fetch RPCs | JVM, on dedicated hosts (or pods) | The thing you actually operate. Usually 3 to 100+ nodes. |
| Controller (KRaft) | Cluster metadata, partition leader election | JVM, often co-located with brokers in small clusters | Since Kafka 4.0, ZooKeeper is gone. Controllers run a Raft quorum (typically 3 or 5 nodes). |
| Kafka Streams | Library for stateful stream processing | Embedded in your application JVM | Java/Scala only. State stored in local RocksDB plus changelog topics. |
| Kafka Connect | Pluggable framework for source and sink connectors | Separate JVM cluster (workers) | Hundreds of connectors exist (JDBC, S3, Debezium, Elastic). Operationally similar to running another Kafka. |
| Schema Registry | Stores Avro/Protobuf/JSON schemas, enforces compatibility | Confluent or Apicurio service (not Apache Kafka core) | Optional but standard. Without it, schema evolution becomes tribal knowledge. |
| MirrorMaker 2 | Cross-cluster replication | Runs as a Connect cluster | For DR and active-active. Offsets do not translate cleanly across clusters. |
| ksqlDB / Flink | SQL on Kafka topics | Separate cluster | Flink is the 2026 default for serious streaming SQL; ksqlDB is Confluent's offering. |
The Kafka mental error most teams make is treating it as one product. Adopting "Kafka" in practice means adopting at least Broker + Schema Registry + Connect, and probably Streams or Flink. Each is a separate operational surface with its own runbooks, version skew rules, and failure modes. Budget accordingly.
02Core Concepts: How Kafka Actually Works
2.1 Partitions: the only abstraction that matters
A topic is just a name. The real unit is the partition. A topic with 12 partitions is 12 independent ordered logs. Everything Kafka does, ordering, parallelism, replication, sharding, traces back to partitions.
Key → partition mapping
The default partitioner does murmur2(key) % numPartitions. Two consequences that bite:
- Partition count is permanent. Adding partitions changes the modulo, which means a key that landed on partition 2 yesterday lands on partition 5 tomorrow. Ordering guarantee per key is broken across the resize event. Most teams pick partition count once and never change it.
- Key cardinality drives skew. If 90% of traffic carries the same key (a single hot customer, a single popular video), that key hammers one partition and one broker. The other brokers idle.
What ordering actually guarantees
Kafka guarantees ordering within a partition, and only within a partition. If you write records A, B, C to partition 0, every consumer reading partition 0 sees them in that order. If A goes to partition 0 and B goes to partition 1, you get no ordering guarantee between A and B.
This is why partition key is the most important decision in a Kafka design. The key defines what your ordering domain is. Customer ID as the key means "ordered per customer." Order ID as the key means "ordered per order, but customer events can interleave." Get this wrong and you cannot fix it without reprocessing.
Parallelism math
Maximum consumer parallelism within a consumer group equals partition count. 12 partitions = at most 12 consumers doing useful work. A 13th consumer in the group sits idle. This is why "partition count" debates exist:
| Decision | Too Few Partitions | Too Many Partitions |
|---|---|---|
| Effect on consumer throughput | Capped. Cannot scale consumers past partition count. | Unlimited (within reason). Add consumers freely. |
| Effect on broker memory | Low. Fewer open file handles, fewer index files in memory. | High. Each partition costs ~10-20MB of broker resources (open files, index pages, controller metadata). |
| Effect on rebalance time | Fast. Few partitions to reassign. | Slow with classic protocol. KIP-848 in Kafka 4.0+ mostly fixes this. |
| Effect on EOS overhead | Low. Each transaction touches few partitions. | High. Transactions across many partitions multiply control-plane traffic. |
| Effect on producer batching | Better. More records hit the same partition, larger batches. | Worse. Same producer rate spread thinner across partitions, smaller batches, lower compression ratio. |
Target throughput per partition: ~10 MB/s sustained sweet spot. Per-broker partition ceiling: ~4000 partitions per broker before metadata overhead degrades performance. KRaft pushed the cluster-wide ceiling to ~1.9M partitions, but individual broker limits did not move proportionally. Sanity check: pick partition count = max(target throughput / 10 MB/s, target consumer count) and round up to a power of two so future resizes are cleaner.
2.2 Storage: log segments, page cache, and the magic of sequential writes
The reason Kafka is fast is not exotic. It is that disks are very fast at sequential append, and Kafka does almost nothing but sequential append.
What a partition looks like on disk
The partition directory is a sequence of segment files. Each .log file is named by the base offset of the first record it contains. The currently-being-written segment is the active segment. Once a segment rolls (by size, default 1 GB, or time, default 7 days), it becomes immutable, and a new active segment opens.
Why segments exist
- Deletion is cheap. Retention deletes a whole segment file via unlink, which is constant-time. There is no scanning, no compaction of individual records (for retention deletion).
- Index files stay small. Each segment has its own index, so the index never grows past ~10 MB by default.
- Tiered storage offload is per-segment. Sealed segments are the unit that gets uploaded to S3.
How reads find the right byte
The .index file is sparse, one entry every log.index.interval.bytes (default 4 KB). For offset lookup:
- Binary search the segment list by filename (the filename is the base offset).
- Binary search the
.indexfile to find the nearest indexed offset ≤ target. - Scan forward in the
.logfile from that byte position until the exact offset is found.
A 1 GB segment with 4 KB index interval has ~262,144 index entries at 8 bytes each, around 2 MB. The default 10 MB index ceiling handles segments up to 5 GB. This is also why arbitrary-offset reads stay cheap, you binary-search a small index, then scan a few KB.
Page cache is the secret weapon
Kafka does not maintain an application-level cache. It writes to the OS page cache and lets the kernel decide when to flush. Consumers reading recent data hit the page cache, not disk. This is why a Kafka broker can sustain 1 GB/s of read throughput while disks are doing only 100 MB/s, the readers are hitting RAM that the kernel happens to have populated with recently-written log data.
Sized correctly (page cache > working set of recently-written data), Kafka is bound by network, not disk. Sized incorrectly (working set falls out of page cache, consumers do random offset jumps), you hit disk and throughput collapses.
Zero-copy reads (sendfile)
When a consumer asks for records, Kafka does not copy the bytes from kernel page cache into a JVM heap buffer and back to a socket. It uses the sendfile() syscall (or its NIO equivalent), which pipes bytes from page cache directly to the network socket. The JVM never touches the data. This is the second reason Kafka feels free, and the reason adding TLS or message-level encryption hurts (sendfile no longer applies, you pay one full copy per record).
Compaction vs deletion
| Cleanup Policy | What It Does | When To Use | Gotcha |
|---|---|---|---|
delete (default) |
Removes whole segments older than retention time/size | Event streams, telemetry, audit logs | Retention is per-topic; running out of disk and reducing retention does not delete data until segments age out. |
compact |
Retains the latest value per key; old values for the same key are garbage-collected | Materialized views, state changelogs, last-known-state topics | Tombstones (null values) trigger deletion, but only after delete.retention.ms. Compaction runs on a background thread and is not real-time. |
compact,delete |
Compaction with an additional time/size ceiling | Long-lived state with an absolute "purge after N years" policy | The two policies interact in ways that surprise people; deletion can remove the latest value for a key if it is too old. |
Tiered storage (GA since Kafka 3.9, mature in 4.x)
Hot segments stay on broker local disk. Sealed segments older than local.retention.ms get uploaded to remote object storage (S3, GCS, Azure Blob) by the Remote Log Manager. Consumers reading old data fetch transparently from remote.
Does not work on compacted topics. Hard limitation, enforced at config time.
Replay-from-zero now pays for S3 GETs. A consumer that rewinds 30 days will trigger that many days of S3 reads, which can spike your bill.
Brokers still read the segment, decompress it, and ship to the consumer. Tiered storage cuts your disk cost, not your network cost. Cross-AZ replication and consumer fetch traffic are unchanged.
2.3 Push vs Pull: why Kafka is pull-based
Kafka brokers do not push records to consumers. Consumers ask. This is the opposite of RabbitMQ, ActiveMQ, and most queue-style brokers, and the choice has structural consequences.
What pull buys Kafka
- Consumer-controlled backpressure. A slow consumer reads slowly. The broker neither buffers nor cares. There is no "consumer crashed and now I have unacked messages to track" problem.
- Broker is stateless about consumers. The broker does not track who has read what. Consumers commit offsets to a special compacted topic (
__consumer_offsets). Brokers serve fetches; that is the entire consumer protocol. - Reads are batched aggressively. A consumer asks for "up to 50 MB or wait 500 ms" in one round-trip, gets a large batch, decompresses it locally, processes the whole thing, then asks for the next batch. One TCP round-trip moves thousands of records.
- Replay is trivial. "Read from offset 0" is the same operation as "read from the tail." The broker has no idea you are reading old data vs new data.
- Multiple independent consumer groups for free. Each group tracks its own offset. Adding a new consumer group adds zero load on the producer side and zero state on the broker.
What pull costs
- Idle latency floor. If a consumer polls every 500 ms and a record arrives at t=10 ms, the consumer sees it ~490 ms later in the worst case. Long-polling (
fetch.max.wait.ms) cuts this, but the floor exists. - Wasted polls when topic is idle. Consumers poll on a schedule even when nothing is happening. Kafka mitigates this with long-poll (broker holds the fetch request until data arrives or timeout), but you still pay the connection cost.
- No fan-out push to many endpoints. You cannot point Kafka at 10,000 mobile devices and have it push notifications. Each consumer needs to maintain a persistent connection and poll. This is what services like SNS, Pusher, and WebSockets are for.
The long-poll behavior
A consumer fetch carries two parameters: fetch.min.bytes (default 1 byte) and fetch.max.wait.ms (default 500 ms). The broker holds the fetch until enough bytes are available OR the timeout expires. If you set min bytes to 1 MB and wait to 500 ms, the broker either returns 1 MB quickly or returns whatever it has after 500 ms. This is the knob for tuning between latency (small min bytes, short wait) and throughput (large min bytes, longer wait).
For request-response patterns or "deliver to this client now" semantics, pull is wrong. Kafka is not trying to solve that problem. If your design needs broker-initiated delivery, you are either using the wrong tool or building a layer (WebSocket gateway, push notification service) in front of Kafka. The 2026 emergence of Share Groups (KIP-932) doesn't change this; share groups are still pull, they just relax the partition-to-consumer mapping.
2.4 Offsets and consumer position
An offset is just a 64-bit integer per partition. Consumers commit offsets back to Kafka via the __consumer_offsets internal topic (which is compacted; latest committed offset per (group, topic, partition) tuple wins). Three commit modes:
| Commit Mode | What It Does | Failure Behavior | When To Use |
|---|---|---|---|
enable.auto.commit=true |
Background commit every auto.commit.interval.ms (default 5s) |
Consumer crash → up to 5 seconds of records reprocessed. At-least-once with a wide window. | Logging, metrics, anything where duplicates are cheap. |
commitSync() |
Blocking commit after each batch | At-least-once, tight window. Consumer crash before commit → reprocess the current batch. | Production default. Pair with idempotent or transactional downstream writes. |
commitAsync() |
Non-blocking commit | Faster but commits can fail silently. Need callback to handle errors. | High-throughput consumers where the occasional retry on shutdown is acceptable. |
Transactional sendOffsetsToTransaction |
Offset commit atomically tied to a produce transaction | Exactly-once across consume-process-produce within Kafka. | Kafka Streams and any consume-transform-produce pipeline needing EOS. |
03Architecture
Kafka is a two-tier distributed system: a data plane of brokers that store partitions and serve produce/fetch RPCs, and a control plane of KRaft controllers running a Raft quorum to manage cluster metadata. Since Kafka 4.0 (March 2025), ZooKeeper is gone; KRaft is the only mode. Clients (producers, consumers, Connect workers, Streams apps) talk to brokers directly over a binary protocol on TCP.
3.1 Cluster Topology
Three node roles, two of which can be co-located:
| Role | What It Does | Quorum / Count | Deployment Reality |
|---|---|---|---|
| Controller | Raft quorum managing cluster metadata: topic configs, partition assignments, leader elections, ACLs. Replicates the metadata log to all brokers. | 3 or 5 nodes (odd for quorum). Tolerates ⌊(N-1)/2⌋ failures. | In small clusters (<20 brokers), often runs combined with brokers (process.roles=broker,controller). In large clusters, runs on dedicated nodes for isolation. |
| Broker | Stores partitions on local disk, serves produce/fetch RPCs, replicates partitions, executes leader/follower roles per partition. | 3 to 100+ nodes. Scales horizontally with partition count. | The thing you spend money on. CPU is rarely the bottleneck; disk IO, network bandwidth, and page cache size dominate sizing. |
| Client | Producer, consumer, Streams app, Connect worker, Admin client. Discovers cluster via bootstrap servers, then refreshes metadata directly from brokers. | Unbounded. | Each client maintains a connection pool to brokers it actively reads/writes from. Metadata refresh is on-demand (cache miss) plus periodic (default 5 min). |
3.2 Broker Internals
A single broker process is structured around two thread pools, a request channel, and a set of long-lived service threads.
The two thread pools
- Network threads (
num.network.threads, default 3). Read bytes off the socket, parse the request envelope, hand the request to the request channel. Write responses back to the socket. They never touch the log. Sized roughly to network connection count; under-sizing causes TCP backpressure on clients. - I/O threads (
num.io.threads, default 8). Dequeue requests, execute the actual work: log append, log read, metadata lookup, replica fetch. These do the disk and page-cache work. Under-sizing causes request queue time to balloon while CPU sits idle on network threads.
Key service threads (not in the request path)
- ReplicaFetcher threads (
num.replica.fetchers, default 1 per source broker). Followers pull data from leaders via long-poll fetches. Increase when many partitions move during a rebalance. - LogCleaner threads. Background compaction for compacted topics. Reads sealed segments, merges them, writes a new segment with latest-value-per-key. Throttle to avoid disk contention with active produces.
- RemoteLogManager threads (Tiered Storage). Upload sealed segments to S3, fetch on miss. Sized via
remote.log.manager.task.thread.pool.size. - GroupCoordinator. Hosts consumer-group and share-group state. Each broker is the coordinator for a subset of groups (mapped via hash of group.id).
- TransactionCoordinator. Manages transactional producer state (PID, epoch, commit/abort markers). Each broker hosts a subset.
The default 3 network / 8 IO is for a dev broker with light load. For production: scale num.network.threads to roughly min(num_cores / 2, 8) and num.io.threads to roughly num_cores. Watch the RequestQueueTimeMs JMX metric: sustained high values mean IO threads can't keep up; sustained low values with high NetworkProcessorAvgIdlePercent means you have too many network threads. The two metrics tell you which pool is the bottleneck.
3.3 KRaft Control Plane in Detail
The control plane runs a Raft quorum over the __cluster_metadata topic. All cluster metadata (topic creates, partition reassignments, ACL changes, broker registrations) is an append to this log. The active controller is the Raft leader; passive controllers replicate the log. Brokers replicate the log too (read-only) so they always have an up-to-date metadata image without going through ZooKeeper-style watches.
What lives in the metadata log
- Topic definitions (name, partition count, configs)
- Partition assignments (which brokers host which replicas)
- ISR membership changes
- Leader epoch changes
- Broker registrations, broker fencing/unfencing
- ACLs, client quotas, dynamic configs
- Feature flags (e.g.
share.version)
Metadata snapshots
The metadata log is compacted via snapshots. When the log grows past a threshold, controllers serialize a full state snapshot to a file (00000000000000XXXXX.checkpoint) and prune the log behind it. New controllers bootstrap from the latest snapshot plus the tail of the log. Without this, the metadata log would grow unbounded.
3.4 Partition Placement and Rack Awareness
When a topic is created, the controller assigns its partitions and replicas to brokers. Two constraints drive placement:
- Even leader spread. Leaders are distributed across brokers so that produce/fetch load is balanced. Round-robin within the broker list.
- Rack-aware replica placement. If
broker.rackis set on each broker (typically to the AZ identifier in cloud deployments), the controller places replicas on different racks. This is what makes RF=3 actually survive an AZ failure.
broker.rack is not set by default. A multi-AZ cluster without it can end up with all 3 replicas of a partition in the same AZ, defeating the AZ-failure tolerance you thought you had. Always set this at broker startup, validate via kafka-topics.sh --describe, and run a controlled AZ-fault test before declaring the cluster production-ready.
3.5 Client-Side Architecture
Clients are not thin wrappers around a TCP socket. The Java client is a substantial state machine with its own background threads, buffers, and metadata cache.
Producer internals
The producer's Sender thread is separate from your application thread. send() returns immediately after enqueueing into the RecordAccumulator. The Sender batches per partition, groups batches per destination broker, sends one ProduceRequest per broker per round-trip, and invokes the callback when the broker acks. This is why throughput depends on linger.ms and batch.size: they govern how long the Sender waits to accumulate a worthwhile batch.
Consumer internals
The consumer is not multi-threaded for record processing; poll() is single-threaded and your code processes the returned batch before the next poll(). But the heartbeat (KIP-848 protocol) and prefetch (Fetcher) run on background threads to overlap I/O with processing. If your processing takes longer than max.poll.interval.ms, you're kicked from the group, even if heartbeats are healthy.
3.6 The Wire Protocol
Kafka has its own binary protocol over TCP. Not HTTP. Not gRPC. The protocol is request/response, length-prefixed, with versioned message schemas. Every API (Produce, Fetch, Metadata, OffsetCommit, JoinGroup, etc.) is a distinct request type with its own request and response schema, identified by an API key.
| Aspect | Behavior | Operational Reality |
|---|---|---|
| Framing | 4-byte length prefix + request header + request body. Same shape for response. | Wireshark dissector exists; kafka-protocol tools can decode for debugging. |
| Versioning | Each API has independently versioned request/response schemas. Clients and brokers negotiate the highest mutually-supported version per API at connection time (ApiVersions request). | This is why clients and brokers can be 1-2 major versions apart and still work. Client < 2.1 cannot talk to broker 4.0+. |
| Connection multiplexing | One TCP connection per (client, broker) pair. Multiple in-flight requests per connection (pipelined). Default cap: 5 in-flight on the producer (max.in.flight.requests.per.connection). |
Pipelining is what makes batched produce throughput fast. Setting in-flight to 1 (sometimes seen for "ordering safety") cripples throughput. |
| Authentication | SASL (PLAIN, SCRAM, GSSAPI/Kerberos, OAUTHBEARER) inside the protocol; TLS at the transport layer. mTLS also supported. | SASL + TLS is the production standard. OAUTHBEARER + OIDC is the 2024-2026 trend for cloud-native deployments. |
| Compression negotiation | Producer-side, per-batch. Broker stores the compressed batch as-is (since 0.11) and forwards it compressed to consumers. Decompression happens at the consumer. | Broker CPU spent on compression is near-zero. Consumer CPU does the work. Compression at producer is one of the highest-leverage tuning knobs. |
04Execution Model
What happens, step by step, when bytes move through Kafka. Five flows matter: producer write, broker request handling, replication, consumer fetch, and consumer-group coordination. Understanding these is what separates "I use Kafka" from "I can debug Kafka."
4.1 Producer Write Path
From producer.send(record) in your application code to a durable, replicated commit:
The idempotence trick
With enable.idempotence=true (default since 3.0), each producer gets a Producer ID (PID) and assigns sequence numbers per partition. The broker tracks (PID, partition) → last-seen sequence number. A retried batch with the same sequence is recognized as a duplicate and silently deduped. This is what makes retries=Integer.MAX_VALUE safe.
The transactional trick
With a transactional.id set, the producer registers with a TransactionCoordinator. Each transaction gets a monotonic epoch. beginTransaction / send / commitTransaction writes a special marker into the partitions involved. Consumers with isolation.level=read_committed skip aborted-transaction records and wait for the commit marker before exposing committed ones (this is the "Last Stable Offset" / LSO).
4.2 Broker Request Handling, in Detail
Every request (Produce, Fetch, Metadata, OffsetCommit, etc.) follows the same network-thread → request-channel → IO-thread path. The processing of Produce and Fetch is what we usually care about.
Produce request handling
Fetch request handling (the zero-copy magic)
The fetch path is the reason Kafka is a 1+ GB/s/broker system on commodity hardware. The data path is: page cache → kernel TCP buffer → NIC. The JVM is in the control path (deciding what to fetch) but never in the data path. Adding TLS forces the kernel to encrypt the bytes, which requires a copy out of page cache into a kernel-encrypted buffer; sendfile no longer applies, and throughput drops 30-50%. Same for any message-level encryption you bolt on.
4.3 Replication Protocol
Replication is followers pulling from leaders via the same Fetch API consumers use. There's no separate replication protocol; followers are just specialized consumers reading their own broker's partitions.
Key invariants
- LEO (Log End Offset): the next offset that will be written. Each replica has its own LEO.
- HW (High Watermark): the highest offset known to be replicated to all ISR members. Consumers can only read up to HW (this is what gives Kafka its "no read uncommitted" guarantee).
- ISR (In-Sync Replicas): replicas whose LEO is within
replica.lag.time.max.msof the leader. Falling behind drops you from ISR; catching up adds you back. - Leader Epoch: a monotonically increasing number that increments on every leader change. Used to detect and recover from divergent logs after a leader change (KIP-101). Without leader epochs, a leader change could silently lose committed data on a follower with stale data; with them, the follower can detect and truncate its log.
What happens when a follower falls behind
- Follower's LEO stops advancing (slow disk, GC pause, network).
- Leader's
ISR shrinkcondition triggers: follower has been >replica.lag.time.max.ms(30s default) behind. - Controller is notified; ISR is updated in
__cluster_metadata. - If ISR size drops below
min.insync.replicas: produce withacks=allstarts failing withNotEnoughReplicasException. - When the follower catches back up, ISR is expanded; produces resume.
Fetch from follower (KIP-392)
Same fetch protocol, but consumers can request reads from a follower in the same AZ. The follower serves data only up to its locally-known HW (which may lag the leader's by milliseconds). The consumer specifies client.rack; the leader returns a redirect on the first fetch ("read from broker X, same rack as you"). Subsequent fetches go directly to the follower.
4.4 Consumer Group Coordination (KIP-848)
Consumer group rebalancing is the trickiest part of Kafka's execution model. KIP-848 (Kafka 4.0 GA) replaced the classic stop-the-world protocol with a server-driven, incremental one.
The classic protocol (legacy, deprecated)
KIP-848 (Kafka 4.0+, GA)
Heartbeat protocol
Members send ConsumerGroupHeartbeat RPCs every group.consumer.heartbeat.interval.ms (default 5s). The session timeout (group.consumer.session.timeout.ms, default 45s) is server-side, no longer client-configurable. Missing 9 consecutive heartbeats removes you from the group. The heartbeat is also the carrier for assignment reconciliation, both directions of state flow on a single RPC.
4.5 Transactions: End-to-End Flow
Transactional producers coordinate with a TransactionCoordinator (a per-broker service, sharded by transactional.id). Each transaction passes through these states:
Why this works
- The TransactionCoordinator writes its own log (
__transaction_state), so coordinator state survives broker restart. - Markers are written into the actual data partitions, so any consumer reading those partitions sees the boundary.
read_committedconsumers maintain an LSO (Last Stable Offset) per partition: the highest offset whose containing transaction has been committed or aborted. Records past LSO are not exposed.- A crashed transactional producer is fenced on restart: the new instance bumps the epoch, the TransactionCoordinator aborts any open transaction from the old epoch. This is how zombie producers are prevented from committing stale data.
Read-committed consumers wait for the LSO to advance, which happens only when a transaction commits or aborts. A long-running transaction (say, 30 seconds) holds back the LSO for all consumers of every partition it touches. If your transactional pipelines run alongside latency-sensitive consumers on the same topic, set transaction.timeout.ms tight (5-15 seconds), and design transactions to be small. Otherwise, your consumers experience their tail latency being controlled by your slowest transaction.
05Feature Usage: Working Code, Production Defaults
Every feature below is shown with the configuration most production deployments end up at, not the documentation default. Defaults are tuned for "small dev cluster, learn the API"; production tuning is different.
5.1Pub/Sub: Producer and Consumer
Producer (idempotent, production defaults)
import org.apache.kafka.clients.producer.*;
import java.util.Properties;
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092,broker3:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// Durability and ordering
props.put("acks", "all"); // wait for ISR ack, default since 3.0
props.put("enable.idempotence", "true"); // no duplicates on retry, default since 3.0
props.put("max.in.flight.requests.per.connection", 5); // safe with idempotence
props.put("retries", Integer.MAX_VALUE); // rely on delivery.timeout.ms instead
props.put("delivery.timeout.ms", 120000); // 2 min total deadline per record
// Throughput
props.put("compression.type", "lz4"); // zstd is often better for archival
props.put("linger.ms", 20); // wait up to 20ms to batch
props.put("batch.size", 65536); // 64 KB per partition batch
props.put("buffer.memory", 67108864); // 64 MB total buffer
try (Producer<String, String> producer = new KafkaProducer<>(props)) {
// Key drives partition assignment. Same key → same partition → ordered.
ProducerRecord<String, String> record =
new ProducerRecord<>("orders", orderId, payload);
producer.send(record, (metadata, exception) -> {
if (exception != null) {
log.error("Send failed for {}", orderId, exception);
} else {
log.debug("Sent {} → partition {} offset {}",
orderId, metadata.partition(), metadata.offset());
}
});
}
acks=1 looks faster but is a footgun. The leader writes locally and acks before replication. A leader crash between local write and replication loses the record silently. The throughput delta vs acks=all is often <10% with idempotence on; the durability delta is enormous.
linger.ms=0 starves your batches. The producer sends one record per request. Throughput plummets, broker load multiplies. 5-20ms is the sweet spot for most workloads.
Setting retries=0 to "fail fast" defeats acks=all. Transient leader elections will drop records that would have succeeded on retry. Use delivery.timeout.ms to bound total time instead.
Consumer (KIP-848 new protocol, manual commit)
import org.apache.kafka.clients.consumer.*;
import java.time.Duration;
import java.util.*;
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092");
props.put("group.id", "order-processor");
props.put("group.protocol", "consumer"); // KIP-848 protocol (Kafka 4.0+)
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
// Manual commit, read only committed records (skip aborted transactions)
props.put("enable.auto.commit", "false");
props.put("isolation.level", "read_committed");
// Tune throughput vs latency
props.put("max.poll.records", 500); // records per poll()
props.put("max.poll.interval.ms", 300000); // processing deadline, exceed → kicked from group
props.put("fetch.min.bytes", 1048576); // wait for 1 MB of data
props.put("fetch.max.wait.ms", 500); // or 500 ms, whichever first
// Fetch from same-AZ follower to cut cross-AZ traffic (KIP-392)
props.put("client.rack", "us-east-1a");
try (Consumer<String, String> consumer = new KafkaConsumer<>(props)) {
consumer.subscribe(List.of("orders"));
while (!Thread.currentThread().isInterrupted()) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(1));
for (ConsumerRecord<String, String> record : records) {
// Process. If processOrder throws, the loop bails and we DO NOT commit.
// Next poll will re-deliver from the last committed offset. At-least-once.
processOrder(record.value());
}
if (!records.isEmpty()) {
consumer.commitSync(); // commit after successful processing
}
}
}
The classic rebalance protocol is "stop the world": every consumer drops its partitions, the coordinator computes a new assignment, every consumer picks up. A 100-partition group with 10 consumers spent 100+ seconds rebalancing on a single member leave. KIP-848 moves the assignment to the broker-side group coordinator and makes it incremental, the same 100-partition group rebalances in single-digit seconds, and consumers whose assignment did not change keep processing. Set group.protocol=consumer on all new consumers. The classic protocol is in Phase 1 of deprecation as of Kafka 4.3.
5.2Kafka Streams: Stateful Processing
Mental model
KStream is a stream of records (insert log). KTable is a changelog interpreted as a materialized view (latest value per key). GlobalKTable is a KTable fully replicated to every Streams instance (no co-partitioning needed). Joins between these have different semantics and different cost.
Example: enrich orders with customer data, aggregate by region in 5-min windows
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "order-enrichment");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
props.put(StreamsConfig.STATE_DIR_CONFIG, "/var/lib/streams");
// KIP-1071: new server-side Streams rebalance protocol (GA in Kafka 4.2)
props.put(StreamsConfig.GROUP_PROTOCOL_CONFIG, "streams");
StreamsBuilder builder = new StreamsBuilder();
// Source: stream of orders, keyed by customerId
KStream<String, Order> orders = builder.stream("orders",
Consumed.with(Serdes.String(), orderSerde));
// Materialized table of customers, keyed by customerId
KTable<String, Customer> customers = builder.table("customers",
Consumed.with(Serdes.String(), customerSerde));
// Stream-table join: enrich each order with customer attributes.
// Requires orders and customers co-partitioned on the same key.
KStream<String, EnrichedOrder> enriched = orders
.leftJoin(customers, (order, customer) -> new EnrichedOrder(order, customer));
// Aggregate revenue per region in 5-minute tumbling windows
KTable<Windowed<String>, RevenueAgg> revenue = enriched
.selectKey((k, v) -> v.region())
.groupByKey(Grouped.with(Serdes.String(), enrichedOrderSerde))
.windowedBy(TimeWindows.ofSizeAndGrace(
Duration.ofMinutes(5), Duration.ofSeconds(30)))
.aggregate(
RevenueAgg::new,
(region, order, agg) -> agg.add(order.amount()),
Materialized.<String, RevenueAgg, WindowStore<Bytes, byte[]>>
as("revenue-store")
.withValueSerde(revenueAggSerde));
revenue.toStream()
.map((windowedKey, agg) -> KeyValue.pair(windowedKey.key(),
new RevenueResult(windowedKey.window().startTime(), agg.total())))
.to("revenue-by-region-5min", Produced.with(Serdes.String(), revenueResultSerde));
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
What's actually happening under the hood
- Local state in RocksDB. The
revenue-storeis an embedded RocksDB instance per partition assigned to this Streams instance. Reads and writes are local; the disk is yours. - Changelog topic for durability. Every update to local state is also written to a changelog topic (Streams creates this for you with the name
{application-id}-{store-name}-changelog). On restart or rebalance, state is restored by replaying the changelog. This is why "Streams state" survives a host failure. - Co-partitioning required for joins. If
ordershas 12 partitions andcustomershas 6, the join is rejected at topology build time. Both must have the same partition count and same partitioner. - EOS v2 uses transactions. Read input → update state + write changelog + write output, all in one Kafka transaction. Crashes leave no half-state.
Repartitioning is automatic and silent. selectKey followed by an aggregation triggers an internal repartition topic. This doubles your network and storage cost for that pipeline. Check the topology before deploying.
Changelog topics are real topics. They count against your partition budget, broker storage, and replication traffic. A complex Streams app can quietly add hundreds of partitions to your cluster.
State restoration is slow on first start. Restoring a multi-GB RocksDB store from a changelog topic takes minutes. Use standby.replicas=1 to keep a warm replica so failover does not wait for restoration.
5.3Kafka Connect: Source and Sink Pipelines
Architecture in one paragraph
Connect runs as a separate JVM cluster of "worker" nodes. Connectors (plugins) define source and sink logic. Tasks are the parallel work units within a connector. State (offsets, configs, status) is persisted to Kafka topics (connect-offsets, connect-configs, connect-status). The REST API on each worker exposes connector lifecycle. Workers coordinate via the same KIP-848 protocol used by consumers.
Source: Postgres CDC via Debezium with SMTs
POST http://connect-worker:8083/connectors
Content-Type: application/json
{
"name": "orders-pg-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1", // CDC is single-task per database
"database.hostname": "pg-primary.internal",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/secrets/db-pw:debezium}",
"database.dbname": "orders_db",
"topic.prefix": "pg.orders_db",
"plugin.name": "pgoutput",
"slot.name": "debezium_orders",
"table.include.list": "public.orders,public.order_items",
"snapshot.mode": "initial",
// Single Message Transforms (SMTs)
"transforms": "unwrap,route,filter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "pg.orders_db.public.(.*)",
"transforms.route.replacement": "cdc.orders.$1",
"transforms.filter.type": "org.apache.kafka.connect.transforms.Filter",
"transforms.filter.predicate": "is-not-system-table",
// Exactly-once support (Kafka Connect 3.3+)
"exactly.once.support": "required",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
}
Sink: write enriched orders to S3 in Parquet, partitioned by date
POST http://connect-worker:8083/connectors
{
"name": "orders-s3-sink",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "8", // parallel tasks, capped at partition count
"topics": "enriched-orders",
"s3.bucket.name": "data-lake-prod",
"s3.region": "us-east-1",
"topics.dir": "raw/kafka",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"parquet.codec": "snappy",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "3600000", // 1 hour partitions
"path.format": "'dt'=YYYY-MM-dd/'hr'=HH",
"locale": "en-US",
"timezone": "UTC",
"timestamp.extractor": "Record",
"flush.size": "100000", // 100k records per S3 file
"rotate.interval.ms": "600000", // or 10 min, whichever first
"behavior.on.null.values": "ignore",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "orders-s3-sink-dlq",
"errors.deadletterqueue.context.headers.enable": "true"
}
}
SMTs (Single Message Transforms) are great for "rename a field, drop a column, route by regex." They become a liability when you find yourself chaining 6 of them with custom predicates. At that point, write the transformation in Streams or Flink between source-connector and sink-connector. SMTs are not testable in isolation, not version-controlled with your application code, and not observable beyond connector-level metrics. Reach for them, then walk back to a real processing layer when complexity creeps.
5.4Schema Registry: Schemas as a Service
What it actually does
Schema Registry stores Avro/Protobuf/JSON Schemas, assigns each one an integer schema ID, and serves them via REST. The serializer in the producer registers the schema (or looks it up) and prepends the 5-byte ID to every record (1 magic byte + 4-byte schema ID + payload). The consumer reads the ID, fetches the schema, deserializes. Without Schema Registry, every consumer needs the schema baked in. With it, schema becomes a runtime concern.
Compatibility modes
| Mode | What It Enforces | What You Can Change | When To Use |
|---|---|---|---|
BACKWARD (default) |
New schema can read data written with old schema | Add optional fields, remove fields | Upgrade consumers first, then producers. The standard rolling-upgrade direction. |
FORWARD |
Old schema can read data written with new schema | Add required fields, remove optional fields | Upgrade producers first, then consumers. Rare in practice. |
FULL |
Both forward and backward compatible | Only add or remove optional fields with defaults | When you genuinely do not know which side upgrades first. |
NONE |
No checks | Anything | Don't. |
*_TRANSITIVE |
Compatibility against all prior versions, not just the latest | Same as above, applied across history | Mature systems where consumers may be pinned to old versions. |
Producer with Avro and Schema Registry
import io.confluent.kafka.serializers.KafkaAvroSerializer;
import org.apache.avro.generic.GenericRecord;
Properties props = new Properties();
props.put("bootstrap.servers", "broker:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", KafkaAvroSerializer.class.getName());
props.put("schema.registry.url", "http://schema-registry:8081");
// PE production defaults:
props.put("auto.register.schemas", "false"); // register via CI, not at runtime
props.put("use.latest.version", "true"); // always serialize with the registered latest
props.put("latest.compatibility.strict", "true"); // fail at startup if schemas drift
try (Producer<String, GenericRecord> producer = new KafkaProducer<>(props)) {
GenericRecord order = new GenericData.Record(orderSchema);
order.put("orderId", "ord-7821");
order.put("customerId", "cust-42");
order.put("amount", 29.99);
producer.send(new ProducerRecord<>("orders", "ord-7821", order));
}
Schema evolution: add an optional field
// V1 schema (already registered)
{
"type": "record", "name": "Order",
"fields": [
{"name": "orderId", "type": "string"},
{"name": "customerId", "type": "string"},
{"name": "amount", "type": "double"}
]
}
// V2 schema: add couponCode (optional, default null) → BACKWARD compatible
{
"type": "record", "name": "Order",
"fields": [
{"name": "orderId", "type": "string"},
{"name": "customerId", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "couponCode", "type": ["null", "string"], "default": null}
]
}
# Register via CLI (use in CI, not at runtime)
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data @order-v2.json \
http://schema-registry:8081/subjects/orders-value/versions
auto.register.schemas=true at runtime is a security hole. A buggy producer can register a half-broken schema and break every consumer. Always register schemas in CI/CD against a staging registry first.
Subject naming strategy matters. Default is TopicNameStrategy ({topic}-value), which means one schema per topic. TopicRecordNameStrategy allows multiple record types per topic. Pick once, do not change, the strategy is baked into producer/consumer code.
Removing a field is FORWARD compatible, not BACKWARD. Many teams default to BACKWARD and then get surprised when "I just removed an unused field" breaks them.
5.5Transactions: Exactly-Once Semantics
What Kafka transactions actually guarantee
Transactional producers write all-or-nothing across multiple partitions and topics, and (in consume-process-produce) atomically commit consumer offsets and produced records together. Consumers with isolation.level=read_committed skip records from aborted transactions. The guarantee scope is Kafka-to-Kafka only.
Consume-process-produce with EOS
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.producer.*;
// Producer config
Properties prodProps = new Properties();
prodProps.put("bootstrap.servers", "broker:9092");
prodProps.put("transactional.id", "order-enricher-instance-1"); // unique per process
prodProps.put("enable.idempotence", "true"); // required for txn
prodProps.put("transaction.timeout.ms", 60000);
// Consumer config
Properties consProps = new Properties();
consProps.put("bootstrap.servers", "broker:9092");
consProps.put("group.id", "order-enricher");
consProps.put("isolation.level", "read_committed");
consProps.put("enable.auto.commit", "false"); // MUST be false for EOS
Producer<String, EnrichedOrder> producer = new KafkaProducer<>(prodProps);
Consumer<String, Order> consumer = new KafkaConsumer<>(consProps);
producer.initTransactions();
consumer.subscribe(List.of("orders"));
while (true) {
ConsumerRecords<String, Order> records = consumer.poll(Duration.ofSeconds(1));
if (records.isEmpty()) continue;
try {
producer.beginTransaction();
for (ConsumerRecord<String, Order> record : records) {
EnrichedOrder enriched = enrich(record.value());
producer.send(new ProducerRecord<>("enriched-orders",
record.key(), enriched));
}
// Critical: commit offsets WITHIN the transaction, not via consumer.commitSync()
Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
for (TopicPartition tp : records.partitions()) {
List<ConsumerRecord<String, Order>> partRecs = records.records(tp);
long lastOffset = partRecs.get(partRecs.size() - 1).offset();
offsets.put(tp, new OffsetAndMetadata(lastOffset + 1));
}
producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException e) {
// Producer is fenced (another instance took over) or fatal error. Bail.
producer.close();
throw e;
} catch (KafkaException e) {
// Transient error. Abort and reprocess.
producer.abortTransaction();
}
}
Side effects break EOS. If your enrich() function calls an external HTTP API, that call happens regardless of whether the transaction commits or aborts. The API may be called twice (re-poll after abort). EOS does not extend across the Kafka boundary.
transactional.id must be unique and stable per logical producer. Reusing IDs across instances causes ProducerFencedException. Generating new ones per restart defeats fencing entirely.
Read-committed adds latency. Consumers wait for the LSO (last stable offset), not the high watermark. Long-running transactions stall all consumers reading that partition.
EOS in Kafka Streams is the cleanest path. If your whole pipeline is consume-from-Kafka-and-produce-to-Kafka, use Streams with processing.guarantee=exactly_once_v2. The transactional wiring is automatic.
5.6Tiered Storage: Hot Local, Cold S3
Broker-side config
# server.properties
remote.log.storage.system.enable=true
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.RemoteLogManager
remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.storage.TopicBasedRemoteLogMetadataManager
# S3-backed plugin (Aiven's TS plugin is the de facto open-source choice)
remote.log.storage.manager.class.path=/opt/kafka/plugins/tiered-storage-s3/*
rsm.config.storage.backend.class=io.aiven.kafka.tieredstorage.storage.s3.S3Storage
rsm.config.storage.s3.bucket.name=kafka-tiered-storage-prod
rsm.config.storage.s3.region=us-east-1
rsm.config.chunk.size=4194304 # 4 MB chunks
rsm.config.compression.enabled=true
# Quota (KIP-956): cap remote fetch/upload throughput
remote.log.manager.copy.max.bytes.per.second=104857600 # 100 MB/s upload
remote.log.manager.fetch.max.bytes.per.second=209715200 # 200 MB/s fetch
Per-topic enablement
# Enable tiered storage on a topic, keep 1 day local, 90 days total
kafka-configs.sh --bootstrap-server broker:9092 \
--entity-type topics --entity-name events \
--alter \
--add-config remote.storage.enable=true,\
local.retention.ms=86400000,\
retention.ms=7776000000
# Dynamically disable and re-enable (KIP-950, KRaft mode only)
kafka-configs.sh --bootstrap-server broker:9092 \
--entity-type topics --entity-name events \
--alter --add-config remote.storage.enable=false
Tiered Storage cuts the broker EBS bill. It does nothing about cross-AZ replication (still 3x replication factor on hot data) or cross-AZ consumer fetch traffic. In a multi-AZ AWS deployment, networking is typically 50-80% of total Kafka spend. Tiered Storage attacks the smaller line item. If cross-AZ is your real problem, look at KIP-392 (Fetch From Follower) for consumer-side savings, then consider truly diskless options (AutoMQ, Aiven Inkless, the WarpStream-IBM stack) for replication-side savings. See the Advanced section for the full diskless landscape.
5.7KRaft: Operating Kafka Without ZooKeeper
What changed
ZooKeeper was Kafka's external metadata service for 14 years. KRaft (Kafka Raft) replaces it with an internal Raft quorum running on dedicated "controller" nodes (or co-located with brokers in small clusters). Metadata lives in a single internal topic, __cluster_metadata, replicated via Raft. Controller failover is now sub-second instead of tens of seconds. Partition scale ceiling moved from ~200k to ~1.9M per cluster.
Bootstrap a fresh KRaft cluster
# Step 1: generate a cluster UUID once, share across all nodes
KAFKA_CLUSTER_ID="$(./bin/kafka-storage.sh random-uuid)"
# Step 2: format storage on each node with that UUID
./bin/kafka-storage.sh format -t "$KAFKA_CLUSTER_ID" -c config/kraft/server.properties
# Step 3: start brokers
./bin/kafka-server-start.sh config/kraft/server.properties
Combined mode (broker + controller in one process)
# config/kraft/server.properties
process.roles=broker,controller
node.id=1
# All controller nodes listed here, on every node
controller.quorum.voters=1@host1:9093,2@host2:9093,3@host3:9093
controller.listener.names=CONTROLLER
listeners=PLAINTEXT://host1:9092,CONTROLLER://host1:9093
advertised.listeners=PLAINTEXT://host1:9092
log.dirs=/var/lib/kafka/data
num.partitions=3
default.replication.factor=3
min.insync.replicas=2
Operational verbs that changed
| Task | ZooKeeper Mode (legacy) | KRaft Mode |
|---|---|---|
| Inspect cluster metadata | zkCli.sh ls /brokers |
kafka-metadata-quorum.sh --describe status |
| List topics | kafka-topics.sh --zookeeper ... |
kafka-topics.sh --bootstrap-server ... (only) |
| Add a controller | Restart with new zookeeper.connect |
kafka-metadata-quorum.sh add-controller (dynamic) |
| Force a controller election | Delete ZK node, hope for the best | Built-in via metadata quorum; election is automatic |
5.8Share Groups (Queues for Kafka): KIP-932
Why this exists
For 14 years, the rule was "partition = unit of parallelism, max consumers per group = partition count." Want 1000 concurrent workers? Make 1000 partitions. That model fights real workloads where you want to scale consumers freely without committing to a partition count upfront.
Share Groups break the rule. A share group lets many consumers cooperatively consume from the same partition with per-record acknowledgment. The whole topic becomes a queue. You lose strict per-key ordering. You gain elastic scaling and SQS-like semantics, inside Kafka.
Share group consumer
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.clients.consumer.AcknowledgeType;
Properties props = new Properties();
props.put("bootstrap.servers", "broker:9092");
props.put("group.id", "image-processors");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
// Per-record ack mode
props.put("share.acknowledgement.mode", "explicit");
try (ShareConsumer<String, byte[]> consumer = new KafkaShareConsumer<>(props)) {
consumer.subscribe(List.of("image-jobs"));
while (true) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofSeconds(2));
for (ConsumerRecord<String, byte[]> record : records) {
try {
processImage(record.value());
// Acknowledge as ACCEPT: remove from delivery set
consumer.acknowledge(record, AcknowledgeType.ACCEPT);
} catch (TransientException e) {
// Release: redeliver to another consumer, increment delivery count
consumer.acknowledge(record, AcknowledgeType.RELEASE);
} catch (PoisonException e) {
// Reject: send to dead-letter handling; will not be redelivered
consumer.acknowledge(record, AcknowledgeType.REJECT);
}
}
consumer.commitSync(); // flush ack state to share coordinator
}
}
Topic-side: enable share group on a topic
# Cluster-wide feature flag (one-time, KRaft)
kafka-features.sh --bootstrap-server broker:9092 \
upgrade --feature share.version=1
# Configure share-group settings (delivery count threshold, lock timeout)
kafka-configs.sh --bootstrap-server broker:9092 \
--entity-type groups --entity-name image-processors \
--alter --add-config \
group.share.delivery.count.limit=5,\
group.share.record.lock.duration.ms=30000
# Inspect
kafka-share-groups.sh --bootstrap-server broker:9092 --describe \
--group image-processors
Share Groups are the right answer for job-queue patterns where you already have Kafka and want to avoid running SQS or RabbitMQ alongside it. They are not the right answer when (a) you need strict per-key ordering, (b) you need EOS, or (c) the workload is naturally low-throughput and SQS would cost less. Treat Share Groups as a feature you adopt deliberately, not as the new default for consumers.
06Trade-Offs
Each row is a deliberate engineering choice baked into Kafka's design. Several of these have been challenged by next-gen alternatives (see Section 11), but the trade-offs are still load-bearing in any cluster you run today.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Partition count is permanent | Constant-time key-to-partition routing, simple consumer assignment, ordered streams per key | Resizing breaks the modulo, key K moves to a different partition, per-key ordering broken across resize | Year 2 when traffic 4x and your 12-partition topic caps consumer parallelism at 12 workers; you cannot add more without breaking ordering | Most teams overshoot intentionally: pick partition count = 2-4x current need, accept slightly worse batching now to avoid the rewrite later. Powers of 2 simplify any future migration via a partition-doubling strategy with parallel consumers. |
| Pull consumer model | Consumer-driven backpressure, broker carries zero per-consumer state, batched fetches, trivial replay | Idle latency floor (consumers poll on a clock), wasted polls when topic is quiet, no broker-initiated push | Mobile push notifications, low-volume real-time UI updates, anything requiring sub-100ms delivery to thousands of endpoints | The Kafka philosophy is "fan out at the edge, not at the broker." If you need push to many endpoints, put a WebSocket gateway between Kafka and clients. The gateway pulls; clients get push. Do not try to make Kafka itself push. |
| Ordering only within a partition | Linear scalability of writes and reads across many partitions; no global coordination | Cross-partition ordering is impossible. Two events with different keys have no defined order. | "User signup" on partition A, "user first login" on partition B, login arrives before signup record is visible; downstream state machine breaks | The fix is partition key design, not Kafka feature flags. Choose the key as the entity whose ordering you actually care about. A single global ordering is impossible at Kafka's scale, so accept partition-level ordering as the contract and design around it. |
| acks=all + min.insync.replicas=2 for durability | No data loss on single-broker failure; ISR contract enforces durability before ack | Write availability requires ≥2 healthy replicas. Lose 2 of 3 brokers in a 3-AZ setup, writes block. | AZ outage in a 3-AZ deployment with RF=3 and min.isr=2; one AZ down leaves 2 replicas, surviving an additional broker loss in either remaining AZ blocks writes | There is no free lunch between "never lose data" and "always accept writes." min.isr=2 with RF=3 chooses durability. min.isr=1 chooses availability but allows a single ack-then-leader-dies scenario to lose committed records. Pick consciously; do not let the default surprise you in an incident. |
| Cross-AZ replication 3x for durability | Survives full-AZ failure with no data loss | Every 1 GB produced generates 2 GB of cross-AZ replication traffic, plus consumer fetch traffic spread across AZs | Cloud bill audit: ~70% of total Kafka spend is network egress, not compute or storage | KIP-392 (fetch-from-follower) cuts consumer-side cross-AZ. KIP-881 (rack-aware assignors) makes it stick. The replication traffic itself is not negotiable in stock Kafka. This is the central reason the diskless wave (AutoMQ, WarpStream, KIP-1150) exists. |
| EOS stops at the Kafka boundary | Strong exactly-once across consume-process-produce within Kafka; ideal for Streams | Side effects to external systems (HTTP, DB, S3) are not transactional with Kafka commits | "Process order, charge customer via Stripe, produce confirmation event" - transaction aborts after Stripe charge, charge stays, event does not. Customer charged twice on retry. | The classic answer is the outbox pattern: write the side-effect intent into a DB transaction with your business write, then a CDC connector publishes to Kafka. Kafka does not solve this for you. Pretending EOS extends across systems causes correctness incidents. |
| Stateless brokers about consumers | Broker scales to millions of consumer connections without per-consumer state | No native per-message ack, no per-consumer retry, no priority delivery; consumer-state tracking is in __consumer_offsets only |
You want SQS semantics. Build them yourself with manual offset management, DLQ topics, and consumer-side retry. Or, since Kafka 4.2, Share Groups (KIP-932). | Share Groups GA in 4.2 is the first real answer Kafka has to this trade-off. Until you have Share Groups, the long-running pattern is: separate retry topics with exponential backoff, DLQ topic for terminal failures, idempotent consumer logic. |
| Page cache + zero-copy reads | Effectively free reads of recent data (RAM speeds); 1 GB/s+ per broker is realistic | TLS or client-side encryption breaks zero-copy; throughput drops 30-50% | Compliance mandate adds TLS, throughput regresses, capacity plan invalidated | Plan for TLS in capacity sizing from day 1, not as an afterthought. The throughput delta is real and reproducible. Modern CPUs with AES-NI mitigate some of it, but the sendfile fast path is genuinely gone. |
| Stateful Streams via local RocksDB + changelog topics | State co-located with compute, sub-ms state access, survives failure via changelog replay | State restore on rebalance is slow (minutes for multi-GB stores); changelog topics double storage and replication cost | Streams app restarts after deploy, takes 8 minutes to restore RocksDB stores, downstream consumers see processing gap | Use num.standby.replicas=1 to keep a warm replica per partition. The replica restores in parallel; failover is instant. Costs ~2x state storage but eliminates the restore-time gap. Mandatory for any Streams app with strict SLAs. |
07Use Cases
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Event sourcing as the system of record | LinkedIn (original use case), large fintechs replacing batch ETL with event-driven architectures | Replayable durable log with ordered per-key streams and 7+ day retention; new consumers can rebuild state from offset 0 | LinkedIn runs 7+ trillion messages/day across its Kafka fleet; multi-PB retained data; 100+ MB/s per topic sustained | Traditional message queues (RabbitMQ, SQS) delete messages on ack. Replay requires a separate store. Kafka removes that store. |
| CDC pipeline from operational DB to data lake | Mid-size SaaS replacing nightly ETL with Debezium + Kafka + S3 sink | Sub-minute end-to-end latency for OLTP → lake replication with at-least-once semantics | 2000+ tables across 30 Postgres instances, 50K events/sec sustained, 24-hour replay window | Direct DB replication (logical slots → consumer) has no fan-out; adding a second consumer doubles replication slot load. Kafka decouples. |
| Real-time analytics aggregation | Ad-tech firm computing per-minute click attribution and revenue rollups via Kafka Streams + Flink | Stateful windowed aggregation with exactly-once semantics; p99 end-to-end latency under 5 seconds at 200K events/sec | 200K events/sec peak, 5-minute tumbling windows, 30-day retention for late-arriving data | Spark Streaming has micro-batch latency floor (seconds). Flink is the alternative, but Kafka Streams keeps state co-located which simplifies ops at this scale. |
| Microservice decoupling backbone | E-commerce platform with 60+ services exchanging order, inventory, fulfillment, and payment events | Service can be deployed independently of consumers; new services consume historical events to bootstrap state | 15K events/sec sustained across ~400 topics, 50+ producer services, 200+ consumer services | Synchronous REST creates a brittle service mesh: one slow service stalls the entire request chain. Kafka absorbs the impedance mismatch. |
| Application log and metric aggregation | Datadog-style observability pipeline before storage in TSDB/object store | Massive write throughput, lossy-acceptable durability, slow consumers do not block fast producers | 1M+ log events/sec, 30-minute retention, ~100 consumer services tailing | Kinesis is similar in concept but has 1000 records/sec per shard cap and 24-hour retention default. Kafka is more elastic at this volume. |
| Materialized views as Kafka topics | Banking ledger projecting account balances from a transactions topic via Kafka Streams | Compacted topic gives "latest balance per account" with key-based access; rebuildable from source-of-truth transactions | 200M+ accounts, 50K transactions/sec, compacted topic ~80 GB, materialized on every Streams instance | Traditional DB-backed projection requires explicit cache invalidation. Compacted topic + Streams = self-healing projection that survives any service restart. |
| Fraud detection feature pipeline | Payment processor enriching transaction stream with rolling per-user features for ML inference | Stateful joins against per-user history with millisecond-scale processing latency | 30K transactions/sec, 5-minute and 1-hour windows per user, ~10M active users in working set | Batch feature stores (Feast in batch mode) miss real-time signals. Online feature stores layered on Kafka Streams give both. |
08Limitations
| Limitation | Severity | Workaround | Workaround Cost |
|---|---|---|---|
| Partition count is irreversible without breaking key ordering | Critical | Over-provision at topic creation (4x expected need). For existing topics, create a new topic with more partitions and run a dual-write/dual-consume migration. | Over-provisioning wastes broker resources (open files, controller metadata). Migration takes weeks of careful dual-write, dual-consume, cutover. |
| Cross-AZ replication traffic dominates cloud cost | Critical | KIP-392 fetch-from-follower (consumer-side), KIP-881 rack-aware assignors, ultimately move to diskless (AutoMQ, WarpStream, Aiven Inkless). | KIP-392/881 fix consumer side only; replication traffic remains. Diskless migrations require platform-level architecture changes and may sacrifice low-latency. |
| No native per-message ack / DLQ until Kafka 4.2 | High | Manual offset management, retry topics with backoff, DLQ topic for terminal failures. Or upgrade to 4.2+ and use Share Groups. | Pre-4.2: 100+ lines of boilerplate per consumer to do what SQS gives for free. Share Groups GA: clean, but requires upgrade and learning a new API. |
| Compacted topics do not work with Tiered Storage | High | Use delete retention with very long ms, accept the disk cost; or store latest-value state in an external KV store and emit change events to a non-compacted Kafka topic. | External KV duplicates the "latest value per key" semantics Kafka offers natively. Long retention with delete policy uses brokers as cheap object storage. |
| EOS does not extend beyond Kafka | High | Outbox pattern: write side-effect intent inside a DB transaction with the business write, CDC publishes to Kafka. | Requires CDC infrastructure (Debezium), an outbox table per service, careful schema design. Real engineering effort. |
| Hot partition cannot be split without re-keying | High | Re-key to a more granular partition key (customerId → customerId+shardId), then repartition. Or salt the hot key across multiple partitions and merge downstream. | Re-keying breaks consumer offset continuity. Salting requires downstream consumers to handle multi-partition gather, breaking the simple "one partition = one consumer = one ordering domain" model. |
| TLS + zero-copy are mutually exclusive | Medium | Accept the 30-50% throughput drop, size capacity accordingly. Or terminate TLS at a sidecar (Envoy) and run Kafka cleartext on a private network. | Capacity: ~30-50% more brokers. Sidecar approach adds operational complexity and a privileged-network assumption. |
| No priority queues | Medium | Separate topics per priority, consumer reads from high-priority topic first with backpressure-aware polling. | Manual priority logic in consumer; "starvation" risk for low-priority topics if high-priority is saturated. |
| Connector framework is a separate cluster to operate | Medium | Run Connect as a managed service (Confluent, Aiven), or use lightweight alternatives (Debezium Server, MirrorMaker 2 standalone). | Managed Connect adds vendor cost; standalone alternatives lack the distributed task rebalancing of full Connect. |
| Schema Registry is not Apache Kafka | Medium | Confluent Schema Registry (Confluent Community License) is most common; Apicurio (Apache 2.0) is the open-source alternative. | License sensitivity in some orgs; Apicurio is fully OSS but has smaller ecosystem support and fewer integrations. |
09Fault Tolerance
Single-tech deep dive: dimension, behavior, and the operational reality you discover on the on-call rotation.
| Dimension | Behavior | Operational Reality |
|---|---|---|
| Replication model | Leader-follower per partition. Replication factor configurable per topic (default 3 in production). One broker is the leader for a partition; followers fetch from leader. | Replication factor 3 with min.insync.replicas=2 is the production baseline. RF=2 is a footgun: lose one broker and writes block immediately. |
| Failure detection | Brokers send heartbeats to the active controller. Follower lag tracked by leader; follower drops from ISR if it lags beyond replica.lag.time.max.ms (default 30s). |
KRaft (Kafka 4.0+) detects broker failure in single-digit seconds vs ZooKeeper's tens-of-seconds. ISR fluctuations under load are normal and not always failures, look for sustained ISR shrinkage, not transient. |
| Failover mechanism | Controller elects a new leader from the ISR (in-sync replica set). Clients receive metadata update via fetch error and refresh routing. No client-side failover code needed. | Sub-second leader election in KRaft. Producer/consumer clients seamlessly retry. The pause is short enough that p99 latency users often do not notice. |
| RTO (typical) | Broker failure: 1-3 seconds for in-cluster failover (KRaft). AZ failure: same, as long as ≥1 ISR replica is in a surviving AZ. Region failure: minutes to hours (depends on MM2/Cluster Linking setup). | RTO for in-region failures is excellent. Cross-region DR is a separate engineering effort, not a feature you flip on. Most teams accept 5-10 minutes of writes lost in true region failure. |
| RPO (typical) | RPO=0 for in-region failures with acks=all and min.isr=2. Cross-region: bounded by MirrorMaker 2 lag (typically seconds, can spike to minutes under load). | "Zero data loss" requires acks=all AND min.isr ≥ 2. acks=1 with min.isr=2 still loses data if the leader crashes before replicating. |
| Split-brain behavior | Prevented by Raft quorum (controllers) and epoch fencing (partition leaders). Old leaders cannot ack writes after losing leadership; clients retry against the new leader. | KRaft eliminated the historical ZooKeeper split-brain class. Eligible Leader Replicas (KIP-966) further reduce edge cases around minority-partition leader election. |
| Blast radius of single-node failure | For each partition the failed broker led, sub-second leader re-election to a follower. Partitions where the failed broker was a follower lose one ISR replica. | If the failed broker led many partitions, the surviving brokers see a thundering herd of leader-promotion. Plan for transient latency spike. Spread leaders evenly via auto.leader.rebalance.enable=true. |
| Cross-region failover story | Active-passive: MirrorMaker 2 or Cluster Linking replicates to a DR region. Consumer offsets do not translate cleanly across clusters; tools like RemoteClusterUtils help. Active-active: rare in stock Kafka, common in Confluent's stretched clusters. | Cross-region DR remains a manual cutover for most. The hard part is not the data, it's the consumer offset translation and the producer client redirection. Practice the failover in staging quarterly or you will get it wrong in prod. |
| Data loss scenarios | (1) acks=1 + leader crash before replication. (2) Unclean leader election enabled + all ISR lost simultaneously + out-of-sync replica promoted. (3) Disk failure on a broker with replication factor 1 (unusual but possible on test topics). | Default of unclean.leader.election.enable=false trades availability for durability: if all ISR is lost, the partition becomes unavailable until ISR recovers. Most teams accept this. Flipping it to true is a deliberate decision to lose data on rare double-failures. |
| Controller quorum failure (KRaft) | Controller quorum (typically 3 or 5 nodes) tolerates ⌊(N-1)/2⌋ failures. Loss of majority blocks metadata changes (cannot create topics, elect leaders), but existing leaders continue serving reads/writes. | Data plane survives controller-quorum loss. Control plane does not. This is a more graceful degradation than ZooKeeper, where the same loss class would block everything. |
For any non-trivial Kafka deployment, the durability triad is: acks=all, min.insync.replicas=2, replication.factor=3, and unclean.leader.election.enable=false. This combination gives RPO=0 in single-AZ failure scenarios. It also means a 2-AZ failure can block writes, which is the trade you make. Any deviation from this should be a deliberate, documented choice with a written justification.
11Replication
| Dimension | Behavior | Operational Reality |
|---|---|---|
| Replication topology | Leader-follower per partition. Single leader per partition at any time; followers fetch from leader and apply changes in order. | This is the only topology Kafka supports. Multi-leader exists only via Confluent's Cluster Linking or in cross-cluster setups (MM2 active-active). Stock open-source Kafka is leader-follower. |
| Sync vs async | Synchronous within ISR (In-Sync Replicas). With acks=all, producer waits for all ISR members to ack. Out-of-ISR replicas catch up asynchronously. |
"Sync replication" is sync only to the ISR, not to all replicas. If a follower falls behind, it's removed from ISR and the leader stops waiting for it. The set is dynamic and changes under load. |
| Replication factor (default / max) | Configurable per topic. Production default 3. Maximum bounded by broker count. RF=1 means no replication; RF=N requires N brokers. | RF=3 is the answer for almost every production workload. RF=5 for ultra-critical (regulatory archives, financial event logs). RF=2 is wrong; lose one broker, writes block. |
| Consistency level options | Producer-side: acks=0 (fire-and-forget), acks=1 (leader ack only), acks=all (all ISR ack). Consumer-side: isolation.level=read_uncommitted or read_committed. |
Production producers should use acks=all + enable.idempotence=true (latter is default since 3.0). Consumers should use read_committed when downstream of a transactional producer. |
| Replication lag (typical) | Within-region: sub-second under normal load, single-digit ms for low-throughput topics. Cross-region (MM2): seconds to tens of seconds, sensitive to inter-region network. | ISR lag spikes during broker restarts, partition rebalances, and disk pressure. Alert on sustained lag > 5s; transient blips are normal. |
| Conflict resolution | No conflicts within a cluster: single leader per partition, all writes serialized. Cross-cluster (active-active with MM2): no automatic resolution. Both sides accept writes; downstream consumers see both. | Active-active cross-cluster setups require application-level conflict design (last-writer-wins is a lie at scale; you need CRDTs or domain-specific reconciliation). Most teams avoid active-active and run active-passive. |
| Cross-region replication | MirrorMaker 2 (Apache, open-source) or Confluent Cluster Linking (Confluent-only, more efficient). Both are async, both have consumer-offset translation challenges. | MM2 runs as a Connect cluster; operate it as carefully as your main Kafka. Cluster Linking is more efficient (broker-to-broker direct) but locks you into Confluent. For DR, MM2 is almost always the right answer. |
| Replication during partition (network) | Followers in the minority side fall out of ISR. Leader continues serving writes if min.insync.replicas can still be met. If not, leader rejects writes with NotEnoughReplicasException. |
This is the partition-tolerance behavior of Kafka: CP-leaning. Availability is sacrificed when min.isr cannot be met. This is the right default; the alternative (accept writes that may not be durable) is silently dangerous. |
| Fetch-from-follower (KIP-392) | Consumers can be configured to fetch from same-AZ followers via client.rack + replica.selector.class=RackAwareReplicaSelector. Reduces cross-AZ consumer fetch traffic dramatically. |
This single config change cuts ~33% of total cross-AZ traffic in a 3-AZ deployment with one consumer group, more with multiple groups. Inexplicably underused; should be default on every multi-AZ cluster. |
| Replication cost (network) | Every 1 GB produced generates (RF-1) × 1 GB of replication traffic. With RF=3, every produced GB costs 2 GB of replication. If replicas span AZs (they should), that 2 GB is cross-AZ. | In AWS at $0.02/GB cross-AZ, 100 MB/s of produce traffic costs ~$5K/month just in replication networking. Multiply by your fleet. This is the single largest line item in Kafka cloud bills and the entire motivation for the diskless wave. |
12Better Usage Patterns
Where PE judgment pays off. Patterns most teams miss, anti-patterns visible on code review, optimizations that compound at scale.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Partition count selection | Pick the default of 1, or copy the partition count from another topic, with no thought to throughput or consumer parallelism. | Compute partition count = max(target throughput MB/s ÷ 10, target consumer count) and round up to a power of two. Document the decision in the topic config. | Partition count is permanent for ordering purposes. Wrong now = rewrite later. Right now = 5 minutes of math. |
| Partition key selection | Use whatever ID is convenient (orderId, eventId), get unordered semantics across the entity that actually matters (customer, account, user). | The key is the ordering domain. Pick the entity whose events must arrive in order to downstream consumers. customerId > orderId for an orders topic. | Wrong key = correctness bugs in downstream stateful processing. Detecting these in production is expensive; preventing them is free. |
| Producer durability config | Leave defaults. Or set acks=1 because "it's faster" without measuring. | Explicit acks=all + enable.idempotence=true + min.insync.replicas=2 on the broker. Document the choice in your service config. | Defaults change. Explicit config survives version upgrades and is greppable when the durability question comes up. |
| Consumer commit strategy | Use enable.auto.commit=true and discover later that batch reprocessing on restart is undefined. | enable.auto.commit=false, commit after successful batch processing. For EOS pipelines, use transactional commit via sendOffsetsToTransaction. | Auto-commit is a footgun: it commits on a clock, not after processing. Restarts cause reprocessing of records you thought were done. |
| Use KIP-848 rebalance protocol | Leave consumers on the classic protocol "because it's been working" - and pay 30-100s rebalances per consumer leave/join. | Set group.protocol=consumer on all consumers from Kafka 4.0 forward. Test in staging, then roll out by consumer group. |
The classic protocol is in Phase 1 deprecation in Kafka 4.3. KIP-848 cuts rebalance time by 5-20x and survives slow members gracefully. |
| Enable fetch-from-follower | Run multi-AZ Kafka without setting client.rack on consumers; pay full cross-AZ traffic cost for fetches. |
Set client.rack on consumer (matching the broker rack of the same AZ) and replica.selector.class=RackAwareReplicaSelector on broker. |
Cuts ~33% of cross-AZ traffic with a 5-minute config change. The most underused dollar-saving config in Kafka. |
| Use compacted topics for state | Treat Kafka as append-only and run a separate KV store for "current state per entity." | Use compacted topic with key=entityId for state. Consumers materialize current state by replaying. Survives any consumer restart. | Eliminates an external state store and its operational burden. The compacted topic IS the source of truth. |
| Schema-first contracts | Send JSON blobs with no schema, evolve fields by adding optional ones and hoping consumers handle it. | Avro or Protobuf with Schema Registry from day 1. Compatibility=BACKWARD by default. Register schemas in CI, never at runtime. | "Hope for the best" schema evolution is the leading cause of cross-team Kafka incidents. Schema Registry catches breakage at deploy time, not at runtime. |
| Standby replicas for Kafka Streams | Set num.standby.replicas=0 (default), then suffer 5-10 minute state restore on every restart. |
Set num.standby.replicas=1. Warm replica restores changelog in parallel. Failover takes seconds. |
State restore is the longest tail on Streams availability. Standbys eliminate it at ~2x storage cost. |
| Topic naming conventions | Free-for-all naming, topic name carries no metadata, no clue about lifecycle or ownership. | Convention: {domain}.{entity}.{event-type}.{version} (e.g. orders.orders.placed.v1). Encode versioning into the name; new versions = new topics. |
Hundreds of topics in mature deployments. Naming convention is the difference between findability and chaos. |
| Throttle partition reassignments | Run a reassignment to add a new broker, take down the cluster with replication traffic, get paged at 3 AM. | Always set --throttle on kafka-reassign-partitions.sh. 50-100 MB/s per broker is a safe ceiling for production. |
Unthrottled reassignment can saturate the network and degrade producer/consumer latencies for hours. |
13Advanced / Next-Gen Alternatives
The 2024-2026 streaming landscape changed more than the previous decade combined. Cross-AZ replication economics are the forcing function. WarpStream was acquired by Confluent (Sep 2024). Confluent was acquired by IBM for $11B (closed March 17, 2026). AutoMQ remains independent. Aiven's Inkless is upstreaming as KIP-1150. The "diskless Kafka on object storage" category is now real, in production at Grab, JD.com, Tencent.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| AutoMQ | Diskless Kafka with pluggable WAL (S3, EBS, NFS). 100% Apache Kafka API compatible. Eliminates cross-AZ replication via S3 as the durability layer. Reported 10x cost reduction for high-throughput workloads. | Production at scale (Grab, JD.com, Tencent; Apache 2.0 license) | Low. Drop-in protocol compatible. AutoMQ Linking preserves consumer offsets during migration. | Cross-AZ replication is the largest line item in your Kafka cloud bill, and you want to stay open-source. Independent vendor, no IBM/Confluent lock-in. |
| WarpStream (IBM/Confluent) | Rewrite of Kafka in Go. Stateless agents write directly to S3. Zero local disk on the data plane. Excellent cost for log/observability workloads. | Production (proprietary, now IBM-owned) | Low for ingest (Kafka-protocol agents); higher for full migration due to feature gaps (no transactions, no compacted topics). | Logs and observability pipelines where 400+ ms latency is acceptable. Already an IBM/Confluent customer. Concerned about IBM lock-in. |
| Apache Kafka with KIP-1150 (Aiven Inkless) | Diskless topics as an upstream Apache Kafka feature. Some topics are diskless (object-storage-backed), others remain classic. Same cluster. | Emerging / in-development (Aiven driving the KIP; upstream target Apache 2.0) | Eventually low (upstream feature). Currently requires Aiven Inkless or commitment to the KIP timeline. | Long-term bet on stock Apache Kafka. Want diskless economics for some topics without abandoning the broader Kafka ecosystem. |
| Redpanda | C++ rewrite, thread-per-core architecture (Seastar). Best-in-class single-node performance, mature tiered storage in 25.x. Strong Kafka protocol compatibility. | Production (BSL 1.1 license, source-available, converts to Apache 2.0 after 4 years) | Low for protocol; license requires due diligence (BSL prohibits competitive managed-service offerings). | Performance-per-node is the dominant concern. Latency-sensitive workloads. License is acceptable. |
| Apache Pulsar | Compute/storage separation via BookKeeper. Multi-tenancy first-class. Native geo-replication, native queue mode, namespaces and policies. | Production (Apache 2.0) | High. Different APIs, different operational model, ZooKeeper still in the loop. | Greenfield deployment with strong multi-tenancy requirements. Want native queue and stream in one system. Not migrating existing Kafka. |
| Apache Flink | True stream processing with stateful operators, exactly-once at scale, watermarks and event-time semantics, broader language support (Java, Scala, Python, SQL). | Production at every scale (Apache 2.0) | Medium-high. Different deployment model (Flink cluster + checkpointing storage). API is more powerful but steeper. | Complex stateful pipelines outgrow Kafka Streams. Cross-organization streaming platform. Need windowing semantics beyond Streams' offering. |
| NATS JetStream | Lightweight messaging, simpler ops, native subjects (no partitions), stream replay, consumer ack semantics. | Production (Apache 2.0) | High - different conceptual model, different client libraries. | Edge or IoT workloads, smaller-scale messaging where Kafka's operational footprint is overkill. Mesh/multi-cluster federation. |
Self-managed Kafka in the cloud is structurally expensive because of cross-AZ replication. The diskless category fixes this and is real now. If you are starting fresh in 2026, the decision is no longer "Kafka vs not-Kafka," it is "classic Kafka vs diskless Kafka." For an existing Kafka deployment with a 6-figure cloud bill, AutoMQ migration is the highest-ROI infrastructure project most teams will see this year, and the one most teams will postpone forever because the existing pipeline is "working." The forcing function is usually an EBS or cross-AZ bill audit.
For latency-sensitive workloads, classic Kafka or Redpanda still win, diskless trades 50-400 ms additional latency for 10x cost reduction. Choose deliberately.
14References
- Apache Kafka 4.3.0 Release Announcement (May 22, 2026)
- Apache Kafka 4.2.0 Release Announcement (Feb 17, 2026) - Share Groups GA, KIP-1071 GA
- Apache Kafka 4.0 Release Announcement (Mar 18, 2025) - ZooKeeper removal
- KIP-848: Next-Generation Consumer Rebalance Protocol
- KIP-932: Queues for Kafka (Share Groups)
- Kafka Tiered Storage Operations (4.1)
- Kafka Tiered Storage GA Release Notes
- The Path Forward for Saving Cross-AZ Replication Costs (KIP-1150, KIP-1176, KIP-1183)
- Kafka Alternatives Compared (2026) - AutoMQ vs Confluent vs WarpStream vs Redpanda vs MSK
- AutoMQ vs WarpStream: Diskless Kafka Platforms Compared (2026)
- Rise of Diskless Kafka - Industry Trend Analysis 2026
- The Hidden Cloud Cost in Your Kafka Bill (Cross-AZ Replication Economics)
- KIP-392 Fetch From Follower: Why It Saves 50% of Kafka Cloud Cost
- How KIP-881 and KIP-392 Reduce Inter-AZ Networking Costs
- Kafka Storage Internals: Segments, Indexing, and Retention
- Deep Dive Into Kafka Storage Internals (Strimzi)
- KIP-848 on Confluent Blog
- Apache Kafka 4.0 Upgrade Guide
- Jack Vanlightly: A Fork in the Road - Deciding Kafka's Diskless Future
- Kafka Queue Semantics Now GA with Share Consumer API