Message Brokers: SQS vs Kafka vs RabbitMQ vs Redis Pub/Sub vs ActiveMQ
A Principal Engineer trade-off analysis across five systems that are commonly listed together and almost never interchangeable.
CATEGORY SWEEP / HEAD-TO-HEADAs of 2026-05-27. Reflects Kafka 4.0 (KRaft-only), RabbitMQ 4.x (quorum queues default, mirrored classic queues removed), SQS FIFO high-throughput mode, and ActiveMQ Classic + Artemis.
These five are not competitors in the same race. Kafka is a durable, replayable log for high-fanout event streaming. SQS is a zero-ops managed work queue you reach for first on AWS. RabbitMQ is the flexible routing broker for complex topologies and per-message workflows. ActiveMQ is the JMS broker you keep for enterprise Java estates. Redis Pub/Sub is a fire-and-forget signal bus with no persistence, and treating it as a queue is the single most common production mistake in this set. Pick by durability requirement and ops budget first, throughput second.
Best default choices
1. Trade-Offs
One table per technology. A trade-off is something you give up X to get Y. Click any header to sort. Color: gain / cost.
Wide tables scroll horizontally on mobile.
Amazon SQS
Best default for AWS-native background jobs, async task buffering, and teams that value zero broker operations over deep routing or replay control.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Fully managed for zero broker ownership | No clusters to patch, scale, or page on. Standard queues have effectively unlimited throughput | No control over internals, no on-prem option, vendor lock to AWS | When you need a multi-cloud or on-prem deployment and discover the queue semantics do not port | The real product is the absence of an on-call rotation, not the queue. That is worth more than most teams price it. |
| Standard queue throughput for at-least-once and out-of-order delivery | Near-unlimited horizontal scale, lowest cost per message | Duplicates and reordering are guaranteed to happen, not just possible | When a consumer is not idempotent and a duplicate double-charges a customer | Idempotency is not optional with standard queues, it is the contract. Design the dedup key before you write the consumer. |
| FIFO queue ordering for throughput ceiling | Exactly-once processing and strict per-group ordering | Default 300 TPS without batching; high-throughput mode reaches up to 70,000 TPS per API action in select regions | When a single message group ID becomes a hot path and serializes everything behind it | FIFO throughput scales with the number of message group IDs, not raw volume. One group ID is a single-threaded bottleneck. |
| Visibility timeout for delivery retry safety | Automatic redelivery if a consumer crashes mid-processing | You must size the timeout to the slowest plausible processing time or you get duplicate work | When processing occasionally takes 90s but the timeout is 30s, so three workers grab the same message | Heartbeat-extend the visibility timeout for long jobs rather than setting a giant fixed value that delays retries on real failures. |
| Pull-based polling for consumer simplicity | Consumers control their own rate, natural backpressure, no broker push state | Short polling burns empty receives and money; you must tune long polling | When a team leaves short polling on and gets a surprise bill from millions of empty ReceiveMessage calls | Long polling (WaitTimeSeconds=20) is almost always correct. The default of 0 is a cost trap. |
| 14-day max retention for storage simplicity | No infinite-growth risk, predictable storage behavior | Cannot use SQS as an event store or replay log | When you want to reprocess last month's events and they no longer exist | If you need replay, you needed Kafka or Kinesis. SQS is a work queue, not a log. |
| 256 KB message size cap for payload predictability | Consistent low-latency delivery, no large-object handling in the broker | Large payloads need the extended client library and an S3 side-channel | When someone sends a 5 MB document and the publish silently fails or truncates | The claim-check pattern (payload in S3, pointer in SQS) is the standard escape, but it adds a second consistency surface. |
| DLQ redrive for poison-message isolation | Failed messages move aside after N retries instead of blocking the queue | You own monitoring and redriving the DLQ; it is not automatic recovery | When nobody alarms on DLQ depth and 40K failed orders sit silently for a week | A DLQ with no alarm is a data-loss timer. Alarm on ApproximateNumberOfMessagesVisible for every DLQ. |
Apache Kafka
Choose when the system needs a durable replayable event log, high fan-out consumers, retained history, and stream processing over ordered partitions.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Durable partitioned log for replay and high fanout | Multiple independent consumer groups read the same data, replay from any offset, millions of msg/s | Operational complexity far above a managed queue, even with KRaft removing ZooKeeper | When a 3-person team adopts self-managed Kafka and spends a quarter on rebalances and disk tuning instead of features | Kafka's superpower is that consumption does not delete data. If you do not need replay or fanout, you are paying the complexity tax for nothing. |
| Partition count for parallelism | Throughput and consumer concurrency scale with partitions | Producer throughput is best between core-count and ~100 partitions and drops off sharply past ~1,000 per cluster on modest hardware | When someone creates a 10,000-partition topic to be safe and tail latency explodes | Partition count is the hardest thing to change after launch. Increasing it breaks key-based ordering. Size it from target throughput, not optimism. |
| Ordering guaranteed only within a partition | Strict per-key ordering when you partition by key | No total order across the topic; cross-partition ordering is your problem | When events for one aggregate land in different partitions because the key was null and ordering silently breaks | "Ordered" in Kafka always means "per partition." Anyone who says Kafka gives global ordering has not run it in production. |
| Consumer-group rebalancing for elastic scaling | Add or remove consumers and partitions redistribute automatically | Classic rebalance is stop-the-world; the whole group pauses during reassignment | When autoscaling triggers frequent rebalances and every scale event stalls processing | KIP-848 (GA in Kafka 4.0) replaces stop-the-world with incremental rebalancing. If you are on an older protocol, this is a real reason to upgrade. |
| ISR + acks=all for durability | No data loss as long as one in-sync replica survives | Write latency rises; with acks=all every write waits for the ISR | When a team sets acks=1 for speed, loses the leader, and silently drops the unreplicated tail | acks=all + min.insync.replicas=2 on RF=3 is the correct durability floor. acks=1 trades correctness for latency, and most teams do not realize they made that trade. |
| Pull-based consumption with retained offsets | Consumers replay, rewind, and control their own pace | Offset management is your responsibility; commit too early and you lose messages on crash | When auto-commit fires before processing finishes and a crash drops in-flight messages | Commit offsets after processing, not before. Auto-commit is at-most-once dressed up as convenience. |
| Log retention by time or size for storage control | Tune how far back you can replay vs disk spend | Long retention means large disks; tiered storage helps but adds a dependency | When retention is set to infinite "just in case" and the cluster runs out of disk during a traffic spike | Tiered storage (offloading old segments to S3) is the modern answer to the retention-vs-cost tension. Use it before buying bigger disks. |
| Exactly-once semantics via idempotent producer + transactions | End-to-end exactly-once within Kafka, no duplicate processing | Transactions add latency and coordinator overhead; only holds inside the Kafka boundary | When a team assumes EOS extends to their external database write and it does not | Kafka EOS is exactly-once within Kafka. The moment you write to an external system, you are back to at-least-once unless you build idempotency there too. |
RabbitMQ
Use for AMQP routing, per-message workflows, request/reply patterns, priority/dead-letter behavior, and broker-managed delivery semantics.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Rich routing (exchanges, bindings, topics) for topology flexibility | Fanout, topic, header, and direct routing without consumer-side filtering | Routing logic lives in broker config, which can become an undocumented dependency | When a binding change in production silently reroutes traffic and no one can find where the config lives | RabbitMQ's routing is its real differentiator over SQS and Kafka. It is also the thing that becomes tribal knowledge if you do not version the topology as code. |
| Quorum queues for data safety | Raft-replicated durability, predictable failover, higher and more stable throughput than the old mirrored queues | All data persisted to disk before ack; needs fast disks and higher prefetch to perform | When you put short-lived RPC reply traffic on a quorum queue and pay the replication cost for data you will discard in 200ms | Quorum queues are the default replicated type in RabbitMQ 4.x. Mirrored classic queues were removed in 4.0. If a migration guide still mentions them, it is stale. |
| Push-based delivery with prefetch for low latency | Messages pushed to consumers immediately, very low per-message latency | A misconfigured prefetch overwhelms a slow consumer or starves a fast one | When prefetch is unlimited and one consumer grabs the whole backlog, defeating the point of multiple consumers | Prefetch (QoS) is the single most important tuning knob in RabbitMQ. Default-unlimited is wrong for almost every real workload. |
| Per-message acknowledgment for processing safety | Fine-grained redelivery; an unacked message returns to the queue on consumer death | Ack bookkeeping overhead; forgotten acks leak memory and stall queues | When a consumer never acks (bug or auto-ack misuse) and the queue grows until the broker OOMs | Auto-ack is at-most-once. Manual ack after processing is the correct default, and it is the inverse of Kafka's offset model. |
| In-memory-first design for throughput | Very fast when the working set fits in RAM | Performance degrades hard when queues grow long and the broker pages to disk | When consumers fall behind, the queue hits millions of messages, and the broker enters flow control and throttles publishers | RabbitMQ is happiest with short queues. Deep backlogs are an anti-pattern; if your queues are routinely millions deep, you wanted a log (Kafka), not a broker. |
| Streams (4.x) for replayable log-style workloads | Append-only, replayable, high-throughput log inside RabbitMQ | Different mental model and client API than queues; not a drop-in | When a team uses a classic queue for event history and discovers it cannot replay, when Streams was the right structure | Streams narrow the gap with Kafka for single-cluster log use cases, but Kafka's ecosystem and horizontal scale still win at the high end. |
| Single logical broker for operational simplicity | Easier to reason about than a partitioned distributed log | Vertical scaling limits; clustering is for HA, not linear throughput scale | When you expect adding nodes to multiply throughput and instead just get more replicas of the same ceiling | RabbitMQ clusters for availability, not for sharding throughput. If you need to scale writes by adding nodes, the model is Kafka's, not RabbitMQ's. |
Redis Pub/Sub
Use classic Pub/Sub only for ephemeral live signals where losing messages during disconnects is acceptable; choose Redis Streams for durable queue-like work.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Fire-and-forget delivery for minimal latency | Sub-millisecond fanout, often a few hundred microseconds, dead simple API | At-most-once delivery, zero persistence, zero acknowledgment | When a subscriber reconnects after a blip and every message sent during the gap is gone forever | This is the defining property. Redis Pub/Sub is a signal bus, not a queue. Any message you cannot afford to lose does not belong here. |
| No persistence for raw speed | No disk I/O, no storage growth, lowest possible overhead | A message with no live subscriber is dropped silently, no error returned | When the publisher sends to a channel with zero subscribers and assumes success | PUBLISH returns the number of clients that received the message. If you do not check it, you have no idea whether anyone heard you. |
| Broadcast to all subscribers for simple fanout | Every subscriber gets every message, trivial fanout | No consumer groups, so you cannot distribute work across consumers | When a team tries to load-balance jobs across workers and every worker processes every job | Pub/Sub fans out; it does not load-balance. Work distribution needs Redis Streams consumer groups or a real queue. |
| In-memory buffering for slow subscribers | Brief tolerance for consumers slightly behind | Slow subscribers cause Redis to buffer in memory; sustained slowness causes backpressure or disconnects | When one slow consumer's output buffer grows until Redis hits client-output-buffer-limit and force-closes it | A slow Pub/Sub subscriber is a memory-pressure incident waiting to happen. There is no disk overflow, only RAM and a kill threshold. |
| Shared Redis instance for infra reuse | No new system to operate if you already run Redis | Pub/Sub traffic competes with cache/data traffic on the same instance | When a Pub/Sub message storm starves the cache workload sharing the node | Co-locating Pub/Sub with your cache couples two unrelated failure domains. For anything serious, isolate it. |
| Cluster-mode fanout via sharded Pub/Sub | Scales fanout across a Redis Cluster with SPUBLISH/SSUBSCRIBE | Sharded Pub/Sub confines messages to a shard; classic Pub/Sub broadcasts cluster-wide and adds inter-node traffic | When classic Pub/Sub on a large cluster floods every node with every message and saturates the bus | On Redis Cluster, use sharded Pub/Sub (Redis 7+) or your fanout cost grows with node count, not just subscriber count. |
ActiveMQ (Classic & Artemis)
Keep or choose for JMS compatibility, enterprise Java estates, and migration paths where protocol support matters more than modern cloud-native ergonomics.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Full JMS compliance for enterprise Java fit | Drop-in for JMS 1.1/2.0 apps, transactions, durable subscriptions, broad protocol support (AMQP, MQTT, STOMP, OpenWire) | JMS-centric mental model; less natural outside the Java/Jakarta ecosystem | When a polyglot org standardizes on it and the non-Java teams fight the JMS abstractions | ActiveMQ's reason to exist in 2026 is an existing JMS estate. For greenfield non-JMS work, the other four are usually a better starting point. |
| Classic vs Artemis as two different brokers under one name | Classic is battle-tested and stable; Artemis is the higher-performance non-blocking successor | They differ in I/O model, persistence, HA, and addressing; "ActiveMQ" alone is ambiguous | When a design doc says "ActiveMQ" and ops deploys Classic while the architect assumed Artemis throughput | Always name the flavor. Artemis (the donated HornetQ codebase) uses an append-only journal and asynchronous architecture; Classic routes everything through OpenWire and KahaDB. |
| Complex-broker / simple-consumer model for routing in the broker | Broker handles routing, consumer state, redelivery, so consumers stay thin | Broker is the bottleneck and the stateful component to scale and protect | When broker-side state becomes the scaling ceiling and you cannot just add stateless consumers to go faster | This is the philosophical opposite of Kafka's simple-broker/complex-consumer design. It is why ActiveMQ tops out far below Kafka on raw throughput. |
| KahaDB / journal persistence for guaranteed delivery | Durable messages with JMS persistence and transactional guarantees | Disk-bound throughput; Classic's KahaDB indexing can become a hotspot under load | When message volume outgrows the journal's write path and latency climbs unpredictably | Artemis's journal-only, index-light design is the main reason it sustains higher throughput than Classic at scale. |
| Moderate throughput ceiling for messaging-not-streaming | Solid for moderate enterprise workloads with low latency | Not built for millions of msg/s; sustained high-throughput single-broker work favors Artemis, and Kafka beyond that | When someone benchmarks Classic against Kafka for a firehose workload and is surprised it cannot keep up | Artemis can rival Kafka in some benchmarks for specific workloads, but Kafka's partitioned horizontal scale-out is a different class for sustained streaming. |
| Network-of-brokers / store-and-forward for topology reach | Federate brokers across sites for geographic distribution | Configuration complexity, message-loop risks, and harder failure reasoning | When a misconfigured network of brokers creates message loops or duplicate delivery across sites | Powerful for legacy WAN topologies, but it is exactly the kind of bespoke configuration that becomes unmaintainable. Prefer simpler HA pairs where you can. |
2. Use Cases
One table per technology. The Driving Property is the single thing that ruled out the alternative. Sortable.
Amazon SQS
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Decoupling microservices on AWS | Any AWS-native service team offloading async work | Zero broker ops, integrates with Lambda/SNS/EventBridge out of the box | From near-zero to unlimited standard-queue throughput | Kafka would add a cluster to run; the team has no platform team for it |
| Order-processing pipeline with retries | E-commerce checkout backend | Visibility timeout + DLQ give automatic retry and poison isolation | Thousands of orders/s during peak, bursty | Redis Pub/Sub would lose orders on any consumer blip; SQS retries them |
| Buffering in front of a rate-limited downstream | Service calling a third-party API with strict quotas | Consumers pull at their own pace, natural backpressure | Smooths 10x spikes into steady downstream load | A push broker would overwhelm the rate-limited dependency |
| Strict per-entity ordering for financial events | Ledger or inventory updates needing exactly-once per account | FIFO queue: exactly-once, strict per-message-group ordering | Up to 70K TPS per API action in high-throughput mode (select regions) | Standard queues reorder and duplicate; Kafka needs self-managed ops |
| Fan-out work distribution to idempotent workers | Image/video transcoding job farm | At-least-once delivery with horizontal worker scaling | Millions of jobs/day across an autoscaling fleet | RabbitMQ would need a broker to operate; SQS is fully managed |
Apache Kafka
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Central event backbone with many consumers | LinkedIn (Kafka's origin), large event-driven orgs | Durable log lets N independent consumer groups read the same stream | Trillions of messages/day across thousands of topics | SQS deletes on consume; you cannot have many independent readers of one stream |
| Stream processing and real-time analytics | Clickstream / telemetry pipelines feeding Flink or Kafka Streams | Ordered, replayable partitions with high sustained throughput | Millions of events/s, sub-second end-to-end | RabbitMQ degrades on deep backlogs; it is not a streaming substrate |
| Event sourcing and CDC | Database change-data-capture via Debezium into Kafka | Immutable ordered log is the system of record for changes | Full change history retained, replayable from offset 0 | No queue retains an infinite replayable history by design |
| Log/metric aggregation pipeline | Centralized observability ingestion | High write throughput with tiered storage for cheap retention | TB/day ingest, weeks of retention | ActiveMQ's broker-centric model tops out well below this volume |
| Decoupling at scale with replay safety | Large microservice estate needing reprocessing capability | Consumers rewind offsets to reprocess after a bug fix | Hundreds of services, replay of days of events | SQS's 14-day cap and delete-on-consume prevent arbitrary replay |
RabbitMQ
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Complex routing to many queue topologies | Workflow engine routing by message attributes | Exchanges/bindings route without consumer-side filtering | Many queues, tens of thousands of msg/s per node | Kafka has no native content-based routing; you would filter consumer-side |
| RPC / request-reply messaging | Synchronous service calls over a broker | Per-message ack, reply-to queues, correlation IDs | Low-latency request/response at moderate volume | Kafka's log model is awkward for short-lived reply semantics |
| Task queues with priority and per-message TTL | Background job system with priorities | Priority queues, message TTL, dead-lettering built in | Moderate throughput, rich per-message semantics | SQS lacks native priority; Kafka lacks per-message TTL and priority |
| Reliable workflows needing data safety | Financial or order workflows on quorum queues | Raft-replicated quorum queues, predictable failover | ~30K msg/s on a quorum queue with 1KB messages, 3-node replication | Redis Pub/Sub has no durability; this work cannot tolerate loss |
| Multi-protocol integration hub | Bridging AMQP, MQTT, and STOMP clients | Native multi-protocol support in one broker | Mixed IoT + backend client fleet | SQS and Kafka do not natively span MQTT/STOMP/AMQP |
Redis Pub/Sub
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Live UI updates and presence | Real-time dashboards, live cursors, "user is typing" | Sub-ms fanout, loss of one update is harmless | Many subscribers, transient state | A durable broker is overkill; the next update overwrites the last anyway |
| Cache invalidation signals | Multi-node app invalidating local caches | Instant broadcast to all app nodes | Cluster-wide signal fanout | SQS/Kafka latency and ops overhead are unjustified for a fire-and-forget ping |
| Chat where missed messages are acceptable | Ephemeral chat / low-stakes notifications | Lowest-latency broadcast, simplicity | High message rate, loss-tolerant | If history matters, you need Streams or a queue; here it does not |
| Cross-process eventing within one app | Coordinating workers already sharing a Redis instance | No new infra, trivial API | Internal signaling, low durability need | Standing up Kafka for an internal ping is disproportionate |
| Real-time leaderboard / score broadcast | Gaming presence and live score push | Microsecond fanout to connected players | Many concurrent connected clients | Durability is unneeded; only currently-connected players matter |
ActiveMQ (Classic & Artemis)
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Legacy enterprise Java messaging | Bank or insurer with a large JMS-based estate | Full JMS compliance, transactional messaging, durable subscriptions | Moderate enterprise volumes, strict reliability | Kafka is not JMS; porting decades of JMS app code is a multi-year project |
| Multi-protocol broker for mixed clients | Integration layer spanning AMQP, MQTT, STOMP, OpenWire | One broker speaks many protocols natively | Heterogeneous client fleet | SQS is AWS-API-only; Redis Pub/Sub is single-protocol |
| IoT ingestion over MQTT into JMS backend | Industrial telemetry into enterprise systems | MQTT front, JMS back, in one broker (esp. Artemis) | Many devices, moderate per-device rate | Kafka needs an MQTT bridge; ActiveMQ handles it inline |
| Guaranteed-delivery point-to-point queues | Order or transaction handoff between services | Persistent queues with redelivery and transactions | Moderate throughput, zero-loss requirement | Redis Pub/Sub cannot guarantee delivery at all |
| Managed broker on AWS without re-architecting | Lift-and-shift JMS app to Amazon MQ | Managed ActiveMQ keeps existing JMS code unchanged | Existing enterprise workload moved to cloud | Moving to SQS/Kafka would require rewriting the messaging layer |
3. Limitations
Matrix layout: rows are limitation categories, columns are technologies. Toggle a chip to hide a column. Severity reflects how hard the constraint bites a common workload.
| Limitation Axis | SQS | Kafka | RabbitMQ | Redis Pub/Sub | ActiveMQ |
|---|---|---|---|---|---|
| Durability / persistence | Medium Durable but 14-day max retention, no replay | Medium Durable; disk cost scales with retention | Medium Durable via quorum queues; not built for deep backlogs | Critical None. At-most-once, messages lost if no live subscriber | Medium Durable via journal; disk-bound throughput |
| Replay / history | High Delete-on-consume, no replay | Medium Full replay is a core strength, gated only by retention | High Queues delete on ack; replay only via Streams | Critical No history at all | High Queues consume-and-remove; no offset replay |
| Throughput ceiling | Medium Standard near-unlimited; FIFO capped per group ID | Medium Millions/s; partition-count tuning required | High Per-node ceiling; clusters add HA not linear scale | Medium Very high fanout but classic Pub/Sub floods cluster nodes | High Broker-centric; tops out below Kafka, Classic below Artemis |
| Ordering guarantees | Medium Standard: none. FIFO: per group ID only | Medium Per-partition only, never global | Medium Per-queue FIFO; lost across competing consumers/requeues | High Best-effort only, no guarantee under reconnect | Medium Per-destination; message groups for strict ordering |
| Consumer work distribution | Medium Competing consumers native; scales well | Medium Bounded by partition count, not consumer count | Medium Competing consumers native; prefetch tuning critical | Critical No consumer groups; every subscriber gets every message | Medium Queues distribute; topics broadcast |
| Message size | Medium 256 KB cap; large payloads need S3 claim-check | Medium Default ~1 MB; large messages hurt throughput | Medium Large messages strain memory-first design | Medium Large payloads buffer in RAM, raising pressure | Medium Large messages need blob/stream handling, hit journal |
| Operational burden | Medium Lowest in the set; fully managed | High Highest; even KRaft needs partition/disk/rebalance expertise | High Cluster, quorum, prefetch, flow-control tuning | Medium Low if Redis already run; couples failure domains | High Two flavors, HA config, journal tuning, network-of-brokers |
| Cross-region / geo | Medium Regional service; cross-region is app-level | High MirrorMaker 2 is async, operationally heavy | High Federation/shovel async, not seamless | High No native geo Pub/Sub durability | High Network-of-brokers works but is loop-prone |
4. Fault Tolerance
Matrix layout. RTO is time to restore service, RPO is tolerable data loss. Toggle chips to focus.
| Dimension | SQS | Kafka | RabbitMQ | Redis Pub/Sub | ActiveMQ |
|---|---|---|---|---|---|
| Replication model | Redundant across multiple AZs in-region, managed by AWS | Leader-follower per partition, ISR-based, RF configurable (3 typical) | Quorum queues: Raft across odd node count (3/5) | None for Pub/Sub messages; Redis HA replicates keyspace, not in-flight pub/sub | Shared-store or replicated journal (Artemis); primary/backup pairs |
| Failure detection | Transparent, AWS-internal | Controller heartbeats; ISR shrink on lag (seconds) | Raft election on leader loss; net-tick timeouts | Sentinel/Cluster gossip for the node, not for pub/sub delivery | Lock/heartbeat on shared store or replication link |
| Failover mechanism | Automatic, invisible to clients | Automatic leader re-election from ISR | Automatic Raft leader election, predictable | Replica promotion for the node; in-flight pub/sub messages are lost | Backup broker takes the store lock and activates |
| RTO (typical) | ~0, no client-visible outage | Seconds for leader re-election | Seconds for quorum re-election | Seconds for node failover; pub/sub gap during it | Seconds to tens of seconds for backup activation |
| RPO (typical) | 0 for committed messages | 0 with acks=all + min.insync.replicas=2; non-zero with acks=1 | 0 for quorum queues with majority intact | Total loss of in-flight pub/sub messages by design | 0 with replicated/shared store; loss possible on async config |
| Split-brain behavior | N/A, managed; no operator-visible split-brain | Quorum (KRaft) prevents it; minority controllers step down | Raft majority rule; minority partition rejects writes | Cluster minority stops serving; risk of lost writes on partition | Store lock prevents dual-primary on shared store; replication needs care |
| Blast radius, single-node loss | None visible; AWS absorbs it | Partitions led by that broker re-elect; brief per-partition pause | Queues with leader on that node re-elect; brief unavailability | Subscribers on that node disconnect; their in-flight messages lost | Destinations on that broker unavailable until backup activates |
| Cross-region failover | App-level; queue is regional | MirrorMaker 2, async, manual cutover, offset translation needed | Federation/shovel, async, manual | No native cross-region pub/sub durability | Network-of-brokers, async, complex |
| Data loss scenarios | Effectively none for committed messages within retention | acks=1 + leader loss; or all ISR lost simultaneously | Loss only if majority of quorum nodes lost at once | Any disconnect, slow consumer, or no-subscriber publish | Async journal flush + crash; misconfigured non-persistent delivery |
6. Replication
How copies are kept and what consistency you get. The Redis Pub/Sub row is the cautionary one: keyspace replicates, in-flight pub/sub does not.
| Dimension | SQS | Kafka | RabbitMQ | Redis Pub/Sub | ActiveMQ |
|---|---|---|---|---|---|
| Replication topology | Managed multi-AZ; opaque to user | Leader-follower, single leader per partition | Quorum queues: Raft leader-follower | Primary-replica for keyspace; pub/sub itself not replicated | Primary-backup; shared store or journal replication |
| Sync vs async | Synchronous within region (managed) | Sync to ISR with acks=all; async cross-region (MM2) | Sync to Raft majority before ack | Async replication by default (can lose recent writes) | Configurable; replicated journal can be sync |
| Replication factor | Managed, not user-configurable | Configurable per topic; RF=3 typical | Quorum: odd count, 3 or 5 typical | Configurable replicas per primary; irrelevant to pub/sub delivery | Typically one backup; Artemis supports replication groups |
| Consistency options | At-least-once (standard); exactly-once (FIFO) | Tunable via acks (0/1/all) + min.insync.replicas | Strong for quorum queues (majority ack) | At-most-once for pub/sub, no tuning | Persistent (durable) vs non-persistent per message |
| Replication lag (typical) | Not exposed; effectively negligible | Sub-second in-region for healthy ISR | Sub-second within a healthy cluster | Async, sub-second but can lose the unreplicated tail on failover | Low on sync journal replication; higher if async |
| Conflict resolution | N/A; single-region authoritative | No conflicts; single leader per partition is authoritative | No conflicts; Raft leader authoritative | Last-write-wins on keyspace; N/A for pub/sub | Single active broker authoritative; no multi-master conflicts |
| Cross-region replication | Not native; app-level fan-out | MirrorMaker 2, async, with offset translation | Federation / shovel, async | CRDT-based active-active in Redis Enterprise, not OSS pub/sub | Network-of-brokers, async store-and-forward |
| Behavior during partition | Managed; no operator-visible partition handling | Minority leaders step down; majority continues with ISR | Minority side rejects writes; majority continues | Minority stops serving; messages during the gap are lost | Store lock prevents dual-primary; backup waits |
7. Better Usage Patterns
Where PE depth shows. The anti-patterns that show up in code review and the corrections that compound at scale. One table per technology.
Amazon SQS
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Idempotent consumers | Assume standard queue delivers once and skip dedup | Design a dedup key and make every consumer idempotent from day one | Standard queues will deliver duplicates; idempotency is the contract, not an optimization |
| Long polling by default | Leave WaitTimeSeconds at 0 and burn empty receives | Set WaitTimeSeconds=20 on every consumer | Short polling multiplies API cost and CPU for no benefit; this is free money left on the table |
| DLQ with alarms | Configure a DLQ but never alarm on its depth | CloudWatch alarm on DLQ message count, treat non-zero as an incident | A silent DLQ is a data-loss timer; failed messages rot unnoticed |
| Visibility timeout heartbeating | Set one giant fixed timeout to cover the slowest job | Use ChangeMessageVisibility to extend during long processing | A huge fixed timeout delays retries on genuine failures; heartbeating keeps fast recovery |
| Claim-check for large payloads | Try to cram big blobs near the 256 KB limit | Store payload in S3, put the pointer in SQS (extended client) | Keeps the queue fast and within limits; the broker stays a control plane, not a data store |
Apache Kafka
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Partition count sizing | Pick a round number, or over-provision 10K partitions "to be safe" | Size from target throughput and consumer parallelism; leave headroom but not 10x | Too few caps throughput; too many spikes latency and recovery time. It is hard to change later without breaking ordering |
| Durability floor | Set acks=1 for speed without realizing the loss window | acks=all + min.insync.replicas=2 on RF=3 for anything that matters | acks=1 silently drops the unreplicated tail on leader loss; most teams never knew they opted out of durability |
| Commit after processing | Use auto-commit and lose in-flight messages on crash | Disable auto-commit; commit offsets only after successful processing | Auto-commit is at-most-once disguised as convenience; manual commit gives real at-least-once |
| Keyed partitioning for ordering | Send with null keys then wonder why per-entity order breaks | Partition by the entity key so all its events land in one partition | Kafka only orders within a partition; the key is how you buy per-entity ordering |
| Tiered storage for retention | Buy ever-bigger broker disks to extend retention | Enable tiered storage, offload old segments to object storage | Decouples retention from broker disk; cheaper long replay without giant local volumes |
RabbitMQ
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Prefetch (QoS) tuning | Leave prefetch unlimited; one consumer grabs the whole backlog | Set a bounded prefetch sized to processing time and consumer count | Unlimited prefetch defeats load balancing and risks consumer-side memory blowups |
| Quorum queues for durable work | Still using (or referencing) mirrored classic queues | Use quorum queues; mirrored classic queues were removed in 4.0 | Quorum queues give predictable failover and better throughput; classic mirroring is gone |
| Manual ack after processing | Use auto-ack and lose messages when a consumer dies mid-work | Manual ack only after the work is durably done | Auto-ack is at-most-once; manual ack is the safety guarantee RabbitMQ is chosen for |
| Keep queues short | Let backlogs grow into the millions and hit flow control | Scale consumers to keep queues near-empty; use Streams for log-style retention | Deep queues trigger paging and publisher throttling; RabbitMQ is a broker, not a log store |
| Topology as code | Click exchanges and bindings into existence in the UI | Declare exchanges/queues/bindings via definitions or IaC | Undocumented broker config becomes tribal knowledge and a rerouting hazard |
Redis Pub/Sub
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Use Streams when you need a queue | Reach for Pub/Sub as a task queue, then lose jobs | Use Redis Streams (consumer groups, ACK, PEL) for any durable work | Pub/Sub is at-most-once with no groups; Streams give at-least-once and work distribution |
| Hybrid Streams + Pub/Sub | Pick one and lose either durability or latency | Persist to a Stream for durability, fire a Pub/Sub message as the low-latency trigger | Reconnecting clients catch up from the Stream; you get both speed and recoverability |
| Check PUBLISH return value | Publish and assume someone received it | Treat the returned subscriber count as a delivery signal; alarm on zero | A no-subscriber publish is a silent drop; the count is your only feedback |
| Sharded Pub/Sub on Cluster | Use classic Pub/Sub on a large cluster and flood every node | Use SPUBLISH/SSUBSCRIBE (Redis 7+) to confine messages to a shard | Classic cluster-wide fanout scales cost with node count, not subscriber count |
| Isolate the failure domain | Run Pub/Sub on the same instance as the cache | Dedicate an instance for messaging when it matters | A pub/sub storm should not be able to starve your cache workload |
ActiveMQ (Classic & Artemis)
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Choose Artemis for new builds | Default to Classic out of habit for greenfield work | Start on Artemis unless you have a hard Classic dependency | Artemis's non-blocking journal architecture sustains higher throughput and is the future major version |
| Name the flavor in design docs | Write "ActiveMQ" and let ops guess Classic vs Artemis | Specify Classic or Artemis explicitly with the assumed perf profile | The two differ in I/O, HA, and addressing; ambiguity causes capacity mismatches |
| Persistent vs non-persistent intent | Leave delivery mode at the default and assume durability | Set persistent delivery explicitly for anything that must survive a crash | Non-persistent messages are lost on broker restart; the default is not always what you want |
| Prefer HA pairs over networks of brokers | Build a sprawling network of brokers for "scale" | Use primary/backup HA pairs; reserve broker networks for genuine geo needs | Networks of brokers are loop-prone and hard to reason about; simpler HA is more reliable |
| Consider Amazon MQ for managed ops | Self-host and absorb all patching/HA burden | Run managed ActiveMQ (Amazon MQ) to keep JMS code but drop the ops load | Keeps the existing JMS investment while removing the broker on-call rotation |
8. Advanced / Next-Gen Alternatives
Successors, adjacent tech that does a specific job better, and patterns that obviate the original need. One table per technology.
Amazon SQS
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Amazon Kinesis / MSK | Replay, ordered streaming, multiple independent readers | Production | Medium, different consumption model | When you need replay or fanout that a delete-on-consume queue cannot give |
| EventBridge | Content-based routing, schema registry, SaaS event integrations | Production | Low for new event flows | When routing and event-driven choreography matter more than raw queueing |
| SNS + SQS fanout | One-to-many delivery to multiple queues | Production | Low, additive | When multiple consumers each need their own durable copy of an event |
Apache Kafka
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Redpanda | Kafka-API-compatible, C++/Raft, ~10x lower tail latency in vendor benchmarks, no JVM/ZooKeeper | Production | Low, wire-compatible with Kafka clients | When tail latency and operational simplicity matter and you want to keep Kafka clients |
| WarpStream / Kafka-on-S3 designs | Object-storage-backed, near-zero inter-AZ cost, stateless brokers | Emerging | Medium, Kafka-compatible API but different latency profile | When inter-AZ network cost dominates and you can tolerate higher latency |
| Apache Pulsar | Separated compute/storage (BookKeeper), native multi-tenancy and tiered storage | Production | High, different architecture and client model | When you need strong multi-tenancy and independent scaling of storage vs serving |
RabbitMQ
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| RabbitMQ Streams | Append-only replayable log inside RabbitMQ for high-throughput log workloads | Production (4.x) | Low if staying on RabbitMQ; different client API | When you need log-style retention/replay without leaving RabbitMQ for Kafka |
| NATS / NATS JetStream | Lighter footprint, simpler ops, fast pub/sub plus optional persistence | Production | Medium, different protocol and semantics | When you want cloud-native simplicity and do not need AMQP's rich routing |
| Apache Kafka | Horizontal write scale-out and replay that a single-broker model cannot match | Production | High, different mental model | When backlogs are routinely millions deep or you need true streaming scale |
Redis Pub/Sub
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Redis Streams | Persistence, consumer groups, ACK, replay, at-least-once delivery | Production | Low, same Redis instance, new commands | The moment you need durability or work distribution. This is the default upgrade path |
| NATS / NATS Core | Purpose-built lightweight pub/sub with clustering and optional JetStream durability | Production | Medium, new system | When pub/sub is a first-class need rather than a Redis side-feature |
| Kafka / SQS for durable work | Real durability, retries, and at-least-once semantics | Production | Medium to high | When you discover Pub/Sub was never a safe place for work that cannot be lost |
ActiveMQ (Classic & Artemis)
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| ActiveMQ Artemis | Non-blocking architecture, higher throughput, JMS 2.0, the designated next major version | Production | Medium; client code often low-effort, but tooling/runbooks/HA differ | The default target for any Classic estate planning a 3+ year horizon |
| Apache Kafka | Massive horizontal throughput and replay for streaming workloads | Production | High, not JMS, requires rewrite | When the workload is really streaming/analytics, not enterprise JMS messaging |
| Amazon MQ (managed) | Removes broker ops while keeping ActiveMQ/JMS compatibility | Production | Low, same broker, managed control plane | When you want to keep JMS code but shed the operational burden |