Message Brokers: SQS vs Kafka vs RabbitMQ vs Redis Pub/Sub vs ActiveMQ

A Principal Engineer trade-off analysis across five systems that are commonly listed together and almost never interchangeable.

CATEGORY SWEEP / HEAD-TO-HEAD

As of 2026-05-27. Reflects Kafka 4.0 (KRaft-only), RabbitMQ 4.x (quorum queues default, mirrored classic queues removed), SQS FIFO high-throughput mode, and ActiveMQ Classic + Artemis.

PE Verdict

These five are not competitors in the same race. Kafka is a durable, replayable log for high-fanout event streaming. SQS is a zero-ops managed work queue you reach for first on AWS. RabbitMQ is the flexible routing broker for complex topologies and per-message workflows. ActiveMQ is the JMS broker you keep for enterprise Java estates. Redis Pub/Sub is a fire-and-forget signal bus with no persistence, and treating it as a queue is the single most common production mistake in this set. Pick by durability requirement and ops budget first, throughput second.

Read this before the tables: Redis Pub/Sub provides at-most-once delivery with no persistence. If no subscriber is connected, the message is gone permanently. It belongs in this comparison only because teams reach for it as a queue, then lose data. Where a row asks about durability, replay, or consumer groups, Redis Pub/Sub will read "N/A by design" and the honest answer is "use Redis Streams instead." That gap is the most important signal in this entire analysis.

Best default choices

SQSDefault AWS work queue and async job buffer KafkaDurable replayable event log and high-fanout streams RabbitMQComplex routing, per-message workflows, and AMQP semantics Redis Pub/SubEphemeral live signals only; use Streams for durable queueing ActiveMQJMS estates and enterprise Java compatibility

1. Trade-Offs

One table per technology. A trade-off is something you give up X to get Y. Click any header to sort. Color: gain / cost.

Wide tables scroll horizontally on mobile.

Amazon SQS

Best default for AWS-native background jobs, async task buffering, and teams that value zero broker operations over deep routing or replay control.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Fully managed for zero broker ownership	No clusters to patch, scale, or page on. Standard queues have effectively unlimited throughput	No control over internals, no on-prem option, vendor lock to AWS	When you need a multi-cloud or on-prem deployment and discover the queue semantics do not port	The real product is the absence of an on-call rotation, not the queue. That is worth more than most teams price it.
Standard queue throughput for at-least-once and out-of-order delivery	Near-unlimited horizontal scale, lowest cost per message	Duplicates and reordering are guaranteed to happen, not just possible	When a consumer is not idempotent and a duplicate double-charges a customer	Idempotency is not optional with standard queues, it is the contract. Design the dedup key before you write the consumer.
FIFO queue ordering for throughput ceiling	Exactly-once processing and strict per-group ordering	Default 300 TPS without batching; high-throughput mode reaches up to 70,000 TPS per API action in select regions	When a single message group ID becomes a hot path and serializes everything behind it	FIFO throughput scales with the number of message group IDs, not raw volume. One group ID is a single-threaded bottleneck.
Visibility timeout for delivery retry safety	Automatic redelivery if a consumer crashes mid-processing	You must size the timeout to the slowest plausible processing time or you get duplicate work	When processing occasionally takes 90s but the timeout is 30s, so three workers grab the same message	Heartbeat-extend the visibility timeout for long jobs rather than setting a giant fixed value that delays retries on real failures.
Pull-based polling for consumer simplicity	Consumers control their own rate, natural backpressure, no broker push state	Short polling burns empty receives and money; you must tune long polling	When a team leaves short polling on and gets a surprise bill from millions of empty ReceiveMessage calls	Long polling (WaitTimeSeconds=20) is almost always correct. The default of 0 is a cost trap.
14-day max retention for storage simplicity	No infinite-growth risk, predictable storage behavior	Cannot use SQS as an event store or replay log	When you want to reprocess last month's events and they no longer exist	If you need replay, you needed Kafka or Kinesis. SQS is a work queue, not a log.
256 KB message size cap for payload predictability	Consistent low-latency delivery, no large-object handling in the broker	Large payloads need the extended client library and an S3 side-channel	When someone sends a 5 MB document and the publish silently fails or truncates	The claim-check pattern (payload in S3, pointer in SQS) is the standard escape, but it adds a second consistency surface.
DLQ redrive for poison-message isolation	Failed messages move aside after N retries instead of blocking the queue	You own monitoring and redriving the DLQ; it is not automatic recovery	When nobody alarms on DLQ depth and 40K failed orders sit silently for a week	A DLQ with no alarm is a data-loss timer. Alarm on ApproximateNumberOfMessagesVisible for every DLQ.

Apache Kafka

Choose when the system needs a durable replayable event log, high fan-out consumers, retained history, and stream processing over ordered partitions.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Durable partitioned log for replay and high fanout	Multiple independent consumer groups read the same data, replay from any offset, millions of msg/s	Operational complexity far above a managed queue, even with KRaft removing ZooKeeper	When a 3-person team adopts self-managed Kafka and spends a quarter on rebalances and disk tuning instead of features	Kafka's superpower is that consumption does not delete data. If you do not need replay or fanout, you are paying the complexity tax for nothing.
Partition count for parallelism	Throughput and consumer concurrency scale with partitions	Producer throughput is best between core-count and ~100 partitions and drops off sharply past ~1,000 per cluster on modest hardware	When someone creates a 10,000-partition topic to be safe and tail latency explodes	Partition count is the hardest thing to change after launch. Increasing it breaks key-based ordering. Size it from target throughput, not optimism.
Ordering guaranteed only within a partition	Strict per-key ordering when you partition by key	No total order across the topic; cross-partition ordering is your problem	When events for one aggregate land in different partitions because the key was null and ordering silently breaks	"Ordered" in Kafka always means "per partition." Anyone who says Kafka gives global ordering has not run it in production.
Consumer-group rebalancing for elastic scaling	Add or remove consumers and partitions redistribute automatically	Classic rebalance is stop-the-world; the whole group pauses during reassignment	When autoscaling triggers frequent rebalances and every scale event stalls processing	KIP-848 (GA in Kafka 4.0) replaces stop-the-world with incremental rebalancing. If you are on an older protocol, this is a real reason to upgrade.
ISR + acks=all for durability	No data loss as long as one in-sync replica survives	Write latency rises; with acks=all every write waits for the ISR	When a team sets acks=1 for speed, loses the leader, and silently drops the unreplicated tail	acks=all + min.insync.replicas=2 on RF=3 is the correct durability floor. acks=1 trades correctness for latency, and most teams do not realize they made that trade.
Pull-based consumption with retained offsets	Consumers replay, rewind, and control their own pace	Offset management is your responsibility; commit too early and you lose messages on crash	When auto-commit fires before processing finishes and a crash drops in-flight messages	Commit offsets after processing, not before. Auto-commit is at-most-once dressed up as convenience.
Log retention by time or size for storage control	Tune how far back you can replay vs disk spend	Long retention means large disks; tiered storage helps but adds a dependency	When retention is set to infinite "just in case" and the cluster runs out of disk during a traffic spike	Tiered storage (offloading old segments to S3) is the modern answer to the retention-vs-cost tension. Use it before buying bigger disks.
Exactly-once semantics via idempotent producer + transactions	End-to-end exactly-once within Kafka, no duplicate processing	Transactions add latency and coordinator overhead; only holds inside the Kafka boundary	When a team assumes EOS extends to their external database write and it does not	Kafka EOS is exactly-once within Kafka. The moment you write to an external system, you are back to at-least-once unless you build idempotency there too.

RabbitMQ

Use for AMQP routing, per-message workflows, request/reply patterns, priority/dead-letter behavior, and broker-managed delivery semantics.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Rich routing (exchanges, bindings, topics) for topology flexibility	Fanout, topic, header, and direct routing without consumer-side filtering	Routing logic lives in broker config, which can become an undocumented dependency	When a binding change in production silently reroutes traffic and no one can find where the config lives	RabbitMQ's routing is its real differentiator over SQS and Kafka. It is also the thing that becomes tribal knowledge if you do not version the topology as code.
Quorum queues for data safety	Raft-replicated durability, predictable failover, higher and more stable throughput than the old mirrored queues	All data persisted to disk before ack; needs fast disks and higher prefetch to perform	When you put short-lived RPC reply traffic on a quorum queue and pay the replication cost for data you will discard in 200ms	Quorum queues are the default replicated type in RabbitMQ 4.x. Mirrored classic queues were removed in 4.0. If a migration guide still mentions them, it is stale.
Push-based delivery with prefetch for low latency	Messages pushed to consumers immediately, very low per-message latency	A misconfigured prefetch overwhelms a slow consumer or starves a fast one	When prefetch is unlimited and one consumer grabs the whole backlog, defeating the point of multiple consumers	Prefetch (QoS) is the single most important tuning knob in RabbitMQ. Default-unlimited is wrong for almost every real workload.
Per-message acknowledgment for processing safety	Fine-grained redelivery; an unacked message returns to the queue on consumer death	Ack bookkeeping overhead; forgotten acks leak memory and stall queues	When a consumer never acks (bug or auto-ack misuse) and the queue grows until the broker OOMs	Auto-ack is at-most-once. Manual ack after processing is the correct default, and it is the inverse of Kafka's offset model.
In-memory-first design for throughput	Very fast when the working set fits in RAM	Performance degrades hard when queues grow long and the broker pages to disk	When consumers fall behind, the queue hits millions of messages, and the broker enters flow control and throttles publishers	RabbitMQ is happiest with short queues. Deep backlogs are an anti-pattern; if your queues are routinely millions deep, you wanted a log (Kafka), not a broker.
Streams (4.x) for replayable log-style workloads	Append-only, replayable, high-throughput log inside RabbitMQ	Different mental model and client API than queues; not a drop-in	When a team uses a classic queue for event history and discovers it cannot replay, when Streams was the right structure	Streams narrow the gap with Kafka for single-cluster log use cases, but Kafka's ecosystem and horizontal scale still win at the high end.
Single logical broker for operational simplicity	Easier to reason about than a partitioned distributed log	Vertical scaling limits; clustering is for HA, not linear throughput scale	When you expect adding nodes to multiply throughput and instead just get more replicas of the same ceiling	RabbitMQ clusters for availability, not for sharding throughput. If you need to scale writes by adding nodes, the model is Kafka's, not RabbitMQ's.

Redis Pub/Sub

Use classic Pub/Sub only for ephemeral live signals where losing messages during disconnects is acceptable; choose Redis Streams for durable queue-like work.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Fire-and-forget delivery for minimal latency	Sub-millisecond fanout, often a few hundred microseconds, dead simple API	At-most-once delivery, zero persistence, zero acknowledgment	When a subscriber reconnects after a blip and every message sent during the gap is gone forever	This is the defining property. Redis Pub/Sub is a signal bus, not a queue. Any message you cannot afford to lose does not belong here.
No persistence for raw speed	No disk I/O, no storage growth, lowest possible overhead	A message with no live subscriber is dropped silently, no error returned	When the publisher sends to a channel with zero subscribers and assumes success	PUBLISH returns the number of clients that received the message. If you do not check it, you have no idea whether anyone heard you.
Broadcast to all subscribers for simple fanout	Every subscriber gets every message, trivial fanout	No consumer groups, so you cannot distribute work across consumers	When a team tries to load-balance jobs across workers and every worker processes every job	Pub/Sub fans out; it does not load-balance. Work distribution needs Redis Streams consumer groups or a real queue.
In-memory buffering for slow subscribers	Brief tolerance for consumers slightly behind	Slow subscribers cause Redis to buffer in memory; sustained slowness causes backpressure or disconnects	When one slow consumer's output buffer grows until Redis hits client-output-buffer-limit and force-closes it	A slow Pub/Sub subscriber is a memory-pressure incident waiting to happen. There is no disk overflow, only RAM and a kill threshold.
Shared Redis instance for infra reuse	No new system to operate if you already run Redis	Pub/Sub traffic competes with cache/data traffic on the same instance	When a Pub/Sub message storm starves the cache workload sharing the node	Co-locating Pub/Sub with your cache couples two unrelated failure domains. For anything serious, isolate it.
Cluster-mode fanout via sharded Pub/Sub	Scales fanout across a Redis Cluster with SPUBLISH/SSUBSCRIBE	Sharded Pub/Sub confines messages to a shard; classic Pub/Sub broadcasts cluster-wide and adds inter-node traffic	When classic Pub/Sub on a large cluster floods every node with every message and saturates the bus	On Redis Cluster, use sharded Pub/Sub (Redis 7+) or your fanout cost grows with node count, not just subscriber count.

ActiveMQ (Classic & Artemis)

Keep or choose for JMS compatibility, enterprise Java estates, and migration paths where protocol support matters more than modern cloud-native ergonomics.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Full JMS compliance for enterprise Java fit	Drop-in for JMS 1.1/2.0 apps, transactions, durable subscriptions, broad protocol support (AMQP, MQTT, STOMP, OpenWire)	JMS-centric mental model; less natural outside the Java/Jakarta ecosystem	When a polyglot org standardizes on it and the non-Java teams fight the JMS abstractions	ActiveMQ's reason to exist in 2026 is an existing JMS estate. For greenfield non-JMS work, the other four are usually a better starting point.
Classic vs Artemis as two different brokers under one name	Classic is battle-tested and stable; Artemis is the higher-performance non-blocking successor	They differ in I/O model, persistence, HA, and addressing; "ActiveMQ" alone is ambiguous	When a design doc says "ActiveMQ" and ops deploys Classic while the architect assumed Artemis throughput	Always name the flavor. Artemis (the donated HornetQ codebase) uses an append-only journal and asynchronous architecture; Classic routes everything through OpenWire and KahaDB.
Complex-broker / simple-consumer model for routing in the broker	Broker handles routing, consumer state, redelivery, so consumers stay thin	Broker is the bottleneck and the stateful component to scale and protect	When broker-side state becomes the scaling ceiling and you cannot just add stateless consumers to go faster	This is the philosophical opposite of Kafka's simple-broker/complex-consumer design. It is why ActiveMQ tops out far below Kafka on raw throughput.
KahaDB / journal persistence for guaranteed delivery	Durable messages with JMS persistence and transactional guarantees	Disk-bound throughput; Classic's KahaDB indexing can become a hotspot under load	When message volume outgrows the journal's write path and latency climbs unpredictably	Artemis's journal-only, index-light design is the main reason it sustains higher throughput than Classic at scale.
Moderate throughput ceiling for messaging-not-streaming	Solid for moderate enterprise workloads with low latency	Not built for millions of msg/s; sustained high-throughput single-broker work favors Artemis, and Kafka beyond that	When someone benchmarks Classic against Kafka for a firehose workload and is surprised it cannot keep up	Artemis can rival Kafka in some benchmarks for specific workloads, but Kafka's partitioned horizontal scale-out is a different class for sustained streaming.
Network-of-brokers / store-and-forward for topology reach	Federate brokers across sites for geographic distribution	Configuration complexity, message-loop risks, and harder failure reasoning	When a misconfigured network of brokers creates message loops or duplicate delivery across sites	Powerful for legacy WAN topologies, but it is exactly the kind of bespoke configuration that becomes unmaintainable. Prefer simpler HA pairs where you can.

2. Use Cases

One table per technology. The Driving Property is the single thing that ruled out the alternative. Sortable.

Amazon SQS

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Decoupling microservices on AWS	Any AWS-native service team offloading async work	Zero broker ops, integrates with Lambda/SNS/EventBridge out of the box	From near-zero to unlimited standard-queue throughput	Kafka would add a cluster to run; the team has no platform team for it
Order-processing pipeline with retries	E-commerce checkout backend	Visibility timeout + DLQ give automatic retry and poison isolation	Thousands of orders/s during peak, bursty	Redis Pub/Sub would lose orders on any consumer blip; SQS retries them
Buffering in front of a rate-limited downstream	Service calling a third-party API with strict quotas	Consumers pull at their own pace, natural backpressure	Smooths 10x spikes into steady downstream load	A push broker would overwhelm the rate-limited dependency
Strict per-entity ordering for financial events	Ledger or inventory updates needing exactly-once per account	FIFO queue: exactly-once, strict per-message-group ordering	Up to 70K TPS per API action in high-throughput mode (select regions)	Standard queues reorder and duplicate; Kafka needs self-managed ops
Fan-out work distribution to idempotent workers	Image/video transcoding job farm	At-least-once delivery with horizontal worker scaling	Millions of jobs/day across an autoscaling fleet	RabbitMQ would need a broker to operate; SQS is fully managed

Apache Kafka

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Central event backbone with many consumers	LinkedIn (Kafka's origin), large event-driven orgs	Durable log lets N independent consumer groups read the same stream	Trillions of messages/day across thousands of topics	SQS deletes on consume; you cannot have many independent readers of one stream
Stream processing and real-time analytics	Clickstream / telemetry pipelines feeding Flink or Kafka Streams	Ordered, replayable partitions with high sustained throughput	Millions of events/s, sub-second end-to-end	RabbitMQ degrades on deep backlogs; it is not a streaming substrate
Event sourcing and CDC	Database change-data-capture via Debezium into Kafka	Immutable ordered log is the system of record for changes	Full change history retained, replayable from offset 0	No queue retains an infinite replayable history by design
Log/metric aggregation pipeline	Centralized observability ingestion	High write throughput with tiered storage for cheap retention	TB/day ingest, weeks of retention	ActiveMQ's broker-centric model tops out well below this volume
Decoupling at scale with replay safety	Large microservice estate needing reprocessing capability	Consumers rewind offsets to reprocess after a bug fix	Hundreds of services, replay of days of events	SQS's 14-day cap and delete-on-consume prevent arbitrary replay

RabbitMQ

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Complex routing to many queue topologies	Workflow engine routing by message attributes	Exchanges/bindings route without consumer-side filtering	Many queues, tens of thousands of msg/s per node	Kafka has no native content-based routing; you would filter consumer-side
RPC / request-reply messaging	Synchronous service calls over a broker	Per-message ack, reply-to queues, correlation IDs	Low-latency request/response at moderate volume	Kafka's log model is awkward for short-lived reply semantics
Task queues with priority and per-message TTL	Background job system with priorities	Priority queues, message TTL, dead-lettering built in	Moderate throughput, rich per-message semantics	SQS lacks native priority; Kafka lacks per-message TTL and priority
Reliable workflows needing data safety	Financial or order workflows on quorum queues	Raft-replicated quorum queues, predictable failover	~30K msg/s on a quorum queue with 1KB messages, 3-node replication	Redis Pub/Sub has no durability; this work cannot tolerate loss
Multi-protocol integration hub	Bridging AMQP, MQTT, and STOMP clients	Native multi-protocol support in one broker	Mixed IoT + backend client fleet	SQS and Kafka do not natively span MQTT/STOMP/AMQP

Redis Pub/Sub

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Live UI updates and presence	Real-time dashboards, live cursors, "user is typing"	Sub-ms fanout, loss of one update is harmless	Many subscribers, transient state	A durable broker is overkill; the next update overwrites the last anyway
Cache invalidation signals	Multi-node app invalidating local caches	Instant broadcast to all app nodes	Cluster-wide signal fanout	SQS/Kafka latency and ops overhead are unjustified for a fire-and-forget ping
Chat where missed messages are acceptable	Ephemeral chat / low-stakes notifications	Lowest-latency broadcast, simplicity	High message rate, loss-tolerant	If history matters, you need Streams or a queue; here it does not
Cross-process eventing within one app	Coordinating workers already sharing a Redis instance	No new infra, trivial API	Internal signaling, low durability need	Standing up Kafka for an internal ping is disproportionate
Real-time leaderboard / score broadcast	Gaming presence and live score push	Microsecond fanout to connected players	Many concurrent connected clients	Durability is unneeded; only currently-connected players matter

ActiveMQ (Classic & Artemis)

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Legacy enterprise Java messaging	Bank or insurer with a large JMS-based estate	Full JMS compliance, transactional messaging, durable subscriptions	Moderate enterprise volumes, strict reliability	Kafka is not JMS; porting decades of JMS app code is a multi-year project
Multi-protocol broker for mixed clients	Integration layer spanning AMQP, MQTT, STOMP, OpenWire	One broker speaks many protocols natively	Heterogeneous client fleet	SQS is AWS-API-only; Redis Pub/Sub is single-protocol
IoT ingestion over MQTT into JMS backend	Industrial telemetry into enterprise systems	MQTT front, JMS back, in one broker (esp. Artemis)	Many devices, moderate per-device rate	Kafka needs an MQTT bridge; ActiveMQ handles it inline
Guaranteed-delivery point-to-point queues	Order or transaction handoff between services	Persistent queues with redelivery and transactions	Moderate throughput, zero-loss requirement	Redis Pub/Sub cannot guarantee delivery at all
Managed broker on AWS without re-architecting	Lift-and-shift JMS app to Amazon MQ	Managed ActiveMQ keeps existing JMS code unchanged	Existing enterprise workload moved to cloud	Moving to SQS/Kafka would require rewriting the messaging layer

3. Limitations

Matrix layout: rows are limitation categories, columns are technologies. Toggle a chip to hide a column. Severity reflects how hard the constraint bites a common workload.

Limitation Axis	SQS	Kafka	RabbitMQ	Redis Pub/Sub	ActiveMQ
Durability / persistence	Medium Durable but 14-day max retention, no replay	Medium Durable; disk cost scales with retention	Medium Durable via quorum queues; not built for deep backlogs	Critical None. At-most-once, messages lost if no live subscriber	Medium Durable via journal; disk-bound throughput
Replay / history	High Delete-on-consume, no replay	Medium Full replay is a core strength, gated only by retention	High Queues delete on ack; replay only via Streams	Critical No history at all	High Queues consume-and-remove; no offset replay
Throughput ceiling	Medium Standard near-unlimited; FIFO capped per group ID	Medium Millions/s; partition-count tuning required	High Per-node ceiling; clusters add HA not linear scale	Medium Very high fanout but classic Pub/Sub floods cluster nodes	High Broker-centric; tops out below Kafka, Classic below Artemis
Ordering guarantees	Medium Standard: none. FIFO: per group ID only	Medium Per-partition only, never global	Medium Per-queue FIFO; lost across competing consumers/requeues	High Best-effort only, no guarantee under reconnect	Medium Per-destination; message groups for strict ordering
Consumer work distribution	Medium Competing consumers native; scales well	Medium Bounded by partition count, not consumer count	Medium Competing consumers native; prefetch tuning critical	Critical No consumer groups; every subscriber gets every message	Medium Queues distribute; topics broadcast
Message size	Medium 256 KB cap; large payloads need S3 claim-check	Medium Default ~1 MB; large messages hurt throughput	Medium Large messages strain memory-first design	Medium Large payloads buffer in RAM, raising pressure	Medium Large messages need blob/stream handling, hit journal
Operational burden	Medium Lowest in the set; fully managed	High Highest; even KRaft needs partition/disk/rebalance expertise	High Cluster, quorum, prefetch, flow-control tuning	Medium Low if Redis already run; couples failure domains	High Two flavors, HA config, journal tuning, network-of-brokers
Cross-region / geo	Medium Regional service; cross-region is app-level	High MirrorMaker 2 is async, operationally heavy	High Federation/shovel async, not seamless	High No native geo Pub/Sub durability	High Network-of-brokers works but is loop-prone

4. Fault Tolerance

Matrix layout. RTO is time to restore service, RPO is tolerable data loss. Toggle chips to focus.

Dimension	SQS	Kafka	RabbitMQ	Redis Pub/Sub	ActiveMQ
Replication model	Redundant across multiple AZs in-region, managed by AWS	Leader-follower per partition, ISR-based, RF configurable (3 typical)	Quorum queues: Raft across odd node count (3/5)	None for Pub/Sub messages; Redis HA replicates keyspace, not in-flight pub/sub	Shared-store or replicated journal (Artemis); primary/backup pairs
Failure detection	Transparent, AWS-internal	Controller heartbeats; ISR shrink on lag (seconds)	Raft election on leader loss; net-tick timeouts	Sentinel/Cluster gossip for the node, not for pub/sub delivery	Lock/heartbeat on shared store or replication link
Failover mechanism	Automatic, invisible to clients	Automatic leader re-election from ISR	Automatic Raft leader election, predictable	Replica promotion for the node; in-flight pub/sub messages are lost	Backup broker takes the store lock and activates
RTO (typical)	~0, no client-visible outage	Seconds for leader re-election	Seconds for quorum re-election	Seconds for node failover; pub/sub gap during it	Seconds to tens of seconds for backup activation
RPO (typical)	0 for committed messages	0 with acks=all + min.insync.replicas=2; non-zero with acks=1	0 for quorum queues with majority intact	Total loss of in-flight pub/sub messages by design	0 with replicated/shared store; loss possible on async config
Split-brain behavior	N/A, managed; no operator-visible split-brain	Quorum (KRaft) prevents it; minority controllers step down	Raft majority rule; minority partition rejects writes	Cluster minority stops serving; risk of lost writes on partition	Store lock prevents dual-primary on shared store; replication needs care
Blast radius, single-node loss	None visible; AWS absorbs it	Partitions led by that broker re-elect; brief per-partition pause	Queues with leader on that node re-elect; brief unavailability	Subscribers on that node disconnect; their in-flight messages lost	Destinations on that broker unavailable until backup activates
Cross-region failover	App-level; queue is regional	MirrorMaker 2, async, manual cutover, offset translation needed	Federation/shovel, async, manual	No native cross-region pub/sub durability	Network-of-brokers, async, complex
Data loss scenarios	Effectively none for committed messages within retention	acks=1 + leader loss; or all ISR lost simultaneously	Loss only if majority of quorum nodes lost at once	Any disconnect, slow consumer, or no-subscriber publish	Async journal flush + crash; misconfigured non-persistent delivery

5. Sharding / Partitioning

How each system divides load across nodes. "Sharding" here means horizontal partitioning of message storage or routing.

Dimension	SQS	Kafka	RabbitMQ	Redis Pub/Sub	ActiveMQ
Sharding model	Internal partitions, AWS-managed; FIFO partitions by message group ID	Explicit partitions per topic; hash on key or round-robin	No native sharding; queues pinned to nodes (sharding plugin exists)	Cluster hash slots; sharded Pub/Sub keys channel to a slot	No native partitioning; scale via multiple destinations/brokers
Shard key constraints	FIFO group ID chosen by producer; standard is keyless	Partition key set at produce time; null key = round-robin	N/A natively; consistent-hash exchange approximates it	Channel name hashes to a slot (sharded Pub/Sub)	N/A; message groups pin a group to one consumer for ordering
Rebalancing mechanism	Automatic partition allocation by AWS as load rises	Partition reassignment tool; consumer-group rebalance (KIP-848 incremental)	Manual queue placement / policy; no automatic data rebalance	Cluster slot migration (CLUSTER SETSLOT)	Manual; add brokers/destinations and rebalance clients
Rebalancing cost / impact	Transparent, no client impact	Partition moves copy data and add network load; throttle to limit impact	Moving a queue means draining/recreating; disruptive	Slot migration is online but adds latency during move	Operationally manual, can require client reconnection
Hot-shard behavior	FIFO: a hot group ID serializes; standard spreads automatically	Hot partition (skewed key) overloads one broker; no auto-split	Hot queue saturates its host node; manual resharding	Hot slot concentrates on one node; classic Pub/Sub floods all nodes	Hot destination saturates its broker
Maximum shards (practical)	Managed; no operator-facing partition limit	KRaft handles far more than ZooKeeper's ~200K; per-cluster practical limits remain (latency rises past ~1K-4K active partitions on modest clusters)	Thousands of queues per node before management overhead bites	16,384 hash slots fixed in Cluster	Many destinations, but broker memory/threads bound it
Resharding without downtime?	Yes, automatic and invisible	Adding partitions is online but breaks key-to-partition mapping and ordering	No clean online resharding of a queue's data	Yes, online slot migration	Adding brokers is online; existing destinations do not auto-redistribute
Cross-shard consumption	N/A; consumers poll the queue, AWS hides partitions	Consumer group spans all partitions of a topic transparently	Consume per-queue; cross-queue is app-level	Sharded Pub/Sub confines to a slot; classic spans cluster	Per-destination; composite destinations aggregate

6. Replication

How copies are kept and what consistency you get. The Redis Pub/Sub row is the cautionary one: keyspace replicates, in-flight pub/sub does not.

Dimension	SQS	Kafka	RabbitMQ	Redis Pub/Sub	ActiveMQ
Replication topology	Managed multi-AZ; opaque to user	Leader-follower, single leader per partition	Quorum queues: Raft leader-follower	Primary-replica for keyspace; pub/sub itself not replicated	Primary-backup; shared store or journal replication
Sync vs async	Synchronous within region (managed)	Sync to ISR with acks=all; async cross-region (MM2)	Sync to Raft majority before ack	Async replication by default (can lose recent writes)	Configurable; replicated journal can be sync
Replication factor	Managed, not user-configurable	Configurable per topic; RF=3 typical	Quorum: odd count, 3 or 5 typical	Configurable replicas per primary; irrelevant to pub/sub delivery	Typically one backup; Artemis supports replication groups
Consistency options	At-least-once (standard); exactly-once (FIFO)	Tunable via acks (0/1/all) + min.insync.replicas	Strong for quorum queues (majority ack)	At-most-once for pub/sub, no tuning	Persistent (durable) vs non-persistent per message
Replication lag (typical)	Not exposed; effectively negligible	Sub-second in-region for healthy ISR	Sub-second within a healthy cluster	Async, sub-second but can lose the unreplicated tail on failover	Low on sync journal replication; higher if async
Conflict resolution	N/A; single-region authoritative	No conflicts; single leader per partition is authoritative	No conflicts; Raft leader authoritative	Last-write-wins on keyspace; N/A for pub/sub	Single active broker authoritative; no multi-master conflicts
Cross-region replication	Not native; app-level fan-out	MirrorMaker 2, async, with offset translation	Federation / shovel, async	CRDT-based active-active in Redis Enterprise, not OSS pub/sub	Network-of-brokers, async store-and-forward
Behavior during partition	Managed; no operator-visible partition handling	Minority leaders step down; majority continues with ISR	Minority side rejects writes; majority continues	Minority stops serving; messages during the gap are lost	Store lock prevents dual-primary; backup waits

7. Better Usage Patterns

Where PE depth shows. The anti-patterns that show up in code review and the corrections that compound at scale. One table per technology.

Amazon SQS

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Idempotent consumers	Assume standard queue delivers once and skip dedup	Design a dedup key and make every consumer idempotent from day one	Standard queues will deliver duplicates; idempotency is the contract, not an optimization
Long polling by default	Leave WaitTimeSeconds at 0 and burn empty receives	Set WaitTimeSeconds=20 on every consumer	Short polling multiplies API cost and CPU for no benefit; this is free money left on the table
DLQ with alarms	Configure a DLQ but never alarm on its depth	CloudWatch alarm on DLQ message count, treat non-zero as an incident	A silent DLQ is a data-loss timer; failed messages rot unnoticed
Visibility timeout heartbeating	Set one giant fixed timeout to cover the slowest job	Use ChangeMessageVisibility to extend during long processing	A huge fixed timeout delays retries on genuine failures; heartbeating keeps fast recovery
Claim-check for large payloads	Try to cram big blobs near the 256 KB limit	Store payload in S3, put the pointer in SQS (extended client)	Keeps the queue fast and within limits; the broker stays a control plane, not a data store

Apache Kafka

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Partition count sizing	Pick a round number, or over-provision 10K partitions "to be safe"	Size from target throughput and consumer parallelism; leave headroom but not 10x	Too few caps throughput; too many spikes latency and recovery time. It is hard to change later without breaking ordering
Durability floor	Set acks=1 for speed without realizing the loss window	acks=all + min.insync.replicas=2 on RF=3 for anything that matters	acks=1 silently drops the unreplicated tail on leader loss; most teams never knew they opted out of durability
Commit after processing	Use auto-commit and lose in-flight messages on crash	Disable auto-commit; commit offsets only after successful processing	Auto-commit is at-most-once disguised as convenience; manual commit gives real at-least-once
Keyed partitioning for ordering	Send with null keys then wonder why per-entity order breaks	Partition by the entity key so all its events land in one partition	Kafka only orders within a partition; the key is how you buy per-entity ordering
Tiered storage for retention	Buy ever-bigger broker disks to extend retention	Enable tiered storage, offload old segments to object storage	Decouples retention from broker disk; cheaper long replay without giant local volumes

RabbitMQ

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Prefetch (QoS) tuning	Leave prefetch unlimited; one consumer grabs the whole backlog	Set a bounded prefetch sized to processing time and consumer count	Unlimited prefetch defeats load balancing and risks consumer-side memory blowups
Quorum queues for durable work	Still using (or referencing) mirrored classic queues	Use quorum queues; mirrored classic queues were removed in 4.0	Quorum queues give predictable failover and better throughput; classic mirroring is gone
Manual ack after processing	Use auto-ack and lose messages when a consumer dies mid-work	Manual ack only after the work is durably done	Auto-ack is at-most-once; manual ack is the safety guarantee RabbitMQ is chosen for
Keep queues short	Let backlogs grow into the millions and hit flow control	Scale consumers to keep queues near-empty; use Streams for log-style retention	Deep queues trigger paging and publisher throttling; RabbitMQ is a broker, not a log store
Topology as code	Click exchanges and bindings into existence in the UI	Declare exchanges/queues/bindings via definitions or IaC	Undocumented broker config becomes tribal knowledge and a rerouting hazard

Redis Pub/Sub

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Use Streams when you need a queue	Reach for Pub/Sub as a task queue, then lose jobs	Use Redis Streams (consumer groups, ACK, PEL) for any durable work	Pub/Sub is at-most-once with no groups; Streams give at-least-once and work distribution
Hybrid Streams + Pub/Sub	Pick one and lose either durability or latency	Persist to a Stream for durability, fire a Pub/Sub message as the low-latency trigger	Reconnecting clients catch up from the Stream; you get both speed and recoverability
Check PUBLISH return value	Publish and assume someone received it	Treat the returned subscriber count as a delivery signal; alarm on zero	A no-subscriber publish is a silent drop; the count is your only feedback
Sharded Pub/Sub on Cluster	Use classic Pub/Sub on a large cluster and flood every node	Use SPUBLISH/SSUBSCRIBE (Redis 7+) to confine messages to a shard	Classic cluster-wide fanout scales cost with node count, not subscriber count
Isolate the failure domain	Run Pub/Sub on the same instance as the cache	Dedicate an instance for messaging when it matters	A pub/sub storm should not be able to starve your cache workload

ActiveMQ (Classic & Artemis)

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Choose Artemis for new builds	Default to Classic out of habit for greenfield work	Start on Artemis unless you have a hard Classic dependency	Artemis's non-blocking journal architecture sustains higher throughput and is the future major version
Name the flavor in design docs	Write "ActiveMQ" and let ops guess Classic vs Artemis	Specify Classic or Artemis explicitly with the assumed perf profile	The two differ in I/O, HA, and addressing; ambiguity causes capacity mismatches
Persistent vs non-persistent intent	Leave delivery mode at the default and assume durability	Set persistent delivery explicitly for anything that must survive a crash	Non-persistent messages are lost on broker restart; the default is not always what you want
Prefer HA pairs over networks of brokers	Build a sprawling network of brokers for "scale"	Use primary/backup HA pairs; reserve broker networks for genuine geo needs	Networks of brokers are loop-prone and hard to reason about; simpler HA is more reliable
Consider Amazon MQ for managed ops	Self-host and absorb all patching/HA burden	Run managed ActiveMQ (Amazon MQ) to keep JMS code but drop the ops load	Keeps the existing JMS investment while removing the broker on-call rotation

8. Advanced / Next-Gen Alternatives

Successors, adjacent tech that does a specific job better, and patterns that obviate the original need. One table per technology.

Amazon SQS

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Amazon Kinesis / MSK	Replay, ordered streaming, multiple independent readers	Production	Medium, different consumption model	When you need replay or fanout that a delete-on-consume queue cannot give
EventBridge	Content-based routing, schema registry, SaaS event integrations	Production	Low for new event flows	When routing and event-driven choreography matter more than raw queueing
SNS + SQS fanout	One-to-many delivery to multiple queues	Production	Low, additive	When multiple consumers each need their own durable copy of an event

Apache Kafka

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Redpanda	Kafka-API-compatible, C++/Raft, ~10x lower tail latency in vendor benchmarks, no JVM/ZooKeeper	Production	Low, wire-compatible with Kafka clients	When tail latency and operational simplicity matter and you want to keep Kafka clients
WarpStream / Kafka-on-S3 designs	Object-storage-backed, near-zero inter-AZ cost, stateless brokers	Emerging	Medium, Kafka-compatible API but different latency profile	When inter-AZ network cost dominates and you can tolerate higher latency
Apache Pulsar	Separated compute/storage (BookKeeper), native multi-tenancy and tiered storage	Production	High, different architecture and client model	When you need strong multi-tenancy and independent scaling of storage vs serving

RabbitMQ

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
RabbitMQ Streams	Append-only replayable log inside RabbitMQ for high-throughput log workloads	Production (4.x)	Low if staying on RabbitMQ; different client API	When you need log-style retention/replay without leaving RabbitMQ for Kafka
NATS / NATS JetStream	Lighter footprint, simpler ops, fast pub/sub plus optional persistence	Production	Medium, different protocol and semantics	When you want cloud-native simplicity and do not need AMQP's rich routing
Apache Kafka	Horizontal write scale-out and replay that a single-broker model cannot match	Production	High, different mental model	When backlogs are routinely millions deep or you need true streaming scale

Redis Pub/Sub

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Redis Streams	Persistence, consumer groups, ACK, replay, at-least-once delivery	Production	Low, same Redis instance, new commands	The moment you need durability or work distribution. This is the default upgrade path
NATS / NATS Core	Purpose-built lightweight pub/sub with clustering and optional JetStream durability	Production	Medium, new system	When pub/sub is a first-class need rather than a Redis side-feature
Kafka / SQS for durable work	Real durability, retries, and at-least-once semantics	Production	Medium to high	When you discover Pub/Sub was never a safe place for work that cannot be lost

ActiveMQ (Classic & Artemis)

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
ActiveMQ Artemis	Non-blocking architecture, higher throughput, JMS 2.0, the designated next major version	Production	Medium; client code often low-effort, but tooling/runbooks/HA differ	The default target for any Classic estate planning a 3+ year horizon
Apache Kafka	Massive horizontal throughput and replay for streaming workloads	Production	High, not JMS, requires rewrite	When the workload is really streaming/analytics, not enterprise JMS messaging
Amazon MQ (managed)	Removes broker ops while keeping ActiveMQ/JMS compatibility	Production	Low, same broker, managed control plane	When you want to keep JMS code but shed the operational burden

Best default choices

Search and compare

1. Trade-Offs

Amazon SQS

Apache Kafka

RabbitMQ

Redis Pub/Sub

ActiveMQ (Classic & Artemis)

2. Use Cases

Amazon SQS

Apache Kafka

RabbitMQ

Redis Pub/Sub

ActiveMQ (Classic & Artemis)

3. Limitations

4. Fault Tolerance

5. Sharding / Partitioning

6. Replication

7. Better Usage Patterns

Amazon SQS

Apache Kafka

RabbitMQ

Redis Pub/Sub

ActiveMQ (Classic & Artemis)

8. Advanced / Next-Gen Alternatives

Amazon SQS

Apache Kafka

RabbitMQ

Redis Pub/Sub

ActiveMQ (Classic & Artemis)