Message Brokers: SQS vs Kafka vs RabbitMQ vs Redis Pub/Sub vs ActiveMQ

A Principal Engineer trade-off analysis across five systems that are commonly listed together and almost never interchangeable.

CATEGORY SWEEP / HEAD-TO-HEAD

As of 2026-05-27. Reflects Kafka 4.0 (KRaft-only), RabbitMQ 4.x (quorum queues default, mirrored classic queues removed), SQS FIFO high-throughput mode, and ActiveMQ Classic + Artemis.

PE Verdict

These five are not competitors in the same race. Kafka is a durable, replayable log for high-fanout event streaming. SQS is a zero-ops managed work queue you reach for first on AWS. RabbitMQ is the flexible routing broker for complex topologies and per-message workflows. ActiveMQ is the JMS broker you keep for enterprise Java estates. Redis Pub/Sub is a fire-and-forget signal bus with no persistence, and treating it as a queue is the single most common production mistake in this set. Pick by durability requirement and ops budget first, throughput second.

Read this before the tables: Redis Pub/Sub provides at-most-once delivery with no persistence. If no subscriber is connected, the message is gone permanently. It belongs in this comparison only because teams reach for it as a queue, then lose data. Where a row asks about durability, replay, or consumer groups, Redis Pub/Sub will read "N/A by design" and the honest answer is "use Redis Streams instead." That gap is the most important signal in this entire analysis.

Best default choices

1. Trade-Offs

One table per technology. A trade-off is something you give up X to get Y. Click any header to sort. Color: gain / cost.

Wide tables scroll horizontally on mobile.

Amazon SQS

Best default for AWS-native background jobs, async task buffering, and teams that value zero broker operations over deep routing or replay control.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Fully managed for zero broker ownershipNo clusters to patch, scale, or page on. Standard queues have effectively unlimited throughputNo control over internals, no on-prem option, vendor lock to AWSWhen you need a multi-cloud or on-prem deployment and discover the queue semantics do not portThe real product is the absence of an on-call rotation, not the queue. That is worth more than most teams price it.
Standard queue throughput for at-least-once and out-of-order deliveryNear-unlimited horizontal scale, lowest cost per messageDuplicates and reordering are guaranteed to happen, not just possibleWhen a consumer is not idempotent and a duplicate double-charges a customerIdempotency is not optional with standard queues, it is the contract. Design the dedup key before you write the consumer.
FIFO queue ordering for throughput ceilingExactly-once processing and strict per-group orderingDefault 300 TPS without batching; high-throughput mode reaches up to 70,000 TPS per API action in select regionsWhen a single message group ID becomes a hot path and serializes everything behind itFIFO throughput scales with the number of message group IDs, not raw volume. One group ID is a single-threaded bottleneck.
Visibility timeout for delivery retry safetyAutomatic redelivery if a consumer crashes mid-processingYou must size the timeout to the slowest plausible processing time or you get duplicate workWhen processing occasionally takes 90s but the timeout is 30s, so three workers grab the same messageHeartbeat-extend the visibility timeout for long jobs rather than setting a giant fixed value that delays retries on real failures.
Pull-based polling for consumer simplicityConsumers control their own rate, natural backpressure, no broker push stateShort polling burns empty receives and money; you must tune long pollingWhen a team leaves short polling on and gets a surprise bill from millions of empty ReceiveMessage callsLong polling (WaitTimeSeconds=20) is almost always correct. The default of 0 is a cost trap.
14-day max retention for storage simplicityNo infinite-growth risk, predictable storage behaviorCannot use SQS as an event store or replay logWhen you want to reprocess last month's events and they no longer existIf you need replay, you needed Kafka or Kinesis. SQS is a work queue, not a log.
256 KB message size cap for payload predictabilityConsistent low-latency delivery, no large-object handling in the brokerLarge payloads need the extended client library and an S3 side-channelWhen someone sends a 5 MB document and the publish silently fails or truncatesThe claim-check pattern (payload in S3, pointer in SQS) is the standard escape, but it adds a second consistency surface.
DLQ redrive for poison-message isolationFailed messages move aside after N retries instead of blocking the queueYou own monitoring and redriving the DLQ; it is not automatic recoveryWhen nobody alarms on DLQ depth and 40K failed orders sit silently for a weekA DLQ with no alarm is a data-loss timer. Alarm on ApproximateNumberOfMessagesVisible for every DLQ.

Apache Kafka

Choose when the system needs a durable replayable event log, high fan-out consumers, retained history, and stream processing over ordered partitions.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Durable partitioned log for replay and high fanoutMultiple independent consumer groups read the same data, replay from any offset, millions of msg/sOperational complexity far above a managed queue, even with KRaft removing ZooKeeperWhen a 3-person team adopts self-managed Kafka and spends a quarter on rebalances and disk tuning instead of featuresKafka's superpower is that consumption does not delete data. If you do not need replay or fanout, you are paying the complexity tax for nothing.
Partition count for parallelismThroughput and consumer concurrency scale with partitionsProducer throughput is best between core-count and ~100 partitions and drops off sharply past ~1,000 per cluster on modest hardwareWhen someone creates a 10,000-partition topic to be safe and tail latency explodesPartition count is the hardest thing to change after launch. Increasing it breaks key-based ordering. Size it from target throughput, not optimism.
Ordering guaranteed only within a partitionStrict per-key ordering when you partition by keyNo total order across the topic; cross-partition ordering is your problemWhen events for one aggregate land in different partitions because the key was null and ordering silently breaks"Ordered" in Kafka always means "per partition." Anyone who says Kafka gives global ordering has not run it in production.
Consumer-group rebalancing for elastic scalingAdd or remove consumers and partitions redistribute automaticallyClassic rebalance is stop-the-world; the whole group pauses during reassignmentWhen autoscaling triggers frequent rebalances and every scale event stalls processingKIP-848 (GA in Kafka 4.0) replaces stop-the-world with incremental rebalancing. If you are on an older protocol, this is a real reason to upgrade.
ISR + acks=all for durabilityNo data loss as long as one in-sync replica survivesWrite latency rises; with acks=all every write waits for the ISRWhen a team sets acks=1 for speed, loses the leader, and silently drops the unreplicated tailacks=all + min.insync.replicas=2 on RF=3 is the correct durability floor. acks=1 trades correctness for latency, and most teams do not realize they made that trade.
Pull-based consumption with retained offsetsConsumers replay, rewind, and control their own paceOffset management is your responsibility; commit too early and you lose messages on crashWhen auto-commit fires before processing finishes and a crash drops in-flight messagesCommit offsets after processing, not before. Auto-commit is at-most-once dressed up as convenience.
Log retention by time or size for storage controlTune how far back you can replay vs disk spendLong retention means large disks; tiered storage helps but adds a dependencyWhen retention is set to infinite "just in case" and the cluster runs out of disk during a traffic spikeTiered storage (offloading old segments to S3) is the modern answer to the retention-vs-cost tension. Use it before buying bigger disks.
Exactly-once semantics via idempotent producer + transactionsEnd-to-end exactly-once within Kafka, no duplicate processingTransactions add latency and coordinator overhead; only holds inside the Kafka boundaryWhen a team assumes EOS extends to their external database write and it does notKafka EOS is exactly-once within Kafka. The moment you write to an external system, you are back to at-least-once unless you build idempotency there too.

RabbitMQ

Use for AMQP routing, per-message workflows, request/reply patterns, priority/dead-letter behavior, and broker-managed delivery semantics.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Rich routing (exchanges, bindings, topics) for topology flexibilityFanout, topic, header, and direct routing without consumer-side filteringRouting logic lives in broker config, which can become an undocumented dependencyWhen a binding change in production silently reroutes traffic and no one can find where the config livesRabbitMQ's routing is its real differentiator over SQS and Kafka. It is also the thing that becomes tribal knowledge if you do not version the topology as code.
Quorum queues for data safetyRaft-replicated durability, predictable failover, higher and more stable throughput than the old mirrored queuesAll data persisted to disk before ack; needs fast disks and higher prefetch to performWhen you put short-lived RPC reply traffic on a quorum queue and pay the replication cost for data you will discard in 200msQuorum queues are the default replicated type in RabbitMQ 4.x. Mirrored classic queues were removed in 4.0. If a migration guide still mentions them, it is stale.
Push-based delivery with prefetch for low latencyMessages pushed to consumers immediately, very low per-message latencyA misconfigured prefetch overwhelms a slow consumer or starves a fast oneWhen prefetch is unlimited and one consumer grabs the whole backlog, defeating the point of multiple consumersPrefetch (QoS) is the single most important tuning knob in RabbitMQ. Default-unlimited is wrong for almost every real workload.
Per-message acknowledgment for processing safetyFine-grained redelivery; an unacked message returns to the queue on consumer deathAck bookkeeping overhead; forgotten acks leak memory and stall queuesWhen a consumer never acks (bug or auto-ack misuse) and the queue grows until the broker OOMsAuto-ack is at-most-once. Manual ack after processing is the correct default, and it is the inverse of Kafka's offset model.
In-memory-first design for throughputVery fast when the working set fits in RAMPerformance degrades hard when queues grow long and the broker pages to diskWhen consumers fall behind, the queue hits millions of messages, and the broker enters flow control and throttles publishersRabbitMQ is happiest with short queues. Deep backlogs are an anti-pattern; if your queues are routinely millions deep, you wanted a log (Kafka), not a broker.
Streams (4.x) for replayable log-style workloadsAppend-only, replayable, high-throughput log inside RabbitMQDifferent mental model and client API than queues; not a drop-inWhen a team uses a classic queue for event history and discovers it cannot replay, when Streams was the right structureStreams narrow the gap with Kafka for single-cluster log use cases, but Kafka's ecosystem and horizontal scale still win at the high end.
Single logical broker for operational simplicityEasier to reason about than a partitioned distributed logVertical scaling limits; clustering is for HA, not linear throughput scaleWhen you expect adding nodes to multiply throughput and instead just get more replicas of the same ceilingRabbitMQ clusters for availability, not for sharding throughput. If you need to scale writes by adding nodes, the model is Kafka's, not RabbitMQ's.

Redis Pub/Sub

Use classic Pub/Sub only for ephemeral live signals where losing messages during disconnects is acceptable; choose Redis Streams for durable queue-like work.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Fire-and-forget delivery for minimal latencySub-millisecond fanout, often a few hundred microseconds, dead simple APIAt-most-once delivery, zero persistence, zero acknowledgmentWhen a subscriber reconnects after a blip and every message sent during the gap is gone foreverThis is the defining property. Redis Pub/Sub is a signal bus, not a queue. Any message you cannot afford to lose does not belong here.
No persistence for raw speedNo disk I/O, no storage growth, lowest possible overheadA message with no live subscriber is dropped silently, no error returnedWhen the publisher sends to a channel with zero subscribers and assumes successPUBLISH returns the number of clients that received the message. If you do not check it, you have no idea whether anyone heard you.
Broadcast to all subscribers for simple fanoutEvery subscriber gets every message, trivial fanoutNo consumer groups, so you cannot distribute work across consumersWhen a team tries to load-balance jobs across workers and every worker processes every jobPub/Sub fans out; it does not load-balance. Work distribution needs Redis Streams consumer groups or a real queue.
In-memory buffering for slow subscribersBrief tolerance for consumers slightly behindSlow subscribers cause Redis to buffer in memory; sustained slowness causes backpressure or disconnectsWhen one slow consumer's output buffer grows until Redis hits client-output-buffer-limit and force-closes itA slow Pub/Sub subscriber is a memory-pressure incident waiting to happen. There is no disk overflow, only RAM and a kill threshold.
Shared Redis instance for infra reuseNo new system to operate if you already run RedisPub/Sub traffic competes with cache/data traffic on the same instanceWhen a Pub/Sub message storm starves the cache workload sharing the nodeCo-locating Pub/Sub with your cache couples two unrelated failure domains. For anything serious, isolate it.
Cluster-mode fanout via sharded Pub/SubScales fanout across a Redis Cluster with SPUBLISH/SSUBSCRIBESharded Pub/Sub confines messages to a shard; classic Pub/Sub broadcasts cluster-wide and adds inter-node trafficWhen classic Pub/Sub on a large cluster floods every node with every message and saturates the busOn Redis Cluster, use sharded Pub/Sub (Redis 7+) or your fanout cost grows with node count, not just subscriber count.

ActiveMQ (Classic & Artemis)

Keep or choose for JMS compatibility, enterprise Java estates, and migration paths where protocol support matters more than modern cloud-native ergonomics.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Full JMS compliance for enterprise Java fitDrop-in for JMS 1.1/2.0 apps, transactions, durable subscriptions, broad protocol support (AMQP, MQTT, STOMP, OpenWire)JMS-centric mental model; less natural outside the Java/Jakarta ecosystemWhen a polyglot org standardizes on it and the non-Java teams fight the JMS abstractionsActiveMQ's reason to exist in 2026 is an existing JMS estate. For greenfield non-JMS work, the other four are usually a better starting point.
Classic vs Artemis as two different brokers under one nameClassic is battle-tested and stable; Artemis is the higher-performance non-blocking successorThey differ in I/O model, persistence, HA, and addressing; "ActiveMQ" alone is ambiguousWhen a design doc says "ActiveMQ" and ops deploys Classic while the architect assumed Artemis throughputAlways name the flavor. Artemis (the donated HornetQ codebase) uses an append-only journal and asynchronous architecture; Classic routes everything through OpenWire and KahaDB.
Complex-broker / simple-consumer model for routing in the brokerBroker handles routing, consumer state, redelivery, so consumers stay thinBroker is the bottleneck and the stateful component to scale and protectWhen broker-side state becomes the scaling ceiling and you cannot just add stateless consumers to go fasterThis is the philosophical opposite of Kafka's simple-broker/complex-consumer design. It is why ActiveMQ tops out far below Kafka on raw throughput.
KahaDB / journal persistence for guaranteed deliveryDurable messages with JMS persistence and transactional guaranteesDisk-bound throughput; Classic's KahaDB indexing can become a hotspot under loadWhen message volume outgrows the journal's write path and latency climbs unpredictablyArtemis's journal-only, index-light design is the main reason it sustains higher throughput than Classic at scale.
Moderate throughput ceiling for messaging-not-streamingSolid for moderate enterprise workloads with low latencyNot built for millions of msg/s; sustained high-throughput single-broker work favors Artemis, and Kafka beyond thatWhen someone benchmarks Classic against Kafka for a firehose workload and is surprised it cannot keep upArtemis can rival Kafka in some benchmarks for specific workloads, but Kafka's partitioned horizontal scale-out is a different class for sustained streaming.
Network-of-brokers / store-and-forward for topology reachFederate brokers across sites for geographic distributionConfiguration complexity, message-loop risks, and harder failure reasoningWhen a misconfigured network of brokers creates message loops or duplicate delivery across sitesPowerful for legacy WAN topologies, but it is exactly the kind of bespoke configuration that becomes unmaintainable. Prefer simpler HA pairs where you can.

2. Use Cases

One table per technology. The Driving Property is the single thing that ruled out the alternative. Sortable.

Amazon SQS

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Decoupling microservices on AWSAny AWS-native service team offloading async workZero broker ops, integrates with Lambda/SNS/EventBridge out of the boxFrom near-zero to unlimited standard-queue throughputKafka would add a cluster to run; the team has no platform team for it
Order-processing pipeline with retriesE-commerce checkout backendVisibility timeout + DLQ give automatic retry and poison isolationThousands of orders/s during peak, burstyRedis Pub/Sub would lose orders on any consumer blip; SQS retries them
Buffering in front of a rate-limited downstreamService calling a third-party API with strict quotasConsumers pull at their own pace, natural backpressureSmooths 10x spikes into steady downstream loadA push broker would overwhelm the rate-limited dependency
Strict per-entity ordering for financial eventsLedger or inventory updates needing exactly-once per accountFIFO queue: exactly-once, strict per-message-group orderingUp to 70K TPS per API action in high-throughput mode (select regions)Standard queues reorder and duplicate; Kafka needs self-managed ops
Fan-out work distribution to idempotent workersImage/video transcoding job farmAt-least-once delivery with horizontal worker scalingMillions of jobs/day across an autoscaling fleetRabbitMQ would need a broker to operate; SQS is fully managed

Apache Kafka

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Central event backbone with many consumersLinkedIn (Kafka's origin), large event-driven orgsDurable log lets N independent consumer groups read the same streamTrillions of messages/day across thousands of topicsSQS deletes on consume; you cannot have many independent readers of one stream
Stream processing and real-time analyticsClickstream / telemetry pipelines feeding Flink or Kafka StreamsOrdered, replayable partitions with high sustained throughputMillions of events/s, sub-second end-to-endRabbitMQ degrades on deep backlogs; it is not a streaming substrate
Event sourcing and CDCDatabase change-data-capture via Debezium into KafkaImmutable ordered log is the system of record for changesFull change history retained, replayable from offset 0No queue retains an infinite replayable history by design
Log/metric aggregation pipelineCentralized observability ingestionHigh write throughput with tiered storage for cheap retentionTB/day ingest, weeks of retentionActiveMQ's broker-centric model tops out well below this volume
Decoupling at scale with replay safetyLarge microservice estate needing reprocessing capabilityConsumers rewind offsets to reprocess after a bug fixHundreds of services, replay of days of eventsSQS's 14-day cap and delete-on-consume prevent arbitrary replay

RabbitMQ

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Complex routing to many queue topologiesWorkflow engine routing by message attributesExchanges/bindings route without consumer-side filteringMany queues, tens of thousands of msg/s per nodeKafka has no native content-based routing; you would filter consumer-side
RPC / request-reply messagingSynchronous service calls over a brokerPer-message ack, reply-to queues, correlation IDsLow-latency request/response at moderate volumeKafka's log model is awkward for short-lived reply semantics
Task queues with priority and per-message TTLBackground job system with prioritiesPriority queues, message TTL, dead-lettering built inModerate throughput, rich per-message semanticsSQS lacks native priority; Kafka lacks per-message TTL and priority
Reliable workflows needing data safetyFinancial or order workflows on quorum queuesRaft-replicated quorum queues, predictable failover~30K msg/s on a quorum queue with 1KB messages, 3-node replicationRedis Pub/Sub has no durability; this work cannot tolerate loss
Multi-protocol integration hubBridging AMQP, MQTT, and STOMP clientsNative multi-protocol support in one brokerMixed IoT + backend client fleetSQS and Kafka do not natively span MQTT/STOMP/AMQP

Redis Pub/Sub

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Live UI updates and presenceReal-time dashboards, live cursors, "user is typing"Sub-ms fanout, loss of one update is harmlessMany subscribers, transient stateA durable broker is overkill; the next update overwrites the last anyway
Cache invalidation signalsMulti-node app invalidating local cachesInstant broadcast to all app nodesCluster-wide signal fanoutSQS/Kafka latency and ops overhead are unjustified for a fire-and-forget ping
Chat where missed messages are acceptableEphemeral chat / low-stakes notificationsLowest-latency broadcast, simplicityHigh message rate, loss-tolerantIf history matters, you need Streams or a queue; here it does not
Cross-process eventing within one appCoordinating workers already sharing a Redis instanceNo new infra, trivial APIInternal signaling, low durability needStanding up Kafka for an internal ping is disproportionate
Real-time leaderboard / score broadcastGaming presence and live score pushMicrosecond fanout to connected playersMany concurrent connected clientsDurability is unneeded; only currently-connected players matter

ActiveMQ (Classic & Artemis)

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Legacy enterprise Java messagingBank or insurer with a large JMS-based estateFull JMS compliance, transactional messaging, durable subscriptionsModerate enterprise volumes, strict reliabilityKafka is not JMS; porting decades of JMS app code is a multi-year project
Multi-protocol broker for mixed clientsIntegration layer spanning AMQP, MQTT, STOMP, OpenWireOne broker speaks many protocols nativelyHeterogeneous client fleetSQS is AWS-API-only; Redis Pub/Sub is single-protocol
IoT ingestion over MQTT into JMS backendIndustrial telemetry into enterprise systemsMQTT front, JMS back, in one broker (esp. Artemis)Many devices, moderate per-device rateKafka needs an MQTT bridge; ActiveMQ handles it inline
Guaranteed-delivery point-to-point queuesOrder or transaction handoff between servicesPersistent queues with redelivery and transactionsModerate throughput, zero-loss requirementRedis Pub/Sub cannot guarantee delivery at all
Managed broker on AWS without re-architectingLift-and-shift JMS app to Amazon MQManaged ActiveMQ keeps existing JMS code unchangedExisting enterprise workload moved to cloudMoving to SQS/Kafka would require rewriting the messaging layer

3. Limitations

Matrix layout: rows are limitation categories, columns are technologies. Toggle a chip to hide a column. Severity reflects how hard the constraint bites a common workload.

Limitation Axis SQS Kafka RabbitMQ Redis Pub/Sub ActiveMQ
Durability / persistence Medium Durable but 14-day max retention, no replay Medium Durable; disk cost scales with retention Medium Durable via quorum queues; not built for deep backlogs Critical None. At-most-once, messages lost if no live subscriber Medium Durable via journal; disk-bound throughput
Replay / history High Delete-on-consume, no replay Medium Full replay is a core strength, gated only by retention High Queues delete on ack; replay only via Streams Critical No history at all High Queues consume-and-remove; no offset replay
Throughput ceiling Medium Standard near-unlimited; FIFO capped per group ID Medium Millions/s; partition-count tuning required High Per-node ceiling; clusters add HA not linear scale Medium Very high fanout but classic Pub/Sub floods cluster nodes High Broker-centric; tops out below Kafka, Classic below Artemis
Ordering guarantees Medium Standard: none. FIFO: per group ID only Medium Per-partition only, never global Medium Per-queue FIFO; lost across competing consumers/requeues High Best-effort only, no guarantee under reconnect Medium Per-destination; message groups for strict ordering
Consumer work distribution Medium Competing consumers native; scales well Medium Bounded by partition count, not consumer count Medium Competing consumers native; prefetch tuning critical Critical No consumer groups; every subscriber gets every message Medium Queues distribute; topics broadcast
Message size Medium 256 KB cap; large payloads need S3 claim-check Medium Default ~1 MB; large messages hurt throughput Medium Large messages strain memory-first design Medium Large payloads buffer in RAM, raising pressure Medium Large messages need blob/stream handling, hit journal
Operational burden Medium Lowest in the set; fully managed High Highest; even KRaft needs partition/disk/rebalance expertise High Cluster, quorum, prefetch, flow-control tuning Medium Low if Redis already run; couples failure domains High Two flavors, HA config, journal tuning, network-of-brokers
Cross-region / geo Medium Regional service; cross-region is app-level High MirrorMaker 2 is async, operationally heavy High Federation/shovel async, not seamless High No native geo Pub/Sub durability High Network-of-brokers works but is loop-prone

4. Fault Tolerance

Matrix layout. RTO is time to restore service, RPO is tolerable data loss. Toggle chips to focus.

Dimension SQS Kafka RabbitMQ Redis Pub/Sub ActiveMQ
Replication model Redundant across multiple AZs in-region, managed by AWS Leader-follower per partition, ISR-based, RF configurable (3 typical) Quorum queues: Raft across odd node count (3/5) None for Pub/Sub messages; Redis HA replicates keyspace, not in-flight pub/sub Shared-store or replicated journal (Artemis); primary/backup pairs
Failure detection Transparent, AWS-internal Controller heartbeats; ISR shrink on lag (seconds) Raft election on leader loss; net-tick timeouts Sentinel/Cluster gossip for the node, not for pub/sub delivery Lock/heartbeat on shared store or replication link
Failover mechanism Automatic, invisible to clients Automatic leader re-election from ISR Automatic Raft leader election, predictable Replica promotion for the node; in-flight pub/sub messages are lost Backup broker takes the store lock and activates
RTO (typical) ~0, no client-visible outage Seconds for leader re-election Seconds for quorum re-election Seconds for node failover; pub/sub gap during it Seconds to tens of seconds for backup activation
RPO (typical) 0 for committed messages 0 with acks=all + min.insync.replicas=2; non-zero with acks=1 0 for quorum queues with majority intact Total loss of in-flight pub/sub messages by design 0 with replicated/shared store; loss possible on async config
Split-brain behavior N/A, managed; no operator-visible split-brain Quorum (KRaft) prevents it; minority controllers step down Raft majority rule; minority partition rejects writes Cluster minority stops serving; risk of lost writes on partition Store lock prevents dual-primary on shared store; replication needs care
Blast radius, single-node loss None visible; AWS absorbs it Partitions led by that broker re-elect; brief per-partition pause Queues with leader on that node re-elect; brief unavailability Subscribers on that node disconnect; their in-flight messages lost Destinations on that broker unavailable until backup activates
Cross-region failover App-level; queue is regional MirrorMaker 2, async, manual cutover, offset translation needed Federation/shovel, async, manual No native cross-region pub/sub durability Network-of-brokers, async, complex
Data loss scenarios Effectively none for committed messages within retention acks=1 + leader loss; or all ISR lost simultaneously Loss only if majority of quorum nodes lost at once Any disconnect, slow consumer, or no-subscriber publish Async journal flush + crash; misconfigured non-persistent delivery

5. Sharding / Partitioning

How each system divides load across nodes. "Sharding" here means horizontal partitioning of message storage or routing.

Dimension SQS Kafka RabbitMQ Redis Pub/Sub ActiveMQ
Sharding model Internal partitions, AWS-managed; FIFO partitions by message group ID Explicit partitions per topic; hash on key or round-robin No native sharding; queues pinned to nodes (sharding plugin exists) Cluster hash slots; sharded Pub/Sub keys channel to a slot No native partitioning; scale via multiple destinations/brokers
Shard key constraints FIFO group ID chosen by producer; standard is keyless Partition key set at produce time; null key = round-robin N/A natively; consistent-hash exchange approximates it Channel name hashes to a slot (sharded Pub/Sub) N/A; message groups pin a group to one consumer for ordering
Rebalancing mechanism Automatic partition allocation by AWS as load rises Partition reassignment tool; consumer-group rebalance (KIP-848 incremental) Manual queue placement / policy; no automatic data rebalance Cluster slot migration (CLUSTER SETSLOT) Manual; add brokers/destinations and rebalance clients
Rebalancing cost / impact Transparent, no client impact Partition moves copy data and add network load; throttle to limit impact Moving a queue means draining/recreating; disruptive Slot migration is online but adds latency during move Operationally manual, can require client reconnection
Hot-shard behavior FIFO: a hot group ID serializes; standard spreads automatically Hot partition (skewed key) overloads one broker; no auto-split Hot queue saturates its host node; manual resharding Hot slot concentrates on one node; classic Pub/Sub floods all nodes Hot destination saturates its broker
Maximum shards (practical) Managed; no operator-facing partition limit KRaft handles far more than ZooKeeper's ~200K; per-cluster practical limits remain (latency rises past ~1K-4K active partitions on modest clusters) Thousands of queues per node before management overhead bites 16,384 hash slots fixed in Cluster Many destinations, but broker memory/threads bound it
Resharding without downtime? Yes, automatic and invisible Adding partitions is online but breaks key-to-partition mapping and ordering No clean online resharding of a queue's data Yes, online slot migration Adding brokers is online; existing destinations do not auto-redistribute
Cross-shard consumption N/A; consumers poll the queue, AWS hides partitions Consumer group spans all partitions of a topic transparently Consume per-queue; cross-queue is app-level Sharded Pub/Sub confines to a slot; classic spans cluster Per-destination; composite destinations aggregate

6. Replication

How copies are kept and what consistency you get. The Redis Pub/Sub row is the cautionary one: keyspace replicates, in-flight pub/sub does not.

Dimension SQS Kafka RabbitMQ Redis Pub/Sub ActiveMQ
Replication topology Managed multi-AZ; opaque to user Leader-follower, single leader per partition Quorum queues: Raft leader-follower Primary-replica for keyspace; pub/sub itself not replicated Primary-backup; shared store or journal replication
Sync vs async Synchronous within region (managed) Sync to ISR with acks=all; async cross-region (MM2) Sync to Raft majority before ack Async replication by default (can lose recent writes) Configurable; replicated journal can be sync
Replication factor Managed, not user-configurable Configurable per topic; RF=3 typical Quorum: odd count, 3 or 5 typical Configurable replicas per primary; irrelevant to pub/sub delivery Typically one backup; Artemis supports replication groups
Consistency options At-least-once (standard); exactly-once (FIFO) Tunable via acks (0/1/all) + min.insync.replicas Strong for quorum queues (majority ack) At-most-once for pub/sub, no tuning Persistent (durable) vs non-persistent per message
Replication lag (typical) Not exposed; effectively negligible Sub-second in-region for healthy ISR Sub-second within a healthy cluster Async, sub-second but can lose the unreplicated tail on failover Low on sync journal replication; higher if async
Conflict resolution N/A; single-region authoritative No conflicts; single leader per partition is authoritative No conflicts; Raft leader authoritative Last-write-wins on keyspace; N/A for pub/sub Single active broker authoritative; no multi-master conflicts
Cross-region replication Not native; app-level fan-out MirrorMaker 2, async, with offset translation Federation / shovel, async CRDT-based active-active in Redis Enterprise, not OSS pub/sub Network-of-brokers, async store-and-forward
Behavior during partition Managed; no operator-visible partition handling Minority leaders step down; majority continues with ISR Minority side rejects writes; majority continues Minority stops serving; messages during the gap are lost Store lock prevents dual-primary; backup waits

7. Better Usage Patterns

Where PE depth shows. The anti-patterns that show up in code review and the corrections that compound at scale. One table per technology.

Amazon SQS

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Idempotent consumersAssume standard queue delivers once and skip dedupDesign a dedup key and make every consumer idempotent from day oneStandard queues will deliver duplicates; idempotency is the contract, not an optimization
Long polling by defaultLeave WaitTimeSeconds at 0 and burn empty receivesSet WaitTimeSeconds=20 on every consumerShort polling multiplies API cost and CPU for no benefit; this is free money left on the table
DLQ with alarmsConfigure a DLQ but never alarm on its depthCloudWatch alarm on DLQ message count, treat non-zero as an incidentA silent DLQ is a data-loss timer; failed messages rot unnoticed
Visibility timeout heartbeatingSet one giant fixed timeout to cover the slowest jobUse ChangeMessageVisibility to extend during long processingA huge fixed timeout delays retries on genuine failures; heartbeating keeps fast recovery
Claim-check for large payloadsTry to cram big blobs near the 256 KB limitStore payload in S3, put the pointer in SQS (extended client)Keeps the queue fast and within limits; the broker stays a control plane, not a data store

Apache Kafka

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Partition count sizingPick a round number, or over-provision 10K partitions "to be safe"Size from target throughput and consumer parallelism; leave headroom but not 10xToo few caps throughput; too many spikes latency and recovery time. It is hard to change later without breaking ordering
Durability floorSet acks=1 for speed without realizing the loss windowacks=all + min.insync.replicas=2 on RF=3 for anything that mattersacks=1 silently drops the unreplicated tail on leader loss; most teams never knew they opted out of durability
Commit after processingUse auto-commit and lose in-flight messages on crashDisable auto-commit; commit offsets only after successful processingAuto-commit is at-most-once disguised as convenience; manual commit gives real at-least-once
Keyed partitioning for orderingSend with null keys then wonder why per-entity order breaksPartition by the entity key so all its events land in one partitionKafka only orders within a partition; the key is how you buy per-entity ordering
Tiered storage for retentionBuy ever-bigger broker disks to extend retentionEnable tiered storage, offload old segments to object storageDecouples retention from broker disk; cheaper long replay without giant local volumes

RabbitMQ

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Prefetch (QoS) tuningLeave prefetch unlimited; one consumer grabs the whole backlogSet a bounded prefetch sized to processing time and consumer countUnlimited prefetch defeats load balancing and risks consumer-side memory blowups
Quorum queues for durable workStill using (or referencing) mirrored classic queuesUse quorum queues; mirrored classic queues were removed in 4.0Quorum queues give predictable failover and better throughput; classic mirroring is gone
Manual ack after processingUse auto-ack and lose messages when a consumer dies mid-workManual ack only after the work is durably doneAuto-ack is at-most-once; manual ack is the safety guarantee RabbitMQ is chosen for
Keep queues shortLet backlogs grow into the millions and hit flow controlScale consumers to keep queues near-empty; use Streams for log-style retentionDeep queues trigger paging and publisher throttling; RabbitMQ is a broker, not a log store
Topology as codeClick exchanges and bindings into existence in the UIDeclare exchanges/queues/bindings via definitions or IaCUndocumented broker config becomes tribal knowledge and a rerouting hazard

Redis Pub/Sub

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Use Streams when you need a queueReach for Pub/Sub as a task queue, then lose jobsUse Redis Streams (consumer groups, ACK, PEL) for any durable workPub/Sub is at-most-once with no groups; Streams give at-least-once and work distribution
Hybrid Streams + Pub/SubPick one and lose either durability or latencyPersist to a Stream for durability, fire a Pub/Sub message as the low-latency triggerReconnecting clients catch up from the Stream; you get both speed and recoverability
Check PUBLISH return valuePublish and assume someone received itTreat the returned subscriber count as a delivery signal; alarm on zeroA no-subscriber publish is a silent drop; the count is your only feedback
Sharded Pub/Sub on ClusterUse classic Pub/Sub on a large cluster and flood every nodeUse SPUBLISH/SSUBSCRIBE (Redis 7+) to confine messages to a shardClassic cluster-wide fanout scales cost with node count, not subscriber count
Isolate the failure domainRun Pub/Sub on the same instance as the cacheDedicate an instance for messaging when it mattersA pub/sub storm should not be able to starve your cache workload

ActiveMQ (Classic & Artemis)

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Choose Artemis for new buildsDefault to Classic out of habit for greenfield workStart on Artemis unless you have a hard Classic dependencyArtemis's non-blocking journal architecture sustains higher throughput and is the future major version
Name the flavor in design docsWrite "ActiveMQ" and let ops guess Classic vs ArtemisSpecify Classic or Artemis explicitly with the assumed perf profileThe two differ in I/O, HA, and addressing; ambiguity causes capacity mismatches
Persistent vs non-persistent intentLeave delivery mode at the default and assume durabilitySet persistent delivery explicitly for anything that must survive a crashNon-persistent messages are lost on broker restart; the default is not always what you want
Prefer HA pairs over networks of brokersBuild a sprawling network of brokers for "scale"Use primary/backup HA pairs; reserve broker networks for genuine geo needsNetworks of brokers are loop-prone and hard to reason about; simpler HA is more reliable
Consider Amazon MQ for managed opsSelf-host and absorb all patching/HA burdenRun managed ActiveMQ (Amazon MQ) to keep JMS code but drop the ops loadKeeps the existing JMS investment while removing the broker on-call rotation

8. Advanced / Next-Gen Alternatives

Successors, adjacent tech that does a specific job better, and patterns that obviate the original need. One table per technology.

Amazon SQS

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Amazon Kinesis / MSKReplay, ordered streaming, multiple independent readersProductionMedium, different consumption modelWhen you need replay or fanout that a delete-on-consume queue cannot give
EventBridgeContent-based routing, schema registry, SaaS event integrationsProductionLow for new event flowsWhen routing and event-driven choreography matter more than raw queueing
SNS + SQS fanoutOne-to-many delivery to multiple queuesProductionLow, additiveWhen multiple consumers each need their own durable copy of an event

Apache Kafka

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
RedpandaKafka-API-compatible, C++/Raft, ~10x lower tail latency in vendor benchmarks, no JVM/ZooKeeperProductionLow, wire-compatible with Kafka clientsWhen tail latency and operational simplicity matter and you want to keep Kafka clients
WarpStream / Kafka-on-S3 designsObject-storage-backed, near-zero inter-AZ cost, stateless brokersEmergingMedium, Kafka-compatible API but different latency profileWhen inter-AZ network cost dominates and you can tolerate higher latency
Apache PulsarSeparated compute/storage (BookKeeper), native multi-tenancy and tiered storageProductionHigh, different architecture and client modelWhen you need strong multi-tenancy and independent scaling of storage vs serving

RabbitMQ

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
RabbitMQ StreamsAppend-only replayable log inside RabbitMQ for high-throughput log workloadsProduction (4.x)Low if staying on RabbitMQ; different client APIWhen you need log-style retention/replay without leaving RabbitMQ for Kafka
NATS / NATS JetStreamLighter footprint, simpler ops, fast pub/sub plus optional persistenceProductionMedium, different protocol and semanticsWhen you want cloud-native simplicity and do not need AMQP's rich routing
Apache KafkaHorizontal write scale-out and replay that a single-broker model cannot matchProductionHigh, different mental modelWhen backlogs are routinely millions deep or you need true streaming scale

Redis Pub/Sub

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Redis StreamsPersistence, consumer groups, ACK, replay, at-least-once deliveryProductionLow, same Redis instance, new commandsThe moment you need durability or work distribution. This is the default upgrade path
NATS / NATS CorePurpose-built lightweight pub/sub with clustering and optional JetStream durabilityProductionMedium, new systemWhen pub/sub is a first-class need rather than a Redis side-feature
Kafka / SQS for durable workReal durability, retries, and at-least-once semanticsProductionMedium to highWhen you discover Pub/Sub was never a safe place for work that cannot be lost

ActiveMQ (Classic & Artemis)

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
ActiveMQ ArtemisNon-blocking architecture, higher throughput, JMS 2.0, the designated next major versionProductionMedium; client code often low-effort, but tooling/runbooks/HA differThe default target for any Classic estate planning a 3+ year horizon
Apache KafkaMassive horizontal throughput and replay for streaming workloadsProductionHigh, not JMS, requires rewriteWhen the workload is really streaming/analytics, not enterprise JMS messaging
Amazon MQ (managed)Removes broker ops while keeping ActiveMQ/JMS compatibilityProductionLow, same broker, managed control planeWhen you want to keep JMS code but shed the operational burden

PE trade-off analysis generated 2026-05-27. Reflects Kafka 4.0 (KRaft-only, KIP-848 incremental rebalancing GA), RabbitMQ 4.x (quorum queues as default replicated type, mirrored classic queues removed in 4.0), Amazon SQS FIFO high-throughput mode, and ActiveMQ Classic and Artemis. Throughput and latency figures are workload-dependent; benchmark against your own profile before committing. Tap any sortable header to reorder; toggle technology chips to focus matrix tables.