Pub/Sub Trade-Offs

Four ways to fan one event out to many consumers: Amazon SNS, Apache Kafka, Redis Pub/Sub, and the SNS+SQS composite. They are not interchangeable. The decisive axis is durability, not throughput.

Messaging / Event Distribution Head-to-Head (4)

As of 2026-05-21. AWS pricing is us-east-1 standard tier. Verify against the AWS pricing console for your region before quoting in a design doc.

PE Verdict

SNS+SQS is the default for durable application fan-out inside AWS. It is the only option here that gives every consumer its own durable, independently-acknowledged copy with per-consumer backpressure, at near-zero ops. Reach past it only for a reason you can name.

Kafka wins when you need replay, ordered partitions, or a shared event log read by many consumer groups at their own offset (analytics replaying history while a live consumer tails the head). You pay for it in cluster ops.

Raw SNS is the right primitive when the subscriber is itself a push endpoint (Lambda, HTTP, mobile push, email/SMS) and you do not need a buffer. Redis Pub/Sub is fire-and-forget: pick it only when losing messages during a disconnect is acceptable (live presence, ephemeral dashboards), never as a system of record.

Framing

Pub/sub decouples a publisher from many subscribers. These options sit at different points on the durability and fan-out spectrum, so the right default depends on whether messages can be lost, replayed, or independently acknowledged.

Redis Pub/SubEphemeral, in-memory, no persistence, no replay, no per-subscriber acknowledgement.
SNSManaged push fan-out to subscribed endpoints, but no consumer-side buffer by itself.
SNS + SQSSNS fans out, each consumer owns a durable queue, ack path, and backpressure buffer.
KafkaDurable partitioned commit log with replay and independent consumer group offsets.

Best default choices

1. Trade-Offs

Click any column header to sort. green = what you gain · orange = what you give up · blue = PE nuance

Amazon SNS (standalone)

Use standalone SNS when subscribers are push endpoints and you do not need a durable per-consumer buffer or replay path.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Push delivery, no consumer bufferSub-second fan-out to endpoints with zero polling costNo durable hold if the endpoint is down past the retry windowHTTP/S endpoint down longer than the retry policy: message lands in a DLQ or is droppedAlways attach a redrive DLQ. Raw SNS retries are finite (4-hour to 23-day backoff by protocol), not infinite
Managed, serverlessNo brokers to run, scales to high publish rates automaticallyNo control over internals, no replay, AWS-lockedYou need to reprocess yesterday's events: impossible, SNS keeps nothingSNS is a router, not a store. If you need history, that requires SNS+SQS, Kinesis, or archiving to S3 via Firehose
Best-effort ordering (Standard)Near-unlimited throughput on Standard topicsOut-of-order and duplicate deliveryConsumer assumes ordering: state machine corrupts on reorderFIFO topics fix ordering but cap at 300 TPS (3,000 batched) and only deliver to SQS FIFO
Native multi-protocol fan-outOne publish reaches Lambda, SQS, HTTP, email, SMS, push togetherPer-subscription delivery is billed and metered separatelyLarge fan-out (1 publish, 50 subs) multiplies delivery cost and failure surfaceA2P (SMS/email) carries very different cost and reliability than A2A (SQS/Lambda). Do not mix them in one mental model
Message filtering at the brokerSubscribers receive only matching messages, less consumer wasteAttribute filtering free, payload filtering billed per GB scannedHeavy payload-based filter policies quietly add cost at scalePrefer attribute-based filtering (free) over payload-based ($/GB scanned). Push filter logic into message attributes at publish time
At-least-once deliverySimpler than building exactly-onceConsumers must be idempotentDuplicate triggers a double charge or double emailIdempotency is the consumer's job on every option here except Kafka EOS and SNS/SQS FIFO
64KB request chunkingPredictable per-request pricingA 256KB message bills as 4 requests and 4 deliveries per endpointLarge payloads silently 4x the billKeep payloads small, pass an S3 pointer for anything large (claim-check pattern)

Apache Kafka

Choose Kafka when replay, ordered partitions, high-throughput streams, and independent consumer group offsets justify owning broker operations.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Durable, replayable logReplay from any offset, add new consumers that read all historyStorage cost and retention managementRetention too short: a backfilling consumer finds the data already aged outRetention is the silent correctness lever. Tiered storage (KIP-405) decouples retention from broker disk
Partition-level orderingStrict order within a partition, parallelism across partitionsNo total order across the topicYou need global ordering: forces a single partition, killing throughputOrder is per-key, not per-topic. Choosing the partition key is the real design decision
Consumer groups + offsetsMany independent groups read the same stream at their own paceOffset management complexity, rebalancing pausesA slow consumer in a group triggers rebalances that stall the othersThis is the property SNS+SQS fakes with N queues. Kafka does it with one copy of the data on disk
Pull-based consumptionConsumers control their own rate, natural backpressureConsumers must poll, no native push to arbitrary endpointsYou wanted to trigger a Lambda directly: needs an extra connector or MSK triggerPull means lag is observable and bounded by the consumer, not the broker. Lag is your single best health metric
Self-managed or MSKFull control, runs anywhere, no vendor lockCluster ops: brokers, rebalancing, ISR, controller, upgradesA broker dies at 3am and under-replicated partitions page youThe ops burden, not throughput, is why teams pick SNS+SQS. MSK and Confluent reduce but do not erase it
Exactly-once semantics (EOS)Transactional produce+consume, no dedup logic in appLower throughput, higher latency, only within KafkaEOS does not extend to an external DB without an outbox or idempotent sinkEOS is intra-Kafka. Crossing to an external system still needs idempotency or the transactional-outbox pattern
High throughput via sequential IOMillions of msg/s on modest hardware, zero-copy readsLatency floor higher than in-memory RedisUltra-low-latency (sub-ms) fan-out: Kafka's batching adds millisecondsKafka optimizes throughput, not tail latency. For sub-ms presence signals, Redis Pub/Sub still wins

Redis Pub/Sub

Use Redis Pub/Sub only for ephemeral broadcasts where losing messages during disconnects is an accepted product behavior.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
In-memory, fire-and-forgetLowest latency here, sub-millisecond fan-outZero durability, no persistence, no replayA subscriber reconnects after a blip: every message published during the gap is goneThis is the one disqualifying property. Redis Pub/Sub is not a message bus, it is a live signal wire
Dead simple modelPUBLISH / SUBSCRIBE, no offsets, no consumer groupsNo backpressure, no per-consumer acksA slow subscriber fills its output buffer and Redis disconnects itRedis drops slow subscribers (client-output-buffer-limit) to protect itself. Silent message loss by design
Co-located with your cacheNo new infrastructure if Redis is already in the stackCouples messaging fate to cache fateCache eviction pressure or failover takes your event bus down with itSharing the Redis instance between cache and bus couples two unrelated failure domains. Separate them
No delivery on the publish path being blockedPublisher never blocks waiting for subscribersA message with zero subscribers vanishes instantlySubscriber deploys lag publisher startup: early events lostIf no one is subscribed at publish time, the message is dropped. There is no "land and wait"
Pattern subscriptions (PSUBSCRIBE)Wildcard topic matching out of the boxPattern matching cost grows with subscriber countThousands of pattern subs degrade publish latencyIn Redis Cluster, plain Pub/Sub is not shard-aware. Use Sharded Pub/Sub (Redis 7+) or fan-out breaks across shards
Cluster mode caveatSharded Pub/Sub scales fan-out across the clusterClassic Pub/Sub broadcasts to all nodes, wasting cluster bandwidthMigrating to Redis Cluster silently changes Pub/Sub semanticsMany teams hit this in production: classic SUBSCRIBE in cluster mode floods every node. Sharded Pub/Sub fixes it
Redis Streams as the durable cousinIf you need durability, Streams (XADD/XREADGROUP) gives consumer groups + persistenceStreams is a different data structure, not Pub/SubTeams reach for Pub/Sub when they actually needed StreamsIf durability matters and you want to stay in Redis, the answer is almost always Streams, not Pub/Sub

SNS + SQS (fan-out composite)

Default to SNS plus SQS for durable AWS application fan-out where every consumer needs its own buffer, retry path, and backpressure boundary.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Durable per-consumer fan-outEach consumer gets its own durable queue, acks independentlyN copies of every message (one per queue), N times the SQS costHuge fan-out (1 event, 100 consumers) multiplies storage and request costThis is the textbook AWS pattern. The cost multiplier is the price of decoupling consumer failure domains
Independent backpressureA slow consumer's queue grows without affecting othersNo shared replay, each queue drains onceYou need a brand-new consumer to read history: SQS already deleted itUnlike Kafka, a consumed message is gone. Adding a late consumer means it sees only future events
Fully managed, near-zero opsNo brokers, no partitions, no rebalancing, autoscaling built inAWS lock-in, no control over internalsMulti-cloud or on-prem requirement appears: the pattern does not portThe operational simplicity is the entire reason to choose this over Kafka inside AWS
Built-in DLQ + visibility timeoutFailed messages redrive to a DLQ, in-flight messages hidden during processingVisibility timeout tuning is a real source of bugsProcessing takes longer than the timeout: message redelivered, duplicate workSet visibility timeout to ~6x expected processing time or use the heartbeat (ChangeMessageVisibility) pattern
At-least-once (Standard) or exactly-once (FIFO)Choose ordering+dedup (FIFO) or throughput (Standard) per queueFIFO caps throughput (300 msg/s, 3,000 batched) and SNS FIFO only feeds SQS FIFOOrdering need at high volume forces an awkward FIFO+partitioning designMix and match: Standard SNS to Standard SQS for throughput, FIFO end-to-end only where order is mandatory
Cheap at moderate scale, polling cost at idleFirst 1M requests/month free, $0.40/M afterEmpty receives from short polling burn requests on idle queuesMany idle queues polled aggressively: surprise request billAlways use long polling (WaitTimeSeconds=20). The 64KB chunk rule applies here too: large payloads 4x the count

2. Use Cases

Amazon SNS (standalone)

Use CaseScenarioDriving PropertyScale DimensionWhy Not Alternative
Mobile / web push notificationsApp sends order-status pushes to millions of devicesNative APNs/FCM/A2P delivery, no buffer neededMillions of endpointsKafka has no native push to mobile; SQS does not deliver to devices
Lambda fan-out triggerOne event triggers several Lambdas in parallelPush-based invoke, no polling, no idle costThousands of invokes/sSQS+Lambda needs polling config; Redis has no Lambda integration
System alerts to humansCloudWatch alarm fans to email + SMS + Slack webhookMulti-protocol A2P delivery from one publishLow volume, high reliabilityOnly SNS speaks email/SMS/HTTP natively in one call
Cross-account event broadcastCentral account publishes events consumed by many team accountsTopic policies allow cross-account subscriptionDozens of accountsKafka cross-account needs networking + ACL plumbing
Fanout entry point (paired with SQS)SNS is the fan-out hub feeding many SQS queuesDecouples publisher from a growing set of consumersMany independent consumersRaw SQS cannot fan one message to many queues; SNS is the splitter

Apache Kafka

Use CaseScenarioDriving PropertyScale DimensionWhy Not Alternative
Event sourcing / CQRSThe log is the source of truth, read models project from itDurable replayable ordered logBillions of events retainedSQS deletes on consume; SNS keeps nothing; no replay anywhere else
Multi-consumer analytics + real-timeSame clickstream feeds Flink real-time and a nightly batch jobIndependent consumer-group offsets on one copyMillions of msg/sSNS+SQS would need N duplicated queues and still no replay
Stream processing backboneKafka feeds Flink/Kafka Streams for joins and windowingOrdered partitions + offset semanticsHigh throughput, statefulRedis/SNS/SQS have no stream-processing ecosystem
Log / metrics aggregation pipelineThousands of services ship logs into topics, fan to sinksHigh write throughput, cheap sequential storageTB/day ingestionSQS request pricing at this volume is brutal; Kafka is built for it
CDC and database replicationDebezium streams DB changes into Kafka for downstream syncOrdered per-key change log, replayableContinuous high volumeOrdering + replay are mandatory; only Kafka offers both here

Redis Pub/Sub

Use CaseScenarioDriving PropertyScale DimensionWhy Not Alternative
Live presence / typing indicatorsChat app shows who is online and typingSub-ms latency, loss is harmlessMany ephemeral signalsKafka/SQS add latency for a signal that is worthless if delayed
WebSocket fan-out across serversBroadcast a message to all WS servers holding client connectionsLowest-latency in-process broadcastMany app serversSNS/SQS round-trip is too slow for interactive broadcast
Live config / cache-invalidation pingTell all nodes to drop a cache key nowInstant best-effort broadcastCluster-wideDurability is unnecessary; a missed invalidation self-heals on TTL
Real-time leaderboard / dashboard ticksPush score updates to live dashboardsLow latency, stale data soon overwrittenHigh-frequency updatesEach tick supersedes the last, so loss does not matter
In-game ephemeral eventsBroadcast transient in-match events to connected clientsSpeed over guaranteeBursty, latency-sensitiveDurable systems over-engineer a throwaway signal

SNS + SQS (fan-out composite)

Use CaseScenarioDriving PropertyScale DimensionWhy Not Alternative
Order-event fan-out to microservicesOrderPlaced fans to billing, inventory, email, analytics queuesEach service buffers and acks independentlyMany services, moderate volumeRaw SNS drops if a consumer is down; this survives outages per consumer
Decoupled async job dispatchAn event spawns durable work units consumed by autoscaling workersDurable buffer + DLQ + visibility timeoutVariable worker fleetRedis loses jobs on disconnect; Kafka is ops-heavy for simple jobs
Reliable cross-service eventing in AWSInternal events where loss is unacceptable but replay is not neededAt-least-once durable delivery, near-zero opsOrg-wide eventingKafka cluster ops not justified when replay is not required
Buffering bursty traffic to slow consumersSpiky publish rate, downstream processes at a steady paceQueue absorbs the burst, consumer drains steadily10x burst factorsSNS alone has no buffer; the SQS queue is the shock absorber
Ordered, exactly-once workflow stepsSequential steps that must not duplicate or reorderSNS FIFO to SQS FIFO end to endUp to 300 msg/s (3,000 batched)Redis/Standard give no ordering or dedup guarantee

3. Limitations

Limitation AxisSNSKafkaRedis Pub/SubSNS+SQS
Durability High No store; relies on subscriber availability Medium Durable to disk, bounded by retention Critical None; offline subscriber loses everything Medium Durable up to 14-day retention, then dropped
Replay / history High Impossible without archiving Medium Native, bounded by retention Critical No replay at all High None; consumed messages are deleted
Ordering Medium Best-effort; FIFO caps at 300 TPS Medium Per-partition only, not global Medium Per-channel best-effort, no guarantee on reconnect Medium Standard unordered; FIFO ordered but 300 TPS
Fan-out cost Medium Billed per subscription delivery Medium One copy on disk, cheap fan-out Medium Classic cluster mode broadcasts to all nodes High N queues = N copies = N times SQS cost
Operational burden Medium Low; managed Critical Brokers, ISR, rebalancing, controller, upgrades Medium Failover and buffer-limit tuning Medium Low; managed, but many queues to govern
Throughput ceiling Medium Very high Standard; 300 TPS FIFO Medium Millions/s, the highest here Medium High but single-threaded command path Medium Very high Standard; 300 TPS FIFO
Payload size Medium 256KB, billed in 64KB chunks Medium Default 1MB, tunable Medium Bounded by memory and buffer limits Medium 256KB (2GB via S3 extended client)
Portability / lock-in High AWS-only Medium Runs anywhere, open protocol Medium Open source, runs anywhere High AWS-only pattern

4. Fault Tolerance

DimensionSNSKafkaRedis Pub/SubSNS+SQS
Replication modelInternal, multi-AZ, opaque to userLeader + ISR followers per partitionPrimary + replica (async), or noneInternal, multi-AZ, opaque (SQS stores redundantly)
Failure detectionAWS-managedController / KRaft heartbeats, ISR shrinkSentinel or Cluster gossipAWS-managed
Failover mechanismTransparent, automaticISR leader election (seconds)Sentinel promotes replica (seconds to tens of s)Transparent, automatic
RTO (typical)Near-zero (managed)Seconds (leader election)Seconds to tens of secondsNear-zero (managed)
RPO (typical)Zero for accepted msgs (best-effort delivery after)Zero with acks=all + min.insync.replicasHigh: in-flight + buffered msgs lost on failoverZero for enqueued messages
Split-brain behaviorN/A, managedPrevented by min.insync.replicas; unclean election risks lossPossible with Sentinel misconfig; writes to old primary lostN/A, managed
Blast radius, single nodeNone visiblePartitions led by that broker fail over; lag spikeIf non-replicated, total loss of that shard's channelsNone visible
Cross-region failoverRegion-scoped; DR needs multi-region topicsMirrorMaker 2 / Confluent replication, manualActive-active needs Redis Enterprise CRDTsRegion-scoped; DR needs multi-region design
Data loss scenarioEndpoint down past retry window with no DLQUnclean leader election or retention expiryRoutine: any disconnect, slow consumer, or restartMessage age exceeds retention (max 14 days)

5. Sharding / Partitioning

DimensionSNSKafkaRedis Pub/SubSNS+SQS
Sharding modelNone visible (managed internally)Explicit partitions, hash on keyChannel-based; Sharded Pub/Sub hashes channel to slotNone visible (managed internally)
Shard key constraintsN/A (FIFO uses MessageGroupId for ordering)Partition key; same key always same partitionChannel name maps to slot in Sharded modeN/A (FIFO uses MessageGroupId)
Rebalancing mechanismAutomatic, invisibleConsumer-group rebalance + partition reassignmentCluster slot migrationAutomatic, invisible
Rebalancing cost / impactNone to userStop-the-world pause (eager) or incremental (cooperative)Slot migration moves keys; brief unavailabilityNone to user
Hot-shard behaviorN/ASkewed key floods one partition; lag on that consumerA hot channel concentrates load on one nodeN/A; a hot queue just scales its consumer fleet
Max shards (practical)N/AThousands of partitions/cluster (KRaft raised the ceiling)16,384 hash slots in ClusterEffectively unlimited queues
Reshard without downtime?N/AAdd partitions yes, but breaks key-to-partition mappingSlot migration is online but operationally heavyN/A; add queues freely
Cross-shard queryN/ANo cross-partition ordering; app must mergeNo cross-channel semanticsN/A; queues are independent

6. Replication

DimensionSNSKafkaRedis Pub/SubSNS+SQS
TopologyManaged multi-AZ (opaque)Leader-follower per partitionPrimary-replica (async)Managed multi-AZ (opaque)
Sync vs asyncManagedConfigurable: acks=all is sync to ISRAsync; replica can lag the primaryManaged (synchronous across AZs)
Replication factorManagedDefault 3, tunable per topicTypically 1 replica per primaryManaged (multiple AZs)
Consistency optionsAt-least-once (Std), exactly-once (FIFO)Tunable via acks + min.insync.replicas + EOSNone; no consistency guarantee for Pub/SubAt-least-once (Std), exactly-once (FIFO)
Replication lagN/ASub-second healthy; watch ISR shrinkAsync, can spike under loadN/A (managed)
Conflict resolutionN/A (single writer path)No conflicts; single leader per partitionLast-write-wins on the primaryN/A
Cross-region replicationNot native; design-levelMirrorMaker 2 / cluster linkingRedis Enterprise active-active (CRDT)Not native; design-level
Replication during partitionManagedISR shrinks; acks=all blocks if below min.insyncPrimary serves alone; replica divergesManaged (stays consistent across AZs)

7. Better Usage Patterns

Amazon SNS (standalone)

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
DLQ on every subscriptionRely on default retries and silently lose messagesAttach a redrive DLQ to each subscriptionWithout a DLQ, an endpoint outage past the retry window is permanent loss
Attribute over payload filteringFilter on message payload, paying per GB scannedPut filter dimensions in message attributes (free filtering)Attribute filtering is free; payload filtering bills per GB scanned at scale
Batch publishingOne PublishBatch is unused, sending 1 message per API callUse PublishBatch (up to 10 msgs/call)Cuts API request cost by up to 90% on small messages
Claim-check for big payloadsSend 256KB blobs and eat the 4x chunk billingStore payload in S3, publish a pointerKeeps each publish at one 64KB chunk and one delivery unit
Reserve FIFO for genuine orderingDefault everything to FIFO topicsUse Standard unless ordering/dedup is truly requiredFIFO caps at 300 TPS and only delivers to SQS FIFO; Standard is far cheaper and faster

Apache Kafka

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Partition-key designRandom keys, losing per-entity orderingKey by the entity that needs ordered processingOrdering is per-partition; the key is the only ordering lever you have
Cooperative rebalancingDefault eager rebalancing stops all consumersUse cooperative-sticky assignorAvoids stop-the-world pauses every time the group changes
acks=all + min.insync.replicasacks=1 and assume durabilityacks=all with min.insync.replicas>=2acks=1 loses data on leader failure before replication
Consumer lag as the SLOAlert on broker CPU, miss the real signalAlert on consumer-group lagLag is the direct measure of whether consumers keep up; it predicts incidents
Tiered storage for long retentionSize broker disks for full retentionUse tiered storage (KIP-405) to offload to object storageDecouples retention from broker disk, slashing cost for replay-heavy topics

Redis Pub/Sub

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Use Streams when you need durabilityReach for Pub/Sub then bolt on hacks to avoid lossUse Redis Streams (consumer groups + persistence)Pub/Sub will never be durable; Streams is the right structure inside Redis
Sharded Pub/Sub in ClusterPlain SUBSCRIBE in Cluster, flooding all nodesUse SSUBSCRIBE (Sharded Pub/Sub, Redis 7+)Classic Pub/Sub broadcasts cluster-wide, wasting bandwidth and capping scale
Separate bus from cacheRun Pub/Sub on the shared cache instanceDedicate a Redis instance to messagingDecouples failure domains; cache pressure should not kill your event bus
Tune client-output-buffer-limitLeave defaults, slow subscribers get dropped silentlySize buffer limits to consumer pace, monitor disconnectsRedis kills slow subscribers to protect itself, causing invisible loss
Accept loss explicitlyTreat Pub/Sub as reliable, build on a false assumptionOnly use it where loss is acceptable by designDesigning for guarantees Redis Pub/Sub does not offer is the root cause of most incidents

SNS + SQS (fan-out composite)

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Long polling everywhereShort polling burns requests on idle queuesSet WaitTimeSeconds=20 on every receiveEliminates empty-receive request charges and reduces latency churn
Visibility timeout sizingDefault 30s, then duplicate work on slow processingSet to ~6x processing time or heartbeat with ChangeMessageVisibilityToo-short timeout redelivers in-flight messages, doubling work
DLQ + maxReceiveCountPoison messages loop forever, blocking the queueConfigure a DLQ with a sane maxReceiveCountA poison pill without a DLQ stalls the whole consumer
Subscribe SQS, not raw HTTPSubscribe an HTTP endpoint directly to SNS, no bufferInsert an SQS queue as the durable bufferThe SQS queue is what makes the consumer outage-tolerant; that is the whole point of the pattern
Idempotent consumersAssume exactly-once on Standard queuesMake handlers idempotent (dedup key)Standard SQS is at-least-once; duplicates are normal, not exceptional

8. Advanced / Next-Gen Alternatives

Amazon SNS (standalone)

AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Amazon EventBridgeRich routing rules, schema registry, SaaS event sourcesProductionLow (similar model)You need content-based routing and many event sources, not just fan-out
SNS + Kinesis Firehose archiveAdds durable history / replay to SNSProductionLowYou like SNS push but need an audit trail or reprocessing
Google Pub/SubPush + pull, durable, replay, global by defaultProductionHigh (cloud move)On GCP, or want durable pub/sub without running Kafka

Apache Kafka

AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Apache PulsarNative multi-tenancy, tiered storage, queue+stream in oneProductionHigh (different model)Heavy multi-tenant fan-out plus queueing in one system
RedpandaKafka API, no JVM/ZooKeeper, lower tail latencyProductionLow (Kafka-compatible)Want Kafka semantics with simpler ops and better p99
WarpStream / diskless KafkaKafka API directly on S3, no local disks, cheaperEmergingLow (Kafka-compatible)Cost-sensitive, latency-tolerant streaming on object storage

Redis Pub/Sub

AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Redis StreamsAdds persistence, consumer groups, replay within RedisProductionLow (same engine)You need durability but want to stay in Redis
NATS / NATS JetStreamLightweight pub/sub, JetStream adds durability + replayProductionMediumWant low-latency messaging with optional durability, lighter than Kafka
Redis Sharded Pub/SubScales fan-out across a cluster correctlyProductionLow (Redis 7+ feature)Already on Redis Cluster and Pub/Sub fan-out is the bottleneck

SNS + SQS (fan-out composite)

AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
EventBridge + SQS targetsAdds rule-based routing in front of the queuesProductionLowYou outgrow simple fan-out and need filtering/routing logic
Amazon MSK / KafkaAdds replay and a shared log for late/new consumersProductionHigh (paradigm shift)You need history, replay, or many groups reading one stream
Kinesis Data StreamsOrdered, replayable shards, managed, AWS-nativeProductionMediumWant Kafka-like replay/ordering without running Kafka, staying on AWS