Pub/Sub Trade-Offs
Four ways to fan one event out to many consumers: Amazon SNS, Apache Kafka, Redis Pub/Sub, and the SNS+SQS composite. They are not interchangeable. The decisive axis is durability, not throughput.
As of 2026-05-21. AWS pricing is us-east-1 standard tier. Verify against the AWS pricing console for your region before quoting in a design doc.
SNS+SQS is the default for durable application fan-out inside AWS. It is the only option here that gives every consumer its own durable, independently-acknowledged copy with per-consumer backpressure, at near-zero ops. Reach past it only for a reason you can name.
Kafka wins when you need replay, ordered partitions, or a shared event log read by many consumer groups at their own offset (analytics replaying history while a live consumer tails the head). You pay for it in cluster ops.
Raw SNS is the right primitive when the subscriber is itself a push endpoint (Lambda, HTTP, mobile push, email/SMS) and you do not need a buffer. Redis Pub/Sub is fire-and-forget: pick it only when losing messages during a disconnect is acceptable (live presence, ephemeral dashboards), never as a system of record.
Framing
Pub/sub decouples a publisher from many subscribers. These options sit at different points on the durability and fan-out spectrum, so the right default depends on whether messages can be lost, replayed, or independently acknowledged.
Best default choices
1. Trade-Offs
Click any column header to sort. green = what you gain · orange = what you give up · blue = PE nuance
Amazon SNS (standalone)
Use standalone SNS when subscribers are push endpoints and you do not need a durable per-consumer buffer or replay path.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Push delivery, no consumer buffer | Sub-second fan-out to endpoints with zero polling cost | No durable hold if the endpoint is down past the retry window | HTTP/S endpoint down longer than the retry policy: message lands in a DLQ or is dropped | Always attach a redrive DLQ. Raw SNS retries are finite (4-hour to 23-day backoff by protocol), not infinite |
| Managed, serverless | No brokers to run, scales to high publish rates automatically | No control over internals, no replay, AWS-locked | You need to reprocess yesterday's events: impossible, SNS keeps nothing | SNS is a router, not a store. If you need history, that requires SNS+SQS, Kinesis, or archiving to S3 via Firehose |
| Best-effort ordering (Standard) | Near-unlimited throughput on Standard topics | Out-of-order and duplicate delivery | Consumer assumes ordering: state machine corrupts on reorder | FIFO topics fix ordering but cap at 300 TPS (3,000 batched) and only deliver to SQS FIFO |
| Native multi-protocol fan-out | One publish reaches Lambda, SQS, HTTP, email, SMS, push together | Per-subscription delivery is billed and metered separately | Large fan-out (1 publish, 50 subs) multiplies delivery cost and failure surface | A2P (SMS/email) carries very different cost and reliability than A2A (SQS/Lambda). Do not mix them in one mental model |
| Message filtering at the broker | Subscribers receive only matching messages, less consumer waste | Attribute filtering free, payload filtering billed per GB scanned | Heavy payload-based filter policies quietly add cost at scale | Prefer attribute-based filtering (free) over payload-based ($/GB scanned). Push filter logic into message attributes at publish time |
| At-least-once delivery | Simpler than building exactly-once | Consumers must be idempotent | Duplicate triggers a double charge or double email | Idempotency is the consumer's job on every option here except Kafka EOS and SNS/SQS FIFO |
| 64KB request chunking | Predictable per-request pricing | A 256KB message bills as 4 requests and 4 deliveries per endpoint | Large payloads silently 4x the bill | Keep payloads small, pass an S3 pointer for anything large (claim-check pattern) |
Apache Kafka
Choose Kafka when replay, ordered partitions, high-throughput streams, and independent consumer group offsets justify owning broker operations.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Durable, replayable log | Replay from any offset, add new consumers that read all history | Storage cost and retention management | Retention too short: a backfilling consumer finds the data already aged out | Retention is the silent correctness lever. Tiered storage (KIP-405) decouples retention from broker disk |
| Partition-level ordering | Strict order within a partition, parallelism across partitions | No total order across the topic | You need global ordering: forces a single partition, killing throughput | Order is per-key, not per-topic. Choosing the partition key is the real design decision |
| Consumer groups + offsets | Many independent groups read the same stream at their own pace | Offset management complexity, rebalancing pauses | A slow consumer in a group triggers rebalances that stall the others | This is the property SNS+SQS fakes with N queues. Kafka does it with one copy of the data on disk |
| Pull-based consumption | Consumers control their own rate, natural backpressure | Consumers must poll, no native push to arbitrary endpoints | You wanted to trigger a Lambda directly: needs an extra connector or MSK trigger | Pull means lag is observable and bounded by the consumer, not the broker. Lag is your single best health metric |
| Self-managed or MSK | Full control, runs anywhere, no vendor lock | Cluster ops: brokers, rebalancing, ISR, controller, upgrades | A broker dies at 3am and under-replicated partitions page you | The ops burden, not throughput, is why teams pick SNS+SQS. MSK and Confluent reduce but do not erase it |
| Exactly-once semantics (EOS) | Transactional produce+consume, no dedup logic in app | Lower throughput, higher latency, only within Kafka | EOS does not extend to an external DB without an outbox or idempotent sink | EOS is intra-Kafka. Crossing to an external system still needs idempotency or the transactional-outbox pattern |
| High throughput via sequential IO | Millions of msg/s on modest hardware, zero-copy reads | Latency floor higher than in-memory Redis | Ultra-low-latency (sub-ms) fan-out: Kafka's batching adds milliseconds | Kafka optimizes throughput, not tail latency. For sub-ms presence signals, Redis Pub/Sub still wins |
Redis Pub/Sub
Use Redis Pub/Sub only for ephemeral broadcasts where losing messages during disconnects is an accepted product behavior.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| In-memory, fire-and-forget | Lowest latency here, sub-millisecond fan-out | Zero durability, no persistence, no replay | A subscriber reconnects after a blip: every message published during the gap is gone | This is the one disqualifying property. Redis Pub/Sub is not a message bus, it is a live signal wire |
| Dead simple model | PUBLISH / SUBSCRIBE, no offsets, no consumer groups | No backpressure, no per-consumer acks | A slow subscriber fills its output buffer and Redis disconnects it | Redis drops slow subscribers (client-output-buffer-limit) to protect itself. Silent message loss by design |
| Co-located with your cache | No new infrastructure if Redis is already in the stack | Couples messaging fate to cache fate | Cache eviction pressure or failover takes your event bus down with it | Sharing the Redis instance between cache and bus couples two unrelated failure domains. Separate them |
| No delivery on the publish path being blocked | Publisher never blocks waiting for subscribers | A message with zero subscribers vanishes instantly | Subscriber deploys lag publisher startup: early events lost | If no one is subscribed at publish time, the message is dropped. There is no "land and wait" |
| Pattern subscriptions (PSUBSCRIBE) | Wildcard topic matching out of the box | Pattern matching cost grows with subscriber count | Thousands of pattern subs degrade publish latency | In Redis Cluster, plain Pub/Sub is not shard-aware. Use Sharded Pub/Sub (Redis 7+) or fan-out breaks across shards |
| Cluster mode caveat | Sharded Pub/Sub scales fan-out across the cluster | Classic Pub/Sub broadcasts to all nodes, wasting cluster bandwidth | Migrating to Redis Cluster silently changes Pub/Sub semantics | Many teams hit this in production: classic SUBSCRIBE in cluster mode floods every node. Sharded Pub/Sub fixes it |
| Redis Streams as the durable cousin | If you need durability, Streams (XADD/XREADGROUP) gives consumer groups + persistence | Streams is a different data structure, not Pub/Sub | Teams reach for Pub/Sub when they actually needed Streams | If durability matters and you want to stay in Redis, the answer is almost always Streams, not Pub/Sub |
SNS + SQS (fan-out composite)
Default to SNS plus SQS for durable AWS application fan-out where every consumer needs its own buffer, retry path, and backpressure boundary.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Durable per-consumer fan-out | Each consumer gets its own durable queue, acks independently | N copies of every message (one per queue), N times the SQS cost | Huge fan-out (1 event, 100 consumers) multiplies storage and request cost | This is the textbook AWS pattern. The cost multiplier is the price of decoupling consumer failure domains |
| Independent backpressure | A slow consumer's queue grows without affecting others | No shared replay, each queue drains once | You need a brand-new consumer to read history: SQS already deleted it | Unlike Kafka, a consumed message is gone. Adding a late consumer means it sees only future events |
| Fully managed, near-zero ops | No brokers, no partitions, no rebalancing, autoscaling built in | AWS lock-in, no control over internals | Multi-cloud or on-prem requirement appears: the pattern does not port | The operational simplicity is the entire reason to choose this over Kafka inside AWS |
| Built-in DLQ + visibility timeout | Failed messages redrive to a DLQ, in-flight messages hidden during processing | Visibility timeout tuning is a real source of bugs | Processing takes longer than the timeout: message redelivered, duplicate work | Set visibility timeout to ~6x expected processing time or use the heartbeat (ChangeMessageVisibility) pattern |
| At-least-once (Standard) or exactly-once (FIFO) | Choose ordering+dedup (FIFO) or throughput (Standard) per queue | FIFO caps throughput (300 msg/s, 3,000 batched) and SNS FIFO only feeds SQS FIFO | Ordering need at high volume forces an awkward FIFO+partitioning design | Mix and match: Standard SNS to Standard SQS for throughput, FIFO end-to-end only where order is mandatory |
| Cheap at moderate scale, polling cost at idle | First 1M requests/month free, $0.40/M after | Empty receives from short polling burn requests on idle queues | Many idle queues polled aggressively: surprise request bill | Always use long polling (WaitTimeSeconds=20). The 64KB chunk rule applies here too: large payloads 4x the count |
2. Use Cases
Amazon SNS (standalone)
| Use Case | Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Mobile / web push notifications | App sends order-status pushes to millions of devices | Native APNs/FCM/A2P delivery, no buffer needed | Millions of endpoints | Kafka has no native push to mobile; SQS does not deliver to devices |
| Lambda fan-out trigger | One event triggers several Lambdas in parallel | Push-based invoke, no polling, no idle cost | Thousands of invokes/s | SQS+Lambda needs polling config; Redis has no Lambda integration |
| System alerts to humans | CloudWatch alarm fans to email + SMS + Slack webhook | Multi-protocol A2P delivery from one publish | Low volume, high reliability | Only SNS speaks email/SMS/HTTP natively in one call |
| Cross-account event broadcast | Central account publishes events consumed by many team accounts | Topic policies allow cross-account subscription | Dozens of accounts | Kafka cross-account needs networking + ACL plumbing |
| Fanout entry point (paired with SQS) | SNS is the fan-out hub feeding many SQS queues | Decouples publisher from a growing set of consumers | Many independent consumers | Raw SQS cannot fan one message to many queues; SNS is the splitter |
Apache Kafka
| Use Case | Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Event sourcing / CQRS | The log is the source of truth, read models project from it | Durable replayable ordered log | Billions of events retained | SQS deletes on consume; SNS keeps nothing; no replay anywhere else |
| Multi-consumer analytics + real-time | Same clickstream feeds Flink real-time and a nightly batch job | Independent consumer-group offsets on one copy | Millions of msg/s | SNS+SQS would need N duplicated queues and still no replay |
| Stream processing backbone | Kafka feeds Flink/Kafka Streams for joins and windowing | Ordered partitions + offset semantics | High throughput, stateful | Redis/SNS/SQS have no stream-processing ecosystem |
| Log / metrics aggregation pipeline | Thousands of services ship logs into topics, fan to sinks | High write throughput, cheap sequential storage | TB/day ingestion | SQS request pricing at this volume is brutal; Kafka is built for it |
| CDC and database replication | Debezium streams DB changes into Kafka for downstream sync | Ordered per-key change log, replayable | Continuous high volume | Ordering + replay are mandatory; only Kafka offers both here |
Redis Pub/Sub
| Use Case | Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Live presence / typing indicators | Chat app shows who is online and typing | Sub-ms latency, loss is harmless | Many ephemeral signals | Kafka/SQS add latency for a signal that is worthless if delayed |
| WebSocket fan-out across servers | Broadcast a message to all WS servers holding client connections | Lowest-latency in-process broadcast | Many app servers | SNS/SQS round-trip is too slow for interactive broadcast |
| Live config / cache-invalidation ping | Tell all nodes to drop a cache key now | Instant best-effort broadcast | Cluster-wide | Durability is unnecessary; a missed invalidation self-heals on TTL |
| Real-time leaderboard / dashboard ticks | Push score updates to live dashboards | Low latency, stale data soon overwritten | High-frequency updates | Each tick supersedes the last, so loss does not matter |
| In-game ephemeral events | Broadcast transient in-match events to connected clients | Speed over guarantee | Bursty, latency-sensitive | Durable systems over-engineer a throwaway signal |
SNS + SQS (fan-out composite)
| Use Case | Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Order-event fan-out to microservices | OrderPlaced fans to billing, inventory, email, analytics queues | Each service buffers and acks independently | Many services, moderate volume | Raw SNS drops if a consumer is down; this survives outages per consumer |
| Decoupled async job dispatch | An event spawns durable work units consumed by autoscaling workers | Durable buffer + DLQ + visibility timeout | Variable worker fleet | Redis loses jobs on disconnect; Kafka is ops-heavy for simple jobs |
| Reliable cross-service eventing in AWS | Internal events where loss is unacceptable but replay is not needed | At-least-once durable delivery, near-zero ops | Org-wide eventing | Kafka cluster ops not justified when replay is not required |
| Buffering bursty traffic to slow consumers | Spiky publish rate, downstream processes at a steady pace | Queue absorbs the burst, consumer drains steadily | 10x burst factors | SNS alone has no buffer; the SQS queue is the shock absorber |
| Ordered, exactly-once workflow steps | Sequential steps that must not duplicate or reorder | SNS FIFO to SQS FIFO end to end | Up to 300 msg/s (3,000 batched) | Redis/Standard give no ordering or dedup guarantee |
3. Limitations
| Limitation Axis | SNS | Kafka | Redis Pub/Sub | SNS+SQS |
|---|---|---|---|---|
| Durability | High No store; relies on subscriber availability | Medium Durable to disk, bounded by retention | Critical None; offline subscriber loses everything | Medium Durable up to 14-day retention, then dropped |
| Replay / history | High Impossible without archiving | Medium Native, bounded by retention | Critical No replay at all | High None; consumed messages are deleted |
| Ordering | Medium Best-effort; FIFO caps at 300 TPS | Medium Per-partition only, not global | Medium Per-channel best-effort, no guarantee on reconnect | Medium Standard unordered; FIFO ordered but 300 TPS |
| Fan-out cost | Medium Billed per subscription delivery | Medium One copy on disk, cheap fan-out | Medium Classic cluster mode broadcasts to all nodes | High N queues = N copies = N times SQS cost |
| Operational burden | Medium Low; managed | Critical Brokers, ISR, rebalancing, controller, upgrades | Medium Failover and buffer-limit tuning | Medium Low; managed, but many queues to govern |
| Throughput ceiling | Medium Very high Standard; 300 TPS FIFO | Medium Millions/s, the highest here | Medium High but single-threaded command path | Medium Very high Standard; 300 TPS FIFO |
| Payload size | Medium 256KB, billed in 64KB chunks | Medium Default 1MB, tunable | Medium Bounded by memory and buffer limits | Medium 256KB (2GB via S3 extended client) |
| Portability / lock-in | High AWS-only | Medium Runs anywhere, open protocol | Medium Open source, runs anywhere | High AWS-only pattern |
4. Fault Tolerance
| Dimension | SNS | Kafka | Redis Pub/Sub | SNS+SQS |
|---|---|---|---|---|
| Replication model | Internal, multi-AZ, opaque to user | Leader + ISR followers per partition | Primary + replica (async), or none | Internal, multi-AZ, opaque (SQS stores redundantly) |
| Failure detection | AWS-managed | Controller / KRaft heartbeats, ISR shrink | Sentinel or Cluster gossip | AWS-managed |
| Failover mechanism | Transparent, automatic | ISR leader election (seconds) | Sentinel promotes replica (seconds to tens of s) | Transparent, automatic |
| RTO (typical) | Near-zero (managed) | Seconds (leader election) | Seconds to tens of seconds | Near-zero (managed) |
| RPO (typical) | Zero for accepted msgs (best-effort delivery after) | Zero with acks=all + min.insync.replicas | High: in-flight + buffered msgs lost on failover | Zero for enqueued messages |
| Split-brain behavior | N/A, managed | Prevented by min.insync.replicas; unclean election risks loss | Possible with Sentinel misconfig; writes to old primary lost | N/A, managed |
| Blast radius, single node | None visible | Partitions led by that broker fail over; lag spike | If non-replicated, total loss of that shard's channels | None visible |
| Cross-region failover | Region-scoped; DR needs multi-region topics | MirrorMaker 2 / Confluent replication, manual | Active-active needs Redis Enterprise CRDTs | Region-scoped; DR needs multi-region design |
| Data loss scenario | Endpoint down past retry window with no DLQ | Unclean leader election or retention expiry | Routine: any disconnect, slow consumer, or restart | Message age exceeds retention (max 14 days) |
6. Replication
| Dimension | SNS | Kafka | Redis Pub/Sub | SNS+SQS |
|---|---|---|---|---|
| Topology | Managed multi-AZ (opaque) | Leader-follower per partition | Primary-replica (async) | Managed multi-AZ (opaque) |
| Sync vs async | Managed | Configurable: acks=all is sync to ISR | Async; replica can lag the primary | Managed (synchronous across AZs) |
| Replication factor | Managed | Default 3, tunable per topic | Typically 1 replica per primary | Managed (multiple AZs) |
| Consistency options | At-least-once (Std), exactly-once (FIFO) | Tunable via acks + min.insync.replicas + EOS | None; no consistency guarantee for Pub/Sub | At-least-once (Std), exactly-once (FIFO) |
| Replication lag | N/A | Sub-second healthy; watch ISR shrink | Async, can spike under load | N/A (managed) |
| Conflict resolution | N/A (single writer path) | No conflicts; single leader per partition | Last-write-wins on the primary | N/A |
| Cross-region replication | Not native; design-level | MirrorMaker 2 / cluster linking | Redis Enterprise active-active (CRDT) | Not native; design-level |
| Replication during partition | Managed | ISR shrinks; acks=all blocks if below min.insync | Primary serves alone; replica diverges | Managed (stays consistent across AZs) |
7. Better Usage Patterns
Amazon SNS (standalone)
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| DLQ on every subscription | Rely on default retries and silently lose messages | Attach a redrive DLQ to each subscription | Without a DLQ, an endpoint outage past the retry window is permanent loss |
| Attribute over payload filtering | Filter on message payload, paying per GB scanned | Put filter dimensions in message attributes (free filtering) | Attribute filtering is free; payload filtering bills per GB scanned at scale |
| Batch publishing | One PublishBatch is unused, sending 1 message per API call | Use PublishBatch (up to 10 msgs/call) | Cuts API request cost by up to 90% on small messages |
| Claim-check for big payloads | Send 256KB blobs and eat the 4x chunk billing | Store payload in S3, publish a pointer | Keeps each publish at one 64KB chunk and one delivery unit |
| Reserve FIFO for genuine ordering | Default everything to FIFO topics | Use Standard unless ordering/dedup is truly required | FIFO caps at 300 TPS and only delivers to SQS FIFO; Standard is far cheaper and faster |
Apache Kafka
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Partition-key design | Random keys, losing per-entity ordering | Key by the entity that needs ordered processing | Ordering is per-partition; the key is the only ordering lever you have |
| Cooperative rebalancing | Default eager rebalancing stops all consumers | Use cooperative-sticky assignor | Avoids stop-the-world pauses every time the group changes |
| acks=all + min.insync.replicas | acks=1 and assume durability | acks=all with min.insync.replicas>=2 | acks=1 loses data on leader failure before replication |
| Consumer lag as the SLO | Alert on broker CPU, miss the real signal | Alert on consumer-group lag | Lag is the direct measure of whether consumers keep up; it predicts incidents |
| Tiered storage for long retention | Size broker disks for full retention | Use tiered storage (KIP-405) to offload to object storage | Decouples retention from broker disk, slashing cost for replay-heavy topics |
Redis Pub/Sub
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Use Streams when you need durability | Reach for Pub/Sub then bolt on hacks to avoid loss | Use Redis Streams (consumer groups + persistence) | Pub/Sub will never be durable; Streams is the right structure inside Redis |
| Sharded Pub/Sub in Cluster | Plain SUBSCRIBE in Cluster, flooding all nodes | Use SSUBSCRIBE (Sharded Pub/Sub, Redis 7+) | Classic Pub/Sub broadcasts cluster-wide, wasting bandwidth and capping scale |
| Separate bus from cache | Run Pub/Sub on the shared cache instance | Dedicate a Redis instance to messaging | Decouples failure domains; cache pressure should not kill your event bus |
| Tune client-output-buffer-limit | Leave defaults, slow subscribers get dropped silently | Size buffer limits to consumer pace, monitor disconnects | Redis kills slow subscribers to protect itself, causing invisible loss |
| Accept loss explicitly | Treat Pub/Sub as reliable, build on a false assumption | Only use it where loss is acceptable by design | Designing for guarantees Redis Pub/Sub does not offer is the root cause of most incidents |
SNS + SQS (fan-out composite)
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Long polling everywhere | Short polling burns requests on idle queues | Set WaitTimeSeconds=20 on every receive | Eliminates empty-receive request charges and reduces latency churn |
| Visibility timeout sizing | Default 30s, then duplicate work on slow processing | Set to ~6x processing time or heartbeat with ChangeMessageVisibility | Too-short timeout redelivers in-flight messages, doubling work |
| DLQ + maxReceiveCount | Poison messages loop forever, blocking the queue | Configure a DLQ with a sane maxReceiveCount | A poison pill without a DLQ stalls the whole consumer |
| Subscribe SQS, not raw HTTP | Subscribe an HTTP endpoint directly to SNS, no buffer | Insert an SQS queue as the durable buffer | The SQS queue is what makes the consumer outage-tolerant; that is the whole point of the pattern |
| Idempotent consumers | Assume exactly-once on Standard queues | Make handlers idempotent (dedup key) | Standard SQS is at-least-once; duplicates are normal, not exceptional |
8. Advanced / Next-Gen Alternatives
Amazon SNS (standalone)
| Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Amazon EventBridge | Rich routing rules, schema registry, SaaS event sources | Production | Low (similar model) | You need content-based routing and many event sources, not just fan-out |
| SNS + Kinesis Firehose archive | Adds durable history / replay to SNS | Production | Low | You like SNS push but need an audit trail or reprocessing |
| Google Pub/Sub | Push + pull, durable, replay, global by default | Production | High (cloud move) | On GCP, or want durable pub/sub without running Kafka |
Apache Kafka
| Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Apache Pulsar | Native multi-tenancy, tiered storage, queue+stream in one | Production | High (different model) | Heavy multi-tenant fan-out plus queueing in one system |
| Redpanda | Kafka API, no JVM/ZooKeeper, lower tail latency | Production | Low (Kafka-compatible) | Want Kafka semantics with simpler ops and better p99 |
| WarpStream / diskless Kafka | Kafka API directly on S3, no local disks, cheaper | Emerging | Low (Kafka-compatible) | Cost-sensitive, latency-tolerant streaming on object storage |
Redis Pub/Sub
| Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Redis Streams | Adds persistence, consumer groups, replay within Redis | Production | Low (same engine) | You need durability but want to stay in Redis |
| NATS / NATS JetStream | Lightweight pub/sub, JetStream adds durability + replay | Production | Medium | Want low-latency messaging with optional durability, lighter than Kafka |
| Redis Sharded Pub/Sub | Scales fan-out across a cluster correctly | Production | Low (Redis 7+ feature) | Already on Redis Cluster and Pub/Sub fan-out is the bottleneck |
SNS + SQS (fan-out composite)
| Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| EventBridge + SQS targets | Adds rule-based routing in front of the queues | Production | Low | You outgrow simple fan-out and need filtering/routing logic |
| Amazon MSK / Kafka | Adds replay and a shared log for late/new consumers | Production | High (paradigm shift) | You need history, replay, or many groups reading one stream |
| Kinesis Data Streams | Ordered, replayable shards, managed, AWS-native | Production | Medium | Want Kafka-like replay/ordering without running Kafka, staying on AWS |