Pub/Sub Trade-Offs

Four ways to fan one event out to many consumers: Amazon SNS, Apache Kafka, Redis Pub/Sub, and the SNS+SQS composite. They are not interchangeable. The decisive axis is durability, not throughput.

Messaging / Event Distribution Head-to-Head (4)

As of 2026-05-21. AWS pricing is us-east-1 standard tier. Verify against the AWS pricing console for your region before quoting in a design doc.

PE Verdict

SNS+SQS is the default for durable application fan-out inside AWS. It is the only option here that gives every consumer its own durable, independently-acknowledged copy with per-consumer backpressure, at near-zero ops. Reach past it only for a reason you can name.

Kafka wins when you need replay, ordered partitions, or a shared event log read by many consumer groups at their own offset (analytics replaying history while a live consumer tails the head). You pay for it in cluster ops.

Raw SNS is the right primitive when the subscriber is itself a push endpoint (Lambda, HTTP, mobile push, email/SMS) and you do not need a buffer. Redis Pub/Sub is fire-and-forget: pick it only when losing messages during a disconnect is acceptable (live presence, ephemeral dashboards), never as a system of record.

Framing

Pub/sub decouples a publisher from many subscribers. These options sit at different points on the durability and fan-out spectrum, so the right default depends on whether messages can be lost, replayed, or independently acknowledged.

Redis Pub/SubEphemeral, in-memory, no persistence, no replay, no per-subscriber acknowledgement.

SNSManaged push fan-out to subscribed endpoints, but no consumer-side buffer by itself.

SNS + SQSSNS fans out, each consumer owns a durable queue, ack path, and backpressure buffer.

KafkaDurable partitioned commit log with replay and independent consumer group offsets.

Best default choices

SNS + SQSDurable AWS fan-out with one queue per consumer KafkaReplayable event log, ordered partitions, many consumer groups SNSManaged push fan-out to Lambda, HTTP, mobile, email, SMS Redis Pub/SubEphemeral low-latency broadcasts where loss is acceptable

1. Trade-Offs

Click any column header to sort. green = what you gain · orange = what you give up · blue = PE nuance

Amazon SNS (standalone)

Use standalone SNS when subscribers are push endpoints and you do not need a durable per-consumer buffer or replay path.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Push delivery, no consumer buffer	Sub-second fan-out to endpoints with zero polling cost	No durable hold if the endpoint is down past the retry window	HTTP/S endpoint down longer than the retry policy: message lands in a DLQ or is dropped	Always attach a redrive DLQ. Raw SNS retries are finite (4-hour to 23-day backoff by protocol), not infinite
Managed, serverless	No brokers to run, scales to high publish rates automatically	No control over internals, no replay, AWS-locked	You need to reprocess yesterday's events: impossible, SNS keeps nothing	SNS is a router, not a store. If you need history, that requires SNS+SQS, Kinesis, or archiving to S3 via Firehose
Best-effort ordering (Standard)	Near-unlimited throughput on Standard topics	Out-of-order and duplicate delivery	Consumer assumes ordering: state machine corrupts on reorder	FIFO topics fix ordering but cap at 300 TPS (3,000 batched) and only deliver to SQS FIFO
Native multi-protocol fan-out	One publish reaches Lambda, SQS, HTTP, email, SMS, push together	Per-subscription delivery is billed and metered separately	Large fan-out (1 publish, 50 subs) multiplies delivery cost and failure surface	A2P (SMS/email) carries very different cost and reliability than A2A (SQS/Lambda). Do not mix them in one mental model
Message filtering at the broker	Subscribers receive only matching messages, less consumer waste	Attribute filtering free, payload filtering billed per GB scanned	Heavy payload-based filter policies quietly add cost at scale	Prefer attribute-based filtering (free) over payload-based ($/GB scanned). Push filter logic into message attributes at publish time
At-least-once delivery	Simpler than building exactly-once	Consumers must be idempotent	Duplicate triggers a double charge or double email	Idempotency is the consumer's job on every option here except Kafka EOS and SNS/SQS FIFO
64KB request chunking	Predictable per-request pricing	A 256KB message bills as 4 requests and 4 deliveries per endpoint	Large payloads silently 4x the bill	Keep payloads small, pass an S3 pointer for anything large (claim-check pattern)

Apache Kafka

Choose Kafka when replay, ordered partitions, high-throughput streams, and independent consumer group offsets justify owning broker operations.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Durable, replayable log	Replay from any offset, add new consumers that read all history	Storage cost and retention management	Retention too short: a backfilling consumer finds the data already aged out	Retention is the silent correctness lever. Tiered storage (KIP-405) decouples retention from broker disk
Partition-level ordering	Strict order within a partition, parallelism across partitions	No total order across the topic	You need global ordering: forces a single partition, killing throughput	Order is per-key, not per-topic. Choosing the partition key is the real design decision
Consumer groups + offsets	Many independent groups read the same stream at their own pace	Offset management complexity, rebalancing pauses	A slow consumer in a group triggers rebalances that stall the others	This is the property SNS+SQS fakes with N queues. Kafka does it with one copy of the data on disk
Pull-based consumption	Consumers control their own rate, natural backpressure	Consumers must poll, no native push to arbitrary endpoints	You wanted to trigger a Lambda directly: needs an extra connector or MSK trigger	Pull means lag is observable and bounded by the consumer, not the broker. Lag is your single best health metric
Self-managed or MSK	Full control, runs anywhere, no vendor lock	Cluster ops: brokers, rebalancing, ISR, controller, upgrades	A broker dies at 3am and under-replicated partitions page you	The ops burden, not throughput, is why teams pick SNS+SQS. MSK and Confluent reduce but do not erase it
Exactly-once semantics (EOS)	Transactional produce+consume, no dedup logic in app	Lower throughput, higher latency, only within Kafka	EOS does not extend to an external DB without an outbox or idempotent sink	EOS is intra-Kafka. Crossing to an external system still needs idempotency or the transactional-outbox pattern
High throughput via sequential IO	Millions of msg/s on modest hardware, zero-copy reads	Latency floor higher than in-memory Redis	Ultra-low-latency (sub-ms) fan-out: Kafka's batching adds milliseconds	Kafka optimizes throughput, not tail latency. For sub-ms presence signals, Redis Pub/Sub still wins

Redis Pub/Sub

Use Redis Pub/Sub only for ephemeral broadcasts where losing messages during disconnects is an accepted product behavior.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
In-memory, fire-and-forget	Lowest latency here, sub-millisecond fan-out	Zero durability, no persistence, no replay	A subscriber reconnects after a blip: every message published during the gap is gone	This is the one disqualifying property. Redis Pub/Sub is not a message bus, it is a live signal wire
Dead simple model	PUBLISH / SUBSCRIBE, no offsets, no consumer groups	No backpressure, no per-consumer acks	A slow subscriber fills its output buffer and Redis disconnects it	Redis drops slow subscribers (client-output-buffer-limit) to protect itself. Silent message loss by design
Co-located with your cache	No new infrastructure if Redis is already in the stack	Couples messaging fate to cache fate	Cache eviction pressure or failover takes your event bus down with it	Sharing the Redis instance between cache and bus couples two unrelated failure domains. Separate them
No delivery on the publish path being blocked	Publisher never blocks waiting for subscribers	A message with zero subscribers vanishes instantly	Subscriber deploys lag publisher startup: early events lost	If no one is subscribed at publish time, the message is dropped. There is no "land and wait"
Pattern subscriptions (PSUBSCRIBE)	Wildcard topic matching out of the box	Pattern matching cost grows with subscriber count	Thousands of pattern subs degrade publish latency	In Redis Cluster, plain Pub/Sub is not shard-aware. Use Sharded Pub/Sub (Redis 7+) or fan-out breaks across shards
Cluster mode caveat	Sharded Pub/Sub scales fan-out across the cluster	Classic Pub/Sub broadcasts to all nodes, wasting cluster bandwidth	Migrating to Redis Cluster silently changes Pub/Sub semantics	Many teams hit this in production: classic SUBSCRIBE in cluster mode floods every node. Sharded Pub/Sub fixes it
Redis Streams as the durable cousin	If you need durability, Streams (XADD/XREADGROUP) gives consumer groups + persistence	Streams is a different data structure, not Pub/Sub	Teams reach for Pub/Sub when they actually needed Streams	If durability matters and you want to stay in Redis, the answer is almost always Streams, not Pub/Sub

SNS + SQS (fan-out composite)

Default to SNS plus SQS for durable AWS application fan-out where every consumer needs its own buffer, retry path, and backpressure boundary.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Durable per-consumer fan-out	Each consumer gets its own durable queue, acks independently	N copies of every message (one per queue), N times the SQS cost	Huge fan-out (1 event, 100 consumers) multiplies storage and request cost	This is the textbook AWS pattern. The cost multiplier is the price of decoupling consumer failure domains
Independent backpressure	A slow consumer's queue grows without affecting others	No shared replay, each queue drains once	You need a brand-new consumer to read history: SQS already deleted it	Unlike Kafka, a consumed message is gone. Adding a late consumer means it sees only future events
Fully managed, near-zero ops	No brokers, no partitions, no rebalancing, autoscaling built in	AWS lock-in, no control over internals	Multi-cloud or on-prem requirement appears: the pattern does not port	The operational simplicity is the entire reason to choose this over Kafka inside AWS
Built-in DLQ + visibility timeout	Failed messages redrive to a DLQ, in-flight messages hidden during processing	Visibility timeout tuning is a real source of bugs	Processing takes longer than the timeout: message redelivered, duplicate work	Set visibility timeout to ~6x expected processing time or use the heartbeat (ChangeMessageVisibility) pattern
At-least-once (Standard) or exactly-once (FIFO)	Choose ordering+dedup (FIFO) or throughput (Standard) per queue	FIFO caps throughput (300 msg/s, 3,000 batched) and SNS FIFO only feeds SQS FIFO	Ordering need at high volume forces an awkward FIFO+partitioning design	Mix and match: Standard SNS to Standard SQS for throughput, FIFO end-to-end only where order is mandatory
Cheap at moderate scale, polling cost at idle	First 1M requests/month free, $0.40/M after	Empty receives from short polling burn requests on idle queues	Many idle queues polled aggressively: surprise request bill	Always use long polling (WaitTimeSeconds=20). The 64KB chunk rule applies here too: large payloads 4x the count

2. Use Cases

Amazon SNS (standalone)

Use Case	Scenario	Driving Property	Scale Dimension	Why Not Alternative
Mobile / web push notifications	App sends order-status pushes to millions of devices	Native APNs/FCM/A2P delivery, no buffer needed	Millions of endpoints	Kafka has no native push to mobile; SQS does not deliver to devices
Lambda fan-out trigger	One event triggers several Lambdas in parallel	Push-based invoke, no polling, no idle cost	Thousands of invokes/s	SQS+Lambda needs polling config; Redis has no Lambda integration
System alerts to humans	CloudWatch alarm fans to email + SMS + Slack webhook	Multi-protocol A2P delivery from one publish	Low volume, high reliability	Only SNS speaks email/SMS/HTTP natively in one call
Cross-account event broadcast	Central account publishes events consumed by many team accounts	Topic policies allow cross-account subscription	Dozens of accounts	Kafka cross-account needs networking + ACL plumbing
Fanout entry point (paired with SQS)	SNS is the fan-out hub feeding many SQS queues	Decouples publisher from a growing set of consumers	Many independent consumers	Raw SQS cannot fan one message to many queues; SNS is the splitter

Apache Kafka

Use Case	Scenario	Driving Property	Scale Dimension	Why Not Alternative
Event sourcing / CQRS	The log is the source of truth, read models project from it	Durable replayable ordered log	Billions of events retained	SQS deletes on consume; SNS keeps nothing; no replay anywhere else
Multi-consumer analytics + real-time	Same clickstream feeds Flink real-time and a nightly batch job	Independent consumer-group offsets on one copy	Millions of msg/s	SNS+SQS would need N duplicated queues and still no replay
Stream processing backbone	Kafka feeds Flink/Kafka Streams for joins and windowing	Ordered partitions + offset semantics	High throughput, stateful	Redis/SNS/SQS have no stream-processing ecosystem
Log / metrics aggregation pipeline	Thousands of services ship logs into topics, fan to sinks	High write throughput, cheap sequential storage	TB/day ingestion	SQS request pricing at this volume is brutal; Kafka is built for it
CDC and database replication	Debezium streams DB changes into Kafka for downstream sync	Ordered per-key change log, replayable	Continuous high volume	Ordering + replay are mandatory; only Kafka offers both here

Redis Pub/Sub

Use Case	Scenario	Driving Property	Scale Dimension	Why Not Alternative
Live presence / typing indicators	Chat app shows who is online and typing	Sub-ms latency, loss is harmless	Many ephemeral signals	Kafka/SQS add latency for a signal that is worthless if delayed
WebSocket fan-out across servers	Broadcast a message to all WS servers holding client connections	Lowest-latency in-process broadcast	Many app servers	SNS/SQS round-trip is too slow for interactive broadcast
Live config / cache-invalidation ping	Tell all nodes to drop a cache key now	Instant best-effort broadcast	Cluster-wide	Durability is unnecessary; a missed invalidation self-heals on TTL
Real-time leaderboard / dashboard ticks	Push score updates to live dashboards	Low latency, stale data soon overwritten	High-frequency updates	Each tick supersedes the last, so loss does not matter
In-game ephemeral events	Broadcast transient in-match events to connected clients	Speed over guarantee	Bursty, latency-sensitive	Durable systems over-engineer a throwaway signal

SNS + SQS (fan-out composite)

Use Case	Scenario	Driving Property	Scale Dimension	Why Not Alternative
Order-event fan-out to microservices	OrderPlaced fans to billing, inventory, email, analytics queues	Each service buffers and acks independently	Many services, moderate volume	Raw SNS drops if a consumer is down; this survives outages per consumer
Decoupled async job dispatch	An event spawns durable work units consumed by autoscaling workers	Durable buffer + DLQ + visibility timeout	Variable worker fleet	Redis loses jobs on disconnect; Kafka is ops-heavy for simple jobs
Reliable cross-service eventing in AWS	Internal events where loss is unacceptable but replay is not needed	At-least-once durable delivery, near-zero ops	Org-wide eventing	Kafka cluster ops not justified when replay is not required
Buffering bursty traffic to slow consumers	Spiky publish rate, downstream processes at a steady pace	Queue absorbs the burst, consumer drains steadily	10x burst factors	SNS alone has no buffer; the SQS queue is the shock absorber
Ordered, exactly-once workflow steps	Sequential steps that must not duplicate or reorder	SNS FIFO to SQS FIFO end to end	Up to 300 msg/s (3,000 batched)	Redis/Standard give no ordering or dedup guarantee

3. Limitations

Limitation Axis	SNS	Kafka	Redis Pub/Sub	SNS+SQS
Durability	High No store; relies on subscriber availability	Medium Durable to disk, bounded by retention	Critical None; offline subscriber loses everything	Medium Durable up to 14-day retention, then dropped
Replay / history	High Impossible without archiving	Medium Native, bounded by retention	Critical No replay at all	High None; consumed messages are deleted
Ordering	Medium Best-effort; FIFO caps at 300 TPS	Medium Per-partition only, not global	Medium Per-channel best-effort, no guarantee on reconnect	Medium Standard unordered; FIFO ordered but 300 TPS
Fan-out cost	Medium Billed per subscription delivery	Medium One copy on disk, cheap fan-out	Medium Classic cluster mode broadcasts to all nodes	High N queues = N copies = N times SQS cost
Operational burden	Medium Low; managed	Critical Brokers, ISR, rebalancing, controller, upgrades	Medium Failover and buffer-limit tuning	Medium Low; managed, but many queues to govern
Throughput ceiling	Medium Very high Standard; 300 TPS FIFO	Medium Millions/s, the highest here	Medium High but single-threaded command path	Medium Very high Standard; 300 TPS FIFO
Payload size	Medium 256KB, billed in 64KB chunks	Medium Default 1MB, tunable	Medium Bounded by memory and buffer limits	Medium 256KB (2GB via S3 extended client)
Portability / lock-in	High AWS-only	Medium Runs anywhere, open protocol	Medium Open source, runs anywhere	High AWS-only pattern

4. Fault Tolerance

Dimension	SNS	Kafka	Redis Pub/Sub	SNS+SQS
Replication model	Internal, multi-AZ, opaque to user	Leader + ISR followers per partition	Primary + replica (async), or none	Internal, multi-AZ, opaque (SQS stores redundantly)
Failure detection	AWS-managed	Controller / KRaft heartbeats, ISR shrink	Sentinel or Cluster gossip	AWS-managed
Failover mechanism	Transparent, automatic	ISR leader election (seconds)	Sentinel promotes replica (seconds to tens of s)	Transparent, automatic
RTO (typical)	Near-zero (managed)	Seconds (leader election)	Seconds to tens of seconds	Near-zero (managed)
RPO (typical)	Zero for accepted msgs (best-effort delivery after)	Zero with acks=all + min.insync.replicas	High: in-flight + buffered msgs lost on failover	Zero for enqueued messages
Split-brain behavior	N/A, managed	Prevented by min.insync.replicas; unclean election risks loss	Possible with Sentinel misconfig; writes to old primary lost	N/A, managed
Blast radius, single node	None visible	Partitions led by that broker fail over; lag spike	If non-replicated, total loss of that shard's channels	None visible
Cross-region failover	Region-scoped; DR needs multi-region topics	MirrorMaker 2 / Confluent replication, manual	Active-active needs Redis Enterprise CRDTs	Region-scoped; DR needs multi-region design
Data loss scenario	Endpoint down past retry window with no DLQ	Unclean leader election or retention expiry	Routine: any disconnect, slow consumer, or restart	Message age exceeds retention (max 14 days)

5. Sharding / Partitioning

Dimension	SNS	Kafka	Redis Pub/Sub	SNS+SQS
Sharding model	None visible (managed internally)	Explicit partitions, hash on key	Channel-based; Sharded Pub/Sub hashes channel to slot	None visible (managed internally)
Shard key constraints	N/A (FIFO uses MessageGroupId for ordering)	Partition key; same key always same partition	Channel name maps to slot in Sharded mode	N/A (FIFO uses MessageGroupId)
Rebalancing mechanism	Automatic, invisible	Consumer-group rebalance + partition reassignment	Cluster slot migration	Automatic, invisible
Rebalancing cost / impact	None to user	Stop-the-world pause (eager) or incremental (cooperative)	Slot migration moves keys; brief unavailability	None to user
Hot-shard behavior	N/A	Skewed key floods one partition; lag on that consumer	A hot channel concentrates load on one node	N/A; a hot queue just scales its consumer fleet
Max shards (practical)	N/A	Thousands of partitions/cluster (KRaft raised the ceiling)	16,384 hash slots in Cluster	Effectively unlimited queues
Reshard without downtime?	N/A	Add partitions yes, but breaks key-to-partition mapping	Slot migration is online but operationally heavy	N/A; add queues freely
Cross-shard query	N/A	No cross-partition ordering; app must merge	No cross-channel semantics	N/A; queues are independent

6. Replication

Dimension	SNS	Kafka	Redis Pub/Sub	SNS+SQS
Topology	Managed multi-AZ (opaque)	Leader-follower per partition	Primary-replica (async)	Managed multi-AZ (opaque)
Sync vs async	Managed	Configurable: acks=all is sync to ISR	Async; replica can lag the primary	Managed (synchronous across AZs)
Replication factor	Managed	Default 3, tunable per topic	Typically 1 replica per primary	Managed (multiple AZs)
Consistency options	At-least-once (Std), exactly-once (FIFO)	Tunable via acks + min.insync.replicas + EOS	None; no consistency guarantee for Pub/Sub	At-least-once (Std), exactly-once (FIFO)
Replication lag	N/A	Sub-second healthy; watch ISR shrink	Async, can spike under load	N/A (managed)
Conflict resolution	N/A (single writer path)	No conflicts; single leader per partition	Last-write-wins on the primary	N/A
Cross-region replication	Not native; design-level	MirrorMaker 2 / cluster linking	Redis Enterprise active-active (CRDT)	Not native; design-level
Replication during partition	Managed	ISR shrinks; acks=all blocks if below min.insync	Primary serves alone; replica diverges	Managed (stays consistent across AZs)

7. Better Usage Patterns

Amazon SNS (standalone)

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
DLQ on every subscription	Rely on default retries and silently lose messages	Attach a redrive DLQ to each subscription	Without a DLQ, an endpoint outage past the retry window is permanent loss
Attribute over payload filtering	Filter on message payload, paying per GB scanned	Put filter dimensions in message attributes (free filtering)	Attribute filtering is free; payload filtering bills per GB scanned at scale
Batch publishing	One PublishBatch is unused, sending 1 message per API call	Use PublishBatch (up to 10 msgs/call)	Cuts API request cost by up to 90% on small messages
Claim-check for big payloads	Send 256KB blobs and eat the 4x chunk billing	Store payload in S3, publish a pointer	Keeps each publish at one 64KB chunk and one delivery unit
Reserve FIFO for genuine ordering	Default everything to FIFO topics	Use Standard unless ordering/dedup is truly required	FIFO caps at 300 TPS and only delivers to SQS FIFO; Standard is far cheaper and faster

Apache Kafka

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Partition-key design	Random keys, losing per-entity ordering	Key by the entity that needs ordered processing	Ordering is per-partition; the key is the only ordering lever you have
Cooperative rebalancing	Default eager rebalancing stops all consumers	Use cooperative-sticky assignor	Avoids stop-the-world pauses every time the group changes
acks=all + min.insync.replicas	acks=1 and assume durability	acks=all with min.insync.replicas>=2	acks=1 loses data on leader failure before replication
Consumer lag as the SLO	Alert on broker CPU, miss the real signal	Alert on consumer-group lag	Lag is the direct measure of whether consumers keep up; it predicts incidents
Tiered storage for long retention	Size broker disks for full retention	Use tiered storage (KIP-405) to offload to object storage	Decouples retention from broker disk, slashing cost for replay-heavy topics

Redis Pub/Sub

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Use Streams when you need durability	Reach for Pub/Sub then bolt on hacks to avoid loss	Use Redis Streams (consumer groups + persistence)	Pub/Sub will never be durable; Streams is the right structure inside Redis
Sharded Pub/Sub in Cluster	Plain SUBSCRIBE in Cluster, flooding all nodes	Use SSUBSCRIBE (Sharded Pub/Sub, Redis 7+)	Classic Pub/Sub broadcasts cluster-wide, wasting bandwidth and capping scale
Separate bus from cache	Run Pub/Sub on the shared cache instance	Dedicate a Redis instance to messaging	Decouples failure domains; cache pressure should not kill your event bus
Tune client-output-buffer-limit	Leave defaults, slow subscribers get dropped silently	Size buffer limits to consumer pace, monitor disconnects	Redis kills slow subscribers to protect itself, causing invisible loss
Accept loss explicitly	Treat Pub/Sub as reliable, build on a false assumption	Only use it where loss is acceptable by design	Designing for guarantees Redis Pub/Sub does not offer is the root cause of most incidents

SNS + SQS (fan-out composite)

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Long polling everywhere	Short polling burns requests on idle queues	Set WaitTimeSeconds=20 on every receive	Eliminates empty-receive request charges and reduces latency churn
Visibility timeout sizing	Default 30s, then duplicate work on slow processing	Set to ~6x processing time or heartbeat with ChangeMessageVisibility	Too-short timeout redelivers in-flight messages, doubling work
DLQ + maxReceiveCount	Poison messages loop forever, blocking the queue	Configure a DLQ with a sane maxReceiveCount	A poison pill without a DLQ stalls the whole consumer
Subscribe SQS, not raw HTTP	Subscribe an HTTP endpoint directly to SNS, no buffer	Insert an SQS queue as the durable buffer	The SQS queue is what makes the consumer outage-tolerant; that is the whole point of the pattern
Idempotent consumers	Assume exactly-once on Standard queues	Make handlers idempotent (dedup key)	Standard SQS is at-least-once; duplicates are normal, not exceptional

8. Advanced / Next-Gen Alternatives

Amazon SNS (standalone)

Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Amazon EventBridge	Rich routing rules, schema registry, SaaS event sources	Production	Low (similar model)	You need content-based routing and many event sources, not just fan-out
SNS + Kinesis Firehose archive	Adds durable history / replay to SNS	Production	Low	You like SNS push but need an audit trail or reprocessing
Google Pub/Sub	Push + pull, durable, replay, global by default	Production	High (cloud move)	On GCP, or want durable pub/sub without running Kafka

Apache Kafka

Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Apache Pulsar	Native multi-tenancy, tiered storage, queue+stream in one	Production	High (different model)	Heavy multi-tenant fan-out plus queueing in one system
Redpanda	Kafka API, no JVM/ZooKeeper, lower tail latency	Production	Low (Kafka-compatible)	Want Kafka semantics with simpler ops and better p99
WarpStream / diskless Kafka	Kafka API directly on S3, no local disks, cheaper	Emerging	Low (Kafka-compatible)	Cost-sensitive, latency-tolerant streaming on object storage

Redis Pub/Sub

Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Redis Streams	Adds persistence, consumer groups, replay within Redis	Production	Low (same engine)	You need durability but want to stay in Redis
NATS / NATS JetStream	Lightweight pub/sub, JetStream adds durability + replay	Production	Medium	Want low-latency messaging with optional durability, lighter than Kafka
Redis Sharded Pub/Sub	Scales fan-out across a cluster correctly	Production	Low (Redis 7+ feature)	Already on Redis Cluster and Pub/Sub fan-out is the bottleneck

SNS + SQS (fan-out composite)

Alternative	What It Improves	Maturity	Migration Cost	When To Consider
EventBridge + SQS targets	Adds rule-based routing in front of the queues	Production	Low	You outgrow simple fan-out and need filtering/routing logic
Amazon MSK / Kafka	Adds replay and a shared log for late/new consumers	Production	High (paradigm shift)	You need history, replay, or many groups reading one stream
Kinesis Data Streams	Ordered, replayable shards, managed, AWS-native	Production	Medium	Want Kafka-like replay/ordering without running Kafka, staying on AWS

Best default choices

Search and compare

1. Trade-Offs

Amazon SNS (standalone)

Apache Kafka

Redis Pub/Sub

SNS + SQS (fan-out composite)

2. Use Cases

Amazon SNS (standalone)

Apache Kafka

Redis Pub/Sub

SNS + SQS (fan-out composite)

3. Limitations

4. Fault Tolerance

5. Sharding / Partitioning

6. Replication

7. Better Usage Patterns

Amazon SNS (standalone)

Apache Kafka

Redis Pub/Sub

SNS + SQS (fan-out composite)

8. Advanced / Next-Gen Alternatives

Amazon SNS (standalone)

Apache Kafka

Redis Pub/Sub

SNS + SQS (fan-out composite)