Communication Protocols — PE Trade-Off Analysis

REST, Polling, Webhooks, SSE, WebSockets, gRPC, WebRTC — analyzed as a layered stack of application-layer transports, with reframed sharding / replication / fault-tolerance semantics suitable for protocols rather than databases.

Category Sweep As of 2026-05-30

PE Verdict

These seven are not alternatives. They are points on a 4-axis decision space — direction (pull vs push vs bidi vs peer), state (stateless vs sticky vs session vs peer), ordering (none vs per-stream vs total), and infra coupling (browser-native vs HTTP intermediaries vs custom). Pick by failure mode, not by feature. The interview signal is naming the inflection point where the trade-off flips.

Matrix-table reframing: Sharding becomes Connection Distribution (sticky sessions, LB compatibility, scale-out story). Replication becomes Fanout / State Sync (how one event reaches N receivers). Fault Tolerance becomes Failure & Recovery (reconnect, retry, idempotency, message-loss windows). These reframings apply consistently across all seven protocols.

Best default choices

RESTCRUD, public APIs, cacheable reads SSEOne-way browser streams and LLM output WebSocketsBidirectional realtime interactions gRPCTyped internal service boundaries WebRTCPeer media, calls, and low-latency data

1. Trade-Offs

Per-protocol. Each row is a real "give up X for Y" decision, not a feature.

REST HTTP/1.1 · HTTP/2 · HTTP/3

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Stateless request-per-resource	Uniform horizontal scaling, every box identical, zero session affinity	Auth re-validated every request, no server-driven state push without polling	User triggers an action whose result arrives 5s later via a background job, and the only way to learn that is the client polling	Statelessness is real only if you don't smuggle state into cookies, JWT payloads, or CDN cache keys. Most "stateless" REST stacks are sticky in three hidden places.
HTTP verbs over custom RPC	Cache at every layer (browser, CDN, reverse proxy) keyed on method+URL	Forces CRUD shape onto domain operations that are not CRUD	"Refund this transaction with reason X" gets shoehorned into POST /refunds and you lose half the domain language	REST orthodoxy is a tax. Pragmatic teams ship RPC-style POST endpoints and accept the cache loss. The real question is whether your reads can be cached, not whether your writes can be RESTful.
JSON over the wire	curl, browser devtools, every language has a parser, schema is optional	3 to 10x larger payload vs protobuf, CPU on serialize/parse at high QPS	Mobile client on 3G in emerging market loads a 200KB JSON response in 4s when a protobuf equivalent loads in 600ms	JSON cost shows up at the egress bill before it shows up in latency. At 50K QPS with 5KB responses, swapping to protobuf saves ~$30K/mo in some clouds before you talk about latency.
HTTP/1.1 connection model	Universal proxy / LB / WAF / CDN compatibility, zero special-case	Head-of-line blocking on a single connection, browser caps at 6 connections per origin	Page with 30 API calls serializes through 6 connections, p95 page load gains 2s over HTTP/2	HTTP/2 fixes this but only if the entire path supports it. Many internal LBs terminate HTTP/2 at the edge and re-emit HTTP/1.1 to backends, and you get half the benefit.
URL / header versioning	Multiple API versions can coexist on the same fleet, gradual migration	No runtime type checking, breaking changes are entirely a coordination problem	v2 is deployed, v1 is supposed to be deprecated, and you find a customer still on v1 the day after you delete the code	The right invariant is "no breaking changes, ever" — only additive evolution and explicit deprecation windows. Versioning is a workaround for not committing to that discipline.
Verb-based idempotency (GET, PUT, DELETE)	Safe-by-default retries on idempotent methods, no extra protocol	POST has no idempotency unless you build it (Idempotency-Key header pattern)	Network blip mid-checkout, client retries POST /orders, customer gets charged twice	Idempotency-Key with a 24h server-side dedup window is non-negotiable for any POST that mutates money or external state. Stripe's pattern is the reference, not the exception.
Standard auth (cookies, Bearer, mTLS)	Drop-in for every identity provider and gateway	No connection-level identity — every request re-validates the token	JWT validation hits a remote JWKS endpoint, your service is now 50ms slower per request and coupled to the IdP's uptime	Cache JWKS aggressively with a TTL shorter than the longest-lived token, and assume the IdP will have an outage at the worst time. Local validation with periodic refresh, not per-request fetch.
Pull-only by design	Receiver in full control of pace, backpressure, and retry	No server-initiated notifications without a second mechanism (SSE / WebSocket / push)	"User got a new message" requires polling at 10s intervals — your battery and bandwidth team will both find you	REST + SSE is a perfectly valid hybrid. The mistake is treating REST as needing replacement when you need push; you only need push for the push channel, not the rest of the API.

Polling Short · Long · Conditional

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Trivial implementation	A loop, an HTTP client, a timer — works in 20 minutes	Latency ceiling equals the poll interval, by definition	Stakeholder demos a "real-time" feature at 10s polling, then asks why it's not real-time	Polling is the right answer more often than engineers admit. The shame around it leads teams to over-engineer SSE/WS solutions when 30s polling would cost 1% as much.
Stateless client and server	Any box can serve any poll, identical to REST scaling story	No notion of "what's new since last poll" without explicit cursors	Client polls and gets the same 50 items every time, must dedupe locally, miss events under reordering	Watermark or sequence-number based delta polling is the only sane pattern at scale. Full-state polling at 10K clients is a self-inflicted DDoS.
Works through any firewall / proxy	Zero special-case infra; corporate proxies, transparent NATs, all pass it	Mobile radio wakes for every poll, kills battery and data plan	iOS app polls every 30s in background, App Store reviewer flags it for power use, you ship a fix that quietly disables the feature	Mobile push (APNs / FCM) is the real polling-killer on mobile, not WebSockets. The OS already maintains one persistent connection — piggyback on it.
Long polling reduces tail latency	p99 latency drops from interval/2 to milliseconds when events occur	Each connected client holds a server socket for the full timeout window	10K clients each holding a 30s long-poll = 10K open sockets at all times, and your origin server runs out of file descriptors	Long polling is structurally SSE with worse ergonomics. If you're already paying the socket cost, switch to SSE and get reconnect + event-id for free.
Trivial backoff and retry	Exponential backoff is a one-liner, no protocol-level reconnect logic	N clients on synchronized intervals create thundering herds at every tick	Cron-style polling at :00 every minute drives a 100x QPS spike for 200ms, every minute, forever	Jitter is mandatory, not optional. ±20% randomization on every interval. Server-side rate-limit is the second line, not the first.
Reuses HTTP cache / auth / observability	Every request is a normal HTTP call, every existing tool works	Most polls return 304 / empty, wasting the entire request budget	90% of your /events endpoint requests return "nothing new" but each one costs an auth check, a DB query, and a CDN miss	If-Modified-Since + ETag-aware origins cut empty-poll cost by 80%. Most poll endpoints ignore conditional headers, which is the highest-ROI optimization in this stack.
No server-side fanout complexity	No pub/sub bus, no consumer group, no offset tracking	Each client at a different "tick" — cross-client consistency is on you	User A sees a new message, switches devices, device B is between polls, the conversation history disagrees for 8 seconds	Polling cannot give you per-user causal ordering across devices without per-user state. Once you add that state, you've reinvented pub/sub with worse semantics — at which point go SSE.

Webhooks HTTP POST callbacks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Push semantics, no persistent client connection	Receiver doesn't pay socket cost when idle, sender pays nothing per receiver	Sender controls when delivery happens, receiver cannot pull on demand	Sender's queue backs up during incident, receivers see no events for hours, then 10x burst on recovery	Pair webhooks with a /events GET fallback. Webhooks alone make lost events permanently lost; the GET fallback makes them recoverable.
Receiver scales independently of sender	Receiver can run on serverless, autoscale on POST volume	No exactly-once delivery — duplicates are routine and your problem to dedup	Stripe retries on receiver 5xx, you ack 200 but the response was lost in transit, you process the same event twice	Event-id dedup with a TTL'd seen-set (Redis or DDB) is the standard pattern. The TTL must exceed the sender's retry window, which means reading the sender's docs carefully.
Standard HTTP at receiver	No new infra — ALB / API Gateway / Lambda handle webhook POSTs directly	No backpressure — sender decides rate, receiver absorbs or returns 429	Marketing sends a campaign, Stripe fires 1M payment events in 2 minutes, your receiver SQS-less endpoint runs out of DB connections	Receiver pattern: accept POST → enqueue to SQS / Kafka → ack 200 immediately. Synchronous processing inside the webhook handler is the most common production fire I've seen.
Decoupled deploy lifecycles	Sender and receiver release independently, no shared library	Schema evolution coordinated only by changelog, no compile-time check	Sender adds a required field, receiver breaks on parse, support tickets flood in two days later when retry buffer drains	CloudEvents spec + schema registry + additive-only evolution. The contract is the schema, not the URL. Most teams treat the URL as the contract and learn this twice.
Public receiver enables cross-org integration	B2B integrations work without VPC peering or shared identity	Signature verification mandatory — sender IP rotates, TLS alone doesn't prove origin	Receiver trusts the source IP, sender migrates clouds, receiver starts rejecting valid events	HMAC signature with a shared secret, timestamp in the signature payload, reject signatures older than 5 minutes. This is the minimum, and most public webhook integrations get all three wrong somewhere.
Retry-with-backoff is standard	Transient failures self-heal without app-level logic	Bounded latency goes out the window — retried events can land hours late	Receiver was down for 1h, sender retries with backoff, your "real-time" event arrives at 11pm when the daytime team is gone	Webhooks are at-least-once with unbounded latency. If your business logic depends on event ordering or timeliness, webhooks are wrong; you want a queue with consumers.
No persistent state on sender side	Sender's webhook delivery service is bounded — fire-and-forget after retry exhaustion	No replay primitive — exhausted retries are gone unless sender separately retains	Receiver discovers a bug after a week, asks sender to "resend last 7 days", sender's retry buffer is 72h	Always pair webhooks with an idempotent /events?since=cursor GET endpoint that can backfill. The webhook is the cache; the GET is the source of truth.
Receiver can fan out internally	One public endpoint, many internal consumers via local pub/sub	No cross-event ordering — parallel HTTP delivery is unordered by definition	"Order created" arrives after "Order shipped" because they took different network paths, your state machine rejects shipped events	If ordering matters, partition by entity id at receive and serialize per-partition. Or use Kafka. Webhooks are not the right transport for ordered streams.

SSE — Server-Sent Events text/event-stream over HTTP

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Single long-lived HTTP stream	Server push without a new protocol; works through HTTP-aware infra	One-way only — client-to-server requires a separate POST	Building chat on SSE, every send is a POST to /messages plus a stream listener for receives, two channels to keep in sync	SSE is the right answer when the read path dominates and writes are infrequent (LLM streaming, dashboards, notifications). It is the wrong answer when read and write are interleaved at high frequency.
Auto-reconnect with Last-Event-ID	Browser reconnects automatically on drop, server can resume from id	Server must remember per-event-id state long enough to honor it	Server restarts, in-memory event log is gone, client reconnects with Last-Event-ID and gets nothing — silent gap	Persist the event log to Redis Streams or Kafka with a retention longer than the worst-case reconnect window. In-memory event store is fine for demos and not much else.
Standard HTTP infra compatibility	Works through proxies, LBs, CDNs (with buffering off), no upgrade handshake	Text-only (UTF-8 event stream format)	Binary payload needed, base64 encoding adds 33% size, you're now defeating the bandwidth point	If you need binary at scale, SSE is the wrong tool. WebSockets or gRPC-streaming. Don't fight the format.
Built-in event-id, retry, and event-type semantics	Replayable streams, configurable client-side retry, named event channels — all in the protocol	No multiplexing — one stream = one logical channel unless you multiplex at application layer	Dashboard needs "metrics" stream and "alerts" stream — you open two SSE connections, hit the HTTP/1.1 6-connection cap fast	HTTP/2 makes the multi-stream problem go away (streams share the connection). HTTP/1.1 SSE at 6+ streams is a debugging nightmare.
Trivial REST integration	Add /stream endpoint to existing API, same auth, same observability	No bidirectional, so any "send me commands" channel needs a separate POST + correlation	LLM streaming with tool-use mid-response requires the client to POST a tool result while the stream is still flowing	Anthropic, OpenAI, and the rest standardized on SSE for LLM streaming. The reason: response streaming is one-way and writes are batch (one prompt). The shape matches.
HTTP/2 = one connection, many streams	No browser connection-cap pressure, streams multiplex naturally	HTTP/1.1 SSE caps at 6 streams per origin per browser — hard ceiling	SaaS app opens an SSE stream per tab, user has 7 tabs open, the 7th hangs forever	HTTP/2 is the only sane SSE transport in 2026. If your stack is HTTP/1.1, you have a bigger problem than SSE.
Same-origin / CORS rules as REST	Auth via cookies just works; no protocol-upgrade weirdness	No cross-origin without explicit CORS; some intermediaries strip the stream headers	SSE stream behind Cloudflare with default buffering on, you never receive an event, browser stays connected forever	Set `X-Accel-Buffering: no` and `Cache-Control: no-cache`. Test through the full CDN/LB stack, not localhost. The buffering footgun is the most common SSE production fire.
Stream is just text	curl, browser devtools, any HTTP client can consume the stream raw	No compression option in the spec (gzip on the response works, but per-event compression doesn't)	Streaming a high-rate metrics feed at 10KB/event/sec, you're shipping 10x the bytes a binary protocol would	Application-layer compression (compress payload bytes, send base64) defeats the debuggability point. If size matters that much, switch to WebSockets with permessage-deflate or gRPC.

WebSockets RFC 6455 · Full-duplex over TCP

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Full-duplex single connection	Send and receive on the same socket, no second channel, sub-ms RTT once connected	HTTP-layer caching, intermediaries, observability all stop working post-handshake	Production incident, you can't replay traffic with curl, can't tail at the CDN, can't trace at the proxy — you're inside an opaque TCP pipe	WebSockets are an observability cliff. Invest in protocol-aware tracing (W3C Trace Context inside frames, structured logging at app layer) before you have an incident.
Low per-message overhead	2-14 byte frame header, no per-message HTTP headers	No app-layer backpressure visibility — TCP-level only, you can't tell a slow consumer from a fast one	One slow client's TCP send buffer fills, server keeps queuing messages in heap, OOMs on what should be a 10 KB/s feed	Build explicit ack-window backpressure at the application layer. TCP flow control alone is not enough at any nontrivial fanout. The pattern is RSocket-style credit windows.
HTTP upgrade handshake	Reuse port 443, TLS, hostname routing, existing edge infra for the handshake	Some L7 LBs and WAFs do not proxy the Upgrade header gracefully, post-handshake	AWS ALB + WAF rules: handshake succeeds, frames get dropped 30s in, you debug for a week before finding the WAF rule	Test with your full production edge path, not localhost. Cloudflare, Cloud Armor, and ALB all have WebSocket-specific gotchas; document them per environment.
Long-lived connection enables low-latency push	Sub-10ms p99 server-to-client for short messages, no per-message setup	Server is stateful — every connected client is a slot on some specific box	Deploy day, you rolling-restart, every client reconnects to a different box, the new box needs to know the client's state — and you didn't think about that	Distributed pub/sub (Redis, NATS, Kafka) between WS servers is mandatory for any non-trivial fanout. Sticky-session-only architectures hit a wall at ~100K concurrent connections.
Binary frames	Efficient encoding, supports protobuf / msgpack / flatbuffers / raw bytes	No protocol semantics on top — you build your own framing, heartbeats, ack, ordering	Six months in, you've reinvented half of gRPC and badly: no flow control, no deadlines, no cancellation propagation	If you're building application-level RPC over WebSockets, you should be using gRPC or RSocket. WS is the wrong layer for the problem you're solving.
Standard browser API	new WebSocket(url) — works in every browser back to IE 10	Reconnect, resume, idempotency are all your problem (no Last-Event-ID equivalent)	User commutes through tunnels, WS drops every 90s, your reconnect logic loses 3 messages each time	Application-layer sequence numbers + server-side replay buffer + idempotent message handling. The pattern is well-known; the bug is forgetting it for V1 and learning the lesson on prod traffic.
Cross-origin via Origin header	Server can accept or reject by Origin, simple security model	Cookies work; bearer tokens must be passed at handshake (subprotocol or query) or in messages	Token in URL query gets logged at every proxy and CDN — you've leaked auth tokens in plaintext to every log aggregator on the path	Send the token in the Sec-WebSocket-Protocol header trick or in the first frame. Never in the URL. This is a CWE-200-class bug waiting to happen.
Persistent connection through most proxies	Once handshake succeeds, frames flow through transparent proxies fine	Graceful shutdown during deploys is hard — every open connection blocks the drain or gets dropped	Rolling deploy with 50K connections per pod, each pod tries to drain for 30s, clients see 30s of reconnect storms across the whole fleet	Send a "server-going-down" frame, give clients 60s with jitter to reconnect to the new fleet, then force-close. Connection-shedding plus jittered reconnect is the production pattern.

gRPC HTTP/2 · Protobuf · 4 streaming modes

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
HTTP/2 multiplexing with binary protobuf	5-10x smaller payloads vs JSON, many concurrent RPCs on one TCP connection	Browser-native support is missing — gRPC-Web requires a proxy (Envoy / Connect)	Public API ambition, two months in, every customer asks for "is there a REST version" — you build it on the side, now you have two APIs	Connect protocol (CNCF) gives you gRPC schemas with HTTP/1.1 + JSON fallback and works in browsers without a proxy. For new greenfield, Connect over raw gRPC is the better default in 2026.
Codegen from .proto contracts	Type-safe clients and servers in every supported language	Every schema change is a regen + redeploy coordinated across services	Cross-team service contract evolves, three teams have different proto versions checked in, the wire works but the types mismatch at compile time everywhere	Buf Schema Registry (or equivalent) with breaking-change CI and a central source of truth. Without it, gRPC's compile-time safety is a lie — you have N copies of the truth.
Strict typed contracts	Wire format errors caught at codegen, no JSON-parse-at-runtime surprises	Loose coupling REST/JSON allows is gone — additive evolution rules are strict (no reusing field numbers, no required fields without defaults)	Engineer renames a field number "to fix the schema", every existing client breaks silently and you learn from the on-call page	Protobuf evolution rules are non-obvious and unforgiving. Mandatory training for any team adopting gRPC; mandatory `buf breaking` in CI.
Four streaming modes	Unary, server-stream, client-stream, bidi — covers SSE and WebSocket use cases in one protocol	HTTP intermediary compatibility breaks — most CDNs and WAFs do not handle HTTP/2 trailers correctly	CloudFront in front of gRPC, half the calls fail with mysterious INTERNAL errors, you spend a week finding the trailer-stripping	Internal-only gRPC is the production sweet spot. External-facing gRPC at scale needs Envoy or equivalent control over the edge path.
Built-in deadlines and cancellation	Deadline propagates through call graph — downstream services know when to give up	Binary on the wire — no curl, mandatory tooling (grpcurl, BloomRPC, Postman gRPC, Buf Studio)	2am on-call page, "service is slow", you can't curl the endpoint to see what's happening — you need a working grpcurl install on the bastion	Reflection enabled in non-prod environments, disabled in prod. grpcurl scripts in the runbook. The cost of "binary wire" is felt at incident time, not coding time.
Channel-level keepalive and pooling	One channel, many concurrent calls; ping/pong frames keep NAT mappings alive	L4 round-robin LBs fight HTTP/2 multiplexing — every call from one client goes to the same backend	Behind an NLB, 90% of QPS lands on one pod, you scale horizontally and CPU utilization stays the same	Client-side load balancing (round-robin per call, not per connection) or a real service mesh (Istio, Linkerd, Consul Connect). NLB / NLB-equivalent + gRPC is a known anti-pattern.
Wire-format efficiency	Protobuf is ~5-10x smaller than JSON for the same data, parsing is ~10x faster	Debuggability cost — no human-readable wire, no field names on the wire (just tags)	PCAP from a production trace is useless without the .proto file; rolling forward to a new schema and reading old captures is now a tooling problem	Keep the .proto file checked in with every release tag. Re-decoding old PCAPs is a real forensic need; if you don't have the schema, you don't have the data.
Backpressure via HTTP/2 flow control	Window-based per-stream flow control built into the wire	Polyglot edge needs a gateway (grpc-gateway, Envoy, Connect) for REST/JSON clients	Mobile team is on Swift / Kotlin where gRPC is fine, web team needs JSON, you maintain a transcoder forever	Connect's HTTP/1.1 + JSON fallback eliminates this entirely for the web team. The mental model: gRPC for service-to-service, Connect for everywhere a browser is involved.

WebRTC SRTP · DTLS · ICE / STUN / TURN

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
Peer-to-peer media path	Sub-100ms end-to-end latency, server bandwidth scales sublinearly with users	Server can't see / record / moderate the stream without a media server (SFU or MCU)	Compliance requires call recording, you didn't budget for an SFU, retrofitting it forces every client to upgrade	Plan for the SFU on day one even if you start with mesh. Mesh-only architectures hit recording, transcoding, and AI-analysis walls and need a rewrite, not a feature.
NAT traversal via ICE/STUN/TURN	Works through 80%+ of consumer NATs without manual port forwarding	Three pieces of infra to run (signaling, STUN, TURN), TURN bandwidth is metered and expensive	15-20% of users behind symmetric NAT or carrier-grade NAT fall back to TURN, your TURN egress cost dominates the bill	Twilio / Cloudflare / Xirsys for TURN-as-a-service if you don't have geographic scale to run your own. The crossover point is roughly 10K concurrent TURN relays.
Adaptive bitrate via SDP / RTCP	Codec, resolution, and FEC adapt to network conditions automatically	Deterministic latency goes out — adaptation can stall frames or drop quality under sustained loss	Cloud gaming or trading floor application needs predictable 50ms latency, WebRTC's adaptation makes that bound impossible to commit to	For "must hit X ms p99" use cases, configure the codec aggressively (forced bitrate, FEC always-on, jitter buffer floor) — accept the artifacts. Or use a different transport.
SRTP for media, DTLS for data channel	End-to-end encrypted, no middle-box can inspect	No middlebox observability — debugging in production is "look at WebRTC internals in chrome://"	Customer reports bad audio, you can't pcap-decode the RTP, you can't reproduce, you're guessing from getStats() metrics	getStats() polling at 1s intervals into your observability stack is mandatory. The metrics are good (jitter, packet loss, RTT) but you have to collect them yourself per-call.
Direct peer bypasses server bandwidth	1-on-1 calls cost almost zero server bandwidth (only signaling)	Mesh topology breaks above 4 peers — each client uploads N-1 streams	Sales demo expands from 4 to 8 people, half the participants' Wi-Fi can't sustain 7 uplink streams, calls drop	SFU at 5+ participants is the rule. The inflection point is upload bandwidth, not server cost. Most clients have asymmetric residential connections; mesh asks for upload they don't have.
Data channels for non-media payloads	Low-latency UDP / SCTP for arbitrary data, opt-in reliability	UDP by default — reliability is per-channel, you choose ordered/unordered, reliable/unreliable	Game state needs reliable ordered, but you configured unordered for speed, now state replication has gaps	Multiple data channels with different reliability profiles is the right pattern: one ordered/reliable for control, one unordered/unreliable for high-rate input. The flexibility is the point.
Native in every browser, no SDK	getUserMedia() + RTCPeerConnection are built-in, no install or extension	Server-side participation (record, transcode, AI) requires a headless browser or SFU integration	"Add real-time transcription" — you discover this means running gstreamer or a headless Chromium in the cloud, multiplying your infra cost	SFU-as-a-service (LiveKit, Daily, 100ms, Cloudflare Calls) gives you record + transcode + AI ingestion without running media servers yourself. The make-vs-buy crossover is steep.
Codec negotiation via SDP	Older clients fall back to widely-supported codecs (Opus, VP8, H.264)	Codec deprecation breaks old clients with no graceful path (no codec = no call)	You drop VP8 support to save SFU CPU, all clients on Android 5 lose video, support tickets explode	Maintain a known minimum codec set for the long tail. WebRTC interoperability is a 2-3 codec problem in practice (Opus + H.264 + one of VP8/VP9/AV1); never less.

2. Use Cases

Per-protocol. Concrete workloads with the driving property that ruled out the alternative.

REST

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Public SaaS API for third parties	Stripe, GitHub, Twilio, SendGrid	Universal client compatibility + edge caching	~100M+ requests/day per platform	gRPC fails on developer ergonomics — every customer would need a code generator install
Cacheable read-heavy paths	News sites, e-commerce catalogs, retail product pages	99%+ CDN cache hit ratio, sub-50ms TTFB	1M+ RPS at edge, <1% origin	GraphQL fragments cache fragmentation, gRPC has no edge caching story
Mobile app backends with diverse clients	Uber, Instagram early-era, most B2C apps	Polyglot client stack (iOS / Android / Web / TV / partners)	Billions of requests/day	Maintaining gRPC clients for 5 platforms is a full team's work; REST + OpenAPI is one team
Webhook delivery target	Any system receiving Stripe / Shopify / GitHub callbacks	Receiver already runs HTTP infra; zero new tech	Highly variable, burst-prone	Sender doesn't speak gRPC; webhook delivery industry standardized on HTTP POST
Healthcare / FHIR / regulated exchange	EMR vendors, insurance claim systems	Regulatory-mandated standard (HL7 FHIR over REST)	Per-region, per-payer	Compliance attestation cost rules out anything not in the spec
Internal microservice CRUD (polyglot)	Mid-size companies with mixed language stacks	Lowest-friction inter-service contract	10K-50K QPS per service	gRPC's codegen overhead is hard to justify when teams don't already share a build system

Polling

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
CI runner heartbeat / job dequeue	GitHub Actions, GitLab Runner, Jenkins agents	Receiver-initiated for security (no inbound ports needed)	Millions of runners polling periodically	Webhooks would require every runner to expose an inbound HTTP endpoint; firewall hostile
Mobile background sync	Email apps, calendar clients, RSS readers	OS-level battery and connection constraints favor pull on a schedule	100M+ MAU per app	OS limits sustained background WebSocket connections; the platform pushes you to polling or APNs/FCM
Distributed config / discovery	etcd v2 watch fallback, Spring Config Server, Consul long-poll	Long-poll is a known, simple primitive — zero new infra	10K+ services polling shared config	Pub/sub bus adds a dependency; polling reuses existing HTTP
Slow-changing data feeds	Leaderboards, public stats pages, weather data	Staleness is acceptable (30s-5min)	Millions of clients, low event rate	Push channels are overkill when the event rate is <1/min per client
Multi-tenant SaaS workspace state	Tenant-specific dashboard data	Per-tenant isolation simpler with pull	1K-10K tenants, modest per-tenant QPS	WebSockets force you to solve per-tenant fanout up front, polling defers that decision

Webhooks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Payment event notifications	Stripe, Square, PayPal, Adyen	Async settlement / dispute / refund notification across org boundaries	Millions of events/day, per-merchant fanout	Polling would cost merchants 100x in idle requests; cross-org WebSocket is operationally infeasible
SCM repo events to CI	GitHub → Jenkins / CircleCI / GitHub Actions	Sub-second push of code events to build infra	Tens of millions of events/day across GitHub	Polling 50M repos every minute = 833K RPS of "nothing happened"
SaaS automation triggers	Zapier, Workato, Make, n8n	Fanout to user-defined workflows in arbitrary destinations	Per-customer flow counts, variable burst	Receivers are external user infra; only HTTP POST is universally accepted
Marketplace order / inventory events	Shopify, Amazon Seller, eBay → 3PL systems	Heterogeneous third-party fulfillment integrations	Per-merchant, high burst on promotions	Each 3PL runs different stack; webhooks are the lowest common denominator
Observability / alert ingestion	PagerDuty, Opsgenie, Slack incoming webhooks	Many monitoring tools fan-in to one alerting platform	High burst at incident-time	Every monitoring tool already speaks HTTP POST; standardizing on anything else fragments the integration market

SSE

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
LLM token-by-token streaming	OpenAI, Anthropic, Google Gemini APIs	One-way server push with HTTP-stack reuse for auth / rate-limit / observability	Per-request, millions of concurrent streams platform-wide	WebSocket adds bidi cost that's unused (one prompt, one stream); reconnect semantics matter less for short streams
Real-time analytics dashboards	Datadog event stream, Grafana Live, observability UIs	Append-only event feed to UI with auto-reconnect	~1K-10K concurrent dashboard sessions	WebSocket bidi is wasted; SSE auto-reconnect is built-in
Build / deploy progress streaming	CI build logs, CD pipeline UI, Terraform Cloud apply	Push log lines to UI as they happen, no UI polling	Per-build, transient sessions	Polling makes the UI choppy; WebSocket complexity isn't justified
Notification / activity feeds	GitHub notifications, Stripe Dashboard live events, JIRA activity	One-way notification push, low frequency, many clients	10K-100K concurrent users per service	WebSocket adds bidi infra cost without bidi benefit
AI agent tool-call streaming	Anthropic Claude, OpenAI Assistants, Bedrock AgentCore	Partial response + tool-use events streamed inline	Per-conversation, long-lived	HTTP/2 SSE reuses existing API gateway infra; WebSocket would require a separate edge path
Server-driven UI update feed	Stripe Dashboard, Linear, Vercel deployment UI	Push state deltas to keep UI in sync with server	100K+ concurrent users per service	SSE is sufficient; WebSocket's complexity buys nothing if writes go through REST

WebSockets

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Real-time collaborative editing	Figma, Google Docs, Notion, Linear	Bidirectional state sync with sub-50ms write echo	100K+ concurrent connections per shard	SSE can't carry client-to-server in same channel; CRDT sync needs both directions interleaved
Trading platforms — quotes + orders	Coinbase Pro, IBKR, Robinhood, crypto exchanges	Low-latency bidi quotes plus order submission on the same wire	1M+ concurrent connections at peak	HTTP-based protocols add too much per-request overhead; SSE alone misses the order-submit path
Multiplayer game state and lobby	Roblox lobbies, Among Us, Discord game integrations	Reliable ordered bidi for game state and chat	Per-room 8-100 players, millions of rooms	WebRTC's UDP is overkill for turn-based / lobby state; TCP ordering matters
Chat / messaging at scale	Slack, Discord, Twitch chat, WhatsApp Web	Bidi messaging with presence and typing indicators	10M+ concurrent (Slack peak), 100M+ daily (Discord)	SSE for chat means typing indicators go through a separate POST channel — operational complexity not worth it
IoT control panels and dashboards	Smart home web dashboards, factory floor SCADA	Bidi state read + command write, low-latency	Per-deployment, persistent connections	REST can't push state changes; SSE can't carry control commands

gRPC

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Internal microservice mesh at hyperscale	Google internal, Netflix, Square, Uber internal	Typed contracts + HTTP/2 multiplexing + 5-10x smaller payloads	1M+ RPC/sec per service at the larger end	REST overhead (JSON parse + HTTP/1.1 head-of-line) adds CPU and latency at every hop in a 10-hop call graph
Service mesh control plane	Envoy xDS, Istio control plane, Linkerd	Bidi-streaming for config push to thousands of proxies	10K+ Envoy instances per control plane	REST polling for config doesn't scale; SSE is one-way; bidi gRPC is the natural fit
ML model serving	TensorFlow Serving, NVIDIA Triton, Vertex AI	Binary tensor payloads, server-streaming for batched inference	1M+ inferences/sec per cluster	JSON serialization for tensors is 5-20x larger and 10x slower to parse
Database client protocols	Spanner, CockroachDB, etcd v3, FoundationDB clients	Typed query results, streaming for large scans, connection multiplexing	Per-cluster, sustained connection count	Custom binary protocols would work but gRPC gives you the language ecosystem for free
Mobile-to-backend with bandwidth pressure	Square POS, Lyft driver app, emerging-market apps	Small payloads matter on cellular; protobuf delivers	100K+ devices, variable connectivity	REST/JSON costs 5-10x bandwidth, which on emerging-market data plans is the difference between usable and not

WebRTC

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Video conferencing	Zoom web client, Google Meet, Microsoft Teams web, Whereby	Sub-100ms peer media with adaptive codec / FEC	4-49 participants mesh, SFU beyond	WebSockets carry no audio/video codec stack; building the SRTP / jitter buffer / FEC layer yourself is multi-year work
Live customer support video	Intercom video, Zendesk Talk, in-product video help	Low-latency 1-on-1 video without server bandwidth	1-on-1, on-demand, low concurrency	Server-mediated video costs egress bandwidth per session; P2P avoids it
Cloud gaming / remote desktop	NVIDIA GeForce Now, Xbox Cloud Gaming, Parsec	Sub-100ms latency video + input over UDP with adaptive bitrate	Per-session, 60fps sustained	WebSocket / TCP can't recover from packet loss fast enough; UDP plus FEC is the only option
P2P file / data transfer	file.pizza, Snapdrop, magic-wormhole-web, browser-based AirDrop alternatives	Direct browser-to-browser without server storage	Pairwise, transient	Server-side upload doubles bandwidth cost and adds privacy concerns
Live audio rooms (many-to-many)	Clubhouse, Discord Stage, Twitter Spaces, Telegram voice chat	Many-to-many audio via SFU with jitter buffer and codec adaptation	Thousands of listeners per room	SFU fanout requires WebRTC-aware media servers; WebSocket audio would require custom codec / jitter / packet-loss handling

3. Limitations

Multi-protocol matrix. Severity reflects how often the limitation forces an architectural choice.

Show:

Limitation Axis	REST	Polling	Webhooks	SSE	WebSockets	gRPC	WebRTC
Server push capability	High None natively; requires SSE/WS side-channel	Critical Pull-only by definition; latency = interval	Low Push is the whole point; sender-initiated	Low One-way server-to-client native	Low Full-duplex native	Low Server-streaming and bidi modes native	Low Native bidi peer or via SFU
Browser-native client	Low fetch / XHR universal	Low setInterval + fetch	Critical Browser cannot be a webhook receiver (no public endpoint)	Low EventSource native everywhere	Low WebSocket native everywhere	High Needs gRPC-Web + proxy, or Connect protocol	Low RTCPeerConnection native, but signaling is on you
Intermediary compatibility (CDN / WAF / proxy)	Low Universal; designed for this	Low Same as REST	Low Plain HTTP POST	Medium Buffering at proxies kills the stream silently	Medium Upgrade handshake; WAFs / L7 LBs need WS support	High HTTP/2 trailers stripped by many CDNs / WAFs	High UDP-based, ICE/TURN needed for hostile NAT/firewall
Backpressure / flow control	Low Per-request, client-paced	Low Client controls poll rate	High Sender controls rate; receiver must absorb or 429	Medium TCP-level only; no app-layer credit window	High TCP-only; app-layer flow control is on you	Low HTTP/2 per-stream flow control built in	Medium RTCP for media; data channel uses SCTP credit
Connection cap per browser per origin	Low 6 on HTTP/1.1, ~100s on HTTP/2 streams	Low Same as REST	Low N/A — server-side concept	High 6 streams cap on HTTP/1.1; HTTP/2 required at scale	Medium ~255 connections per origin in Chrome	Low HTTP/2 multiplexing on one connection	Low Per-PeerConnection; no origin cap
Observability (curl / pcap / trace)	Low curl, devtools, every existing tool	Low Same as REST	Low ngrok / webhook.site / RequestBin work fine	Low curl -N streams the response	High Frame-level tools only; no HTTP-layer visibility	Medium grpcurl with reflection; Buf Studio; not curl-able	Critical Encrypted SRTP; getStats() polling is the only window
Mobile battery / radio impact	Medium Per-request radio wake	High Periodic radio wake-ups; battery hostile	Low N/A — server-to-server	Medium Persistent connection; OS may kill in background	Medium Persistent; OS-level background limits apply	Medium Channel keepalive; smaller payloads help	High Sustained UDP + media codec is CPU-heavy
HTTP/3 support maturity	Low Native; transparent QUIC upgrade widely deployed	Low Same as REST	Low Same as REST at receiver	Low Works over HTTP/3 transparently	Medium RFC 9220 WebSockets-over-HTTP/3 still emerging in 2026	Medium Experimental in gRPC core; production via Connect or Tonic	Low Already uses UDP+QUIC-like semantics
Auth model uniformity	Low Cookies, Bearer, mTLS all standard	Low Same as REST	Medium HMAC signature non-standardized across providers	Low Same as REST	Medium Token-in-URL leaks; subprotocol header or first-frame	Low Metadata for tokens, mTLS native	High Signaling auth is separate from media auth; complex
Graceful deploy / connection draining	Low Request-scoped; in-flight drains in seconds	Low Per-poll drain trivial	Low Sender retries handle drops naturally	Medium Client auto-reconnects, but Last-Event-ID coordination needed	High Custom shutdown protocol + jittered reconnect required	Medium GOAWAY frame standard; streams must finish or restart	High Re-ICE / re-SDP needed; calls drop during media-server deploys

4. Failure & Recovery

Reframed from "Fault Tolerance". For protocols, the failure modes are reconnect, retry, idempotency, and message-loss windows.

Show:

Dimension	REST	Polling	Webhooks	SSE	WebSockets	gRPC	WebRTC
Reconnect model	N/A — request-scoped; each call independent	Trivial — next poll attempts	Sender retries with backoff (provider-specific, often exponential up to 24-72h)	Browser auto-reconnects with `Last-Event-ID` header	Application-implemented; no protocol-level reconnect	Channel reconnects automatically; stream RPCs must be restarted by app	Re-ICE on path failure; full re-SDP if peer connection dies
Resume / state-recovery primitive	ETag / If-Match for optimistic concurrency	Watermark / cursor / since-timestamp param	None native; pair with `/events?since=cursor` backfill API	`Last-Event-ID` header on reconnect; server replays from that id	App-layer sequence numbers + server replay buffer	No native resume on streams; pagination tokens for unary	SVC / SCTP retransmit at media layer; data channel SCTP retransmits
Message-loss window	N/A — synchronous response or error	Between polls if events expire from server buffer before next poll	Receiver-side: receive but crash before processing → at-least-once retry covers it	Between disconnect and reconnect, bounded by server's event-id retention	Disconnect to reconnect — entire window of loss unless app buffers	During stream disconnect; unary calls are atomic (response or error)	Datagram loss tolerated for media (FEC / jitter buffer); reliable channels SCTP-retransmit
Idempotency primitive	Verb semantics (GET, PUT, DELETE); `Idempotency-Key` header for POST	Same as REST	Event-id from sender; receiver dedups in seen-set with TTL	Event-id; replay-safe if events are idempotent	App-layer message-id + ack window	App-layer; protocol gives request-id metadata convention	N/A for media; data channel needs app-layer
Retry semantics	Client decides; safe retries on idempotent verbs only	Client-driven; next poll is the retry	Sender retries on non-2xx with exponential backoff (Stripe: 3d, GitHub: 8h)	Browser-controlled retry interval (`retry:` field), defaults ~3s	Application-implemented; jitter mandatory at scale	Built-in retry policy via service config; per-RPC retryable status codes	Media: FEC + retransmit; data channel: SCTP retransmit policy
Backoff strategy	App-layer; conventionally exponential with jitter	Conventionally exponential on empty; constant on event	Sender-controlled; each provider documents schedule	Client-driven; server can hint via `retry:`	App-implemented; truncated exponential with jitter is standard	Configurable per-service in retry policy	ICE has restart timer; app handles call-level reconnect cadence
Server-failure visibility to client	Immediate — HTTP status code on the response	Immediate per poll	N/A — receiver perspective; sender retries	On connection drop; otherwise opaque (no out-of-band signal)	On close frame or TCP RST; timeout-based otherwise	Per-RPC status code; GOAWAY for graceful shutdown	ICE connection-state events; media-quality from getStats()
Cross-region failover	DNS + health checks; clients re-resolve and retry	Same as REST	Sender re-resolves receiver URL; receiver's failover is its problem	Client reconnects; new region picks up from Last-Event-ID if state is cross-region replicated	Connection lost on region failure; reconnect to new region, state must be in shared store	Client-side LB or service mesh handles; xDS can shift traffic	Re-ICE to new TURN/SFU; calls drop during failover unless multi-region SFU
Half-open / zombie detection	Request timeout (configurable, typically 30s)	Per-poll timeout	Sender timeout (commonly 10-30s); receiver ack required	Server-sent heartbeat comments + client read timeout	Ping/pong frames + app-layer timeout	HTTP/2 keepalive (PING frames)	ICE consent-freshness checks (STUN binding every 5-15s)
Heartbeat / keepalive	N/A — request-scoped	The poll itself is the heartbeat	N/A — sender-initiated	Comment lines (`:\n\n`) every 15-30s; mandatory to keep proxies alive	RFC 6455 ping/pong frames; app-layer heartbeat for NAT-rebind detection	HTTP/2 PING frames; configurable interval / timeout	STUN binding requests; RTCP receiver reports

5. Connection Distribution

Reframed from "Sharding". For protocols, the question is how connections distribute across servers, what stickiness is required, and what scale-out looks like.

Show:

Dimension	REST	Polling	Webhooks	SSE	WebSockets	gRPC	WebRTC
Stateless across servers?	Yes (when properly designed)	Yes	Yes at receiver; sender keeps retry state	Per-connection stateful; but events can come from shared pub/sub bus	Stateful — each connection bound to a specific server	Channel-stateful; one channel pinned to one HTTP/2 connection	Per-peer-connection state; SFU is the stateful hub
Sticky session required?	No (anti-pattern if present)	No	No at receiver	Effectively yes for the lifetime of the stream	Yes — sticky for connection lifetime	Yes per channel; LB must respect HTTP/2 connection affinity	Yes — peer-to-SFU pinning
LB type (L4 vs L7)	Either; L7 for path-routing	Either	Either at receiver	L7 required for path routing; L4 works for fleet-level	L7 with WS support; L4 works for raw TCP	L7 with HTTP/2 awareness; L4 NLB causes single-pod hotspots	L4 for UDP; signaling separately L7
Per-server connection ceiling	Limited by FD / thread; ~10K concurrent requests/node typical	Same as REST plus active poll concurrency	Same as REST at receiver	10K-50K connections/node with event-loop server (Go, Node, Netty)	50K-100K+ connections/node; ulimit + tcp_mem tuning required	HTTP/2 multiplexing → 1 connection serves many streams; fewer connections than REST	~2K-5K participants per SFU node typical (CPU bound, not socket bound)
Horizontal scale story	Add nodes behind LB; uniform	Same as REST	Same as REST at receiver	Add nodes; shared pub/sub bus (Redis / NATS / Kafka) for cross-node fanout	Add nodes + shared pub/sub bus + session directory if reconnect-targeting needed	Client-side LB or service mesh; xDS-driven endpoint discovery	SFU cluster with cascade / mesh between SFUs; signaling layer is separate
Session rebalancing	N/A — no session	N/A	N/A	Drop-and-reconnect; client picks new node	Drop-and-reconnect with jitter; reconnect storm risk	Server can send GOAWAY; client re-establishes channel	Re-ICE or full re-SDP; visible to user as call drop
Hot connection behavior	N/A — per-request	High-rate single client looks like many requests; rate-limit by API key	Receiver overload returns 429; sender backs off	Single fast-producing event-id stream can OOM if not bounded	Slow-consumer client blocks server send queue; needs app-layer backpressure	HTTP/2 flow control bounds per-stream	High-bitrate client drives SFU CPU and bandwidth; simulcast / SVC mitigates
Max connections (practical)	Limited by fleet capacity, not protocol	Same as REST	Same as REST at receiver	~1M concurrent connections across a fleet of 20-50 modern nodes	10M+ documented at WhatsApp, Slack, Discord scale	Bounded by channel count, not stream count; effectively very high	SFU bound; ~50K-100K concurrent across an SFU cluster typical
Cross-shard fanout	N/A — stateless	N/A	Sender fans out to N receivers from its queue	Pub/sub bus delivers events to nodes holding subscriber connections	Pub/sub bus + session directory: who is connected to which node	N/A for unary; bidi-streaming uses pub/sub fanout same as WS	SFU forwards to all participants; cascade between SFUs for cross-region

6. Fanout / State Sync

Reframed from "Replication". For protocols, this is how one event reaches N receivers, with what ordering and delivery guarantees.

Show:

Dimension	REST	Polling	Webhooks	SSE	WebSockets	gRPC	WebRTC
Fanout model	Unicast — client pulls own copy	Unicast — each client pulls	Sender-side multicast — sender posts to each receiver URL	Server-side broadcast — backend fans out via pub/sub to subscriber connections	Server-side broadcast — same pattern as SSE plus bidi	Server-stream / bidi — server fans out via pub/sub on its end	SFU forwards; or mesh (each peer sends to all others, N²)
State ownership	Server	Server (with client cursor)	Sender (until ack); receiver after	Server with event-id retention	Server + client (per-connection); shared via pub/sub bus	Server; client tracks stream cursor app-side	Peer + SFU; CRDT / OT for shared state via data channel
Multi-server fanout — pub/sub bus required?	No	No	No — fanout is sender-driven HTTP POST	Yes for any cross-node fanout (Redis pub/sub, NATS, Kafka)	Yes — same as SSE	Yes for streaming fanout	SFU is the fanout primitive; cascade for cross-SFU
Message ordering guarantees	Per-call only	Client-visible order = poll order; reordering risk if events expire	No order guarantee across events (parallel HTTP delivery)	Per-stream FIFO from server	Per-connection FIFO (TCP-ordered)	Per-stream FIFO; HTTP/2 stream ordering	Media: timestamped, jitter-buffered; data channel: configurable ordered/unordered
Delivery guarantee	At-most-once (one response or error)	At-least-once with client dedup on cursor	At-least-once (retries on failure)	At-least-once on reconnect with Last-Event-ID; at-most-once otherwise	App-defined; protocol gives best-effort TCP	At-most-once per RPC; app-layer retry adds at-least-once	Media: best-effort with FEC; data channel: configurable
Cross-region fanout	DNS / multi-region origin	Same as REST	Sender hits receivers per-region	Cross-region pub/sub (MSK MirrorMaker, Confluent, NATS leaf nodes)	Same as SSE plus session directory for reconnect routing	Service mesh with global xDS	SFU cascade across regions; latency budget tightens
Lag (typical)	Sub-50ms intra-region; per-request	Equal to poll interval	Sub-second to seconds; minutes-hours on retry	Sub-100ms intra-region with healthy pub/sub bus	Sub-50ms intra-region typical	Sub-50ms intra-region	50-150ms peer-to-peer; +50ms per SFU hop
Conflict resolution	App-layer (ETag / If-Match / OCC)	App-layer with cursor	App-layer at receiver (idempotency-key dedup)	N/A — one-way stream	App-layer (CRDT, OT, vector clocks)	App-layer	Media: jitter-buffer ordering; data channel: app-layer (CRDT typical)
Behavior during network partition	Request fails fast; client retries	Polls fail; client backs off and retries	Sender retries with backoff; receiver may catch up via /events GET	Connection drops; auto-reconnect with Last-Event-ID	Connection drops; app-layer reconnect required	Channel reconnects; streams must be restarted by app	ICE attempts alternate candidates; falls back to TURN

7. Better Usage Patterns

Per-protocol. Patterns most teams miss until production bites; the PE-grade approach beats them.

REST

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Idempotency-Key on every POST that mutates external state	Treat POST as fire-and-forget; rely on client to dedup on retry	Server stores Idempotency-Key + response hash for 24h; replays return original response	Customer-charged-twice bugs are the most expensive REST bug class; the fix is one header
ETag + If-Match for optimistic concurrency	Last-write-wins; the second update silently overwrites the first	Server returns ETag on GET; PUT/PATCH requires If-Match matching the current ETag	Concurrent edits stop corrupting data and start surfacing as 412 Precondition Failed
Cursor-based pagination, not offset	?offset=10000&limit=20 — DB does O(n) scan per page	?cursor=opaque_token&limit=20 — token encodes the last seen primary key	Cursors are O(1) regardless of position; offsets break at scale with timeouts on deep pages
Faithful 4xx vs 5xx	Return 500 for every error (including malformed input)	4xx = "your request was wrong", 5xx = "we failed"; 4xx is not retryable, 5xx may be	Client retry logic depends on this distinction; getting it wrong creates retry storms or silent loss
Cache-Control + Vary explicit, never implicit	No headers, then learn the CDN cached a user-specific response globally	Cache-Control: private for user-scoped; Vary: Authorization, Accept-Encoding always	CDN response leakage between users is a CWE-200; explicit headers prevent it

Polling

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Long-poll with coordinated timeout	Short-poll at 1-5s; 95%+ of requests return empty	30-60s long-poll; server holds until event arrives or timeout	10x reduction in QPS for the same freshness; works through HTTP infra
Conditional requests (If-Modified-Since, ETag)	Endpoint ignores conditional headers; client downloads full payload every poll	Server respects If-None-Match → 304 Not Modified with no body	80%+ bandwidth reduction on unchanged-data polls; trivial server change
Jittered intervals	Hardcoded 60s interval; N clients sync to wall-clock boundary, cause :00 spike	Base + ±20% random; server adds Retry-After hints under load	Thundering herds at scheduled boundaries are an outage class on their own
Delta polling with watermark	Poll /messages and get the full list every time; client dedupes locally	Poll /messages?since=last_seen_cursor; server returns only new items	100x payload reduction at any nontrivial dataset size; cleaner client logic
Exponential backoff on empty	Constant interval regardless of activity; idle clients hammer at 1Hz	Double interval on empty (capped at, say, 5min); reset to base on event	Idle traffic drops 90%+; active clients still see fast updates

Webhooks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
HMAC signature verification on every request	Trust the sender IP; fail when sender rotates clouds or adds CDN	HMAC-SHA256 over (timestamp + body) with rejection of stale timestamps (>5min)	Without this, anyone who learns your endpoint can spoof events; IP-based trust does not scale
Receiver: enqueue first, process async	Synchronous processing inside the webhook handler; slow DB write blocks ack	Validate signature → enqueue to SQS / Kafka → ack 200 in <100ms → process from queue	Sender retry storms on receiver slowness are how production fires start; async absorbs bursts
Idempotent dedup on event-id	Trust at-most-once delivery claim; process duplicates and double-charge	Maintain a TTL'd seen-set keyed on event-id (Redis SETEX or DDB conditional write)	At-least-once is the real semantic; assuming otherwise causes duplicate side effects
Per-event-type endpoints	One /webhook endpoint that switches on event_type internally	/webhook/payment-success, /webhook/refund-issued, separate handlers, separate scale	You can't independently scale, monitor, or rate-limit a god-endpoint; per-type routes solve this
Replay / backfill endpoint pair	Webhook is the only source of truth; missed events are gone	Always offer /events?since=cursor as a pull-based backfill; webhook is the push cache	Bugs, downtime, schema changes all need replay; without it, you debug by asking the sender to resend

SSE

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Heartbeat comments every 15-30s	No heartbeats; proxies kill idle streams at 30-60s with no error visible	Send `:\n\n` comment lines on a timer; clients ignore, proxies see traffic	Silent stream death is the #1 SSE production fire; one timer prevents it
Run over HTTP/2 only	HTTP/1.1 SSE; browser caps at 6 streams per origin, app hangs on 7th tab	Terminate HTTP/2 at the edge and through to the SSE origin	HTTP/1.1 SSE at any nontrivial fanout is structurally broken; HTTP/2 makes it work
Persist Last-Event-ID server-side	Event log in-memory; server restart loses position; reconnect gets silent gap	Redis Streams or Kafka with retention longer than worst-case reconnect window	Auto-reconnect is the SSE selling point; without persistence it's a half-feature
Disable proxy buffering explicitly	Default nginx / Cloudflare config buffers responses; stream never flushes	Set `X-Accel-Buffering: no`, `Cache-Control: no-cache`, `Content-Type: text/event-stream`	Stream-buffering footgun produces zero visible errors; just nothing happens for users
Bound event payload size	Send arbitrary payloads; one large event blocks the stream for slow clients	Cap event body at, say, 32KB; chunk larger payloads with continuation event-ids	Large single events flush through TCP windows and spike p99 latency for every client on the stream

WebSockets

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Application-layer ping/pong + sequence numbers	Rely on TCP keepalive (default 2hr); miss NAT rebinding, half-open sockets	App-layer ping every 30s; sequence number on every message; replay on reconnect	TCP-only liveness misses real failures; sequence numbers enable lossless reconnect
LB idle-timeout coordination	App holds connection idle; LB kills it at 60s with no warning	App-level keepalive interval < LB idle timeout (e.g. 30s vs 60s)	Mysterious disconnects mid-app-think are almost always LB timeout mismatch
Explicit ack-window backpressure	Server writes to socket; one slow client fills the send queue; OOM	Credit-based: client sends N "ready for N more" tokens; server respects the window	TCP flow control alone is not enough at fanout; OOM under slow-consumer is a common cause
Connection draining on deploy	Rolling-restart drops all WS connections; reconnect storm hits new pods	Send "going-down" frame, 60s jittered grace, then force-close; new pods warm before old pods drain	Deploy-triggered reconnect storms cause cascading failures that look like real outages
Token in handshake, never in URL	?token=eyJhb...; token is now in every proxy access log forever	Sec-WebSocket-Protocol subprotocol header or first-frame auth message	Tokens leaked to logs are a CWE-200 incident; auditors will find it eventually

gRPC

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Deadlines propagated through the call graph	No deadline set; runaway calls block goroutines / threads / Netty workers	Set deadline at the edge; propagate via Context / Metadata; downstreams enforce	Deadline propagation is the single biggest gRPC reliability lever; without it you can't bound tail latency
Channel reuse, not per-call channels	Create a channel per RPC; exhaust ports, leak HTTP/2 connections	One channel per (host, port), shared across goroutines / threads	Channel creation is expensive (TLS + HTTP/2 setup); reuse is 10-100x cheaper
Client-side LB or service mesh, not L4 NLB	NLB round-robins TCP connections; HTTP/2 multiplexing pins all RPCs to one backend	round_robin LB policy in gRPC client, or service mesh with HTTP/2-aware routing	L4 LB + gRPC = single-backend hotspots that look like a scaling bug but are a protocol mismatch
Cancellation propagation	Caller cancels Context; downstream services don't see it, keep working	Pass Context (Go) / CancellationToken (.NET) / equivalent through every call	Without propagation, cancelled calls keep burning CPU / DB / external API budget downstream
Strict additive schema evolution	Reuse field numbers, rename fields, remove fields — wire is "fine"	Never reuse field numbers; mark fields reserved; add only with new numbers	Protobuf evolution rules are unforgiving; one violation breaks every old client silently

WebRTC

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Budget for TURN bandwidth on day one	Ship STUN only; 15-20% of users behind symmetric NAT can't connect	TURN servers in each region (or TURN-as-a-service); accept the egress cost	Without TURN, your connection success rate is 80% and you'll never debug why; with TURN it's 99%+
SFU at 5+ peers, mesh below	Mesh-only architecture; works at 4, fails at 5; users blame their wifi	Switch to SFU when participant count exceeds 4 (configurable inflection point)	Mesh upload bandwidth is N-1 streams per peer; residential uplinks fail before SFUs do
Trickle ICE always	Vanilla ICE; gather all candidates before offer; 5-15s call setup	Trickle ICE; emit candidates as found, offer goes out fast; sub-second setup	Call-setup latency dominates first-impression UX; trickle is non-optional in 2026
Simulcast for multi-receiver bitrate adaptation	Single bitrate; one viewer on mobile forces everyone to the lowest quality	Send 3 layers (high/mid/low); SFU forwards appropriate layer per subscriber	Without simulcast, one bad-network participant degrades the whole call; with it, each subscriber gets best-available
DTLS-SRTP only, never SDES	Old samples or libraries negotiate SDES key exchange	DTLS-SRTP mandatory in the SDP offer; reject SDES outright	SDES is deprecated and insecure; modern libraries enforce DTLS-SRTP but legacy samples still show SDES

8. Advanced / Next-Gen Alternatives

Per-protocol. The successors, adjacent tech, and patterns that obviate the original when constraints shift.

REST

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
GraphQL (Federation v2)	Eliminates over-fetching / N+1 across diverse clients	Production	Medium — server resolver layer plus federation gateway	Multiple client teams with different data shapes hitting the same backends
HTTP/3 + QUIC	Removes head-of-line blocking, faster handshake, better mobile resilience	Production — ~35% of Cloudflare HTTPS traffic in 2026	Low — transparent upgrade if edge supports it	Mobile-heavy traffic, lossy networks, latency-sensitive global apps
tRPC / typed REST	End-to-end TypeScript type safety without protobuf toolchain	Production in TS monorepos	Medium — TypeScript-only, both ends must speak it	Greenfield TypeScript fullstack where shared types are the value
Connect protocol (CNCF)	gRPC schemas + HTTP/1.1 + JSON fallback; browser-native, no proxy	Production — CrowdStrike, Bluesky, Dropbox using	Low if you're greenfield; medium from existing REST	Want gRPC's type safety with REST's debuggability and browser story

Polling

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
SSE / WebSockets	Eliminates poll-interval latency floor; reduces wasted requests	Production	Medium — adds push-channel infra; client refactor	Event rate exceeds 1/min per client and clients are open browsers
Long-poll + HTTP/2 keepalive	Same polling model, fewer empty round-trips, no idle radio wake	Production	Low — server-side change; client interval becomes ack-on-receive	Want polling semantics but reduce wasted bandwidth and battery
Webhooks (for B2B receivers)	Push from sender; no client-side scheduling	Production	Medium — receiver must expose accessible HTTP endpoint	Receiver is another server you control with a routable URL
Mobile push (APNs / FCM)	OS-level persistent connection; battery-friendly	Production	Low — provider SDK integration	Mobile app where battery and background limits dominate

Webhooks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
EventBridge / Kafka topic subscriptions	Replay, ordering, exactly-once semantics, consumer groups	Production	High — architectural shift, requires both sides in compatible infra	Sender and receivers within same cloud / org; ordering matters
CloudEvents spec for payloads	Cross-vendor portable event schema; type / source / id standardized	CNCF Graduated	Low — schema-level only, sits on top of HTTP POST	Multi-vendor event flow with multiple sender / receiver implementations
WebSub (PubSubHubbub)	Discovery via topic URLs; subscriber-to-hub model	Niche — stable but narrow adoption	Low	Open content distribution (RSS/Atom-like) at internet scale
Polling /events?since=cursor as primary	Receiver-paced, replay-friendly, no signature complexity	Production	Low — usually paired with webhooks anyway	Receiver wants control over rate / replay; sender wants no per-receiver retry state

SSE

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
WebTransport	Multi-stream + unreliable datagrams over QUIC; replaces SSE+WS combo	Baseline (Mar 2026) — Safari 26.4 shipped support	Medium — new client API, HTTP/3 edge required	Need datagrams plus reliable streams in one session; lossy mobile networks
gRPC server-streaming + Connect	Typed schemas, binary efficiency, browser support via Connect	Production	Medium — protobuf adoption	You control client and want types over text/event-stream
HTTP/3 SSE	Head-of-line blocking elimination on lossy networks	Production	Low — transparent	Mobile-heavy or globally distributed SSE consumers
RSocket (reactive streaming)	Application-layer backpressure (credit-based), 4 interaction models	Niche	High — language ecosystem narrow (Java / TS / Go)	Reactive stack (Spring Reactor) that wants backpressure as a first-class concept

WebSockets

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
WebTransport (QUIC)	Multi-stream, datagrams, no head-of-line blocking, native QUIC	Baseline (Mar 2026)	Medium — new client API, HTTP/3 edge required, server stack different	Greenfield real-time apps where datagrams or multi-stream matter (cloud gaming, observability streams)
WebSocketStream API	Native backpressure via ReadableStream / WritableStream	Early — Chrome behind flag	Low — same protocol, new API surface	Streaming large payloads or high-throughput per-connection
WebRTC data channels	UDP-based option, peer-to-peer, configurable reliability	Production	High — different protocol stack entirely	Sub-100ms p99 needed, tolerant of unreliable, peer-to-peer fits
RSocket	Credit-based backpressure built into the protocol, 4 interaction models	Niche	High	Reactive Java / Spring stack with strong backpressure requirements

gRPC

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Connect protocol (Buf, CNCF)	HTTP/1.1 + JSON fallback, browser-native, wire-compatible with gRPC	Production (CNCF, 2026) — CrowdStrike, PlanetScale, Bluesky, Dropbox using	Low — wire-compatible, drop-in for many cases	Want gRPC's schema + codegen with REST's debuggability and zero-proxy browser support
gRPC-Web	Browser support via Envoy / similar proxy	Production	Low — sidecar proxy	Need browser clients but don't want to switch protocols server-side
gRPC over HTTP/3	Removes HTTP/2 head-of-line blocking; better mobile	Emerging — Connect and Tonic experimenting; gRPC-core not stable yet	Low — protocol-level upgrade	Mobile or globally-distributed gRPC traffic over lossy networks
Cap'n Proto / FlatBuffers RPC	Zero-copy deserialization, microsecond-tier latency	Niche	High — different schema language and toolchain	Latency budget is single-digit microseconds (HFT, game-server tick loops)

WebRTC

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Media over QUIC (MoQ)	Pub/sub media delivery over QUIC; relay-friendly; simpler than full WebRTC stack	IETF Draft-17 (May 2026) — Cloudflare relay network, nanocosmos in production at low hundreds of thousands of concurrent	High — new spec, evolving across drafts	Live streaming where 200-300ms is acceptable and you want CDN-scale economics
WebTransport (data-only use)	Low-latency reliable + unreliable without media/codec stack	Baseline (Mar 2026)	Medium — different API surface	You want sub-100ms data, not media — don't need codec / jitter buffer / SFU
WebCodecs API	Low-level codec access; build custom media pipeline (cloud gaming, custom encode)	Production in Chrome/Edge; Safari/Firefox shipping	Medium — pair with WebTransport for end-to-end control	Custom encoding / decoding outside WebRTC's negotiated codec set
SFU-as-a-service (LiveKit, Daily, Cloudflare Calls, 100ms)	Removes operational burden of TURN / SFU / signaling	Production	Low — SDK swap	Don't want to run media infra; trade per-minute cost for zero ops

Best default choices

Search and compare

1. Trade-Offs

REST HTTP/1.1 · HTTP/2 · HTTP/3

Polling Short · Long · Conditional

Webhooks HTTP POST callbacks

SSE — Server-Sent Events text/event-stream over HTTP

WebSockets RFC 6455 · Full-duplex over TCP

gRPC HTTP/2 · Protobuf · 4 streaming modes

WebRTC SRTP · DTLS · ICE / STUN / TURN

2. Use Cases

REST

Polling

Webhooks

SSE

WebSockets

gRPC

WebRTC

3. Limitations

4. Failure & Recovery

5. Connection Distribution

6. Fanout / State Sync

7. Better Usage Patterns

REST

Polling

Webhooks

SSE

WebSockets

gRPC

WebRTC

8. Advanced / Next-Gen Alternatives

REST

Polling

Webhooks

SSE

WebSockets

gRPC

WebRTC