Communication Protocols — PE Trade-Off Analysis

REST, Polling, Webhooks, SSE, WebSockets, gRPC, WebRTC — analyzed as a layered stack of application-layer transports, with reframed sharding / replication / fault-tolerance semantics suitable for protocols rather than databases.

Category Sweep As of 2026-05-30
PE Verdict

These seven are not alternatives. They are points on a 4-axis decision space — direction (pull vs push vs bidi vs peer), state (stateless vs sticky vs session vs peer), ordering (none vs per-stream vs total), and infra coupling (browser-native vs HTTP intermediaries vs custom). Pick by failure mode, not by feature. The interview signal is naming the inflection point where the trade-off flips.

Matrix-table reframing: Sharding becomes Connection Distribution (sticky sessions, LB compatibility, scale-out story). Replication becomes Fanout / State Sync (how one event reaches N receivers). Fault Tolerance becomes Failure & Recovery (reconnect, retry, idempotency, message-loss windows). These reframings apply consistently across all seven protocols.

Best default choices

1. Trade-Offs

Per-protocol. Each row is a real "give up X for Y" decision, not a feature.

REST HTTP/1.1 · HTTP/2 · HTTP/3

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Stateless request-per-resourceUniform horizontal scaling, every box identical, zero session affinityAuth re-validated every request, no server-driven state push without pollingUser triggers an action whose result arrives 5s later via a background job, and the only way to learn that is the client pollingStatelessness is real only if you don't smuggle state into cookies, JWT payloads, or CDN cache keys. Most "stateless" REST stacks are sticky in three hidden places.
HTTP verbs over custom RPCCache at every layer (browser, CDN, reverse proxy) keyed on method+URLForces CRUD shape onto domain operations that are not CRUD"Refund this transaction with reason X" gets shoehorned into POST /refunds and you lose half the domain languageREST orthodoxy is a tax. Pragmatic teams ship RPC-style POST endpoints and accept the cache loss. The real question is whether your reads can be cached, not whether your writes can be RESTful.
JSON over the wirecurl, browser devtools, every language has a parser, schema is optional3 to 10x larger payload vs protobuf, CPU on serialize/parse at high QPSMobile client on 3G in emerging market loads a 200KB JSON response in 4s when a protobuf equivalent loads in 600msJSON cost shows up at the egress bill before it shows up in latency. At 50K QPS with 5KB responses, swapping to protobuf saves ~$30K/mo in some clouds before you talk about latency.
HTTP/1.1 connection modelUniversal proxy / LB / WAF / CDN compatibility, zero special-caseHead-of-line blocking on a single connection, browser caps at 6 connections per originPage with 30 API calls serializes through 6 connections, p95 page load gains 2s over HTTP/2HTTP/2 fixes this but only if the entire path supports it. Many internal LBs terminate HTTP/2 at the edge and re-emit HTTP/1.1 to backends, and you get half the benefit.
URL / header versioningMultiple API versions can coexist on the same fleet, gradual migrationNo runtime type checking, breaking changes are entirely a coordination problemv2 is deployed, v1 is supposed to be deprecated, and you find a customer still on v1 the day after you delete the codeThe right invariant is "no breaking changes, ever" — only additive evolution and explicit deprecation windows. Versioning is a workaround for not committing to that discipline.
Verb-based idempotency (GET, PUT, DELETE)Safe-by-default retries on idempotent methods, no extra protocolPOST has no idempotency unless you build it (Idempotency-Key header pattern)Network blip mid-checkout, client retries POST /orders, customer gets charged twiceIdempotency-Key with a 24h server-side dedup window is non-negotiable for any POST that mutates money or external state. Stripe's pattern is the reference, not the exception.
Standard auth (cookies, Bearer, mTLS)Drop-in for every identity provider and gatewayNo connection-level identity — every request re-validates the tokenJWT validation hits a remote JWKS endpoint, your service is now 50ms slower per request and coupled to the IdP's uptimeCache JWKS aggressively with a TTL shorter than the longest-lived token, and assume the IdP will have an outage at the worst time. Local validation with periodic refresh, not per-request fetch.
Pull-only by designReceiver in full control of pace, backpressure, and retryNo server-initiated notifications without a second mechanism (SSE / WebSocket / push)"User got a new message" requires polling at 10s intervals — your battery and bandwidth team will both find youREST + SSE is a perfectly valid hybrid. The mistake is treating REST as needing replacement when you need push; you only need push for the push channel, not the rest of the API.

Polling Short · Long · Conditional

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Trivial implementationA loop, an HTTP client, a timer — works in 20 minutesLatency ceiling equals the poll interval, by definitionStakeholder demos a "real-time" feature at 10s polling, then asks why it's not real-timePolling is the right answer more often than engineers admit. The shame around it leads teams to over-engineer SSE/WS solutions when 30s polling would cost 1% as much.
Stateless client and serverAny box can serve any poll, identical to REST scaling storyNo notion of "what's new since last poll" without explicit cursorsClient polls and gets the same 50 items every time, must dedupe locally, miss events under reorderingWatermark or sequence-number based delta polling is the only sane pattern at scale. Full-state polling at 10K clients is a self-inflicted DDoS.
Works through any firewall / proxyZero special-case infra; corporate proxies, transparent NATs, all pass itMobile radio wakes for every poll, kills battery and data planiOS app polls every 30s in background, App Store reviewer flags it for power use, you ship a fix that quietly disables the featureMobile push (APNs / FCM) is the real polling-killer on mobile, not WebSockets. The OS already maintains one persistent connection — piggyback on it.
Long polling reduces tail latencyp99 latency drops from interval/2 to milliseconds when events occurEach connected client holds a server socket for the full timeout window10K clients each holding a 30s long-poll = 10K open sockets at all times, and your origin server runs out of file descriptorsLong polling is structurally SSE with worse ergonomics. If you're already paying the socket cost, switch to SSE and get reconnect + event-id for free.
Trivial backoff and retryExponential backoff is a one-liner, no protocol-level reconnect logicN clients on synchronized intervals create thundering herds at every tickCron-style polling at :00 every minute drives a 100x QPS spike for 200ms, every minute, foreverJitter is mandatory, not optional. ±20% randomization on every interval. Server-side rate-limit is the second line, not the first.
Reuses HTTP cache / auth / observabilityEvery request is a normal HTTP call, every existing tool worksMost polls return 304 / empty, wasting the entire request budget90% of your /events endpoint requests return "nothing new" but each one costs an auth check, a DB query, and a CDN missIf-Modified-Since + ETag-aware origins cut empty-poll cost by 80%. Most poll endpoints ignore conditional headers, which is the highest-ROI optimization in this stack.
No server-side fanout complexityNo pub/sub bus, no consumer group, no offset trackingEach client at a different "tick" — cross-client consistency is on youUser A sees a new message, switches devices, device B is between polls, the conversation history disagrees for 8 secondsPolling cannot give you per-user causal ordering across devices without per-user state. Once you add that state, you've reinvented pub/sub with worse semantics — at which point go SSE.

Webhooks HTTP POST callbacks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Push semantics, no persistent client connectionReceiver doesn't pay socket cost when idle, sender pays nothing per receiverSender controls when delivery happens, receiver cannot pull on demandSender's queue backs up during incident, receivers see no events for hours, then 10x burst on recoveryPair webhooks with a /events GET fallback. Webhooks alone make lost events permanently lost; the GET fallback makes them recoverable.
Receiver scales independently of senderReceiver can run on serverless, autoscale on POST volumeNo exactly-once delivery — duplicates are routine and your problem to dedupStripe retries on receiver 5xx, you ack 200 but the response was lost in transit, you process the same event twiceEvent-id dedup with a TTL'd seen-set (Redis or DDB) is the standard pattern. The TTL must exceed the sender's retry window, which means reading the sender's docs carefully.
Standard HTTP at receiverNo new infra — ALB / API Gateway / Lambda handle webhook POSTs directlyNo backpressure — sender decides rate, receiver absorbs or returns 429Marketing sends a campaign, Stripe fires 1M payment events in 2 minutes, your receiver SQS-less endpoint runs out of DB connectionsReceiver pattern: accept POST → enqueue to SQS / Kafka → ack 200 immediately. Synchronous processing inside the webhook handler is the most common production fire I've seen.
Decoupled deploy lifecyclesSender and receiver release independently, no shared librarySchema evolution coordinated only by changelog, no compile-time checkSender adds a required field, receiver breaks on parse, support tickets flood in two days later when retry buffer drainsCloudEvents spec + schema registry + additive-only evolution. The contract is the schema, not the URL. Most teams treat the URL as the contract and learn this twice.
Public receiver enables cross-org integrationB2B integrations work without VPC peering or shared identitySignature verification mandatory — sender IP rotates, TLS alone doesn't prove originReceiver trusts the source IP, sender migrates clouds, receiver starts rejecting valid eventsHMAC signature with a shared secret, timestamp in the signature payload, reject signatures older than 5 minutes. This is the minimum, and most public webhook integrations get all three wrong somewhere.
Retry-with-backoff is standardTransient failures self-heal without app-level logicBounded latency goes out the window — retried events can land hours lateReceiver was down for 1h, sender retries with backoff, your "real-time" event arrives at 11pm when the daytime team is goneWebhooks are at-least-once with unbounded latency. If your business logic depends on event ordering or timeliness, webhooks are wrong; you want a queue with consumers.
No persistent state on sender sideSender's webhook delivery service is bounded — fire-and-forget after retry exhaustionNo replay primitive — exhausted retries are gone unless sender separately retainsReceiver discovers a bug after a week, asks sender to "resend last 7 days", sender's retry buffer is 72hAlways pair webhooks with an idempotent /events?since=cursor GET endpoint that can backfill. The webhook is the cache; the GET is the source of truth.
Receiver can fan out internallyOne public endpoint, many internal consumers via local pub/subNo cross-event ordering — parallel HTTP delivery is unordered by definition"Order created" arrives after "Order shipped" because they took different network paths, your state machine rejects shipped eventsIf ordering matters, partition by entity id at receive and serialize per-partition. Or use Kafka. Webhooks are not the right transport for ordered streams.

SSE — Server-Sent Events text/event-stream over HTTP

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Single long-lived HTTP streamServer push without a new protocol; works through HTTP-aware infraOne-way only — client-to-server requires a separate POSTBuilding chat on SSE, every send is a POST to /messages plus a stream listener for receives, two channels to keep in syncSSE is the right answer when the read path dominates and writes are infrequent (LLM streaming, dashboards, notifications). It is the wrong answer when read and write are interleaved at high frequency.
Auto-reconnect with Last-Event-IDBrowser reconnects automatically on drop, server can resume from idServer must remember per-event-id state long enough to honor itServer restarts, in-memory event log is gone, client reconnects with Last-Event-ID and gets nothing — silent gapPersist the event log to Redis Streams or Kafka with a retention longer than the worst-case reconnect window. In-memory event store is fine for demos and not much else.
Standard HTTP infra compatibilityWorks through proxies, LBs, CDNs (with buffering off), no upgrade handshakeText-only (UTF-8 event stream format)Binary payload needed, base64 encoding adds 33% size, you're now defeating the bandwidth pointIf you need binary at scale, SSE is the wrong tool. WebSockets or gRPC-streaming. Don't fight the format.
Built-in event-id, retry, and event-type semanticsReplayable streams, configurable client-side retry, named event channels — all in the protocolNo multiplexing — one stream = one logical channel unless you multiplex at application layerDashboard needs "metrics" stream and "alerts" stream — you open two SSE connections, hit the HTTP/1.1 6-connection cap fastHTTP/2 makes the multi-stream problem go away (streams share the connection). HTTP/1.1 SSE at 6+ streams is a debugging nightmare.
Trivial REST integrationAdd /stream endpoint to existing API, same auth, same observabilityNo bidirectional, so any "send me commands" channel needs a separate POST + correlationLLM streaming with tool-use mid-response requires the client to POST a tool result while the stream is still flowingAnthropic, OpenAI, and the rest standardized on SSE for LLM streaming. The reason: response streaming is one-way and writes are batch (one prompt). The shape matches.
HTTP/2 = one connection, many streamsNo browser connection-cap pressure, streams multiplex naturallyHTTP/1.1 SSE caps at 6 streams per origin per browser — hard ceilingSaaS app opens an SSE stream per tab, user has 7 tabs open, the 7th hangs foreverHTTP/2 is the only sane SSE transport in 2026. If your stack is HTTP/1.1, you have a bigger problem than SSE.
Same-origin / CORS rules as RESTAuth via cookies just works; no protocol-upgrade weirdnessNo cross-origin without explicit CORS; some intermediaries strip the stream headersSSE stream behind Cloudflare with default buffering on, you never receive an event, browser stays connected foreverSet X-Accel-Buffering: no and Cache-Control: no-cache. Test through the full CDN/LB stack, not localhost. The buffering footgun is the most common SSE production fire.
Stream is just textcurl, browser devtools, any HTTP client can consume the stream rawNo compression option in the spec (gzip on the response works, but per-event compression doesn't)Streaming a high-rate metrics feed at 10KB/event/sec, you're shipping 10x the bytes a binary protocol wouldApplication-layer compression (compress payload bytes, send base64) defeats the debuggability point. If size matters that much, switch to WebSockets with permessage-deflate or gRPC.

WebSockets RFC 6455 · Full-duplex over TCP

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Full-duplex single connectionSend and receive on the same socket, no second channel, sub-ms RTT once connectedHTTP-layer caching, intermediaries, observability all stop working post-handshakeProduction incident, you can't replay traffic with curl, can't tail at the CDN, can't trace at the proxy — you're inside an opaque TCP pipeWebSockets are an observability cliff. Invest in protocol-aware tracing (W3C Trace Context inside frames, structured logging at app layer) before you have an incident.
Low per-message overhead2-14 byte frame header, no per-message HTTP headersNo app-layer backpressure visibility — TCP-level only, you can't tell a slow consumer from a fast oneOne slow client's TCP send buffer fills, server keeps queuing messages in heap, OOMs on what should be a 10 KB/s feedBuild explicit ack-window backpressure at the application layer. TCP flow control alone is not enough at any nontrivial fanout. The pattern is RSocket-style credit windows.
HTTP upgrade handshakeReuse port 443, TLS, hostname routing, existing edge infra for the handshakeSome L7 LBs and WAFs do not proxy the Upgrade header gracefully, post-handshakeAWS ALB + WAF rules: handshake succeeds, frames get dropped 30s in, you debug for a week before finding the WAF ruleTest with your full production edge path, not localhost. Cloudflare, Cloud Armor, and ALB all have WebSocket-specific gotchas; document them per environment.
Long-lived connection enables low-latency pushSub-10ms p99 server-to-client for short messages, no per-message setupServer is stateful — every connected client is a slot on some specific boxDeploy day, you rolling-restart, every client reconnects to a different box, the new box needs to know the client's state — and you didn't think about thatDistributed pub/sub (Redis, NATS, Kafka) between WS servers is mandatory for any non-trivial fanout. Sticky-session-only architectures hit a wall at ~100K concurrent connections.
Binary framesEfficient encoding, supports protobuf / msgpack / flatbuffers / raw bytesNo protocol semantics on top — you build your own framing, heartbeats, ack, orderingSix months in, you've reinvented half of gRPC and badly: no flow control, no deadlines, no cancellation propagationIf you're building application-level RPC over WebSockets, you should be using gRPC or RSocket. WS is the wrong layer for the problem you're solving.
Standard browser APInew WebSocket(url) — works in every browser back to IE 10Reconnect, resume, idempotency are all your problem (no Last-Event-ID equivalent)User commutes through tunnels, WS drops every 90s, your reconnect logic loses 3 messages each timeApplication-layer sequence numbers + server-side replay buffer + idempotent message handling. The pattern is well-known; the bug is forgetting it for V1 and learning the lesson on prod traffic.
Cross-origin via Origin headerServer can accept or reject by Origin, simple security modelCookies work; bearer tokens must be passed at handshake (subprotocol or query) or in messagesToken in URL query gets logged at every proxy and CDN — you've leaked auth tokens in plaintext to every log aggregator on the pathSend the token in the Sec-WebSocket-Protocol header trick or in the first frame. Never in the URL. This is a CWE-200-class bug waiting to happen.
Persistent connection through most proxiesOnce handshake succeeds, frames flow through transparent proxies fineGraceful shutdown during deploys is hard — every open connection blocks the drain or gets droppedRolling deploy with 50K connections per pod, each pod tries to drain for 30s, clients see 30s of reconnect storms across the whole fleetSend a "server-going-down" frame, give clients 60s with jitter to reconnect to the new fleet, then force-close. Connection-shedding plus jittered reconnect is the production pattern.

gRPC HTTP/2 · Protobuf · 4 streaming modes

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
HTTP/2 multiplexing with binary protobuf5-10x smaller payloads vs JSON, many concurrent RPCs on one TCP connectionBrowser-native support is missing — gRPC-Web requires a proxy (Envoy / Connect)Public API ambition, two months in, every customer asks for "is there a REST version" — you build it on the side, now you have two APIsConnect protocol (CNCF) gives you gRPC schemas with HTTP/1.1 + JSON fallback and works in browsers without a proxy. For new greenfield, Connect over raw gRPC is the better default in 2026.
Codegen from .proto contractsType-safe clients and servers in every supported languageEvery schema change is a regen + redeploy coordinated across servicesCross-team service contract evolves, three teams have different proto versions checked in, the wire works but the types mismatch at compile time everywhereBuf Schema Registry (or equivalent) with breaking-change CI and a central source of truth. Without it, gRPC's compile-time safety is a lie — you have N copies of the truth.
Strict typed contractsWire format errors caught at codegen, no JSON-parse-at-runtime surprisesLoose coupling REST/JSON allows is gone — additive evolution rules are strict (no reusing field numbers, no required fields without defaults)Engineer renames a field number "to fix the schema", every existing client breaks silently and you learn from the on-call pageProtobuf evolution rules are non-obvious and unforgiving. Mandatory training for any team adopting gRPC; mandatory `buf breaking` in CI.
Four streaming modesUnary, server-stream, client-stream, bidi — covers SSE and WebSocket use cases in one protocolHTTP intermediary compatibility breaks — most CDNs and WAFs do not handle HTTP/2 trailers correctlyCloudFront in front of gRPC, half the calls fail with mysterious INTERNAL errors, you spend a week finding the trailer-strippingInternal-only gRPC is the production sweet spot. External-facing gRPC at scale needs Envoy or equivalent control over the edge path.
Built-in deadlines and cancellationDeadline propagates through call graph — downstream services know when to give upBinary on the wire — no curl, mandatory tooling (grpcurl, BloomRPC, Postman gRPC, Buf Studio)2am on-call page, "service is slow", you can't curl the endpoint to see what's happening — you need a working grpcurl install on the bastionReflection enabled in non-prod environments, disabled in prod. grpcurl scripts in the runbook. The cost of "binary wire" is felt at incident time, not coding time.
Channel-level keepalive and poolingOne channel, many concurrent calls; ping/pong frames keep NAT mappings aliveL4 round-robin LBs fight HTTP/2 multiplexing — every call from one client goes to the same backendBehind an NLB, 90% of QPS lands on one pod, you scale horizontally and CPU utilization stays the sameClient-side load balancing (round-robin per call, not per connection) or a real service mesh (Istio, Linkerd, Consul Connect). NLB / NLB-equivalent + gRPC is a known anti-pattern.
Wire-format efficiencyProtobuf is ~5-10x smaller than JSON for the same data, parsing is ~10x fasterDebuggability cost — no human-readable wire, no field names on the wire (just tags)PCAP from a production trace is useless without the .proto file; rolling forward to a new schema and reading old captures is now a tooling problemKeep the .proto file checked in with every release tag. Re-decoding old PCAPs is a real forensic need; if you don't have the schema, you don't have the data.
Backpressure via HTTP/2 flow controlWindow-based per-stream flow control built into the wirePolyglot edge needs a gateway (grpc-gateway, Envoy, Connect) for REST/JSON clientsMobile team is on Swift / Kotlin where gRPC is fine, web team needs JSON, you maintain a transcoder foreverConnect's HTTP/1.1 + JSON fallback eliminates this entirely for the web team. The mental model: gRPC for service-to-service, Connect for everywhere a browser is involved.

WebRTC SRTP · DTLS · ICE / STUN / TURN

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Peer-to-peer media pathSub-100ms end-to-end latency, server bandwidth scales sublinearly with usersServer can't see / record / moderate the stream without a media server (SFU or MCU)Compliance requires call recording, you didn't budget for an SFU, retrofitting it forces every client to upgradePlan for the SFU on day one even if you start with mesh. Mesh-only architectures hit recording, transcoding, and AI-analysis walls and need a rewrite, not a feature.
NAT traversal via ICE/STUN/TURNWorks through 80%+ of consumer NATs without manual port forwardingThree pieces of infra to run (signaling, STUN, TURN), TURN bandwidth is metered and expensive15-20% of users behind symmetric NAT or carrier-grade NAT fall back to TURN, your TURN egress cost dominates the billTwilio / Cloudflare / Xirsys for TURN-as-a-service if you don't have geographic scale to run your own. The crossover point is roughly 10K concurrent TURN relays.
Adaptive bitrate via SDP / RTCPCodec, resolution, and FEC adapt to network conditions automaticallyDeterministic latency goes out — adaptation can stall frames or drop quality under sustained lossCloud gaming or trading floor application needs predictable 50ms latency, WebRTC's adaptation makes that bound impossible to commit toFor "must hit X ms p99" use cases, configure the codec aggressively (forced bitrate, FEC always-on, jitter buffer floor) — accept the artifacts. Or use a different transport.
SRTP for media, DTLS for data channelEnd-to-end encrypted, no middle-box can inspectNo middlebox observability — debugging in production is "look at WebRTC internals in chrome://"Customer reports bad audio, you can't pcap-decode the RTP, you can't reproduce, you're guessing from getStats() metricsgetStats() polling at 1s intervals into your observability stack is mandatory. The metrics are good (jitter, packet loss, RTT) but you have to collect them yourself per-call.
Direct peer bypasses server bandwidth1-on-1 calls cost almost zero server bandwidth (only signaling)Mesh topology breaks above 4 peers — each client uploads N-1 streamsSales demo expands from 4 to 8 people, half the participants' Wi-Fi can't sustain 7 uplink streams, calls dropSFU at 5+ participants is the rule. The inflection point is upload bandwidth, not server cost. Most clients have asymmetric residential connections; mesh asks for upload they don't have.
Data channels for non-media payloadsLow-latency UDP / SCTP for arbitrary data, opt-in reliabilityUDP by default — reliability is per-channel, you choose ordered/unordered, reliable/unreliableGame state needs reliable ordered, but you configured unordered for speed, now state replication has gapsMultiple data channels with different reliability profiles is the right pattern: one ordered/reliable for control, one unordered/unreliable for high-rate input. The flexibility is the point.
Native in every browser, no SDKgetUserMedia() + RTCPeerConnection are built-in, no install or extensionServer-side participation (record, transcode, AI) requires a headless browser or SFU integration"Add real-time transcription" — you discover this means running gstreamer or a headless Chromium in the cloud, multiplying your infra costSFU-as-a-service (LiveKit, Daily, 100ms, Cloudflare Calls) gives you record + transcode + AI ingestion without running media servers yourself. The make-vs-buy crossover is steep.
Codec negotiation via SDPOlder clients fall back to widely-supported codecs (Opus, VP8, H.264)Codec deprecation breaks old clients with no graceful path (no codec = no call)You drop VP8 support to save SFU CPU, all clients on Android 5 lose video, support tickets explodeMaintain a known minimum codec set for the long tail. WebRTC interoperability is a 2-3 codec problem in practice (Opus + H.264 + one of VP8/VP9/AV1); never less.

2. Use Cases

Per-protocol. Concrete workloads with the driving property that ruled out the alternative.

REST

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Public SaaS API for third partiesStripe, GitHub, Twilio, SendGridUniversal client compatibility + edge caching~100M+ requests/day per platformgRPC fails on developer ergonomics — every customer would need a code generator install
Cacheable read-heavy pathsNews sites, e-commerce catalogs, retail product pages99%+ CDN cache hit ratio, sub-50ms TTFB1M+ RPS at edge, <1% originGraphQL fragments cache fragmentation, gRPC has no edge caching story
Mobile app backends with diverse clientsUber, Instagram early-era, most B2C appsPolyglot client stack (iOS / Android / Web / TV / partners)Billions of requests/dayMaintaining gRPC clients for 5 platforms is a full team's work; REST + OpenAPI is one team
Webhook delivery targetAny system receiving Stripe / Shopify / GitHub callbacksReceiver already runs HTTP infra; zero new techHighly variable, burst-proneSender doesn't speak gRPC; webhook delivery industry standardized on HTTP POST
Healthcare / FHIR / regulated exchangeEMR vendors, insurance claim systemsRegulatory-mandated standard (HL7 FHIR over REST)Per-region, per-payerCompliance attestation cost rules out anything not in the spec
Internal microservice CRUD (polyglot)Mid-size companies with mixed language stacksLowest-friction inter-service contract10K-50K QPS per servicegRPC's codegen overhead is hard to justify when teams don't already share a build system

Polling

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
CI runner heartbeat / job dequeueGitHub Actions, GitLab Runner, Jenkins agentsReceiver-initiated for security (no inbound ports needed)Millions of runners polling periodicallyWebhooks would require every runner to expose an inbound HTTP endpoint; firewall hostile
Mobile background syncEmail apps, calendar clients, RSS readersOS-level battery and connection constraints favor pull on a schedule100M+ MAU per appOS limits sustained background WebSocket connections; the platform pushes you to polling or APNs/FCM
Distributed config / discoveryetcd v2 watch fallback, Spring Config Server, Consul long-pollLong-poll is a known, simple primitive — zero new infra10K+ services polling shared configPub/sub bus adds a dependency; polling reuses existing HTTP
Slow-changing data feedsLeaderboards, public stats pages, weather dataStaleness is acceptable (30s-5min)Millions of clients, low event ratePush channels are overkill when the event rate is <1/min per client
Multi-tenant SaaS workspace stateTenant-specific dashboard dataPer-tenant isolation simpler with pull1K-10K tenants, modest per-tenant QPSWebSockets force you to solve per-tenant fanout up front, polling defers that decision

Webhooks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Payment event notificationsStripe, Square, PayPal, AdyenAsync settlement / dispute / refund notification across org boundariesMillions of events/day, per-merchant fanoutPolling would cost merchants 100x in idle requests; cross-org WebSocket is operationally infeasible
SCM repo events to CIGitHub → Jenkins / CircleCI / GitHub ActionsSub-second push of code events to build infraTens of millions of events/day across GitHubPolling 50M repos every minute = 833K RPS of "nothing happened"
SaaS automation triggersZapier, Workato, Make, n8nFanout to user-defined workflows in arbitrary destinationsPer-customer flow counts, variable burstReceivers are external user infra; only HTTP POST is universally accepted
Marketplace order / inventory eventsShopify, Amazon Seller, eBay → 3PL systemsHeterogeneous third-party fulfillment integrationsPer-merchant, high burst on promotionsEach 3PL runs different stack; webhooks are the lowest common denominator
Observability / alert ingestionPagerDuty, Opsgenie, Slack incoming webhooksMany monitoring tools fan-in to one alerting platformHigh burst at incident-timeEvery monitoring tool already speaks HTTP POST; standardizing on anything else fragments the integration market

SSE

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
LLM token-by-token streamingOpenAI, Anthropic, Google Gemini APIsOne-way server push with HTTP-stack reuse for auth / rate-limit / observabilityPer-request, millions of concurrent streams platform-wideWebSocket adds bidi cost that's unused (one prompt, one stream); reconnect semantics matter less for short streams
Real-time analytics dashboardsDatadog event stream, Grafana Live, observability UIsAppend-only event feed to UI with auto-reconnect~1K-10K concurrent dashboard sessionsWebSocket bidi is wasted; SSE auto-reconnect is built-in
Build / deploy progress streamingCI build logs, CD pipeline UI, Terraform Cloud applyPush log lines to UI as they happen, no UI pollingPer-build, transient sessionsPolling makes the UI choppy; WebSocket complexity isn't justified
Notification / activity feedsGitHub notifications, Stripe Dashboard live events, JIRA activityOne-way notification push, low frequency, many clients10K-100K concurrent users per serviceWebSocket adds bidi infra cost without bidi benefit
AI agent tool-call streamingAnthropic Claude, OpenAI Assistants, Bedrock AgentCorePartial response + tool-use events streamed inlinePer-conversation, long-livedHTTP/2 SSE reuses existing API gateway infra; WebSocket would require a separate edge path
Server-driven UI update feedStripe Dashboard, Linear, Vercel deployment UIPush state deltas to keep UI in sync with server100K+ concurrent users per serviceSSE is sufficient; WebSocket's complexity buys nothing if writes go through REST

WebSockets

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Real-time collaborative editingFigma, Google Docs, Notion, LinearBidirectional state sync with sub-50ms write echo100K+ concurrent connections per shardSSE can't carry client-to-server in same channel; CRDT sync needs both directions interleaved
Trading platforms — quotes + ordersCoinbase Pro, IBKR, Robinhood, crypto exchangesLow-latency bidi quotes plus order submission on the same wire1M+ concurrent connections at peakHTTP-based protocols add too much per-request overhead; SSE alone misses the order-submit path
Multiplayer game state and lobbyRoblox lobbies, Among Us, Discord game integrationsReliable ordered bidi for game state and chatPer-room 8-100 players, millions of roomsWebRTC's UDP is overkill for turn-based / lobby state; TCP ordering matters
Chat / messaging at scaleSlack, Discord, Twitch chat, WhatsApp WebBidi messaging with presence and typing indicators10M+ concurrent (Slack peak), 100M+ daily (Discord)SSE for chat means typing indicators go through a separate POST channel — operational complexity not worth it
IoT control panels and dashboardsSmart home web dashboards, factory floor SCADABidi state read + command write, low-latencyPer-deployment, persistent connectionsREST can't push state changes; SSE can't carry control commands

gRPC

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Internal microservice mesh at hyperscaleGoogle internal, Netflix, Square, Uber internalTyped contracts + HTTP/2 multiplexing + 5-10x smaller payloads1M+ RPC/sec per service at the larger endREST overhead (JSON parse + HTTP/1.1 head-of-line) adds CPU and latency at every hop in a 10-hop call graph
Service mesh control planeEnvoy xDS, Istio control plane, LinkerdBidi-streaming for config push to thousands of proxies10K+ Envoy instances per control planeREST polling for config doesn't scale; SSE is one-way; bidi gRPC is the natural fit
ML model servingTensorFlow Serving, NVIDIA Triton, Vertex AIBinary tensor payloads, server-streaming for batched inference1M+ inferences/sec per clusterJSON serialization for tensors is 5-20x larger and 10x slower to parse
Database client protocolsSpanner, CockroachDB, etcd v3, FoundationDB clientsTyped query results, streaming for large scans, connection multiplexingPer-cluster, sustained connection countCustom binary protocols would work but gRPC gives you the language ecosystem for free
Mobile-to-backend with bandwidth pressureSquare POS, Lyft driver app, emerging-market appsSmall payloads matter on cellular; protobuf delivers100K+ devices, variable connectivityREST/JSON costs 5-10x bandwidth, which on emerging-market data plans is the difference between usable and not

WebRTC

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Video conferencingZoom web client, Google Meet, Microsoft Teams web, WherebySub-100ms peer media with adaptive codec / FEC4-49 participants mesh, SFU beyondWebSockets carry no audio/video codec stack; building the SRTP / jitter buffer / FEC layer yourself is multi-year work
Live customer support videoIntercom video, Zendesk Talk, in-product video helpLow-latency 1-on-1 video without server bandwidth1-on-1, on-demand, low concurrencyServer-mediated video costs egress bandwidth per session; P2P avoids it
Cloud gaming / remote desktopNVIDIA GeForce Now, Xbox Cloud Gaming, ParsecSub-100ms latency video + input over UDP with adaptive bitratePer-session, 60fps sustainedWebSocket / TCP can't recover from packet loss fast enough; UDP plus FEC is the only option
P2P file / data transferfile.pizza, Snapdrop, magic-wormhole-web, browser-based AirDrop alternativesDirect browser-to-browser without server storagePairwise, transientServer-side upload doubles bandwidth cost and adds privacy concerns
Live audio rooms (many-to-many)Clubhouse, Discord Stage, Twitter Spaces, Telegram voice chatMany-to-many audio via SFU with jitter buffer and codec adaptationThousands of listeners per roomSFU fanout requires WebRTC-aware media servers; WebSocket audio would require custom codec / jitter / packet-loss handling

3. Limitations

Multi-protocol matrix. Severity reflects how often the limitation forces an architectural choice.

Show:
Limitation Axis REST Polling Webhooks SSE WebSockets gRPC WebRTC
Server push capability High None natively; requires SSE/WS side-channel Critical Pull-only by definition; latency = interval Low Push is the whole point; sender-initiated Low One-way server-to-client native Low Full-duplex native Low Server-streaming and bidi modes native Low Native bidi peer or via SFU
Browser-native client Low fetch / XHR universal Low setInterval + fetch Critical Browser cannot be a webhook receiver (no public endpoint) Low EventSource native everywhere Low WebSocket native everywhere High Needs gRPC-Web + proxy, or Connect protocol Low RTCPeerConnection native, but signaling is on you
Intermediary compatibility (CDN / WAF / proxy) Low Universal; designed for this Low Same as REST Low Plain HTTP POST Medium Buffering at proxies kills the stream silently Medium Upgrade handshake; WAFs / L7 LBs need WS support High HTTP/2 trailers stripped by many CDNs / WAFs High UDP-based, ICE/TURN needed for hostile NAT/firewall
Backpressure / flow control Low Per-request, client-paced Low Client controls poll rate High Sender controls rate; receiver must absorb or 429 Medium TCP-level only; no app-layer credit window High TCP-only; app-layer flow control is on you Low HTTP/2 per-stream flow control built in Medium RTCP for media; data channel uses SCTP credit
Connection cap per browser per origin Low 6 on HTTP/1.1, ~100s on HTTP/2 streams Low Same as REST Low N/A — server-side concept High 6 streams cap on HTTP/1.1; HTTP/2 required at scale Medium ~255 connections per origin in Chrome Low HTTP/2 multiplexing on one connection Low Per-PeerConnection; no origin cap
Observability (curl / pcap / trace) Low curl, devtools, every existing tool Low Same as REST Low ngrok / webhook.site / RequestBin work fine Low curl -N streams the response High Frame-level tools only; no HTTP-layer visibility Medium grpcurl with reflection; Buf Studio; not curl-able Critical Encrypted SRTP; getStats() polling is the only window
Mobile battery / radio impact Medium Per-request radio wake High Periodic radio wake-ups; battery hostile Low N/A — server-to-server Medium Persistent connection; OS may kill in background Medium Persistent; OS-level background limits apply Medium Channel keepalive; smaller payloads help High Sustained UDP + media codec is CPU-heavy
HTTP/3 support maturity Low Native; transparent QUIC upgrade widely deployed Low Same as REST Low Same as REST at receiver Low Works over HTTP/3 transparently Medium RFC 9220 WebSockets-over-HTTP/3 still emerging in 2026 Medium Experimental in gRPC core; production via Connect or Tonic Low Already uses UDP+QUIC-like semantics
Auth model uniformity Low Cookies, Bearer, mTLS all standard Low Same as REST Medium HMAC signature non-standardized across providers Low Same as REST Medium Token-in-URL leaks; subprotocol header or first-frame Low Metadata for tokens, mTLS native High Signaling auth is separate from media auth; complex
Graceful deploy / connection draining Low Request-scoped; in-flight drains in seconds Low Per-poll drain trivial Low Sender retries handle drops naturally Medium Client auto-reconnects, but Last-Event-ID coordination needed High Custom shutdown protocol + jittered reconnect required Medium GOAWAY frame standard; streams must finish or restart High Re-ICE / re-SDP needed; calls drop during media-server deploys

4. Failure & Recovery

Reframed from "Fault Tolerance". For protocols, the failure modes are reconnect, retry, idempotency, and message-loss windows.

Show:
Dimension REST Polling Webhooks SSE WebSockets gRPC WebRTC
Reconnect model N/A — request-scoped; each call independent Trivial — next poll attempts Sender retries with backoff (provider-specific, often exponential up to 24-72h) Browser auto-reconnects with Last-Event-ID header Application-implemented; no protocol-level reconnect Channel reconnects automatically; stream RPCs must be restarted by app Re-ICE on path failure; full re-SDP if peer connection dies
Resume / state-recovery primitive ETag / If-Match for optimistic concurrency Watermark / cursor / since-timestamp param None native; pair with /events?since=cursor backfill API Last-Event-ID header on reconnect; server replays from that id App-layer sequence numbers + server replay buffer No native resume on streams; pagination tokens for unary SVC / SCTP retransmit at media layer; data channel SCTP retransmits
Message-loss window N/A — synchronous response or error Between polls if events expire from server buffer before next poll Receiver-side: receive but crash before processing → at-least-once retry covers it Between disconnect and reconnect, bounded by server's event-id retention Disconnect to reconnect — entire window of loss unless app buffers During stream disconnect; unary calls are atomic (response or error) Datagram loss tolerated for media (FEC / jitter buffer); reliable channels SCTP-retransmit
Idempotency primitive Verb semantics (GET, PUT, DELETE); Idempotency-Key header for POST Same as REST Event-id from sender; receiver dedups in seen-set with TTL Event-id; replay-safe if events are idempotent App-layer message-id + ack window App-layer; protocol gives request-id metadata convention N/A for media; data channel needs app-layer
Retry semantics Client decides; safe retries on idempotent verbs only Client-driven; next poll is the retry Sender retries on non-2xx with exponential backoff (Stripe: 3d, GitHub: 8h) Browser-controlled retry interval (retry: field), defaults ~3s Application-implemented; jitter mandatory at scale Built-in retry policy via service config; per-RPC retryable status codes Media: FEC + retransmit; data channel: SCTP retransmit policy
Backoff strategy App-layer; conventionally exponential with jitter Conventionally exponential on empty; constant on event Sender-controlled; each provider documents schedule Client-driven; server can hint via retry: App-implemented; truncated exponential with jitter is standard Configurable per-service in retry policy ICE has restart timer; app handles call-level reconnect cadence
Server-failure visibility to client Immediate — HTTP status code on the response Immediate per poll N/A — receiver perspective; sender retries On connection drop; otherwise opaque (no out-of-band signal) On close frame or TCP RST; timeout-based otherwise Per-RPC status code; GOAWAY for graceful shutdown ICE connection-state events; media-quality from getStats()
Cross-region failover DNS + health checks; clients re-resolve and retry Same as REST Sender re-resolves receiver URL; receiver's failover is its problem Client reconnects; new region picks up from Last-Event-ID if state is cross-region replicated Connection lost on region failure; reconnect to new region, state must be in shared store Client-side LB or service mesh handles; xDS can shift traffic Re-ICE to new TURN/SFU; calls drop during failover unless multi-region SFU
Half-open / zombie detection Request timeout (configurable, typically 30s) Per-poll timeout Sender timeout (commonly 10-30s); receiver ack required Server-sent heartbeat comments + client read timeout Ping/pong frames + app-layer timeout HTTP/2 keepalive (PING frames) ICE consent-freshness checks (STUN binding every 5-15s)
Heartbeat / keepalive N/A — request-scoped The poll itself is the heartbeat N/A — sender-initiated Comment lines (:\n\n) every 15-30s; mandatory to keep proxies alive RFC 6455 ping/pong frames; app-layer heartbeat for NAT-rebind detection HTTP/2 PING frames; configurable interval / timeout STUN binding requests; RTCP receiver reports

5. Connection Distribution

Reframed from "Sharding". For protocols, the question is how connections distribute across servers, what stickiness is required, and what scale-out looks like.

Show:
Dimension REST Polling Webhooks SSE WebSockets gRPC WebRTC
Stateless across servers? Yes (when properly designed) Yes Yes at receiver; sender keeps retry state Per-connection stateful; but events can come from shared pub/sub bus Stateful — each connection bound to a specific server Channel-stateful; one channel pinned to one HTTP/2 connection Per-peer-connection state; SFU is the stateful hub
Sticky session required? No (anti-pattern if present) No No at receiver Effectively yes for the lifetime of the stream Yes — sticky for connection lifetime Yes per channel; LB must respect HTTP/2 connection affinity Yes — peer-to-SFU pinning
LB type (L4 vs L7) Either; L7 for path-routing Either Either at receiver L7 required for path routing; L4 works for fleet-level L7 with WS support; L4 works for raw TCP L7 with HTTP/2 awareness; L4 NLB causes single-pod hotspots L4 for UDP; signaling separately L7
Per-server connection ceiling Limited by FD / thread; ~10K concurrent requests/node typical Same as REST plus active poll concurrency Same as REST at receiver 10K-50K connections/node with event-loop server (Go, Node, Netty) 50K-100K+ connections/node; ulimit + tcp_mem tuning required HTTP/2 multiplexing → 1 connection serves many streams; fewer connections than REST ~2K-5K participants per SFU node typical (CPU bound, not socket bound)
Horizontal scale story Add nodes behind LB; uniform Same as REST Same as REST at receiver Add nodes; shared pub/sub bus (Redis / NATS / Kafka) for cross-node fanout Add nodes + shared pub/sub bus + session directory if reconnect-targeting needed Client-side LB or service mesh; xDS-driven endpoint discovery SFU cluster with cascade / mesh between SFUs; signaling layer is separate
Session rebalancing N/A — no session N/A N/A Drop-and-reconnect; client picks new node Drop-and-reconnect with jitter; reconnect storm risk Server can send GOAWAY; client re-establishes channel Re-ICE or full re-SDP; visible to user as call drop
Hot connection behavior N/A — per-request High-rate single client looks like many requests; rate-limit by API key Receiver overload returns 429; sender backs off Single fast-producing event-id stream can OOM if not bounded Slow-consumer client blocks server send queue; needs app-layer backpressure HTTP/2 flow control bounds per-stream High-bitrate client drives SFU CPU and bandwidth; simulcast / SVC mitigates
Max connections (practical) Limited by fleet capacity, not protocol Same as REST Same as REST at receiver ~1M concurrent connections across a fleet of 20-50 modern nodes 10M+ documented at WhatsApp, Slack, Discord scale Bounded by channel count, not stream count; effectively very high SFU bound; ~50K-100K concurrent across an SFU cluster typical
Cross-shard fanout N/A — stateless N/A Sender fans out to N receivers from its queue Pub/sub bus delivers events to nodes holding subscriber connections Pub/sub bus + session directory: who is connected to which node N/A for unary; bidi-streaming uses pub/sub fanout same as WS SFU forwards to all participants; cascade between SFUs for cross-region

6. Fanout / State Sync

Reframed from "Replication". For protocols, this is how one event reaches N receivers, with what ordering and delivery guarantees.

Show:
Dimension REST Polling Webhooks SSE WebSockets gRPC WebRTC
Fanout model Unicast — client pulls own copy Unicast — each client pulls Sender-side multicast — sender posts to each receiver URL Server-side broadcast — backend fans out via pub/sub to subscriber connections Server-side broadcast — same pattern as SSE plus bidi Server-stream / bidi — server fans out via pub/sub on its end SFU forwards; or mesh (each peer sends to all others, N²)
State ownership Server Server (with client cursor) Sender (until ack); receiver after Server with event-id retention Server + client (per-connection); shared via pub/sub bus Server; client tracks stream cursor app-side Peer + SFU; CRDT / OT for shared state via data channel
Multi-server fanout — pub/sub bus required? No No No — fanout is sender-driven HTTP POST Yes for any cross-node fanout (Redis pub/sub, NATS, Kafka) Yes — same as SSE Yes for streaming fanout SFU is the fanout primitive; cascade for cross-SFU
Message ordering guarantees Per-call only Client-visible order = poll order; reordering risk if events expire No order guarantee across events (parallel HTTP delivery) Per-stream FIFO from server Per-connection FIFO (TCP-ordered) Per-stream FIFO; HTTP/2 stream ordering Media: timestamped, jitter-buffered; data channel: configurable ordered/unordered
Delivery guarantee At-most-once (one response or error) At-least-once with client dedup on cursor At-least-once (retries on failure) At-least-once on reconnect with Last-Event-ID; at-most-once otherwise App-defined; protocol gives best-effort TCP At-most-once per RPC; app-layer retry adds at-least-once Media: best-effort with FEC; data channel: configurable
Cross-region fanout DNS / multi-region origin Same as REST Sender hits receivers per-region Cross-region pub/sub (MSK MirrorMaker, Confluent, NATS leaf nodes) Same as SSE plus session directory for reconnect routing Service mesh with global xDS SFU cascade across regions; latency budget tightens
Lag (typical) Sub-50ms intra-region; per-request Equal to poll interval Sub-second to seconds; minutes-hours on retry Sub-100ms intra-region with healthy pub/sub bus Sub-50ms intra-region typical Sub-50ms intra-region 50-150ms peer-to-peer; +50ms per SFU hop
Conflict resolution App-layer (ETag / If-Match / OCC) App-layer with cursor App-layer at receiver (idempotency-key dedup) N/A — one-way stream App-layer (CRDT, OT, vector clocks) App-layer Media: jitter-buffer ordering; data channel: app-layer (CRDT typical)
Behavior during network partition Request fails fast; client retries Polls fail; client backs off and retries Sender retries with backoff; receiver may catch up via /events GET Connection drops; auto-reconnect with Last-Event-ID Connection drops; app-layer reconnect required Channel reconnects; streams must be restarted by app ICE attempts alternate candidates; falls back to TURN

7. Better Usage Patterns

Per-protocol. Patterns most teams miss until production bites; the PE-grade approach beats them.

REST

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Idempotency-Key on every POST that mutates external stateTreat POST as fire-and-forget; rely on client to dedup on retryServer stores Idempotency-Key + response hash for 24h; replays return original responseCustomer-charged-twice bugs are the most expensive REST bug class; the fix is one header
ETag + If-Match for optimistic concurrencyLast-write-wins; the second update silently overwrites the firstServer returns ETag on GET; PUT/PATCH requires If-Match matching the current ETagConcurrent edits stop corrupting data and start surfacing as 412 Precondition Failed
Cursor-based pagination, not offset?offset=10000&limit=20 — DB does O(n) scan per page?cursor=opaque_token&limit=20 — token encodes the last seen primary keyCursors are O(1) regardless of position; offsets break at scale with timeouts on deep pages
Faithful 4xx vs 5xxReturn 500 for every error (including malformed input)4xx = "your request was wrong", 5xx = "we failed"; 4xx is not retryable, 5xx may beClient retry logic depends on this distinction; getting it wrong creates retry storms or silent loss
Cache-Control + Vary explicit, never implicitNo headers, then learn the CDN cached a user-specific response globallyCache-Control: private for user-scoped; Vary: Authorization, Accept-Encoding alwaysCDN response leakage between users is a CWE-200; explicit headers prevent it

Polling

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Long-poll with coordinated timeoutShort-poll at 1-5s; 95%+ of requests return empty30-60s long-poll; server holds until event arrives or timeout10x reduction in QPS for the same freshness; works through HTTP infra
Conditional requests (If-Modified-Since, ETag)Endpoint ignores conditional headers; client downloads full payload every pollServer respects If-None-Match → 304 Not Modified with no body80%+ bandwidth reduction on unchanged-data polls; trivial server change
Jittered intervalsHardcoded 60s interval; N clients sync to wall-clock boundary, cause :00 spikeBase + ±20% random; server adds Retry-After hints under loadThundering herds at scheduled boundaries are an outage class on their own
Delta polling with watermarkPoll /messages and get the full list every time; client dedupes locallyPoll /messages?since=last_seen_cursor; server returns only new items100x payload reduction at any nontrivial dataset size; cleaner client logic
Exponential backoff on emptyConstant interval regardless of activity; idle clients hammer at 1HzDouble interval on empty (capped at, say, 5min); reset to base on eventIdle traffic drops 90%+; active clients still see fast updates

Webhooks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
HMAC signature verification on every requestTrust the sender IP; fail when sender rotates clouds or adds CDNHMAC-SHA256 over (timestamp + body) with rejection of stale timestamps (>5min)Without this, anyone who learns your endpoint can spoof events; IP-based trust does not scale
Receiver: enqueue first, process asyncSynchronous processing inside the webhook handler; slow DB write blocks ackValidate signature → enqueue to SQS / Kafka → ack 200 in <100ms → process from queueSender retry storms on receiver slowness are how production fires start; async absorbs bursts
Idempotent dedup on event-idTrust at-most-once delivery claim; process duplicates and double-chargeMaintain a TTL'd seen-set keyed on event-id (Redis SETEX or DDB conditional write)At-least-once is the real semantic; assuming otherwise causes duplicate side effects
Per-event-type endpointsOne /webhook endpoint that switches on event_type internally/webhook/payment-success, /webhook/refund-issued, separate handlers, separate scaleYou can't independently scale, monitor, or rate-limit a god-endpoint; per-type routes solve this
Replay / backfill endpoint pairWebhook is the only source of truth; missed events are goneAlways offer /events?since=cursor as a pull-based backfill; webhook is the push cacheBugs, downtime, schema changes all need replay; without it, you debug by asking the sender to resend

SSE

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Heartbeat comments every 15-30sNo heartbeats; proxies kill idle streams at 30-60s with no error visibleSend :\n\n comment lines on a timer; clients ignore, proxies see trafficSilent stream death is the #1 SSE production fire; one timer prevents it
Run over HTTP/2 onlyHTTP/1.1 SSE; browser caps at 6 streams per origin, app hangs on 7th tabTerminate HTTP/2 at the edge and through to the SSE originHTTP/1.1 SSE at any nontrivial fanout is structurally broken; HTTP/2 makes it work
Persist Last-Event-ID server-sideEvent log in-memory; server restart loses position; reconnect gets silent gapRedis Streams or Kafka with retention longer than worst-case reconnect windowAuto-reconnect is the SSE selling point; without persistence it's a half-feature
Disable proxy buffering explicitlyDefault nginx / Cloudflare config buffers responses; stream never flushesSet X-Accel-Buffering: no, Cache-Control: no-cache, Content-Type: text/event-streamStream-buffering footgun produces zero visible errors; just nothing happens for users
Bound event payload sizeSend arbitrary payloads; one large event blocks the stream for slow clientsCap event body at, say, 32KB; chunk larger payloads with continuation event-idsLarge single events flush through TCP windows and spike p99 latency for every client on the stream

WebSockets

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Application-layer ping/pong + sequence numbersRely on TCP keepalive (default 2hr); miss NAT rebinding, half-open socketsApp-layer ping every 30s; sequence number on every message; replay on reconnectTCP-only liveness misses real failures; sequence numbers enable lossless reconnect
LB idle-timeout coordinationApp holds connection idle; LB kills it at 60s with no warningApp-level keepalive interval < LB idle timeout (e.g. 30s vs 60s)Mysterious disconnects mid-app-think are almost always LB timeout mismatch
Explicit ack-window backpressureServer writes to socket; one slow client fills the send queue; OOMCredit-based: client sends N "ready for N more" tokens; server respects the windowTCP flow control alone is not enough at fanout; OOM under slow-consumer is a common cause
Connection draining on deployRolling-restart drops all WS connections; reconnect storm hits new podsSend "going-down" frame, 60s jittered grace, then force-close; new pods warm before old pods drainDeploy-triggered reconnect storms cause cascading failures that look like real outages
Token in handshake, never in URL?token=eyJhb...; token is now in every proxy access log foreverSec-WebSocket-Protocol subprotocol header or first-frame auth messageTokens leaked to logs are a CWE-200 incident; auditors will find it eventually

gRPC

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Deadlines propagated through the call graphNo deadline set; runaway calls block goroutines / threads / Netty workersSet deadline at the edge; propagate via Context / Metadata; downstreams enforceDeadline propagation is the single biggest gRPC reliability lever; without it you can't bound tail latency
Channel reuse, not per-call channelsCreate a channel per RPC; exhaust ports, leak HTTP/2 connectionsOne channel per (host, port), shared across goroutines / threadsChannel creation is expensive (TLS + HTTP/2 setup); reuse is 10-100x cheaper
Client-side LB or service mesh, not L4 NLBNLB round-robins TCP connections; HTTP/2 multiplexing pins all RPCs to one backendround_robin LB policy in gRPC client, or service mesh with HTTP/2-aware routingL4 LB + gRPC = single-backend hotspots that look like a scaling bug but are a protocol mismatch
Cancellation propagationCaller cancels Context; downstream services don't see it, keep workingPass Context (Go) / CancellationToken (.NET) / equivalent through every callWithout propagation, cancelled calls keep burning CPU / DB / external API budget downstream
Strict additive schema evolutionReuse field numbers, rename fields, remove fields — wire is "fine"Never reuse field numbers; mark fields reserved; add only with new numbersProtobuf evolution rules are unforgiving; one violation breaks every old client silently

WebRTC

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Budget for TURN bandwidth on day oneShip STUN only; 15-20% of users behind symmetric NAT can't connectTURN servers in each region (or TURN-as-a-service); accept the egress costWithout TURN, your connection success rate is 80% and you'll never debug why; with TURN it's 99%+
SFU at 5+ peers, mesh belowMesh-only architecture; works at 4, fails at 5; users blame their wifiSwitch to SFU when participant count exceeds 4 (configurable inflection point)Mesh upload bandwidth is N-1 streams per peer; residential uplinks fail before SFUs do
Trickle ICE alwaysVanilla ICE; gather all candidates before offer; 5-15s call setupTrickle ICE; emit candidates as found, offer goes out fast; sub-second setupCall-setup latency dominates first-impression UX; trickle is non-optional in 2026
Simulcast for multi-receiver bitrate adaptationSingle bitrate; one viewer on mobile forces everyone to the lowest qualitySend 3 layers (high/mid/low); SFU forwards appropriate layer per subscriberWithout simulcast, one bad-network participant degrades the whole call; with it, each subscriber gets best-available
DTLS-SRTP only, never SDESOld samples or libraries negotiate SDES key exchangeDTLS-SRTP mandatory in the SDP offer; reject SDES outrightSDES is deprecated and insecure; modern libraries enforce DTLS-SRTP but legacy samples still show SDES

8. Advanced / Next-Gen Alternatives

Per-protocol. The successors, adjacent tech, and patterns that obviate the original when constraints shift.

REST

Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
GraphQL (Federation v2)Eliminates over-fetching / N+1 across diverse clientsProductionMedium — server resolver layer plus federation gatewayMultiple client teams with different data shapes hitting the same backends
HTTP/3 + QUICRemoves head-of-line blocking, faster handshake, better mobile resilienceProduction — ~35% of Cloudflare HTTPS traffic in 2026Low — transparent upgrade if edge supports itMobile-heavy traffic, lossy networks, latency-sensitive global apps
tRPC / typed RESTEnd-to-end TypeScript type safety without protobuf toolchainProduction in TS monoreposMedium — TypeScript-only, both ends must speak itGreenfield TypeScript fullstack where shared types are the value
Connect protocol (CNCF)gRPC schemas + HTTP/1.1 + JSON fallback; browser-native, no proxyProduction — CrowdStrike, Bluesky, Dropbox usingLow if you're greenfield; medium from existing RESTWant gRPC's type safety with REST's debuggability and browser story

Polling

Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
SSE / WebSocketsEliminates poll-interval latency floor; reduces wasted requestsProductionMedium — adds push-channel infra; client refactorEvent rate exceeds 1/min per client and clients are open browsers
Long-poll + HTTP/2 keepaliveSame polling model, fewer empty round-trips, no idle radio wakeProductionLow — server-side change; client interval becomes ack-on-receiveWant polling semantics but reduce wasted bandwidth and battery
Webhooks (for B2B receivers)Push from sender; no client-side schedulingProductionMedium — receiver must expose accessible HTTP endpointReceiver is another server you control with a routable URL
Mobile push (APNs / FCM)OS-level persistent connection; battery-friendlyProductionLow — provider SDK integrationMobile app where battery and background limits dominate

Webhooks

Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
EventBridge / Kafka topic subscriptionsReplay, ordering, exactly-once semantics, consumer groupsProductionHigh — architectural shift, requires both sides in compatible infraSender and receivers within same cloud / org; ordering matters
CloudEvents spec for payloadsCross-vendor portable event schema; type / source / id standardizedCNCF GraduatedLow — schema-level only, sits on top of HTTP POSTMulti-vendor event flow with multiple sender / receiver implementations
WebSub (PubSubHubbub)Discovery via topic URLs; subscriber-to-hub modelNiche — stable but narrow adoptionLowOpen content distribution (RSS/Atom-like) at internet scale
Polling /events?since=cursor as primaryReceiver-paced, replay-friendly, no signature complexityProductionLow — usually paired with webhooks anywayReceiver wants control over rate / replay; sender wants no per-receiver retry state

SSE

A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
WebTransportMulti-stream + unreliable datagrams over QUIC; replaces SSE+WS comboBaseline (Mar 2026) — Safari 26.4 shipped supportMedium — new client API, HTTP/3 edge requiredNeed datagrams plus reliable streams in one session; lossy mobile networks
gRPC server-streaming + ConnectTyped schemas, binary efficiency, browser support via ConnectProductionMedium — protobuf adoptionYou control client and want types over text/event-stream
HTTP/3 SSEHead-of-line blocking elimination on lossy networksProductionLow — transparentMobile-heavy or globally distributed SSE consumers
RSocket (reactive streaming)Application-layer backpressure (credit-based), 4 interaction modelsNicheHigh — language ecosystem narrow (Java / TS / Go)Reactive stack (Spring Reactor) that wants backpressure as a first-class concept

WebSockets

Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
WebTransport (QUIC)Multi-stream, datagrams, no head-of-line blocking, native QUICBaseline (Mar 2026)Medium — new client API, HTTP/3 edge required, server stack differentGreenfield real-time apps where datagrams or multi-stream matter (cloud gaming, observability streams)
WebSocketStream APINative backpressure via ReadableStream / WritableStreamEarly — Chrome behind flagLow — same protocol, new API surfaceStreaming large payloads or high-throughput per-connection
WebRTC data channelsUDP-based option, peer-to-peer, configurable reliabilityProductionHigh — different protocol stack entirelySub-100ms p99 needed, tolerant of unreliable, peer-to-peer fits
RSocketCredit-based backpressure built into the protocol, 4 interaction modelsNicheHighReactive Java / Spring stack with strong backpressure requirements

gRPC

Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Connect protocol (Buf, CNCF)HTTP/1.1 + JSON fallback, browser-native, wire-compatible with gRPCProduction (CNCF, 2026) — CrowdStrike, PlanetScale, Bluesky, Dropbox usingLow — wire-compatible, drop-in for many casesWant gRPC's schema + codegen with REST's debuggability and zero-proxy browser support
gRPC-WebBrowser support via Envoy / similar proxyProductionLow — sidecar proxyNeed browser clients but don't want to switch protocols server-side
gRPC over HTTP/3Removes HTTP/2 head-of-line blocking; better mobileEmerging — Connect and Tonic experimenting; gRPC-core not stable yetLow — protocol-level upgradeMobile or globally-distributed gRPC traffic over lossy networks
Cap'n Proto / FlatBuffers RPCZero-copy deserialization, microsecond-tier latencyNicheHigh — different schema language and toolchainLatency budget is single-digit microseconds (HFT, game-server tick loops)

WebRTC

Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Media over QUIC (MoQ)Pub/sub media delivery over QUIC; relay-friendly; simpler than full WebRTC stackIETF Draft-17 (May 2026) — Cloudflare relay network, nanocosmos in production at low hundreds of thousands of concurrentHigh — new spec, evolving across draftsLive streaming where 200-300ms is acceptable and you want CDN-scale economics
WebTransport (data-only use)Low-latency reliable + unreliable without media/codec stackBaseline (Mar 2026)Medium — different API surfaceYou want sub-100ms data, not media — don't need codec / jitter buffer / SFU
WebCodecs APILow-level codec access; build custom media pipeline (cloud gaming, custom encode)Production in Chrome/Edge; Safari/Firefox shippingMedium — pair with WebTransport for end-to-end controlCustom encoding / decoding outside WebRTC's negotiated codec set
SFU-as-a-service (LiveKit, Daily, Cloudflare Calls, 100ms)Removes operational burden of TURN / SFU / signalingProductionLow — SDK swapDon't want to run media infra; trade per-minute cost for zero ops