Communication Protocols — PE Trade-Off Analysis
REST, Polling, Webhooks, SSE, WebSockets, gRPC, WebRTC — analyzed as a layered stack of application-layer transports, with reframed sharding / replication / fault-tolerance semantics suitable for protocols rather than databases.
Category Sweep As of 2026-05-30These seven are not alternatives. They are points on a 4-axis decision space — direction (pull vs push vs bidi vs peer), state (stateless vs sticky vs session vs peer), ordering (none vs per-stream vs total), and infra coupling (browser-native vs HTTP intermediaries vs custom). Pick by failure mode, not by feature. The interview signal is naming the inflection point where the trade-off flips.
Best default choices
1. Trade-Offs
Per-protocol. Each row is a real "give up X for Y" decision, not a feature.
REST HTTP/1.1 · HTTP/2 · HTTP/3
Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Stateless request-per-resource | Uniform horizontal scaling, every box identical, zero session affinity | Auth re-validated every request, no server-driven state push without polling | User triggers an action whose result arrives 5s later via a background job, and the only way to learn that is the client polling | Statelessness is real only if you don't smuggle state into cookies, JWT payloads, or CDN cache keys. Most "stateless" REST stacks are sticky in three hidden places. |
| HTTP verbs over custom RPC | Cache at every layer (browser, CDN, reverse proxy) keyed on method+URL | Forces CRUD shape onto domain operations that are not CRUD | "Refund this transaction with reason X" gets shoehorned into POST /refunds and you lose half the domain language | REST orthodoxy is a tax. Pragmatic teams ship RPC-style POST endpoints and accept the cache loss. The real question is whether your reads can be cached, not whether your writes can be RESTful. |
| JSON over the wire | curl, browser devtools, every language has a parser, schema is optional | 3 to 10x larger payload vs protobuf, CPU on serialize/parse at high QPS | Mobile client on 3G in emerging market loads a 200KB JSON response in 4s when a protobuf equivalent loads in 600ms | JSON cost shows up at the egress bill before it shows up in latency. At 50K QPS with 5KB responses, swapping to protobuf saves ~$30K/mo in some clouds before you talk about latency. |
| HTTP/1.1 connection model | Universal proxy / LB / WAF / CDN compatibility, zero special-case | Head-of-line blocking on a single connection, browser caps at 6 connections per origin | Page with 30 API calls serializes through 6 connections, p95 page load gains 2s over HTTP/2 | HTTP/2 fixes this but only if the entire path supports it. Many internal LBs terminate HTTP/2 at the edge and re-emit HTTP/1.1 to backends, and you get half the benefit. |
| URL / header versioning | Multiple API versions can coexist on the same fleet, gradual migration | No runtime type checking, breaking changes are entirely a coordination problem | v2 is deployed, v1 is supposed to be deprecated, and you find a customer still on v1 the day after you delete the code | The right invariant is "no breaking changes, ever" — only additive evolution and explicit deprecation windows. Versioning is a workaround for not committing to that discipline. |
| Verb-based idempotency (GET, PUT, DELETE) | Safe-by-default retries on idempotent methods, no extra protocol | POST has no idempotency unless you build it (Idempotency-Key header pattern) | Network blip mid-checkout, client retries POST /orders, customer gets charged twice | Idempotency-Key with a 24h server-side dedup window is non-negotiable for any POST that mutates money or external state. Stripe's pattern is the reference, not the exception. |
| Standard auth (cookies, Bearer, mTLS) | Drop-in for every identity provider and gateway | No connection-level identity — every request re-validates the token | JWT validation hits a remote JWKS endpoint, your service is now 50ms slower per request and coupled to the IdP's uptime | Cache JWKS aggressively with a TTL shorter than the longest-lived token, and assume the IdP will have an outage at the worst time. Local validation with periodic refresh, not per-request fetch. |
| Pull-only by design | Receiver in full control of pace, backpressure, and retry | No server-initiated notifications without a second mechanism (SSE / WebSocket / push) | "User got a new message" requires polling at 10s intervals — your battery and bandwidth team will both find you | REST + SSE is a perfectly valid hybrid. The mistake is treating REST as needing replacement when you need push; you only need push for the push channel, not the rest of the API. |
Polling Short · Long · Conditional
Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Trivial implementation | A loop, an HTTP client, a timer — works in 20 minutes | Latency ceiling equals the poll interval, by definition | Stakeholder demos a "real-time" feature at 10s polling, then asks why it's not real-time | Polling is the right answer more often than engineers admit. The shame around it leads teams to over-engineer SSE/WS solutions when 30s polling would cost 1% as much. |
| Stateless client and server | Any box can serve any poll, identical to REST scaling story | No notion of "what's new since last poll" without explicit cursors | Client polls and gets the same 50 items every time, must dedupe locally, miss events under reordering | Watermark or sequence-number based delta polling is the only sane pattern at scale. Full-state polling at 10K clients is a self-inflicted DDoS. |
| Works through any firewall / proxy | Zero special-case infra; corporate proxies, transparent NATs, all pass it | Mobile radio wakes for every poll, kills battery and data plan | iOS app polls every 30s in background, App Store reviewer flags it for power use, you ship a fix that quietly disables the feature | Mobile push (APNs / FCM) is the real polling-killer on mobile, not WebSockets. The OS already maintains one persistent connection — piggyback on it. |
| Long polling reduces tail latency | p99 latency drops from interval/2 to milliseconds when events occur | Each connected client holds a server socket for the full timeout window | 10K clients each holding a 30s long-poll = 10K open sockets at all times, and your origin server runs out of file descriptors | Long polling is structurally SSE with worse ergonomics. If you're already paying the socket cost, switch to SSE and get reconnect + event-id for free. |
| Trivial backoff and retry | Exponential backoff is a one-liner, no protocol-level reconnect logic | N clients on synchronized intervals create thundering herds at every tick | Cron-style polling at :00 every minute drives a 100x QPS spike for 200ms, every minute, forever | Jitter is mandatory, not optional. ±20% randomization on every interval. Server-side rate-limit is the second line, not the first. |
| Reuses HTTP cache / auth / observability | Every request is a normal HTTP call, every existing tool works | Most polls return 304 / empty, wasting the entire request budget | 90% of your /events endpoint requests return "nothing new" but each one costs an auth check, a DB query, and a CDN miss | If-Modified-Since + ETag-aware origins cut empty-poll cost by 80%. Most poll endpoints ignore conditional headers, which is the highest-ROI optimization in this stack. |
| No server-side fanout complexity | No pub/sub bus, no consumer group, no offset tracking | Each client at a different "tick" — cross-client consistency is on you | User A sees a new message, switches devices, device B is between polls, the conversation history disagrees for 8 seconds | Polling cannot give you per-user causal ordering across devices without per-user state. Once you add that state, you've reinvented pub/sub with worse semantics — at which point go SSE. |
Webhooks HTTP POST callbacks
Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Push semantics, no persistent client connection | Receiver doesn't pay socket cost when idle, sender pays nothing per receiver | Sender controls when delivery happens, receiver cannot pull on demand | Sender's queue backs up during incident, receivers see no events for hours, then 10x burst on recovery | Pair webhooks with a /events GET fallback. Webhooks alone make lost events permanently lost; the GET fallback makes them recoverable. |
| Receiver scales independently of sender | Receiver can run on serverless, autoscale on POST volume | No exactly-once delivery — duplicates are routine and your problem to dedup | Stripe retries on receiver 5xx, you ack 200 but the response was lost in transit, you process the same event twice | Event-id dedup with a TTL'd seen-set (Redis or DDB) is the standard pattern. The TTL must exceed the sender's retry window, which means reading the sender's docs carefully. |
| Standard HTTP at receiver | No new infra — ALB / API Gateway / Lambda handle webhook POSTs directly | No backpressure — sender decides rate, receiver absorbs or returns 429 | Marketing sends a campaign, Stripe fires 1M payment events in 2 minutes, your receiver SQS-less endpoint runs out of DB connections | Receiver pattern: accept POST → enqueue to SQS / Kafka → ack 200 immediately. Synchronous processing inside the webhook handler is the most common production fire I've seen. |
| Decoupled deploy lifecycles | Sender and receiver release independently, no shared library | Schema evolution coordinated only by changelog, no compile-time check | Sender adds a required field, receiver breaks on parse, support tickets flood in two days later when retry buffer drains | CloudEvents spec + schema registry + additive-only evolution. The contract is the schema, not the URL. Most teams treat the URL as the contract and learn this twice. |
| Public receiver enables cross-org integration | B2B integrations work without VPC peering or shared identity | Signature verification mandatory — sender IP rotates, TLS alone doesn't prove origin | Receiver trusts the source IP, sender migrates clouds, receiver starts rejecting valid events | HMAC signature with a shared secret, timestamp in the signature payload, reject signatures older than 5 minutes. This is the minimum, and most public webhook integrations get all three wrong somewhere. |
| Retry-with-backoff is standard | Transient failures self-heal without app-level logic | Bounded latency goes out the window — retried events can land hours late | Receiver was down for 1h, sender retries with backoff, your "real-time" event arrives at 11pm when the daytime team is gone | Webhooks are at-least-once with unbounded latency. If your business logic depends on event ordering or timeliness, webhooks are wrong; you want a queue with consumers. |
| No persistent state on sender side | Sender's webhook delivery service is bounded — fire-and-forget after retry exhaustion | No replay primitive — exhausted retries are gone unless sender separately retains | Receiver discovers a bug after a week, asks sender to "resend last 7 days", sender's retry buffer is 72h | Always pair webhooks with an idempotent /events?since=cursor GET endpoint that can backfill. The webhook is the cache; the GET is the source of truth. |
| Receiver can fan out internally | One public endpoint, many internal consumers via local pub/sub | No cross-event ordering — parallel HTTP delivery is unordered by definition | "Order created" arrives after "Order shipped" because they took different network paths, your state machine rejects shipped events | If ordering matters, partition by entity id at receive and serialize per-partition. Or use Kafka. Webhooks are not the right transport for ordered streams. |
SSE — Server-Sent Events text/event-stream over HTTP
A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Single long-lived HTTP stream | Server push without a new protocol; works through HTTP-aware infra | One-way only — client-to-server requires a separate POST | Building chat on SSE, every send is a POST to /messages plus a stream listener for receives, two channels to keep in sync | SSE is the right answer when the read path dominates and writes are infrequent (LLM streaming, dashboards, notifications). It is the wrong answer when read and write are interleaved at high frequency. |
| Auto-reconnect with Last-Event-ID | Browser reconnects automatically on drop, server can resume from id | Server must remember per-event-id state long enough to honor it | Server restarts, in-memory event log is gone, client reconnects with Last-Event-ID and gets nothing — silent gap | Persist the event log to Redis Streams or Kafka with a retention longer than the worst-case reconnect window. In-memory event store is fine for demos and not much else. |
| Standard HTTP infra compatibility | Works through proxies, LBs, CDNs (with buffering off), no upgrade handshake | Text-only (UTF-8 event stream format) | Binary payload needed, base64 encoding adds 33% size, you're now defeating the bandwidth point | If you need binary at scale, SSE is the wrong tool. WebSockets or gRPC-streaming. Don't fight the format. |
| Built-in event-id, retry, and event-type semantics | Replayable streams, configurable client-side retry, named event channels — all in the protocol | No multiplexing — one stream = one logical channel unless you multiplex at application layer | Dashboard needs "metrics" stream and "alerts" stream — you open two SSE connections, hit the HTTP/1.1 6-connection cap fast | HTTP/2 makes the multi-stream problem go away (streams share the connection). HTTP/1.1 SSE at 6+ streams is a debugging nightmare. |
| Trivial REST integration | Add /stream endpoint to existing API, same auth, same observability | No bidirectional, so any "send me commands" channel needs a separate POST + correlation | LLM streaming with tool-use mid-response requires the client to POST a tool result while the stream is still flowing | Anthropic, OpenAI, and the rest standardized on SSE for LLM streaming. The reason: response streaming is one-way and writes are batch (one prompt). The shape matches. |
| HTTP/2 = one connection, many streams | No browser connection-cap pressure, streams multiplex naturally | HTTP/1.1 SSE caps at 6 streams per origin per browser — hard ceiling | SaaS app opens an SSE stream per tab, user has 7 tabs open, the 7th hangs forever | HTTP/2 is the only sane SSE transport in 2026. If your stack is HTTP/1.1, you have a bigger problem than SSE. |
| Same-origin / CORS rules as REST | Auth via cookies just works; no protocol-upgrade weirdness | No cross-origin without explicit CORS; some intermediaries strip the stream headers | SSE stream behind Cloudflare with default buffering on, you never receive an event, browser stays connected forever | Set X-Accel-Buffering: no and Cache-Control: no-cache. Test through the full CDN/LB stack, not localhost. The buffering footgun is the most common SSE production fire. |
| Stream is just text | curl, browser devtools, any HTTP client can consume the stream raw | No compression option in the spec (gzip on the response works, but per-event compression doesn't) | Streaming a high-rate metrics feed at 10KB/event/sec, you're shipping 10x the bytes a binary protocol would | Application-layer compression (compress payload bytes, send base64) defeats the debuggability point. If size matters that much, switch to WebSockets with permessage-deflate or gRPC. |
WebSockets RFC 6455 · Full-duplex over TCP
Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Full-duplex single connection | Send and receive on the same socket, no second channel, sub-ms RTT once connected | HTTP-layer caching, intermediaries, observability all stop working post-handshake | Production incident, you can't replay traffic with curl, can't tail at the CDN, can't trace at the proxy — you're inside an opaque TCP pipe | WebSockets are an observability cliff. Invest in protocol-aware tracing (W3C Trace Context inside frames, structured logging at app layer) before you have an incident. |
| Low per-message overhead | 2-14 byte frame header, no per-message HTTP headers | No app-layer backpressure visibility — TCP-level only, you can't tell a slow consumer from a fast one | One slow client's TCP send buffer fills, server keeps queuing messages in heap, OOMs on what should be a 10 KB/s feed | Build explicit ack-window backpressure at the application layer. TCP flow control alone is not enough at any nontrivial fanout. The pattern is RSocket-style credit windows. |
| HTTP upgrade handshake | Reuse port 443, TLS, hostname routing, existing edge infra for the handshake | Some L7 LBs and WAFs do not proxy the Upgrade header gracefully, post-handshake | AWS ALB + WAF rules: handshake succeeds, frames get dropped 30s in, you debug for a week before finding the WAF rule | Test with your full production edge path, not localhost. Cloudflare, Cloud Armor, and ALB all have WebSocket-specific gotchas; document them per environment. |
| Long-lived connection enables low-latency push | Sub-10ms p99 server-to-client for short messages, no per-message setup | Server is stateful — every connected client is a slot on some specific box | Deploy day, you rolling-restart, every client reconnects to a different box, the new box needs to know the client's state — and you didn't think about that | Distributed pub/sub (Redis, NATS, Kafka) between WS servers is mandatory for any non-trivial fanout. Sticky-session-only architectures hit a wall at ~100K concurrent connections. |
| Binary frames | Efficient encoding, supports protobuf / msgpack / flatbuffers / raw bytes | No protocol semantics on top — you build your own framing, heartbeats, ack, ordering | Six months in, you've reinvented half of gRPC and badly: no flow control, no deadlines, no cancellation propagation | If you're building application-level RPC over WebSockets, you should be using gRPC or RSocket. WS is the wrong layer for the problem you're solving. |
| Standard browser API | new WebSocket(url) — works in every browser back to IE 10 | Reconnect, resume, idempotency are all your problem (no Last-Event-ID equivalent) | User commutes through tunnels, WS drops every 90s, your reconnect logic loses 3 messages each time | Application-layer sequence numbers + server-side replay buffer + idempotent message handling. The pattern is well-known; the bug is forgetting it for V1 and learning the lesson on prod traffic. |
| Cross-origin via Origin header | Server can accept or reject by Origin, simple security model | Cookies work; bearer tokens must be passed at handshake (subprotocol or query) or in messages | Token in URL query gets logged at every proxy and CDN — you've leaked auth tokens in plaintext to every log aggregator on the path | Send the token in the Sec-WebSocket-Protocol header trick or in the first frame. Never in the URL. This is a CWE-200-class bug waiting to happen. |
| Persistent connection through most proxies | Once handshake succeeds, frames flow through transparent proxies fine | Graceful shutdown during deploys is hard — every open connection blocks the drain or gets dropped | Rolling deploy with 50K connections per pod, each pod tries to drain for 30s, clients see 30s of reconnect storms across the whole fleet | Send a "server-going-down" frame, give clients 60s with jitter to reconnect to the new fleet, then force-close. Connection-shedding plus jittered reconnect is the production pattern. |
gRPC HTTP/2 · Protobuf · 4 streaming modes
Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| HTTP/2 multiplexing with binary protobuf | 5-10x smaller payloads vs JSON, many concurrent RPCs on one TCP connection | Browser-native support is missing — gRPC-Web requires a proxy (Envoy / Connect) | Public API ambition, two months in, every customer asks for "is there a REST version" — you build it on the side, now you have two APIs | Connect protocol (CNCF) gives you gRPC schemas with HTTP/1.1 + JSON fallback and works in browsers without a proxy. For new greenfield, Connect over raw gRPC is the better default in 2026. |
| Codegen from .proto contracts | Type-safe clients and servers in every supported language | Every schema change is a regen + redeploy coordinated across services | Cross-team service contract evolves, three teams have different proto versions checked in, the wire works but the types mismatch at compile time everywhere | Buf Schema Registry (or equivalent) with breaking-change CI and a central source of truth. Without it, gRPC's compile-time safety is a lie — you have N copies of the truth. |
| Strict typed contracts | Wire format errors caught at codegen, no JSON-parse-at-runtime surprises | Loose coupling REST/JSON allows is gone — additive evolution rules are strict (no reusing field numbers, no required fields without defaults) | Engineer renames a field number "to fix the schema", every existing client breaks silently and you learn from the on-call page | Protobuf evolution rules are non-obvious and unforgiving. Mandatory training for any team adopting gRPC; mandatory `buf breaking` in CI. |
| Four streaming modes | Unary, server-stream, client-stream, bidi — covers SSE and WebSocket use cases in one protocol | HTTP intermediary compatibility breaks — most CDNs and WAFs do not handle HTTP/2 trailers correctly | CloudFront in front of gRPC, half the calls fail with mysterious INTERNAL errors, you spend a week finding the trailer-stripping | Internal-only gRPC is the production sweet spot. External-facing gRPC at scale needs Envoy or equivalent control over the edge path. |
| Built-in deadlines and cancellation | Deadline propagates through call graph — downstream services know when to give up | Binary on the wire — no curl, mandatory tooling (grpcurl, BloomRPC, Postman gRPC, Buf Studio) | 2am on-call page, "service is slow", you can't curl the endpoint to see what's happening — you need a working grpcurl install on the bastion | Reflection enabled in non-prod environments, disabled in prod. grpcurl scripts in the runbook. The cost of "binary wire" is felt at incident time, not coding time. |
| Channel-level keepalive and pooling | One channel, many concurrent calls; ping/pong frames keep NAT mappings alive | L4 round-robin LBs fight HTTP/2 multiplexing — every call from one client goes to the same backend | Behind an NLB, 90% of QPS lands on one pod, you scale horizontally and CPU utilization stays the same | Client-side load balancing (round-robin per call, not per connection) or a real service mesh (Istio, Linkerd, Consul Connect). NLB / NLB-equivalent + gRPC is a known anti-pattern. |
| Wire-format efficiency | Protobuf is ~5-10x smaller than JSON for the same data, parsing is ~10x faster | Debuggability cost — no human-readable wire, no field names on the wire (just tags) | PCAP from a production trace is useless without the .proto file; rolling forward to a new schema and reading old captures is now a tooling problem | Keep the .proto file checked in with every release tag. Re-decoding old PCAPs is a real forensic need; if you don't have the schema, you don't have the data. |
| Backpressure via HTTP/2 flow control | Window-based per-stream flow control built into the wire | Polyglot edge needs a gateway (grpc-gateway, Envoy, Connect) for REST/JSON clients | Mobile team is on Swift / Kotlin where gRPC is fine, web team needs JSON, you maintain a transcoder forever | Connect's HTTP/1.1 + JSON fallback eliminates this entirely for the web team. The mental model: gRPC for service-to-service, Connect for everywhere a browser is involved. |
WebRTC SRTP · DTLS · ICE / STUN / TURN
Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Peer-to-peer media path | Sub-100ms end-to-end latency, server bandwidth scales sublinearly with users | Server can't see / record / moderate the stream without a media server (SFU or MCU) | Compliance requires call recording, you didn't budget for an SFU, retrofitting it forces every client to upgrade | Plan for the SFU on day one even if you start with mesh. Mesh-only architectures hit recording, transcoding, and AI-analysis walls and need a rewrite, not a feature. |
| NAT traversal via ICE/STUN/TURN | Works through 80%+ of consumer NATs without manual port forwarding | Three pieces of infra to run (signaling, STUN, TURN), TURN bandwidth is metered and expensive | 15-20% of users behind symmetric NAT or carrier-grade NAT fall back to TURN, your TURN egress cost dominates the bill | Twilio / Cloudflare / Xirsys for TURN-as-a-service if you don't have geographic scale to run your own. The crossover point is roughly 10K concurrent TURN relays. |
| Adaptive bitrate via SDP / RTCP | Codec, resolution, and FEC adapt to network conditions automatically | Deterministic latency goes out — adaptation can stall frames or drop quality under sustained loss | Cloud gaming or trading floor application needs predictable 50ms latency, WebRTC's adaptation makes that bound impossible to commit to | For "must hit X ms p99" use cases, configure the codec aggressively (forced bitrate, FEC always-on, jitter buffer floor) — accept the artifacts. Or use a different transport. |
| SRTP for media, DTLS for data channel | End-to-end encrypted, no middle-box can inspect | No middlebox observability — debugging in production is "look at WebRTC internals in chrome://" | Customer reports bad audio, you can't pcap-decode the RTP, you can't reproduce, you're guessing from getStats() metrics | getStats() polling at 1s intervals into your observability stack is mandatory. The metrics are good (jitter, packet loss, RTT) but you have to collect them yourself per-call. |
| Direct peer bypasses server bandwidth | 1-on-1 calls cost almost zero server bandwidth (only signaling) | Mesh topology breaks above 4 peers — each client uploads N-1 streams | Sales demo expands from 4 to 8 people, half the participants' Wi-Fi can't sustain 7 uplink streams, calls drop | SFU at 5+ participants is the rule. The inflection point is upload bandwidth, not server cost. Most clients have asymmetric residential connections; mesh asks for upload they don't have. |
| Data channels for non-media payloads | Low-latency UDP / SCTP for arbitrary data, opt-in reliability | UDP by default — reliability is per-channel, you choose ordered/unordered, reliable/unreliable | Game state needs reliable ordered, but you configured unordered for speed, now state replication has gaps | Multiple data channels with different reliability profiles is the right pattern: one ordered/reliable for control, one unordered/unreliable for high-rate input. The flexibility is the point. |
| Native in every browser, no SDK | getUserMedia() + RTCPeerConnection are built-in, no install or extension | Server-side participation (record, transcode, AI) requires a headless browser or SFU integration | "Add real-time transcription" — you discover this means running gstreamer or a headless Chromium in the cloud, multiplying your infra cost | SFU-as-a-service (LiveKit, Daily, 100ms, Cloudflare Calls) gives you record + transcode + AI ingestion without running media servers yourself. The make-vs-buy crossover is steep. |
| Codec negotiation via SDP | Older clients fall back to widely-supported codecs (Opus, VP8, H.264) | Codec deprecation breaks old clients with no graceful path (no codec = no call) | You drop VP8 support to save SFU CPU, all clients on Android 5 lose video, support tickets explode | Maintain a known minimum codec set for the long tail. WebRTC interoperability is a 2-3 codec problem in practice (Opus + H.264 + one of VP8/VP9/AV1); never less. |
2. Use Cases
Per-protocol. Concrete workloads with the driving property that ruled out the alternative.
REST
Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Public SaaS API for third parties | Stripe, GitHub, Twilio, SendGrid | Universal client compatibility + edge caching | ~100M+ requests/day per platform | gRPC fails on developer ergonomics — every customer would need a code generator install |
| Cacheable read-heavy paths | News sites, e-commerce catalogs, retail product pages | 99%+ CDN cache hit ratio, sub-50ms TTFB | 1M+ RPS at edge, <1% origin | GraphQL fragments cache fragmentation, gRPC has no edge caching story |
| Mobile app backends with diverse clients | Uber, Instagram early-era, most B2C apps | Polyglot client stack (iOS / Android / Web / TV / partners) | Billions of requests/day | Maintaining gRPC clients for 5 platforms is a full team's work; REST + OpenAPI is one team |
| Webhook delivery target | Any system receiving Stripe / Shopify / GitHub callbacks | Receiver already runs HTTP infra; zero new tech | Highly variable, burst-prone | Sender doesn't speak gRPC; webhook delivery industry standardized on HTTP POST |
| Healthcare / FHIR / regulated exchange | EMR vendors, insurance claim systems | Regulatory-mandated standard (HL7 FHIR over REST) | Per-region, per-payer | Compliance attestation cost rules out anything not in the spec |
| Internal microservice CRUD (polyglot) | Mid-size companies with mixed language stacks | Lowest-friction inter-service contract | 10K-50K QPS per service | gRPC's codegen overhead is hard to justify when teams don't already share a build system |
Polling
Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| CI runner heartbeat / job dequeue | GitHub Actions, GitLab Runner, Jenkins agents | Receiver-initiated for security (no inbound ports needed) | Millions of runners polling periodically | Webhooks would require every runner to expose an inbound HTTP endpoint; firewall hostile |
| Mobile background sync | Email apps, calendar clients, RSS readers | OS-level battery and connection constraints favor pull on a schedule | 100M+ MAU per app | OS limits sustained background WebSocket connections; the platform pushes you to polling or APNs/FCM |
| Distributed config / discovery | etcd v2 watch fallback, Spring Config Server, Consul long-poll | Long-poll is a known, simple primitive — zero new infra | 10K+ services polling shared config | Pub/sub bus adds a dependency; polling reuses existing HTTP |
| Slow-changing data feeds | Leaderboards, public stats pages, weather data | Staleness is acceptable (30s-5min) | Millions of clients, low event rate | Push channels are overkill when the event rate is <1/min per client |
| Multi-tenant SaaS workspace state | Tenant-specific dashboard data | Per-tenant isolation simpler with pull | 1K-10K tenants, modest per-tenant QPS | WebSockets force you to solve per-tenant fanout up front, polling defers that decision |
Webhooks
Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Payment event notifications | Stripe, Square, PayPal, Adyen | Async settlement / dispute / refund notification across org boundaries | Millions of events/day, per-merchant fanout | Polling would cost merchants 100x in idle requests; cross-org WebSocket is operationally infeasible |
| SCM repo events to CI | GitHub → Jenkins / CircleCI / GitHub Actions | Sub-second push of code events to build infra | Tens of millions of events/day across GitHub | Polling 50M repos every minute = 833K RPS of "nothing happened" |
| SaaS automation triggers | Zapier, Workato, Make, n8n | Fanout to user-defined workflows in arbitrary destinations | Per-customer flow counts, variable burst | Receivers are external user infra; only HTTP POST is universally accepted |
| Marketplace order / inventory events | Shopify, Amazon Seller, eBay → 3PL systems | Heterogeneous third-party fulfillment integrations | Per-merchant, high burst on promotions | Each 3PL runs different stack; webhooks are the lowest common denominator |
| Observability / alert ingestion | PagerDuty, Opsgenie, Slack incoming webhooks | Many monitoring tools fan-in to one alerting platform | High burst at incident-time | Every monitoring tool already speaks HTTP POST; standardizing on anything else fragments the integration market |
SSE
A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| LLM token-by-token streaming | OpenAI, Anthropic, Google Gemini APIs | One-way server push with HTTP-stack reuse for auth / rate-limit / observability | Per-request, millions of concurrent streams platform-wide | WebSocket adds bidi cost that's unused (one prompt, one stream); reconnect semantics matter less for short streams |
| Real-time analytics dashboards | Datadog event stream, Grafana Live, observability UIs | Append-only event feed to UI with auto-reconnect | ~1K-10K concurrent dashboard sessions | WebSocket bidi is wasted; SSE auto-reconnect is built-in |
| Build / deploy progress streaming | CI build logs, CD pipeline UI, Terraform Cloud apply | Push log lines to UI as they happen, no UI polling | Per-build, transient sessions | Polling makes the UI choppy; WebSocket complexity isn't justified |
| Notification / activity feeds | GitHub notifications, Stripe Dashboard live events, JIRA activity | One-way notification push, low frequency, many clients | 10K-100K concurrent users per service | WebSocket adds bidi infra cost without bidi benefit |
| AI agent tool-call streaming | Anthropic Claude, OpenAI Assistants, Bedrock AgentCore | Partial response + tool-use events streamed inline | Per-conversation, long-lived | HTTP/2 SSE reuses existing API gateway infra; WebSocket would require a separate edge path |
| Server-driven UI update feed | Stripe Dashboard, Linear, Vercel deployment UI | Push state deltas to keep UI in sync with server | 100K+ concurrent users per service | SSE is sufficient; WebSocket's complexity buys nothing if writes go through REST |
WebSockets
Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Real-time collaborative editing | Figma, Google Docs, Notion, Linear | Bidirectional state sync with sub-50ms write echo | 100K+ concurrent connections per shard | SSE can't carry client-to-server in same channel; CRDT sync needs both directions interleaved |
| Trading platforms — quotes + orders | Coinbase Pro, IBKR, Robinhood, crypto exchanges | Low-latency bidi quotes plus order submission on the same wire | 1M+ concurrent connections at peak | HTTP-based protocols add too much per-request overhead; SSE alone misses the order-submit path |
| Multiplayer game state and lobby | Roblox lobbies, Among Us, Discord game integrations | Reliable ordered bidi for game state and chat | Per-room 8-100 players, millions of rooms | WebRTC's UDP is overkill for turn-based / lobby state; TCP ordering matters |
| Chat / messaging at scale | Slack, Discord, Twitch chat, WhatsApp Web | Bidi messaging with presence and typing indicators | 10M+ concurrent (Slack peak), 100M+ daily (Discord) | SSE for chat means typing indicators go through a separate POST channel — operational complexity not worth it |
| IoT control panels and dashboards | Smart home web dashboards, factory floor SCADA | Bidi state read + command write, low-latency | Per-deployment, persistent connections | REST can't push state changes; SSE can't carry control commands |
gRPC
Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Internal microservice mesh at hyperscale | Google internal, Netflix, Square, Uber internal | Typed contracts + HTTP/2 multiplexing + 5-10x smaller payloads | 1M+ RPC/sec per service at the larger end | REST overhead (JSON parse + HTTP/1.1 head-of-line) adds CPU and latency at every hop in a 10-hop call graph |
| Service mesh control plane | Envoy xDS, Istio control plane, Linkerd | Bidi-streaming for config push to thousands of proxies | 10K+ Envoy instances per control plane | REST polling for config doesn't scale; SSE is one-way; bidi gRPC is the natural fit |
| ML model serving | TensorFlow Serving, NVIDIA Triton, Vertex AI | Binary tensor payloads, server-streaming for batched inference | 1M+ inferences/sec per cluster | JSON serialization for tensors is 5-20x larger and 10x slower to parse |
| Database client protocols | Spanner, CockroachDB, etcd v3, FoundationDB clients | Typed query results, streaming for large scans, connection multiplexing | Per-cluster, sustained connection count | Custom binary protocols would work but gRPC gives you the language ecosystem for free |
| Mobile-to-backend with bandwidth pressure | Square POS, Lyft driver app, emerging-market apps | Small payloads matter on cellular; protobuf delivers | 100K+ devices, variable connectivity | REST/JSON costs 5-10x bandwidth, which on emerging-market data plans is the difference between usable and not |
WebRTC
Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Video conferencing | Zoom web client, Google Meet, Microsoft Teams web, Whereby | Sub-100ms peer media with adaptive codec / FEC | 4-49 participants mesh, SFU beyond | WebSockets carry no audio/video codec stack; building the SRTP / jitter buffer / FEC layer yourself is multi-year work |
| Live customer support video | Intercom video, Zendesk Talk, in-product video help | Low-latency 1-on-1 video without server bandwidth | 1-on-1, on-demand, low concurrency | Server-mediated video costs egress bandwidth per session; P2P avoids it |
| Cloud gaming / remote desktop | NVIDIA GeForce Now, Xbox Cloud Gaming, Parsec | Sub-100ms latency video + input over UDP with adaptive bitrate | Per-session, 60fps sustained | WebSocket / TCP can't recover from packet loss fast enough; UDP plus FEC is the only option |
| P2P file / data transfer | file.pizza, Snapdrop, magic-wormhole-web, browser-based AirDrop alternatives | Direct browser-to-browser without server storage | Pairwise, transient | Server-side upload doubles bandwidth cost and adds privacy concerns |
| Live audio rooms (many-to-many) | Clubhouse, Discord Stage, Twitter Spaces, Telegram voice chat | Many-to-many audio via SFU with jitter buffer and codec adaptation | Thousands of listeners per room | SFU fanout requires WebRTC-aware media servers; WebSocket audio would require custom codec / jitter / packet-loss handling |
3. Limitations
Multi-protocol matrix. Severity reflects how often the limitation forces an architectural choice.
| Limitation Axis | REST | Polling | Webhooks | SSE | WebSockets | gRPC | WebRTC |
|---|---|---|---|---|---|---|---|
| Server push capability | High None natively; requires SSE/WS side-channel | Critical Pull-only by definition; latency = interval | Low Push is the whole point; sender-initiated | Low One-way server-to-client native | Low Full-duplex native | Low Server-streaming and bidi modes native | Low Native bidi peer or via SFU |
| Browser-native client | Low fetch / XHR universal | Low setInterval + fetch | Critical Browser cannot be a webhook receiver (no public endpoint) | Low EventSource native everywhere | Low WebSocket native everywhere | High Needs gRPC-Web + proxy, or Connect protocol | Low RTCPeerConnection native, but signaling is on you |
| Intermediary compatibility (CDN / WAF / proxy) | Low Universal; designed for this | Low Same as REST | Low Plain HTTP POST | Medium Buffering at proxies kills the stream silently | Medium Upgrade handshake; WAFs / L7 LBs need WS support | High HTTP/2 trailers stripped by many CDNs / WAFs | High UDP-based, ICE/TURN needed for hostile NAT/firewall |
| Backpressure / flow control | Low Per-request, client-paced | Low Client controls poll rate | High Sender controls rate; receiver must absorb or 429 | Medium TCP-level only; no app-layer credit window | High TCP-only; app-layer flow control is on you | Low HTTP/2 per-stream flow control built in | Medium RTCP for media; data channel uses SCTP credit |
| Connection cap per browser per origin | Low 6 on HTTP/1.1, ~100s on HTTP/2 streams | Low Same as REST | Low N/A — server-side concept | High 6 streams cap on HTTP/1.1; HTTP/2 required at scale | Medium ~255 connections per origin in Chrome | Low HTTP/2 multiplexing on one connection | Low Per-PeerConnection; no origin cap |
| Observability (curl / pcap / trace) | Low curl, devtools, every existing tool | Low Same as REST | Low ngrok / webhook.site / RequestBin work fine | Low curl -N streams the response | High Frame-level tools only; no HTTP-layer visibility | Medium grpcurl with reflection; Buf Studio; not curl-able | Critical Encrypted SRTP; getStats() polling is the only window |
| Mobile battery / radio impact | Medium Per-request radio wake | High Periodic radio wake-ups; battery hostile | Low N/A — server-to-server | Medium Persistent connection; OS may kill in background | Medium Persistent; OS-level background limits apply | Medium Channel keepalive; smaller payloads help | High Sustained UDP + media codec is CPU-heavy |
| HTTP/3 support maturity | Low Native; transparent QUIC upgrade widely deployed | Low Same as REST | Low Same as REST at receiver | Low Works over HTTP/3 transparently | Medium RFC 9220 WebSockets-over-HTTP/3 still emerging in 2026 | Medium Experimental in gRPC core; production via Connect or Tonic | Low Already uses UDP+QUIC-like semantics |
| Auth model uniformity | Low Cookies, Bearer, mTLS all standard | Low Same as REST | Medium HMAC signature non-standardized across providers | Low Same as REST | Medium Token-in-URL leaks; subprotocol header or first-frame | Low Metadata for tokens, mTLS native | High Signaling auth is separate from media auth; complex |
| Graceful deploy / connection draining | Low Request-scoped; in-flight drains in seconds | Low Per-poll drain trivial | Low Sender retries handle drops naturally | Medium Client auto-reconnects, but Last-Event-ID coordination needed | High Custom shutdown protocol + jittered reconnect required | Medium GOAWAY frame standard; streams must finish or restart | High Re-ICE / re-SDP needed; calls drop during media-server deploys |
4. Failure & Recovery
Reframed from "Fault Tolerance". For protocols, the failure modes are reconnect, retry, idempotency, and message-loss windows.
| Dimension | REST | Polling | Webhooks | SSE | WebSockets | gRPC | WebRTC |
|---|---|---|---|---|---|---|---|
| Reconnect model | N/A — request-scoped; each call independent | Trivial — next poll attempts | Sender retries with backoff (provider-specific, often exponential up to 24-72h) | Browser auto-reconnects with Last-Event-ID header |
Application-implemented; no protocol-level reconnect | Channel reconnects automatically; stream RPCs must be restarted by app | Re-ICE on path failure; full re-SDP if peer connection dies |
| Resume / state-recovery primitive | ETag / If-Match for optimistic concurrency | Watermark / cursor / since-timestamp param | None native; pair with /events?since=cursor backfill API |
Last-Event-ID header on reconnect; server replays from that id |
App-layer sequence numbers + server replay buffer | No native resume on streams; pagination tokens for unary | SVC / SCTP retransmit at media layer; data channel SCTP retransmits |
| Message-loss window | N/A — synchronous response or error | Between polls if events expire from server buffer before next poll | Receiver-side: receive but crash before processing → at-least-once retry covers it | Between disconnect and reconnect, bounded by server's event-id retention | Disconnect to reconnect — entire window of loss unless app buffers | During stream disconnect; unary calls are atomic (response or error) | Datagram loss tolerated for media (FEC / jitter buffer); reliable channels SCTP-retransmit |
| Idempotency primitive | Verb semantics (GET, PUT, DELETE); Idempotency-Key header for POST |
Same as REST | Event-id from sender; receiver dedups in seen-set with TTL | Event-id; replay-safe if events are idempotent | App-layer message-id + ack window | App-layer; protocol gives request-id metadata convention | N/A for media; data channel needs app-layer |
| Retry semantics | Client decides; safe retries on idempotent verbs only | Client-driven; next poll is the retry | Sender retries on non-2xx with exponential backoff (Stripe: 3d, GitHub: 8h) | Browser-controlled retry interval (retry: field), defaults ~3s |
Application-implemented; jitter mandatory at scale | Built-in retry policy via service config; per-RPC retryable status codes | Media: FEC + retransmit; data channel: SCTP retransmit policy |
| Backoff strategy | App-layer; conventionally exponential with jitter | Conventionally exponential on empty; constant on event | Sender-controlled; each provider documents schedule | Client-driven; server can hint via retry: |
App-implemented; truncated exponential with jitter is standard | Configurable per-service in retry policy | ICE has restart timer; app handles call-level reconnect cadence |
| Server-failure visibility to client | Immediate — HTTP status code on the response | Immediate per poll | N/A — receiver perspective; sender retries | On connection drop; otherwise opaque (no out-of-band signal) | On close frame or TCP RST; timeout-based otherwise | Per-RPC status code; GOAWAY for graceful shutdown | ICE connection-state events; media-quality from getStats() |
| Cross-region failover | DNS + health checks; clients re-resolve and retry | Same as REST | Sender re-resolves receiver URL; receiver's failover is its problem | Client reconnects; new region picks up from Last-Event-ID if state is cross-region replicated | Connection lost on region failure; reconnect to new region, state must be in shared store | Client-side LB or service mesh handles; xDS can shift traffic | Re-ICE to new TURN/SFU; calls drop during failover unless multi-region SFU |
| Half-open / zombie detection | Request timeout (configurable, typically 30s) | Per-poll timeout | Sender timeout (commonly 10-30s); receiver ack required | Server-sent heartbeat comments + client read timeout | Ping/pong frames + app-layer timeout | HTTP/2 keepalive (PING frames) | ICE consent-freshness checks (STUN binding every 5-15s) |
| Heartbeat / keepalive | N/A — request-scoped | The poll itself is the heartbeat | N/A — sender-initiated | Comment lines (:\n\n) every 15-30s; mandatory to keep proxies alive |
RFC 6455 ping/pong frames; app-layer heartbeat for NAT-rebind detection | HTTP/2 PING frames; configurable interval / timeout | STUN binding requests; RTCP receiver reports |
5. Connection Distribution
Reframed from "Sharding". For protocols, the question is how connections distribute across servers, what stickiness is required, and what scale-out looks like.
| Dimension | REST | Polling | Webhooks | SSE | WebSockets | gRPC | WebRTC |
|---|---|---|---|---|---|---|---|
| Stateless across servers? | Yes (when properly designed) | Yes | Yes at receiver; sender keeps retry state | Per-connection stateful; but events can come from shared pub/sub bus | Stateful — each connection bound to a specific server | Channel-stateful; one channel pinned to one HTTP/2 connection | Per-peer-connection state; SFU is the stateful hub |
| Sticky session required? | No (anti-pattern if present) | No | No at receiver | Effectively yes for the lifetime of the stream | Yes — sticky for connection lifetime | Yes per channel; LB must respect HTTP/2 connection affinity | Yes — peer-to-SFU pinning |
| LB type (L4 vs L7) | Either; L7 for path-routing | Either | Either at receiver | L7 required for path routing; L4 works for fleet-level | L7 with WS support; L4 works for raw TCP | L7 with HTTP/2 awareness; L4 NLB causes single-pod hotspots | L4 for UDP; signaling separately L7 |
| Per-server connection ceiling | Limited by FD / thread; ~10K concurrent requests/node typical | Same as REST plus active poll concurrency | Same as REST at receiver | 10K-50K connections/node with event-loop server (Go, Node, Netty) | 50K-100K+ connections/node; ulimit + tcp_mem tuning required | HTTP/2 multiplexing → 1 connection serves many streams; fewer connections than REST | ~2K-5K participants per SFU node typical (CPU bound, not socket bound) |
| Horizontal scale story | Add nodes behind LB; uniform | Same as REST | Same as REST at receiver | Add nodes; shared pub/sub bus (Redis / NATS / Kafka) for cross-node fanout | Add nodes + shared pub/sub bus + session directory if reconnect-targeting needed | Client-side LB or service mesh; xDS-driven endpoint discovery | SFU cluster with cascade / mesh between SFUs; signaling layer is separate |
| Session rebalancing | N/A — no session | N/A | N/A | Drop-and-reconnect; client picks new node | Drop-and-reconnect with jitter; reconnect storm risk | Server can send GOAWAY; client re-establishes channel | Re-ICE or full re-SDP; visible to user as call drop |
| Hot connection behavior | N/A — per-request | High-rate single client looks like many requests; rate-limit by API key | Receiver overload returns 429; sender backs off | Single fast-producing event-id stream can OOM if not bounded | Slow-consumer client blocks server send queue; needs app-layer backpressure | HTTP/2 flow control bounds per-stream | High-bitrate client drives SFU CPU and bandwidth; simulcast / SVC mitigates |
| Max connections (practical) | Limited by fleet capacity, not protocol | Same as REST | Same as REST at receiver | ~1M concurrent connections across a fleet of 20-50 modern nodes | 10M+ documented at WhatsApp, Slack, Discord scale | Bounded by channel count, not stream count; effectively very high | SFU bound; ~50K-100K concurrent across an SFU cluster typical |
| Cross-shard fanout | N/A — stateless | N/A | Sender fans out to N receivers from its queue | Pub/sub bus delivers events to nodes holding subscriber connections | Pub/sub bus + session directory: who is connected to which node | N/A for unary; bidi-streaming uses pub/sub fanout same as WS | SFU forwards to all participants; cascade between SFUs for cross-region |
6. Fanout / State Sync
Reframed from "Replication". For protocols, this is how one event reaches N receivers, with what ordering and delivery guarantees.
| Dimension | REST | Polling | Webhooks | SSE | WebSockets | gRPC | WebRTC |
|---|---|---|---|---|---|---|---|
| Fanout model | Unicast — client pulls own copy | Unicast — each client pulls | Sender-side multicast — sender posts to each receiver URL | Server-side broadcast — backend fans out via pub/sub to subscriber connections | Server-side broadcast — same pattern as SSE plus bidi | Server-stream / bidi — server fans out via pub/sub on its end | SFU forwards; or mesh (each peer sends to all others, N²) |
| State ownership | Server | Server (with client cursor) | Sender (until ack); receiver after | Server with event-id retention | Server + client (per-connection); shared via pub/sub bus | Server; client tracks stream cursor app-side | Peer + SFU; CRDT / OT for shared state via data channel |
| Multi-server fanout — pub/sub bus required? | No | No | No — fanout is sender-driven HTTP POST | Yes for any cross-node fanout (Redis pub/sub, NATS, Kafka) | Yes — same as SSE | Yes for streaming fanout | SFU is the fanout primitive; cascade for cross-SFU |
| Message ordering guarantees | Per-call only | Client-visible order = poll order; reordering risk if events expire | No order guarantee across events (parallel HTTP delivery) | Per-stream FIFO from server | Per-connection FIFO (TCP-ordered) | Per-stream FIFO; HTTP/2 stream ordering | Media: timestamped, jitter-buffered; data channel: configurable ordered/unordered |
| Delivery guarantee | At-most-once (one response or error) | At-least-once with client dedup on cursor | At-least-once (retries on failure) | At-least-once on reconnect with Last-Event-ID; at-most-once otherwise | App-defined; protocol gives best-effort TCP | At-most-once per RPC; app-layer retry adds at-least-once | Media: best-effort with FEC; data channel: configurable |
| Cross-region fanout | DNS / multi-region origin | Same as REST | Sender hits receivers per-region | Cross-region pub/sub (MSK MirrorMaker, Confluent, NATS leaf nodes) | Same as SSE plus session directory for reconnect routing | Service mesh with global xDS | SFU cascade across regions; latency budget tightens |
| Lag (typical) | Sub-50ms intra-region; per-request | Equal to poll interval | Sub-second to seconds; minutes-hours on retry | Sub-100ms intra-region with healthy pub/sub bus | Sub-50ms intra-region typical | Sub-50ms intra-region | 50-150ms peer-to-peer; +50ms per SFU hop |
| Conflict resolution | App-layer (ETag / If-Match / OCC) | App-layer with cursor | App-layer at receiver (idempotency-key dedup) | N/A — one-way stream | App-layer (CRDT, OT, vector clocks) | App-layer | Media: jitter-buffer ordering; data channel: app-layer (CRDT typical) |
| Behavior during network partition | Request fails fast; client retries | Polls fail; client backs off and retries | Sender retries with backoff; receiver may catch up via /events GET | Connection drops; auto-reconnect with Last-Event-ID | Connection drops; app-layer reconnect required | Channel reconnects; streams must be restarted by app | ICE attempts alternate candidates; falls back to TURN |
7. Better Usage Patterns
Per-protocol. Patterns most teams miss until production bites; the PE-grade approach beats them.
REST
Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Idempotency-Key on every POST that mutates external state | Treat POST as fire-and-forget; rely on client to dedup on retry | Server stores Idempotency-Key + response hash for 24h; replays return original response | Customer-charged-twice bugs are the most expensive REST bug class; the fix is one header |
| ETag + If-Match for optimistic concurrency | Last-write-wins; the second update silently overwrites the first | Server returns ETag on GET; PUT/PATCH requires If-Match matching the current ETag | Concurrent edits stop corrupting data and start surfacing as 412 Precondition Failed |
| Cursor-based pagination, not offset | ?offset=10000&limit=20 — DB does O(n) scan per page | ?cursor=opaque_token&limit=20 — token encodes the last seen primary key | Cursors are O(1) regardless of position; offsets break at scale with timeouts on deep pages |
| Faithful 4xx vs 5xx | Return 500 for every error (including malformed input) | 4xx = "your request was wrong", 5xx = "we failed"; 4xx is not retryable, 5xx may be | Client retry logic depends on this distinction; getting it wrong creates retry storms or silent loss |
| Cache-Control + Vary explicit, never implicit | No headers, then learn the CDN cached a user-specific response globally | Cache-Control: private for user-scoped; Vary: Authorization, Accept-Encoding always | CDN response leakage between users is a CWE-200; explicit headers prevent it |
Polling
Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Long-poll with coordinated timeout | Short-poll at 1-5s; 95%+ of requests return empty | 30-60s long-poll; server holds until event arrives or timeout | 10x reduction in QPS for the same freshness; works through HTTP infra |
| Conditional requests (If-Modified-Since, ETag) | Endpoint ignores conditional headers; client downloads full payload every poll | Server respects If-None-Match → 304 Not Modified with no body | 80%+ bandwidth reduction on unchanged-data polls; trivial server change |
| Jittered intervals | Hardcoded 60s interval; N clients sync to wall-clock boundary, cause :00 spike | Base + ±20% random; server adds Retry-After hints under load | Thundering herds at scheduled boundaries are an outage class on their own |
| Delta polling with watermark | Poll /messages and get the full list every time; client dedupes locally | Poll /messages?since=last_seen_cursor; server returns only new items | 100x payload reduction at any nontrivial dataset size; cleaner client logic |
| Exponential backoff on empty | Constant interval regardless of activity; idle clients hammer at 1Hz | Double interval on empty (capped at, say, 5min); reset to base on event | Idle traffic drops 90%+; active clients still see fast updates |
Webhooks
Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| HMAC signature verification on every request | Trust the sender IP; fail when sender rotates clouds or adds CDN | HMAC-SHA256 over (timestamp + body) with rejection of stale timestamps (>5min) | Without this, anyone who learns your endpoint can spoof events; IP-based trust does not scale |
| Receiver: enqueue first, process async | Synchronous processing inside the webhook handler; slow DB write blocks ack | Validate signature → enqueue to SQS / Kafka → ack 200 in <100ms → process from queue | Sender retry storms on receiver slowness are how production fires start; async absorbs bursts |
| Idempotent dedup on event-id | Trust at-most-once delivery claim; process duplicates and double-charge | Maintain a TTL'd seen-set keyed on event-id (Redis SETEX or DDB conditional write) | At-least-once is the real semantic; assuming otherwise causes duplicate side effects |
| Per-event-type endpoints | One /webhook endpoint that switches on event_type internally | /webhook/payment-success, /webhook/refund-issued, separate handlers, separate scale | You can't independently scale, monitor, or rate-limit a god-endpoint; per-type routes solve this |
| Replay / backfill endpoint pair | Webhook is the only source of truth; missed events are gone | Always offer /events?since=cursor as a pull-based backfill; webhook is the push cache | Bugs, downtime, schema changes all need replay; without it, you debug by asking the sender to resend |
SSE
A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Heartbeat comments every 15-30s | No heartbeats; proxies kill idle streams at 30-60s with no error visible | Send :\n\n comment lines on a timer; clients ignore, proxies see traffic | Silent stream death is the #1 SSE production fire; one timer prevents it |
| Run over HTTP/2 only | HTTP/1.1 SSE; browser caps at 6 streams per origin, app hangs on 7th tab | Terminate HTTP/2 at the edge and through to the SSE origin | HTTP/1.1 SSE at any nontrivial fanout is structurally broken; HTTP/2 makes it work |
| Persist Last-Event-ID server-side | Event log in-memory; server restart loses position; reconnect gets silent gap | Redis Streams or Kafka with retention longer than worst-case reconnect window | Auto-reconnect is the SSE selling point; without persistence it's a half-feature |
| Disable proxy buffering explicitly | Default nginx / Cloudflare config buffers responses; stream never flushes | Set X-Accel-Buffering: no, Cache-Control: no-cache, Content-Type: text/event-stream | Stream-buffering footgun produces zero visible errors; just nothing happens for users |
| Bound event payload size | Send arbitrary payloads; one large event blocks the stream for slow clients | Cap event body at, say, 32KB; chunk larger payloads with continuation event-ids | Large single events flush through TCP windows and spike p99 latency for every client on the stream |
WebSockets
Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Application-layer ping/pong + sequence numbers | Rely on TCP keepalive (default 2hr); miss NAT rebinding, half-open sockets | App-layer ping every 30s; sequence number on every message; replay on reconnect | TCP-only liveness misses real failures; sequence numbers enable lossless reconnect |
| LB idle-timeout coordination | App holds connection idle; LB kills it at 60s with no warning | App-level keepalive interval < LB idle timeout (e.g. 30s vs 60s) | Mysterious disconnects mid-app-think are almost always LB timeout mismatch |
| Explicit ack-window backpressure | Server writes to socket; one slow client fills the send queue; OOM | Credit-based: client sends N "ready for N more" tokens; server respects the window | TCP flow control alone is not enough at fanout; OOM under slow-consumer is a common cause |
| Connection draining on deploy | Rolling-restart drops all WS connections; reconnect storm hits new pods | Send "going-down" frame, 60s jittered grace, then force-close; new pods warm before old pods drain | Deploy-triggered reconnect storms cause cascading failures that look like real outages |
| Token in handshake, never in URL | ?token=eyJhb...; token is now in every proxy access log forever | Sec-WebSocket-Protocol subprotocol header or first-frame auth message | Tokens leaked to logs are a CWE-200 incident; auditors will find it eventually |
gRPC
Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Deadlines propagated through the call graph | No deadline set; runaway calls block goroutines / threads / Netty workers | Set deadline at the edge; propagate via Context / Metadata; downstreams enforce | Deadline propagation is the single biggest gRPC reliability lever; without it you can't bound tail latency |
| Channel reuse, not per-call channels | Create a channel per RPC; exhaust ports, leak HTTP/2 connections | One channel per (host, port), shared across goroutines / threads | Channel creation is expensive (TLS + HTTP/2 setup); reuse is 10-100x cheaper |
| Client-side LB or service mesh, not L4 NLB | NLB round-robins TCP connections; HTTP/2 multiplexing pins all RPCs to one backend | round_robin LB policy in gRPC client, or service mesh with HTTP/2-aware routing | L4 LB + gRPC = single-backend hotspots that look like a scaling bug but are a protocol mismatch |
| Cancellation propagation | Caller cancels Context; downstream services don't see it, keep working | Pass Context (Go) / CancellationToken (.NET) / equivalent through every call | Without propagation, cancelled calls keep burning CPU / DB / external API budget downstream |
| Strict additive schema evolution | Reuse field numbers, rename fields, remove fields — wire is "fine" | Never reuse field numbers; mark fields reserved; add only with new numbers | Protobuf evolution rules are unforgiving; one violation breaks every old client silently |
WebRTC
Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Budget for TURN bandwidth on day one | Ship STUN only; 15-20% of users behind symmetric NAT can't connect | TURN servers in each region (or TURN-as-a-service); accept the egress cost | Without TURN, your connection success rate is 80% and you'll never debug why; with TURN it's 99%+ |
| SFU at 5+ peers, mesh below | Mesh-only architecture; works at 4, fails at 5; users blame their wifi | Switch to SFU when participant count exceeds 4 (configurable inflection point) | Mesh upload bandwidth is N-1 streams per peer; residential uplinks fail before SFUs do |
| Trickle ICE always | Vanilla ICE; gather all candidates before offer; 5-15s call setup | Trickle ICE; emit candidates as found, offer goes out fast; sub-second setup | Call-setup latency dominates first-impression UX; trickle is non-optional in 2026 |
| Simulcast for multi-receiver bitrate adaptation | Single bitrate; one viewer on mobile forces everyone to the lowest quality | Send 3 layers (high/mid/low); SFU forwards appropriate layer per subscriber | Without simulcast, one bad-network participant degrades the whole call; with it, each subscriber gets best-available |
| DTLS-SRTP only, never SDES | Old samples or libraries negotiate SDES key exchange | DTLS-SRTP mandatory in the SDP offer; reject SDES outright | SDES is deprecated and insecure; modern libraries enforce DTLS-SRTP but legacy samples still show SDES |
8. Advanced / Next-Gen Alternatives
Per-protocol. The successors, adjacent tech, and patterns that obviate the original when constraints shift.
REST
Default choice for resource APIs, CRUD reads/writes, public HTTP surfaces, and systems where cacheability and operational simplicity matter more than push.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| GraphQL (Federation v2) | Eliminates over-fetching / N+1 across diverse clients | Production | Medium — server resolver layer plus federation gateway | Multiple client teams with different data shapes hitting the same backends |
| HTTP/3 + QUIC | Removes head-of-line blocking, faster handshake, better mobile resilience | Production — ~35% of Cloudflare HTTPS traffic in 2026 | Low — transparent upgrade if edge supports it | Mobile-heavy traffic, lossy networks, latency-sensitive global apps |
| tRPC / typed REST | End-to-end TypeScript type safety without protobuf toolchain | Production in TS monorepos | Medium — TypeScript-only, both ends must speak it | Greenfield TypeScript fullstack where shared types are the value |
| Connect protocol (CNCF) | gRPC schemas + HTTP/1.1 + JSON fallback; browser-native, no proxy | Production — CrowdStrike, Bluesky, Dropbox using | Low if you're greenfield; medium from existing REST | Want gRPC's type safety with REST's debuggability and browser story |
Polling
Use when freshness can lag by seconds or minutes and you want the simplest, most firewall-friendly implementation before paying for push infrastructure.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| SSE / WebSockets | Eliminates poll-interval latency floor; reduces wasted requests | Production | Medium — adds push-channel infra; client refactor | Event rate exceeds 1/min per client and clients are open browsers |
| Long-poll + HTTP/2 keepalive | Same polling model, fewer empty round-trips, no idle radio wake | Production | Low — server-side change; client interval becomes ack-on-receive | Want polling semantics but reduce wasted bandwidth and battery |
| Webhooks (for B2B receivers) | Push from sender; no client-side scheduling | Production | Medium — receiver must expose accessible HTTP endpoint | Receiver is another server you control with a routable URL |
| Mobile push (APNs / FCM) | OS-level persistent connection; battery-friendly | Production | Low — provider SDK integration | Mobile app where battery and background limits dominate |
Webhooks
Best for server-to-server event notifications across ownership boundaries, as long as receivers build idempotency, buffering, replay, and signature checks.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| EventBridge / Kafka topic subscriptions | Replay, ordering, exactly-once semantics, consumer groups | Production | High — architectural shift, requires both sides in compatible infra | Sender and receivers within same cloud / org; ordering matters |
| CloudEvents spec for payloads | Cross-vendor portable event schema; type / source / id standardized | CNCF Graduated | Low — schema-level only, sits on top of HTTP POST | Multi-vendor event flow with multiple sender / receiver implementations |
| WebSub (PubSubHubbub) | Discovery via topic URLs; subscriber-to-hub model | Niche — stable but narrow adoption | Low | Open content distribution (RSS/Atom-like) at internet scale |
| Polling /events?since=cursor as primary | Receiver-paced, replay-friendly, no signature complexity | Production | Low — usually paired with webhooks anyway | Receiver wants control over rate / replay; sender wants no per-receiver retry state |
SSE
A strong fit for one-way server-to-browser streams such as LLM output, dashboards, notifications, and feeds where writes can stay as normal HTTP requests.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| WebTransport | Multi-stream + unreliable datagrams over QUIC; replaces SSE+WS combo | Baseline (Mar 2026) — Safari 26.4 shipped support | Medium — new client API, HTTP/3 edge required | Need datagrams plus reliable streams in one session; lossy mobile networks |
| gRPC server-streaming + Connect | Typed schemas, binary efficiency, browser support via Connect | Production | Medium — protobuf adoption | You control client and want types over text/event-stream |
| HTTP/3 SSE | Head-of-line blocking elimination on lossy networks | Production | Low — transparent | Mobile-heavy or globally distributed SSE consumers |
| RSocket (reactive streaming) | Application-layer backpressure (credit-based), 4 interaction models | Niche | High — language ecosystem narrow (Java / TS / Go) | Reactive stack (Spring Reactor) that wants backpressure as a first-class concept |
WebSockets
Choose for bidirectional realtime interactions where both sides speak frequently and you are ready to own connection state, replay, backpressure, and observability.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| WebTransport (QUIC) | Multi-stream, datagrams, no head-of-line blocking, native QUIC | Baseline (Mar 2026) | Medium — new client API, HTTP/3 edge required, server stack different | Greenfield real-time apps where datagrams or multi-stream matter (cloud gaming, observability streams) |
| WebSocketStream API | Native backpressure via ReadableStream / WritableStream | Early — Chrome behind flag | Low — same protocol, new API surface | Streaming large payloads or high-throughput per-connection |
| WebRTC data channels | UDP-based option, peer-to-peer, configurable reliability | Production | High — different protocol stack entirely | Sub-100ms p99 needed, tolerant of unreliable, peer-to-peer fits |
| RSocket | Credit-based backpressure built into the protocol, 4 interaction models | Niche | High | Reactive Java / Spring stack with strong backpressure requirements |
gRPC
Best inside service meshes and typed service-to-service boundaries where protobuf contracts, deadlines, streaming, and binary efficiency justify the tooling cost.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Connect protocol (Buf, CNCF) | HTTP/1.1 + JSON fallback, browser-native, wire-compatible with gRPC | Production (CNCF, 2026) — CrowdStrike, PlanetScale, Bluesky, Dropbox using | Low — wire-compatible, drop-in for many cases | Want gRPC's schema + codegen with REST's debuggability and zero-proxy browser support |
| gRPC-Web | Browser support via Envoy / similar proxy | Production | Low — sidecar proxy | Need browser clients but don't want to switch protocols server-side |
| gRPC over HTTP/3 | Removes HTTP/2 head-of-line blocking; better mobile | Emerging — Connect and Tonic experimenting; gRPC-core not stable yet | Low — protocol-level upgrade | Mobile or globally-distributed gRPC traffic over lossy networks |
| Cap'n Proto / FlatBuffers RPC | Zero-copy deserialization, microsecond-tier latency | Niche | High — different schema language and toolchain | Latency budget is single-digit microseconds (HFT, game-server tick loops) |
WebRTC
Use for peer/media paths, calls, low-latency data channels, and cases where NAT traversal, TURN, SFU planning, and client-side stats are part of the product.
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Media over QUIC (MoQ) | Pub/sub media delivery over QUIC; relay-friendly; simpler than full WebRTC stack | IETF Draft-17 (May 2026) — Cloudflare relay network, nanocosmos in production at low hundreds of thousands of concurrent | High — new spec, evolving across drafts | Live streaming where 200-300ms is acceptable and you want CDN-scale economics |
| WebTransport (data-only use) | Low-latency reliable + unreliable without media/codec stack | Baseline (Mar 2026) | Medium — different API surface | You want sub-100ms data, not media — don't need codec / jitter buffer / SFU |
| WebCodecs API | Low-level codec access; build custom media pipeline (cloud gaming, custom encode) | Production in Chrome/Edge; Safari/Firefox shipping | Medium — pair with WebTransport for end-to-end control | Custom encoding / decoding outside WebRTC's negotiated codec set |
| SFU-as-a-service (LiveKit, Daily, Cloudflare Calls, 100ms) | Removes operational burden of TURN / SFU / signaling | Production | Low — SDK swap | Don't want to run media infra; trade per-minute cost for zero ops |