Apache Flink

Stateful stream processing at scale. Deep dive on architecture, then a Principal Engineer trade-off pass.

Stream Processing Framework

As of 2026-06-02 · Flink 2.2 (Dec 2025) is GA · 1.20 is LTS · 2.0 introduced disaggregated state

PE Verdict

Flink is the reference implementation for stateful, event-time, exactly-once stream processing. It earns that title with a true streaming runtime, mature checkpointing, and the largest connector ecosystem in the space. It pays for it with operational complexity that punishes teams without dedicated streaming platform engineers. Flink 2.0’s ForSt disaggregated state finally addresses its cloud-native weak spot, but the JVM-centric runtime, RocksDB tuning surface, and key-group-bound rescaling model still make new SQL-first streaming engines (RisingWave, Materialize, Arroyo) the right call for greenfield analytical workloads under 10 TB of state.

Flink Core Mechanics

ArchitectureJobManager, TaskManager, slots, deployment modes Execution ModelDataStream, SQL, operator chaining, backpressure State BackendsHashMap, RocksDB, ForSt, keyed vs operator state CheckpointingAligned, unaligned, incremental, changelog, 2PC sinks Time & WatermarksEvent time, watermark generation, idle partitions, alignment

01 · Cluster Architecture

Flink runs as a two-tier distributed system: a coordinator process (JobManager) and N worker processes (TaskManagers). Slots are the unit of resource isolation inside a TaskManager. Everything else (parallelism, state placement, checkpoint coordination) is layered on top of this model.

Component responsibilities

Component	What It Does	PE-Level Detail
`Dispatcher`	Receives job submissions over REST, spawns a JobMaster per job, exposes the Web UI.	In Application Mode the Dispatcher runs the `main()` on the cluster, eliminating client-side JobGraph generation. Session Mode keeps one Dispatcher across many JobMasters.
`ResourceManager`	Allocates slots to JobMasters. Talks to YARN, Kubernetes, or standalone.	Slot is the unit of scheduling, not CPU. One TaskManager exposes N slots, each gets `1/N` of TM managed memory. Slot sharing groups let operator chains share a slot to cut serialization cost.
`JobMaster`	Owns one job’s ExecutionGraph, schedules tasks, coordinates checkpoints, handles failover.	This is the checkpoint coordinator. It injects barriers into sources, tracks per-operator ACKs, and commits the checkpoint metadata atomically. Loss of JobMaster without HA means job loss.
`TaskManager`	JVM worker process. Runs operator subtasks in slots. Owns local state, network buffers, RocksDB / ForSt.	Heap is for user code and operator framework. Managed memory (off-heap) is for sorting, hash tables, RocksDB block cache. Network memory is for credit-based shuffles. Three pools, three tuning surfaces.
`Slot`	Fixed share of one TM’s managed memory plus one task thread per chained operator.	Slots do not isolate CPU. Two slots on a busy TM compete for cores. This is the source of most “noisy neighbor” pain in shared session clusters.
`Client`	Builds the JobGraph from user code, submits it to Dispatcher. Optional once submitted.	Session Mode keeps the client around for tracking. Application Mode terminates the client after submission, which is mandatory for K8s-native deployments at scale.

02 · Execution Model

User code becomes a logical dataflow, which the optimizer rewrites into a JobGraph, which the JobMaster expands into an ExecutionGraph (one vertex per parallel subtask), which gets pinned to slots. Operator chaining fuses adjacent non-shuffling operators into a single thread to skip serialization.

The four graphs

Graph	Built By	Granularity	What Matters at PE Level
StreamGraph	Client (DataStream API) or Table planner (SQL)	Logical operators	SQL queries go through Calcite plan, then a rule-based stream optimizer. Plan stability across versions is the production risk, which is why `COMPILED PLAN` exists.
JobGraph	Client	Chained operators	Adjacent operators with forward-only edges and matching parallelism fuse into one task. This is where `disableChaining()` calls show up to expose subtasks for backpressure debugging.
ExecutionGraph	JobMaster	Per-subtask	One ExecutionVertex per parallel instance. This is what the Web UI shows and what the scheduler operates on. Region failover scopes resets to the pipelined region of failed vertices, not the whole job.
Physical Graph	Scheduler + RM	Subtask to slot	Pinning happens here. Slot sharing groups, co-location constraints, and resource profiles all collapse into a placement decision the operator cannot easily inspect post-deploy.

API layers

DataStream API: Java / Scala. Low level. Direct access to keyed state, timers, side outputs, custom operators. Used when you need pinpoint control over state and time semantics.
Table API / Flink SQL: Declarative. Goes through Calcite. Supports CEP via MATCH_RECOGNIZE, session windows, named parameters (1.19+), per-operator state TTL via SQL hints. The optimizer can rewrite plans across versions, so COMPILED PLAN snapshots are the upgrade safety net.
PyFlink: Python. Runs UDFs in a Beam-portability sidecar over gRPC. Significantly slower than Java for stateful ops because state access crosses the language boundary. Acceptable for sources, sinks, and stateless transforms.
Stateful Functions: Function-as-a-service style API on top of Flink. Decouples logic from runtime. Largely superseded by Application Mode plus regular DataStream for most teams.
DataSet API: Removed in 2.0. Batch is unified under DataStream + Table API now.

Networking and backpressure

Inter-task data movement uses a credit-based flow control protocol over Netty. Downstream subtasks announce available buffer credits to upstream; upstream sends only as many buffers as credits allow. Backpressure is not a config knob, it is the emergent behavior of a slow consumer starving credit upstream.

The pathology: backpressure couples with aligned checkpoints. Barriers travel through network buffers in FIFO with data. A slow operator delays barrier arrival at downstream operators, alignment time grows, checkpoint duration explodes, and eventually checkpoints time out. Unaligned checkpoints (1.11+) cut barriers ahead of in-flight buffers and snapshot the buffers themselves, breaking the coupling at the cost of larger checkpoint size and a savepoint compatibility caveat.

03 · State Model & Backends

State is what makes Flink Flink. Three kinds of state and three backends, and the choice cascades into checkpoint speed, rescaling cost, and recovery time.

State types

State Type	Scope	Typical Use	Rescaling Behavior
Keyed State	Per key, per operator	Per-user aggregates, joins, session windows, deduplication	Reassigned by key group on parallelism change. Key groups are fixed at max parallelism, set at job start, immutable thereafter.
Operator State	Per operator subtask	Source offsets (Kafka consumer), sink buffers, broadcasted enrichment	Redistributed via `ListState` (even split) or `UnionListState` (full broadcast on restore, expensive at scale).
Broadcast State	Per operator, replicated to all subtasks	Dynamic rules, feature flags, slowly-changing dimensions joined to the main stream	Replicated identically on every parallel instance. Useful but the broadcast side cannot be repartitioned mid-flight.

State backends

Backend	Storage	Best For	Hard Cost
`HashMapStateBackend`	Java heap, plain objects	Small state (under a few GB per TM), latency-sensitive workloads, no serialization overhead per access	GC pressure. Heap fragments under churn. State size capped by heap. Full checkpoint always, no incremental.
`EmbeddedRocksDBStateBackend`	RocksDB on local disk, off-heap block cache	Large state (TB scale), incremental checkpoints, time-windowed aggregates with long retention	Every state access is a serde + LSM lookup. Per-record latency goes from sub-microsecond (heap) to tens of microseconds. RocksDB tuning surface is large: write buffer, block cache, compaction style, bloom filters.
`ForSt` (Flink 2.0)	Distributed file system as primary, local disk as cache	Cloud-native deployments, fast rescaling, large state without local disk dependency	Network I/O for state access. Mitigated by async execution model and tiered cache. Without local cache, throughput drops around 48% on average for I/O-heavy queries on remote storage. With cache it approaches RocksDB parity.

Why ForSt matters. Pre-2.0, large state meant large local disks on TaskManagers, which meant stateful pods on Kubernetes (PVCs, anti-affinity, slow rescheduling). With ForSt, state lives on object storage and the local disk is just a hot cache. Flink 2.0 reports up to 94% checkpoint duration reduction, up to 49x faster recovery after failures or rescaling, and up to 50% cost savings in the disaggregated configuration. The catch is the asynchronous execution model required to hide DFS latency, which is opt-in and not yet covering every operator.

04 · Checkpoint Protocol

Flink’s checkpointing is a variant of Chandy-Lamport adapted for cyclic-free dataflows. The JobMaster injects a numbered barrier into each source. Each operator snapshots its state when it has received the barrier from every input channel, then forwards the barrier downstream. Once every sink ACKs the barrier, the checkpoint is committed.

Checkpoint variants

Variant	Mechanism	When to Use	Cost
Aligned (default)	Operator waits for barrier on every input channel before snapshotting.	Healthy job with no sustained backpressure. Smaller checkpoint size.	Backpressure couples directly into checkpoint duration. Under backpressure, checkpoints may take longer to complete or time out completely.
Unaligned (1.11+)	Barrier cuts ahead of in-flight buffers; the buffers are snapshotted as part of operator state.	Persistent backpressure where aligned CPs are timing out. Stateless or low-state operators benefit most.	Checkpoint size grows by in-flight buffer volume. Only works for exactly-once and with one concurrent checkpoint. Savepoints cannot be unaligned.
Incremental (RocksDB)	Only uploads RocksDB SSTables changed since last checkpoint.	Large state, frequent checkpoints. Default and correct for almost all RocksDB jobs.	RocksDB compaction creates “new” files even for unchanged data, occasionally blowing up incremental delta size. Background manual compaction in 1.20 mitigates.
Generic Log-based (changelog, 1.15+)	Operators write state changes to a durable changelog continuously; materialization happens in the background.	When checkpoint duration must be predictable and short (sub-second p99). Cost is paid continuously, not at CP time.	Doubles write amplification. Extra DFS traffic. Worth it only when CP duration variance is the production pain.
Savepoint	Manually triggered, version-stable, format-independent snapshot.	Upgrades, parallelism changes, code changes, A/B job swaps.	Always aligned. Cannot be unaligned even with the feature enabled. Slower than checkpoints. Holds state in the canonical format, not RocksDB native.
File Merging (1.20+)	Combines many small checkpoint files into fewer larger ones.	Large parallelism (thousands of subtasks) where filesystem metadata is the bottleneck.	MVP feature, opt-in via `execution.checkpointing.file-merging.enabled`. Watch for edge cases on retention.

2PC sinks are the other half of exactly-once. Checkpoint guarantees state consistency. End-to-end exactly-once also requires the sink to participate. Kafka’s KafkaSink with EXACTLY_ONCE uses transactional producers. The producer transaction commits only when the corresponding checkpoint commits, with a 2PC handshake driven by the JobMaster. Get this wrong and you have exactly-once on state but at-least-once on output, which is the most common silent correctness bug in Flink deployments.

05 · Time & Watermarks

Flink supports three notions of time. Event time is the only one that gives correct results under out-of-order data, which is most production data. Event time relies on watermarks: monotonic timestamps that assert “no more data with timestamp earlier than this will arrive”. Watermarks are how Flink fires windows, expires timers, and handles late data.

Time Mode	Source of Time	Correctness Property	Production Risk
Event time	Timestamp embedded in the record.	Deterministic results across reprocessing. Same input → same windows → same output.	Skewed sources stall watermarks. One slow Kafka partition can freeze global watermark and block window firing across the entire job.
Processing time	Wall clock on the TaskManager processing the record.	Lowest latency, no watermark machinery, no late data.	Non-deterministic. Reprocessing a backlog produces different windows than the live run. Most teams regret choosing this within 6 months.
Ingestion time	Wall clock at source operator.	Middle ground. Approximate event-time with no per-record timestamp needed.	Mostly deprecated in practice. If you want event time, use event time.

Watermark mechanics worth knowing

Generation: Per-source-partition watermark, taken as the minimum across partitions for downstream. Idle partition detection (1.11+) lets stalled partitions opt out instead of holding global watermark hostage.
Allowed lateness: Window-level grace period after watermark passes window end. Late records reopen the window and re-emit results. Costs state retention for the lateness window.
Side output for late data: Records arriving after allowed lateness can be diverted to a side output instead of dropped. The right pattern for forensic correctness without inflating main-path state.
Watermark alignment (1.15+): Coordinates watermark progress across source subtasks to prevent one fast partition from racing ahead and forcing huge buffers on slow ones. Mandatory for backpressure-sensitive workloads.

06 · Trade-Offs

Click any column header to sort. Trade-offs are stated as “give up X to get Y” with the operational scenario where the cost actually bites.

Trade-Off	What You Gain	What You Give Up	When It Bites You	PE Nuance
True streaming runtime over microbatch	Per-record latency in low tens of ms, event-time semantics, no batch boundary artifacts	Throughput-per-core lower than Spark Structured Streaming on stateless workloads, more complex resource accounting	Hourly aggregation jobs over Parquet where Spark would just be simpler and cheaper on managed Databricks	Teams pick Flink for “real-time” and then run 90% of their workload as 5-minute tumbling windows. If the SLA tolerates 5 minutes, you are paying streaming-runtime tax for batch outcomes.
Exactly-once via aligned checkpoints	Correctness guarantee across state and 2PC sinks, no application-level dedup	Checkpoint duration coupled to backpressure, end-to-end exactly-once requires sink cooperation (Kafka 2PC, Iceberg, JDBC XA)	Sustained backpressure makes aligned checkpoints time out; you flip to unaligned and savepoints take longer for upgrades	Most “exactly-once” Flink jobs in production are at-least-once at the sink because the team used a non-transactional connector. The semantic is end-to-end, not framework-only.
RocksDB state backend over heap	TB-scale state per TM, incremental checkpoints, predictable memory footprint via managed memory	Per-state-access latency goes from sub-microsecond to tens of microseconds, large tuning surface (write buffer, block cache, compaction)	Per-record state access on the hot path (e.g., session keying), where heap would be 50x faster and the state actually fits	The default of RocksDB is correct for the common case but wrong for low-state, latency-critical jobs. Profile first. HashMap backend is not deprecated, it is underused.
Key groups for hash-partitioned state	Deterministic state placement, parallel recovery, sub-second key routing	Max parallelism is fixed at job submission. Going past it requires a state-incompatible rewrite	Year 2 traffic outgrows the parallelism ceiling you set on day 1. Now you need a savepoint, manual repartitioning, and a release	Always set max parallelism explicitly to something like 720 or 1440 (composite-friendly numbers). Defaults of 128 cap you fast at any real scale.
JVM-centric runtime	Mature ecosystem, deep observability via JMX, vast operator and connector library	GC pauses bite at 100GB+ heaps, native ML/SIMD workloads must JNI out, Python is a second-class citizen	P99 latency spikes from G1 mixed GCs at 8 AM traffic ramp, masking real bottlenecks	ZGC and Shenandoah are the right defaults at scale, not G1. Most “Flink is slow” tickets resolve to a GC config from 2018.
Event-time semantics with watermarks	Deterministic reprocessing, correct windows over out-of-order data, replay safety	Watermark sensitivity to skewed sources, per-source-partition idleness tuning, late-data accounting	One Kafka partition with no producers freezes global watermark; windows stop firing; alerts trip	Watermark alignment (1.15+) and idle-partition timeouts are not optional at scale. Most “windows not firing” tickets are watermark stalls, not bugs.
Mature connector ecosystem	Kafka, Kinesis, Pulsar, JDBC, Iceberg, Delta, HBase, ES, Pinot, CDC connectors covering most production stacks	Quality varies wildly. Some connectors lag Flink versions, others have correctness issues on edge cases	Two-quarter blocker waiting for a connector to support a new Flink minor, or hitting a connector deadlock under specific shutdown sequences	The first-party Kafka and Iceberg connectors are production-grade. Treat community connectors as alpha until you have validated checkpoint + restore + 2PC under chaos testing.
Slot sharing for operator chaining	Skips serialization between chained operators, slashes per-record cost for forward edges	CPU contention inside slots, harder backpressure isolation, debugging requires manually un-chaining	Production performance regression that disappears the moment you `disableChaining()` to see what is slow	Slot sharing groups are powerful for cost but they hide the truth. For new jobs, deploy with chaining disabled in pre-prod, find the bottleneck, then re-enable.
Coupled compute and state (pre-2.0)	Sub-microsecond state access from local disk, predictable RocksDB performance	Stateful pods, slow rescaling (download all state on scale-up), local disk dependency on K8s	Autoscaling event takes 15 minutes for a 200GB-per-TM job because the new TMs must hydrate from S3 first	This is exactly why Flink 2.0 ForSt exists. If you are designing a new cloud-native deployment, evaluate ForSt now; the pre-2.0 model is a structural mismatch for elastic infra.

07 · Use Cases

Production deployments at the top of the scale curve. Numbers are from public engineering posts and conference talks.

Use Case	Company / Scenario	Driving Property	Scale Dimension	Why Not Alternative
Real-time recommendation and transaction processing	Alibaba (Blink fork, now upstreamed)	Sub-100ms p99 stateful joins across user, product, ad streams, exactly-once during Singles’ Day peaks	Alibaba processes 40 billion events per day on Flink	Spark Structured Streaming microbatch boundaries violated the latency SLA. Kafka Streams lacked the multi-stream join and event-time depth.
Streaming SQL platform for analysts	Uber AthenaX	SQL-first stream interface so non-engineers can ship streaming queries against canonical event streams	Trillion messages per day across the AthenaX platform	Custom DSLs created onboarding friction; Spark SQL did not give true event-time. Flink SQL hit the sweet spot for self-serve.
Ad-hoc stream processing at platform layer	Netflix Keystone	Multi-tenant stream processing where teams ship jobs without owning infrastructure, with isolation per job	Hundreds of jobs, mission-critical recommendation features, low-latency event routing	In-house Mantis covered some cases; Flink was chosen for stateful workloads that Mantis was not designed for.
Streaming fraud and risk detection	Stripe (5+ years on Flink), ING	Exactly-once stateful CEP over payment streams with sub-second decision SLAs and replayable audit trail	Stripe scaled Flink across multiple regions and hundreds of jobs	Kafka Streams could not handle the multi-stream join complexity at the required latency. Custom CEP frameworks lacked Flink’s checkpointing.
Real-time experimentation and analytics	Pinterest Xenon platform	Event aggregation for experimentation with windowed metrics, deduplication, and low-latency dashboards	Pinterest runs Flink in production with self-service diagnosis tools and custom deployment frameworks	Batch experimentation pipelines lagged 24 hours; Spark Streaming had operator state limits Xenon hit.
Real-time gaming and personalization	King (Candy Crush parent)	Stateful processing of player events for in-game personalization and anti-cheat	King handles 30 billion daily events across 200+ countries	Kappa-style replay over batch could not meet the in-session personalization latency. Custom code would not have given exactly-once for free.
Telco real-time analytics	Bouygues Telecom	Sub-second event-time windowing over network telemetry with strict SLA for operations dashboards	30+ production Flink apps processing more than 10 billion raw events per day	Apache Storm (predecessor) lacked stateful CEP and exactly-once. Spark Streaming microbatch latency was too high.
End-to-end exactly-once ad event processing	Uber Ad Events (Flink + Kafka + Pinot)	Exactly-once stateful enrichment and dedup of ad impressions with billing-grade correctness	Production at Uber ad-platform scale with audited revenue impact	Application-level dedup at downstream services would have added 30%+ infra cost and is correctness-fragile at peak load.

08 · Limitations

Real ceilings teams hit in production, with the standard workaround and what the workaround actually costs.

Limitation	Severity	Workaround	Workaround Cost
Max parallelism is immutable after first job submission	High	Set high max parallelism (720, 1440) upfront. If exceeded, savepoint, rewrite key partitioning, restart from clean state	State-incompatible rewrite. In practice, jobs hit this and accept a hard restart or rebuild state from source.
Rescaling requires savepoint + restore (pre-2.0)	High	Use Reactive Mode (1.13+) or migrate to Flink 2.0 + ForSt for instant rescaling	Reactive Mode is opt-in and not production-default; ForSt requires async execution and 2.0 minor version.
Local disk dependency on Kubernetes for RocksDB	High	Use PVCs with fast SSD, or move to ForSt with object storage as primary	PVCs slow pod scheduling. ForSt is opt-in and not all operators support async execution.
Backpressure propagates into checkpoint duration	High	Enable unaligned checkpoints, enable buffer debloating, fix the actual backpressure source	Unaligned CP size grows; only works for exactly-once with one concurrent checkpoint; savepoints stay aligned and slower.
State schema evolution is fragile for non-Avro/POJO state	High	Use Avro or POJO state with explicit schema, never `Kryo` default serializers for production state	Refactoring legacy jobs to Avro state is a multi-sprint effort with savepoint-replay validation.
Watermark stalls from idle source partitions	Medium	Use `WatermarkStrategy.withIdleness()` and watermark alignment (1.15+)	Idleness timeout must be tuned per source; too short causes premature firing, too long causes stalls. Requires per-job analysis.
PyFlink is significantly slower than Java for stateful operators	Medium	Keep stateful logic in Java/Scala; use PyFlink only for sources, sinks, and stateless transforms	Mixed-language jobs add deployment and debugging complexity. Pure Python shops face a 2-5x throughput penalty.
Connector quality varies; some community connectors lag Flink versions	Medium	Validate connector under chaos testing (restart + replay + 2PC commit) before production	Adds 1-2 weeks per connector to validation cycle. Some teams maintain forks.
Job graph compilation time grows with SQL plan size	Medium	Split very large SQL pipelines into multiple jobs; use `COMPILED PLAN` to cache the optimized plan across upgrades	Multi-job pipelines need explicit topic-as-boundary contracts. Compiled plans must be tested across Flink versions on upgrade.
Window state grows unboundedly without TTL	Critical	Set `StateTtlConfig` on all keyed state. Use session windows over global windows for unbounded keyspaces	TTL with incremental checkpoints is based on RocksDB compaction, so state size reduction is only observed after compaction occurs. Misleading checkpoint metrics during cleanup.
JobManager is a SPOF without HA configured	Critical	Enable HA via ZooKeeper or Kubernetes leader election; store checkpoint metadata on durable DFS	HA adds operational dependency (ZK cluster) or K8s ConfigMap reliance. Misconfigured HA can split-brain on K8s rolling updates.
Operator state under `UnionListState` broadcasts on restore	Medium	Avoid `UnionListState` at high parallelism; prefer keyed state for genuinely keyed data	Some legacy sources still use it. Migration requires connector cooperation.

09 · Fault Tolerance

Flink’s fault tolerance is checkpoint-and-restart, not in-memory replication. Recovery time is bounded by state size plus source replay, not by RPO/RTO of a replicated quorum.

Dimension	Behavior	Operational Reality
Replication model	State snapshotted asynchronously to durable DFS at checkpoint intervals. No in-memory state replicas across TMs.	Durability inherits from the DFS (S3 11 nines, HDFS configurable replication). Loss between checkpoints is unrecoverable from Flink, must be replayed from source.
Failure detection	TaskManager heartbeats to JobManager. Default timeout 50 seconds, tunable via `heartbeat.timeout`.	Tuning lower than 30s causes false positives under GC; longer delays detection. Set 20-50s and pair with appropriate restart strategy.
Failover mechanism	Three strategies: full job restart, region failover (since 1.9), or individual failover for embarrassingly parallel jobs.	Region failover is the right default for most jobs. Limits blast radius to the connected pipelined region around the failed task, not the whole job graph.
RTO (typical)	Seconds for small state (under 10GB), minutes for large state due to RocksDB hydration from DFS. Flink 2.0 ForSt cuts this dramatically.	Flink 2.0 reports up to 49x faster recovery after failures or rescaling with the disaggregated state model, since state is read from DFS on demand.
RPO (typical)	Zero data loss if source is replayable. Effective RPO is the checkpoint interval (commonly 10s to 5min); work between checkpoints is replayed, not lost.	Non-replayable sources (sockets, HTTP push) make RPO non-zero. Always pair Flink with a replayable log (Kafka, Kinesis, Pulsar, file-based source).
Split-brain behavior	JobManager HA uses ZooKeeper or K8s leader election. Only one active JobMaster per job at any time. TaskManagers refuse to connect to a non-leader.	K8s ConfigMap-based leader election has known races under rolling restarts. ZK-based HA is more battle-tested at large scale.
Blast radius of single-node failure	With region failover: scoped to the connected pipelined region of failed subtasks. With full restart: entire job.	Wide shuffles (keyBy, rebalance) expand the region. Narrow forward edges keep regions small. Operator placement matters more than most teams realize.
Cross-region failover story	No native multi-region. Pattern is savepoint replication to secondary region’s DFS plus a standby cluster that resumes from the latest savepoint on disaster.	Operationally non-trivial. Savepoint frequency determines cross-region RPO. Most teams settle for warm-standby with 5-15 minute savepoint cadence.
Data loss scenarios	Loss possible only if: source cannot replay (non-replayable), or DFS holding checkpoints loses data, or non-2PC sink emits before checkpoint commit.	The third scenario is the silent killer. A team enables exactly-once on state but uses a non-transactional JDBC sink and ships duplicates on every restart.

10 · Sharding / Parallelism (Key Groups)

Flink does not shard data on disk like a database; it shards keyed state across parallel subtasks via key groups. A key group is an intermediate hash bucket that lets state ranges map cleanly to subtasks on rescale.

Dimension	Behavior	Operational Reality
Sharding model	Hash on key into N key groups (max parallelism). Subtasks own a contiguous range of key groups. Default max parallelism is 128, configurable up to 32768.	Key groups are the immutable shard count. Rescaling reassigns ranges to subtasks; it does not change the number of key groups.
Shard key constraints	Any hashable type. `KeySelector` must be deterministic. Null keys go to a default group (anti-pattern).	Skewed keys (one customer takes 90% of traffic) concentrate on one subtask. No automatic split. Solution is application-level salting or upstream pre-aggregation.
Rebalancing mechanism	Savepoint with old parallelism, restore with new parallelism. The framework redistributes key group ranges automatically.	This is the pre-2.0 model. Requires job downtime. Flink 2.0 + ForSt enables online rescaling without state migration since state lives on DFS.
Rebalancing cost / impact	Pre-2.0: full job restart, downtime proportional to state size for RocksDB hydration. 2.0 + ForSt: near-instant since state is read from DFS lazily.	For a 1TB-state job pre-2.0, rescaling typically takes 5-15 minutes depending on DFS bandwidth. ForSt brings this into the seconds range.
Hot-shard behavior	A skewed key concentrates on one subtask. Throughput is capped by single-subtask processing. No automatic split.	Mitigation patterns: key salting (split-then-aggregate), local pre-aggregation, or two-phase aggregation with random keys for the first phase.
Maximum shards (practical)	32768 key groups max. Practical parallelism caps in the low thousands per job due to JM coordination overhead.	Above 2000 subtasks, JobManager checkpoint coordination becomes a bottleneck. File merging (1.20+) helps with metadata pressure.
Resharding without downtime?	Pre-2.0: no. 2.0 + ForSt + async execution: yes, near-instant.	Most production jobs are still on 1.x as of mid-2026. 2.0 adoption is gated by connector and operator support for async execution.
Cross-shard query support	Implicit via `keyBy` (re-partition + shuffle) or broadcast state (every subtask sees all rows). No global state query.	Need a global aggregate? Use `WindowAll` (single-parallel sink) or a downstream OLAP system (Pinot, Druid, ClickHouse) for cross-key queries.

11 · Replication / State Persistence

Replication in Flink lives at the durable storage layer, not in the runtime. State exists on one TaskManager at a time; durability comes from the DFS where checkpoints land.

Dimension	Behavior	Operational Reality
Replication topology	Single owner per key range at runtime. Checkpoints write state to durable DFS at intervals. No multi-leader, no quorum reads.	The runtime is master-of-key. Reads from external systems happen against the latest committed checkpoint via Queryable State (deprecated) or external sinks.
Sync vs async	Checkpoints are asynchronous: barrier injection is synchronous, but state upload to DFS proceeds in background. Operators continue processing during async upload.	Synchronous phase is alignment + state copy; asynchronous phase is upload. If async upload runs longer than the checkpoint interval, checkpoints overlap.
Replication factor	Inherits from DFS: S3 (11 nines durability), HDFS (typically 3x), GCS (multi-region options).	Flink does not control replication factor; it is a DFS concern. Setting RF=1 on the checkpoint DFS is a one-failure-from-data-loss configuration.
Consistency level options	Exactly-once (default) or at-least-once. At-least-once skips alignment and is faster but reprocesses some records on recovery.	At-least-once with idempotent sinks is a valid pattern. Exactly-once with non-transactional sinks is the most common silent at-least-once-end-to-end bug.
Replication lag (typical)	Equal to checkpoint interval, commonly 10 seconds to 5 minutes. ForSt streams state changes continuously, cutting effective lag to seconds.	Shorter intervals (under 10s) increase IOPS pressure on DFS, particularly S3 PUT request rate limits and multipart upload overhead.
Conflict resolution	N/A. Single writer per key by design. No conflict possible within a job.	Cross-job conflicts (two jobs writing to the same Kafka topic) are an application-level problem, not framework-resolvable.
Cross-region replication	Savepoint replication to secondary region’s DFS. Some teams use S3 cross-region replication for checkpoint buckets, but this adds RPO equal to S3 CRR lag.	S3 CRR is “minutes” SLA, not “seconds”. For sub-minute cross-region RPO, application-level Kafka replication (MirrorMaker 2 or Cluster Linking) is the answer, not Flink itself.
Replication during partition	If TaskManager loses contact with JobManager beyond heartbeat timeout, the JM marks it failed and triggers failover. The partitioned TM cannot commit checkpoints.	K8s network partitions during rolling node updates have historically caused spurious failovers. Heartbeat tuning (50s default) is the lever; pair with restart strategy that does not thrash.

12 · Better Usage Patterns

Patterns most engineers miss until they have shipped a few Flink jobs to production and read the post-mortems.

Pattern	What Most Teams Do Wrong	The Better Way	Why It Matters
Set max parallelism explicitly at job creation	Leave it as default 128. Hit the cap a year later when traffic grows.	Set max parallelism to a high composite-friendly number like 720, 1440, or 4320. Sets ceiling for rescaling.	Max parallelism is immutable. Changing it means rebuilding state. One config decision on day 1 saves a 2-week migration on year 2.
Profile before defaulting to RocksDB	Use RocksDB everywhere because “it’s the default for production”.	Use HashMap backend for low-state, latency-critical jobs (under 4GB per TM). RocksDB only when state genuinely exceeds heap.	Per-record state access in heap is sub-microsecond; in RocksDB it is tens of microseconds. For hot-path operators this is a 50x slowdown for state you did not need to persist that way.
Always set state TTL on keyed state	Forget TTL. State grows until savepoint becomes too large to restore.	Set `StateTtlConfig` with explicit retention. Use `cleanupInRocksdbCompactFilter()` to reclaim disk during compaction.	Without TTL, state size is monotonically increasing for non-bounded keyspaces. Even with TTL, observable reduction depends on RocksDB compaction, so set it early not as a recovery measure.
Use Application Mode for production K8s	Run a long-lived Session cluster with many jobs sharing it.	One Application Mode cluster per job. Job lifecycle equals cluster lifecycle.	Session clusters create resource isolation problems (noisy neighbors), version pinning problems (cluster must be on the lowest Flink version that all jobs support), and harder failure attribution. Application mode is the K8s-native pattern.
Use 2PC-capable sinks for exactly-once	Enable exactly-once checkpointing, pair with non-transactional JDBC sink, ship duplicates on every restart.	Use `KafkaSink` with `EXACTLY_ONCE`, Iceberg sink, or JDBC XA. Validate end-to-end exactly-once with a chaos test.	Framework-level exactly-once is necessary but not sufficient. End-to-end correctness requires sink cooperation. This is the silent correctness bug.
Configure RocksDB managed memory pool, not per-instance memory	Hand-tune RocksDB block cache, write buffer, and arena per operator. Drift from the framework’s managed memory accounting.	Let Flink’s managed memory pool sub-allocate to RocksDB instances per slot. Tune the total pool, not per-instance knobs.	The framework owns sizing across operators sharing a TM. Manual tuning drifts from this accounting and causes OOMs when operators rebalance.
Enable unaligned checkpoints only when needed	Turn it on globally because “it solves backpressure”.	Enable it only on jobs with sustained backpressure. Watch checkpoint size growth. Keep aligned for jobs with frequent savepoint upgrades.	Savepoints are always aligned and unaligned CPs only work with one concurrent checkpoint. Globally enabling unaligned slows your upgrade story.
Use side outputs instead of filter chains	Chain multiple `filter()` calls to split a stream into categories.	Use `ProcessFunction` with `OutputTag` side outputs. One pass over the stream, multiple outputs.	Filter chains re-iterate the same stream. Side outputs do a single pass. At high QPS the savings are linear.
Use Avro or POJO state serializers, never default Kryo	Use Kryo for “convenience”. Schema evolution silently breaks on field addition.	Explicit Avro schema or strict POJO with `@TypeInfo` annotations. Validate state evolution in CI with savepoint replay.	Kryo’s “evolution” support is by ordinal, not by name. Adding a field shifts ordinals and corrupts state. Avro evolution by name is the only safe pattern at scale.
Disable operator chaining temporarily to debug	Try to debug performance with chaining on, conflating multiple operators into one subtask metric.	Run pre-prod with `env.disableOperatorChaining()`. Find the bottleneck. Re-enable for production.	Chaining hides which operator is slow. Most performance investigations dead-end on chained tasks until someone splits them.
Use COMPILED PLAN for Flink SQL upgrades	Re-plan SQL queries on each Flink version, accept silent plan changes.	Compile the plan once with `COMPILE PLAN`, store as JSON, replay across versions. Test plan stability in CI before upgrade.	The Calcite optimizer can rewrite plans on minor version upgrade. Compiled plans pin the physical execution, making upgrades safe.
Use watermark alignment for skewed sources	Trust per-partition watermarks. One slow partition stalls global watermark and windows stop firing.	Enable `WatermarkStrategy.withWatermarkAlignment()` (1.15+) plus idleness timeouts.	Most “windows not firing” tickets are watermark stalls from skew, not bugs. Alignment is the framework-level fix.

13 · Next-Gen Alternatives

Where would you reach instead of Flink today, and what does each one actually improve?

Successor / Alternative	What It Improves	Maturity	Migration Cost	When To Consider
Flink 2.0 ForSt (within-family)	Disaggregated state on object storage, near-instant rescaling and recovery, no local-disk dependency. Same Flink runtime above.	Emerging	Low for new jobs. Existing jobs need async execution adoption and operator/connector support validation.	New cloud-native deployments, jobs with TB+ state that must rescale frequently, or anyone tired of stateful pods on K8s.
RisingWave	PostgreSQL-compatible streaming database written in Rust. State on object storage natively. SQL-first, no Java required. Materialized-view-as-pipeline model.	Production	High for jobs with custom Java operators or complex CEP. Low for SQL-only analytical pipelines.	Greenfield streaming analytics under 10 TB of state, SQL-only teams, or where Flink’s operational burden is the cost driver.
Materialize	Strict-serializable consistency, recursive CTEs, differential dataflow runtime. PostgreSQL wire protocol. Best correctness story for streaming joins.	Production	High. Different runtime model. Limited connector ecosystem versus Flink. Single-region for now.	Financial-grade correctness on streaming SQL, recursive computations (graph traversal, BOM explosion), or when consistency matters more than throughput.
Arroyo (now Cloudflare Pipelines)	Rust on DataFusion. Serverless via Cloudflare. Sub-second latency on stateless and lightly-stateful workloads.	Emerging	Moderate. Open source engine still available; managed path is Cloudflare-specific post-2025 acquisition.	Edge-tier ingest and transform, Cloudflare-native stacks, or teams that want streaming SQL without operating any infrastructure.
Kafka Streams	Lightweight Java library. No external cluster. Local state in RocksDB. Sits inside your microservice.	Production	Low for new Kafka-only pipelines. High for migration from Flink because of state model differences (per-task vs key-group).	Single-Kafka-cluster pipelines, microservice-embedded stream processing, teams without dedicated streaming ops.
Spark Structured Streaming	Better batch unification, mature DataFrame API, Databricks/EMR managed runtime. Microbatch model.	Production	Low if already on Spark/Databricks. High if you depend on Flink event-time semantics.	Existing Spark shops, hourly+ latency SLAs, workloads that are batch-y in shape and just need streaming for freshness.
ksqlDB	SQL over Kafka with managed lifecycle (Confluent). Lower ops than Flink for simple stream-stream joins.	Production	Moderate. Different consistency model (Kafka-only), more limited windowing.	Confluent-centric stacks, SQL pipelines that fit Kafka’s ordering model, teams that want managed-only.
Apache Beam on Flink runner	Portable pipeline definition that targets Flink, Spark, Dataflow without rewrite. Avoids lock-in.	Production	Low for new pipelines. Beam abstractions sometimes hide Flink-specific optimizations.	Multi-runtime portability is genuinely needed (e.g., on-prem Flink + GCP Dataflow). Otherwise direct Flink APIs give more control.

Flink Core Mechanics

Search this guide

01 · Cluster Architecture

Component responsibilities

02 · Execution Model

The four graphs

API layers

Networking and backpressure

03 · State Model & Backends

State types

State backends

04 · Checkpoint Protocol

Checkpoint variants

05 · Time & Watermarks

Watermark mechanics worth knowing

06 · Trade-Offs

07 · Use Cases

08 · Limitations

09 · Fault Tolerance

10 · Sharding / Parallelism (Key Groups)

11 · Replication / State Persistence

12 · Better Usage Patterns

13 · Next-Gen Alternatives