Siva Reddy AI
  • Blog
  • Resources
  • About
  • Contact
Back to Deep Dives

Apache Flink

Stateful stream processing at scale. Deep dive on architecture, then a Principal Engineer trade-off pass.

Stream Processing Framework

As of 2026-06-02 · Flink 2.2 (Dec 2025) is GA · 1.20 is LTS · 2.0 introduced disaggregated state

PE Verdict

Flink is the reference implementation for stateful, event-time, exactly-once stream processing. It earns that title with a true streaming runtime, mature checkpointing, and the largest connector ecosystem in the space. It pays for it with operational complexity that punishes teams without dedicated streaming platform engineers. Flink 2.0’s ForSt disaggregated state finally addresses its cloud-native weak spot, but the JVM-centric runtime, RocksDB tuning surface, and key-group-bound rescaling model still make new SQL-first streaming engines (RisingWave, Materialize, Arroyo) the right call for greenfield analytical workloads under 10 TB of state.

Flink Core Mechanics

ArchitectureJobManager, TaskManager, slots, deployment modes Execution ModelDataStream, SQL, operator chaining, backpressure State BackendsHashMap, RocksDB, ForSt, keyed vs operator state CheckpointingAligned, unaligned, incremental, changelog, 2PC sinks Time & WatermarksEvent time, watermark generation, idle partitions, alignment
Architecture Execution State Checkpoints Time Trade-Offs Use Cases Limitations Fault Tolerance Sharding Replication Better Patterns Next-Gen

01 · Cluster Architecture

Flink runs as a two-tier distributed system: a coordinator process (JobManager) and N worker processes (TaskManagers). Slots are the unit of resource isolation inside a TaskManager. Everything else (parallelism, state placement, checkpoint coordination) is layered on top of this model.

CLUSTER Client Submits JobGraph JobManager Dispatcher REST API, job submission ResourceManager Slot allocation JobMaster ExecutionGraph + CP coord HA via ZooKeeper or K8s leader election TaskManager 1 slot slot slot slot RocksDB / ForSt state Local disk / off-heap TaskManager 2 slot slot slot slot Credit-based netty Inter-TM shuffles Durable Storage (S3 / HDFS / GCS) Checkpoints, Savepoints, ForSt primary state Source: replayable log (Kafka, Kinesis, Pulsar)

Component responsibilities

ComponentWhat It DoesPE-Level Detail
DispatcherReceives job submissions over REST, spawns a JobMaster per job, exposes the Web UI.In Application Mode the Dispatcher runs the main() on the cluster, eliminating client-side JobGraph generation. Session Mode keeps one Dispatcher across many JobMasters.
ResourceManagerAllocates slots to JobMasters. Talks to YARN, Kubernetes, or standalone.Slot is the unit of scheduling, not CPU. One TaskManager exposes N slots, each gets 1/N of TM managed memory. Slot sharing groups let operator chains share a slot to cut serialization cost.
JobMasterOwns one job’s ExecutionGraph, schedules tasks, coordinates checkpoints, handles failover.This is the checkpoint coordinator. It injects barriers into sources, tracks per-operator ACKs, and commits the checkpoint metadata atomically. Loss of JobMaster without HA means job loss.
TaskManagerJVM worker process. Runs operator subtasks in slots. Owns local state, network buffers, RocksDB / ForSt.Heap is for user code and operator framework. Managed memory (off-heap) is for sorting, hash tables, RocksDB block cache. Network memory is for credit-based shuffles. Three pools, three tuning surfaces.
SlotFixed share of one TM’s managed memory plus one task thread per chained operator.Slots do not isolate CPU. Two slots on a busy TM compete for cores. This is the source of most “noisy neighbor” pain in shared session clusters.
ClientBuilds the JobGraph from user code, submits it to Dispatcher. Optional once submitted.Session Mode keeps the client around for tracking. Application Mode terminates the client after submission, which is mandatory for K8s-native deployments at scale.

02 · Execution Model

User code becomes a logical dataflow, which the optimizer rewrites into a JobGraph, which the JobMaster expands into an ExecutionGraph (one vertex per parallel subtask), which gets pinned to slots. Operator chaining fuses adjacent non-shuffling operators into a single thread to skip serialization.

The four graphs

GraphBuilt ByGranularityWhat Matters at PE Level
StreamGraphClient (DataStream API) or Table planner (SQL)Logical operatorsSQL queries go through Calcite plan, then a rule-based stream optimizer. Plan stability across versions is the production risk, which is why COMPILED PLAN exists.
JobGraphClientChained operatorsAdjacent operators with forward-only edges and matching parallelism fuse into one task. This is where disableChaining() calls show up to expose subtasks for backpressure debugging.
ExecutionGraphJobMasterPer-subtaskOne ExecutionVertex per parallel instance. This is what the Web UI shows and what the scheduler operates on. Region failover scopes resets to the pipelined region of failed vertices, not the whole job.
Physical GraphScheduler + RMSubtask to slotPinning happens here. Slot sharing groups, co-location constraints, and resource profiles all collapse into a placement decision the operator cannot easily inspect post-deploy.

API layers

DataStream API
Java / Scala. Low level. Direct access to keyed state, timers, side outputs, custom operators. Used when you need pinpoint control over state and time semantics.
Table API / Flink SQL
Declarative. Goes through Calcite. Supports CEP via MATCH_RECOGNIZE, session windows, named parameters (1.19+), per-operator state TTL via SQL hints. The optimizer can rewrite plans across versions, so COMPILED PLAN snapshots are the upgrade safety net.
PyFlink
Python. Runs UDFs in a Beam-portability sidecar over gRPC. Significantly slower than Java for stateful ops because state access crosses the language boundary. Acceptable for sources, sinks, and stateless transforms.
Stateful Functions
Function-as-a-service style API on top of Flink. Decouples logic from runtime. Largely superseded by Application Mode plus regular DataStream for most teams.
DataSet API
Removed in 2.0. Batch is unified under DataStream + Table API now.

Networking and backpressure

Inter-task data movement uses a credit-based flow control protocol over Netty. Downstream subtasks announce available buffer credits to upstream; upstream sends only as many buffers as credits allow. Backpressure is not a config knob, it is the emergent behavior of a slow consumer starving credit upstream.

The pathology: backpressure couples with aligned checkpoints. Barriers travel through network buffers in FIFO with data. A slow operator delays barrier arrival at downstream operators, alignment time grows, checkpoint duration explodes, and eventually checkpoints time out. Unaligned checkpoints (1.11+) cut barriers ahead of in-flight buffers and snapshot the buffers themselves, breaking the coupling at the cost of larger checkpoint size and a savepoint compatibility caveat.

03 · State Model & Backends

State is what makes Flink Flink. Three kinds of state and three backends, and the choice cascades into checkpoint speed, rescaling cost, and recovery time.

State types

State TypeScopeTypical UseRescaling Behavior
Keyed StatePer key, per operatorPer-user aggregates, joins, session windows, deduplicationReassigned by key group on parallelism change. Key groups are fixed at max parallelism, set at job start, immutable thereafter.
Operator StatePer operator subtaskSource offsets (Kafka consumer), sink buffers, broadcasted enrichmentRedistributed via ListState (even split) or UnionListState (full broadcast on restore, expensive at scale).
Broadcast StatePer operator, replicated to all subtasksDynamic rules, feature flags, slowly-changing dimensions joined to the main streamReplicated identically on every parallel instance. Useful but the broadcast side cannot be repartitioned mid-flight.

State backends

BackendStorageBest ForHard Cost
HashMapStateBackendJava heap, plain objectsSmall state (under a few GB per TM), latency-sensitive workloads, no serialization overhead per accessGC pressure. Heap fragments under churn. State size capped by heap. Full checkpoint always, no incremental.
EmbeddedRocksDBStateBackendRocksDB on local disk, off-heap block cacheLarge state (TB scale), incremental checkpoints, time-windowed aggregates with long retentionEvery state access is a serde + LSM lookup. Per-record latency goes from sub-microsecond (heap) to tens of microseconds. RocksDB tuning surface is large: write buffer, block cache, compaction style, bloom filters.
ForSt (Flink 2.0)Distributed file system as primary, local disk as cacheCloud-native deployments, fast rescaling, large state without local disk dependencyNetwork I/O for state access. Mitigated by async execution model and tiered cache. Without local cache, throughput drops around 48% on average for I/O-heavy queries on remote storage. With cache it approaches RocksDB parity.
Why ForSt matters. Pre-2.0, large state meant large local disks on TaskManagers, which meant stateful pods on Kubernetes (PVCs, anti-affinity, slow rescheduling). With ForSt, state lives on object storage and the local disk is just a hot cache. Flink 2.0 reports up to 94% checkpoint duration reduction, up to 49x faster recovery after failures or rescaling, and up to 50% cost savings in the disaggregated configuration. The catch is the asynchronous execution model required to hide DFS latency, which is opt-in and not yet covering every operator.

04 · Checkpoint Protocol

Flink’s checkpointing is a variant of Chandy-Lamport adapted for cyclic-free dataflows. The JobMaster injects a numbered barrier into each source. Each operator snapshots its state when it has received the barrier from every input channel, then forwards the barrier downstream. Once every sink ACKs the barrier, the checkpoint is committed.

ALIGNED (exactly-once, default) Source |B5 Op A |B5 Op B Op B waits for B5 from ALL inputs. Alignment time = backpressure time. UNALIGNED (cuts barrier ahead, snapshots in-flight) Source |B5+buf Op A Op B Barrier overtakes data. In-flight buffers snapshot with state. CP size grows. Barriers are special records. Aligned: alignment-time-bound. Unaligned: snapshot-size-bound.

Checkpoint variants

VariantMechanismWhen to UseCost
Aligned (default)Operator waits for barrier on every input channel before snapshotting.Healthy job with no sustained backpressure. Smaller checkpoint size.Backpressure couples directly into checkpoint duration. Under backpressure, checkpoints may take longer to complete or time out completely.
Unaligned (1.11+)Barrier cuts ahead of in-flight buffers; the buffers are snapshotted as part of operator state.Persistent backpressure where aligned CPs are timing out. Stateless or low-state operators benefit most.Checkpoint size grows by in-flight buffer volume. Only works for exactly-once and with one concurrent checkpoint. Savepoints cannot be unaligned.
Incremental (RocksDB)Only uploads RocksDB SSTables changed since last checkpoint.Large state, frequent checkpoints. Default and correct for almost all RocksDB jobs.RocksDB compaction creates “new” files even for unchanged data, occasionally blowing up incremental delta size. Background manual compaction in 1.20 mitigates.
Generic Log-based (changelog, 1.15+)Operators write state changes to a durable changelog continuously; materialization happens in the background.When checkpoint duration must be predictable and short (sub-second p99). Cost is paid continuously, not at CP time.Doubles write amplification. Extra DFS traffic. Worth it only when CP duration variance is the production pain.
SavepointManually triggered, version-stable, format-independent snapshot.Upgrades, parallelism changes, code changes, A/B job swaps.Always aligned. Cannot be unaligned even with the feature enabled. Slower than checkpoints. Holds state in the canonical format, not RocksDB native.
File Merging (1.20+)Combines many small checkpoint files into fewer larger ones.Large parallelism (thousands of subtasks) where filesystem metadata is the bottleneck.MVP feature, opt-in via execution.checkpointing.file-merging.enabled. Watch for edge cases on retention.
2PC sinks are the other half of exactly-once. Checkpoint guarantees state consistency. End-to-end exactly-once also requires the sink to participate. Kafka’s KafkaSink with EXACTLY_ONCE uses transactional producers. The producer transaction commits only when the corresponding checkpoint commits, with a 2PC handshake driven by the JobMaster. Get this wrong and you have exactly-once on state but at-least-once on output, which is the most common silent correctness bug in Flink deployments.

05 · Time & Watermarks

Flink supports three notions of time. Event time is the only one that gives correct results under out-of-order data, which is most production data. Event time relies on watermarks: monotonic timestamps that assert “no more data with timestamp earlier than this will arrive”. Watermarks are how Flink fires windows, expires timers, and handles late data.

Time ModeSource of TimeCorrectness PropertyProduction Risk
Event timeTimestamp embedded in the record.Deterministic results across reprocessing. Same input → same windows → same output.Skewed sources stall watermarks. One slow Kafka partition can freeze global watermark and block window firing across the entire job.
Processing timeWall clock on the TaskManager processing the record.Lowest latency, no watermark machinery, no late data.Non-deterministic. Reprocessing a backlog produces different windows than the live run. Most teams regret choosing this within 6 months.
Ingestion timeWall clock at source operator.Middle ground. Approximate event-time with no per-record timestamp needed.Mostly deprecated in practice. If you want event time, use event time.

Watermark mechanics worth knowing

Generation
Per-source-partition watermark, taken as the minimum across partitions for downstream. Idle partition detection (1.11+) lets stalled partitions opt out instead of holding global watermark hostage.
Allowed lateness
Window-level grace period after watermark passes window end. Late records reopen the window and re-emit results. Costs state retention for the lateness window.
Side output for late data
Records arriving after allowed lateness can be diverted to a side output instead of dropped. The right pattern for forensic correctness without inflating main-path state.
Watermark alignment (1.15+)
Coordinates watermark progress across source subtasks to prevent one fast partition from racing ahead and forcing huge buffers on slow ones. Mandatory for backpressure-sensitive workloads.

06 · Trade-Offs

Click any column header to sort. Trade-offs are stated as “give up X to get Y” with the operational scenario where the cost actually bites.

Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
True streaming runtime over microbatch Per-record latency in low tens of ms, event-time semantics, no batch boundary artifacts Throughput-per-core lower than Spark Structured Streaming on stateless workloads, more complex resource accounting Hourly aggregation jobs over Parquet where Spark would just be simpler and cheaper on managed Databricks Teams pick Flink for “real-time” and then run 90% of their workload as 5-minute tumbling windows. If the SLA tolerates 5 minutes, you are paying streaming-runtime tax for batch outcomes.
Exactly-once via aligned checkpoints Correctness guarantee across state and 2PC sinks, no application-level dedup Checkpoint duration coupled to backpressure, end-to-end exactly-once requires sink cooperation (Kafka 2PC, Iceberg, JDBC XA) Sustained backpressure makes aligned checkpoints time out; you flip to unaligned and savepoints take longer for upgrades Most “exactly-once” Flink jobs in production are at-least-once at the sink because the team used a non-transactional connector. The semantic is end-to-end, not framework-only.
RocksDB state backend over heap TB-scale state per TM, incremental checkpoints, predictable memory footprint via managed memory Per-state-access latency goes from sub-microsecond to tens of microseconds, large tuning surface (write buffer, block cache, compaction) Per-record state access on the hot path (e.g., session keying), where heap would be 50x faster and the state actually fits The default of RocksDB is correct for the common case but wrong for low-state, latency-critical jobs. Profile first. HashMap backend is not deprecated, it is underused.
Key groups for hash-partitioned state Deterministic state placement, parallel recovery, sub-second key routing Max parallelism is fixed at job submission. Going past it requires a state-incompatible rewrite Year 2 traffic outgrows the parallelism ceiling you set on day 1. Now you need a savepoint, manual repartitioning, and a release Always set max parallelism explicitly to something like 720 or 1440 (composite-friendly numbers). Defaults of 128 cap you fast at any real scale.
JVM-centric runtime Mature ecosystem, deep observability via JMX, vast operator and connector library GC pauses bite at 100GB+ heaps, native ML/SIMD workloads must JNI out, Python is a second-class citizen P99 latency spikes from G1 mixed GCs at 8 AM traffic ramp, masking real bottlenecks ZGC and Shenandoah are the right defaults at scale, not G1. Most “Flink is slow” tickets resolve to a GC config from 2018.
Event-time semantics with watermarks Deterministic reprocessing, correct windows over out-of-order data, replay safety Watermark sensitivity to skewed sources, per-source-partition idleness tuning, late-data accounting One Kafka partition with no producers freezes global watermark; windows stop firing; alerts trip Watermark alignment (1.15+) and idle-partition timeouts are not optional at scale. Most “windows not firing” tickets are watermark stalls, not bugs.
Mature connector ecosystem Kafka, Kinesis, Pulsar, JDBC, Iceberg, Delta, HBase, ES, Pinot, CDC connectors covering most production stacks Quality varies wildly. Some connectors lag Flink versions, others have correctness issues on edge cases Two-quarter blocker waiting for a connector to support a new Flink minor, or hitting a connector deadlock under specific shutdown sequences The first-party Kafka and Iceberg connectors are production-grade. Treat community connectors as alpha until you have validated checkpoint + restore + 2PC under chaos testing.
Slot sharing for operator chaining Skips serialization between chained operators, slashes per-record cost for forward edges CPU contention inside slots, harder backpressure isolation, debugging requires manually un-chaining Production performance regression that disappears the moment you disableChaining() to see what is slow Slot sharing groups are powerful for cost but they hide the truth. For new jobs, deploy with chaining disabled in pre-prod, find the bottleneck, then re-enable.
Coupled compute and state (pre-2.0) Sub-microsecond state access from local disk, predictable RocksDB performance Stateful pods, slow rescaling (download all state on scale-up), local disk dependency on K8s Autoscaling event takes 15 minutes for a 200GB-per-TM job because the new TMs must hydrate from S3 first This is exactly why Flink 2.0 ForSt exists. If you are designing a new cloud-native deployment, evaluate ForSt now; the pre-2.0 model is a structural mismatch for elastic infra.

07 · Use Cases

Production deployments at the top of the scale curve. Numbers are from public engineering posts and conference talks.

Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Real-time recommendation and transaction processing Alibaba (Blink fork, now upstreamed) Sub-100ms p99 stateful joins across user, product, ad streams, exactly-once during Singles’ Day peaks Alibaba processes 40 billion events per day on Flink Spark Structured Streaming microbatch boundaries violated the latency SLA. Kafka Streams lacked the multi-stream join and event-time depth.
Streaming SQL platform for analysts Uber AthenaX SQL-first stream interface so non-engineers can ship streaming queries against canonical event streams Trillion messages per day across the AthenaX platform Custom DSLs created onboarding friction; Spark SQL did not give true event-time. Flink SQL hit the sweet spot for self-serve.
Ad-hoc stream processing at platform layer Netflix Keystone Multi-tenant stream processing where teams ship jobs without owning infrastructure, with isolation per job Hundreds of jobs, mission-critical recommendation features, low-latency event routing In-house Mantis covered some cases; Flink was chosen for stateful workloads that Mantis was not designed for.
Streaming fraud and risk detection Stripe (5+ years on Flink), ING Exactly-once stateful CEP over payment streams with sub-second decision SLAs and replayable audit trail Stripe scaled Flink across multiple regions and hundreds of jobs Kafka Streams could not handle the multi-stream join complexity at the required latency. Custom CEP frameworks lacked Flink’s checkpointing.
Real-time experimentation and analytics Pinterest Xenon platform Event aggregation for experimentation with windowed metrics, deduplication, and low-latency dashboards Pinterest runs Flink in production with self-service diagnosis tools and custom deployment frameworks Batch experimentation pipelines lagged 24 hours; Spark Streaming had operator state limits Xenon hit.
Real-time gaming and personalization King (Candy Crush parent) Stateful processing of player events for in-game personalization and anti-cheat King handles 30 billion daily events across 200+ countries Kappa-style replay over batch could not meet the in-session personalization latency. Custom code would not have given exactly-once for free.
Telco real-time analytics Bouygues Telecom Sub-second event-time windowing over network telemetry with strict SLA for operations dashboards 30+ production Flink apps processing more than 10 billion raw events per day Apache Storm (predecessor) lacked stateful CEP and exactly-once. Spark Streaming microbatch latency was too high.
End-to-end exactly-once ad event processing Uber Ad Events (Flink + Kafka + Pinot) Exactly-once stateful enrichment and dedup of ad impressions with billing-grade correctness Production at Uber ad-platform scale with audited revenue impact Application-level dedup at downstream services would have added 30%+ infra cost and is correctness-fragile at peak load.

08 · Limitations

Real ceilings teams hit in production, with the standard workaround and what the workaround actually costs.

LimitationSeverityWorkaroundWorkaround Cost
Max parallelism is immutable after first job submission High Set high max parallelism (720, 1440) upfront. If exceeded, savepoint, rewrite key partitioning, restart from clean state State-incompatible rewrite. In practice, jobs hit this and accept a hard restart or rebuild state from source.
Rescaling requires savepoint + restore (pre-2.0) High Use Reactive Mode (1.13+) or migrate to Flink 2.0 + ForSt for instant rescaling Reactive Mode is opt-in and not production-default; ForSt requires async execution and 2.0 minor version.
Local disk dependency on Kubernetes for RocksDB High Use PVCs with fast SSD, or move to ForSt with object storage as primary PVCs slow pod scheduling. ForSt is opt-in and not all operators support async execution.
Backpressure propagates into checkpoint duration High Enable unaligned checkpoints, enable buffer debloating, fix the actual backpressure source Unaligned CP size grows; only works for exactly-once with one concurrent checkpoint; savepoints stay aligned and slower.
State schema evolution is fragile for non-Avro/POJO state High Use Avro or POJO state with explicit schema, never Kryo default serializers for production state Refactoring legacy jobs to Avro state is a multi-sprint effort with savepoint-replay validation.
Watermark stalls from idle source partitions Medium Use WatermarkStrategy.withIdleness() and watermark alignment (1.15+) Idleness timeout must be tuned per source; too short causes premature firing, too long causes stalls. Requires per-job analysis.
PyFlink is significantly slower than Java for stateful operators Medium Keep stateful logic in Java/Scala; use PyFlink only for sources, sinks, and stateless transforms Mixed-language jobs add deployment and debugging complexity. Pure Python shops face a 2-5x throughput penalty.
Connector quality varies; some community connectors lag Flink versions Medium Validate connector under chaos testing (restart + replay + 2PC commit) before production Adds 1-2 weeks per connector to validation cycle. Some teams maintain forks.
Job graph compilation time grows with SQL plan size Medium Split very large SQL pipelines into multiple jobs; use COMPILED PLAN to cache the optimized plan across upgrades Multi-job pipelines need explicit topic-as-boundary contracts. Compiled plans must be tested across Flink versions on upgrade.
Window state grows unboundedly without TTL Critical Set StateTtlConfig on all keyed state. Use session windows over global windows for unbounded keyspaces TTL with incremental checkpoints is based on RocksDB compaction, so state size reduction is only observed after compaction occurs. Misleading checkpoint metrics during cleanup.
JobManager is a SPOF without HA configured Critical Enable HA via ZooKeeper or Kubernetes leader election; store checkpoint metadata on durable DFS HA adds operational dependency (ZK cluster) or K8s ConfigMap reliance. Misconfigured HA can split-brain on K8s rolling updates.
Operator state under UnionListState broadcasts on restore Medium Avoid UnionListState at high parallelism; prefer keyed state for genuinely keyed data Some legacy sources still use it. Migration requires connector cooperation.

09 · Fault Tolerance

Flink’s fault tolerance is checkpoint-and-restart, not in-memory replication. Recovery time is bounded by state size plus source replay, not by RPO/RTO of a replicated quorum.

DimensionBehaviorOperational Reality
Replication modelState snapshotted asynchronously to durable DFS at checkpoint intervals. No in-memory state replicas across TMs.Durability inherits from the DFS (S3 11 nines, HDFS configurable replication). Loss between checkpoints is unrecoverable from Flink, must be replayed from source.
Failure detectionTaskManager heartbeats to JobManager. Default timeout 50 seconds, tunable via heartbeat.timeout.Tuning lower than 30s causes false positives under GC; longer delays detection. Set 20-50s and pair with appropriate restart strategy.
Failover mechanismThree strategies: full job restart, region failover (since 1.9), or individual failover for embarrassingly parallel jobs.Region failover is the right default for most jobs. Limits blast radius to the connected pipelined region around the failed task, not the whole job graph.
RTO (typical)Seconds for small state (under 10GB), minutes for large state due to RocksDB hydration from DFS. Flink 2.0 ForSt cuts this dramatically.Flink 2.0 reports up to 49x faster recovery after failures or rescaling with the disaggregated state model, since state is read from DFS on demand.
RPO (typical)Zero data loss if source is replayable. Effective RPO is the checkpoint interval (commonly 10s to 5min); work between checkpoints is replayed, not lost.Non-replayable sources (sockets, HTTP push) make RPO non-zero. Always pair Flink with a replayable log (Kafka, Kinesis, Pulsar, file-based source).
Split-brain behaviorJobManager HA uses ZooKeeper or K8s leader election. Only one active JobMaster per job at any time. TaskManagers refuse to connect to a non-leader.K8s ConfigMap-based leader election has known races under rolling restarts. ZK-based HA is more battle-tested at large scale.
Blast radius of single-node failureWith region failover: scoped to the connected pipelined region of failed subtasks. With full restart: entire job.Wide shuffles (keyBy, rebalance) expand the region. Narrow forward edges keep regions small. Operator placement matters more than most teams realize.
Cross-region failover storyNo native multi-region. Pattern is savepoint replication to secondary region’s DFS plus a standby cluster that resumes from the latest savepoint on disaster.Operationally non-trivial. Savepoint frequency determines cross-region RPO. Most teams settle for warm-standby with 5-15 minute savepoint cadence.
Data loss scenariosLoss possible only if: source cannot replay (non-replayable), or DFS holding checkpoints loses data, or non-2PC sink emits before checkpoint commit.The third scenario is the silent killer. A team enables exactly-once on state but uses a non-transactional JDBC sink and ships duplicates on every restart.

10 · Sharding / Parallelism (Key Groups)

Flink does not shard data on disk like a database; it shards keyed state across parallel subtasks via key groups. A key group is an intermediate hash bucket that lets state ranges map cleanly to subtasks on rescale.

DimensionBehaviorOperational Reality
Sharding modelHash on key into N key groups (max parallelism). Subtasks own a contiguous range of key groups. Default max parallelism is 128, configurable up to 32768.Key groups are the immutable shard count. Rescaling reassigns ranges to subtasks; it does not change the number of key groups.
Shard key constraintsAny hashable type. KeySelector must be deterministic. Null keys go to a default group (anti-pattern).Skewed keys (one customer takes 90% of traffic) concentrate on one subtask. No automatic split. Solution is application-level salting or upstream pre-aggregation.
Rebalancing mechanismSavepoint with old parallelism, restore with new parallelism. The framework redistributes key group ranges automatically.This is the pre-2.0 model. Requires job downtime. Flink 2.0 + ForSt enables online rescaling without state migration since state lives on DFS.
Rebalancing cost / impactPre-2.0: full job restart, downtime proportional to state size for RocksDB hydration. 2.0 + ForSt: near-instant since state is read from DFS lazily.For a 1TB-state job pre-2.0, rescaling typically takes 5-15 minutes depending on DFS bandwidth. ForSt brings this into the seconds range.
Hot-shard behaviorA skewed key concentrates on one subtask. Throughput is capped by single-subtask processing. No automatic split.Mitigation patterns: key salting (split-then-aggregate), local pre-aggregation, or two-phase aggregation with random keys for the first phase.
Maximum shards (practical)32768 key groups max. Practical parallelism caps in the low thousands per job due to JM coordination overhead.Above 2000 subtasks, JobManager checkpoint coordination becomes a bottleneck. File merging (1.20+) helps with metadata pressure.
Resharding without downtime?Pre-2.0: no. 2.0 + ForSt + async execution: yes, near-instant.Most production jobs are still on 1.x as of mid-2026. 2.0 adoption is gated by connector and operator support for async execution.
Cross-shard query supportImplicit via keyBy (re-partition + shuffle) or broadcast state (every subtask sees all rows). No global state query.Need a global aggregate? Use WindowAll (single-parallel sink) or a downstream OLAP system (Pinot, Druid, ClickHouse) for cross-key queries.

11 · Replication / State Persistence

Replication in Flink lives at the durable storage layer, not in the runtime. State exists on one TaskManager at a time; durability comes from the DFS where checkpoints land.

DimensionBehaviorOperational Reality
Replication topologySingle owner per key range at runtime. Checkpoints write state to durable DFS at intervals. No multi-leader, no quorum reads.The runtime is master-of-key. Reads from external systems happen against the latest committed checkpoint via Queryable State (deprecated) or external sinks.
Sync vs asyncCheckpoints are asynchronous: barrier injection is synchronous, but state upload to DFS proceeds in background. Operators continue processing during async upload.Synchronous phase is alignment + state copy; asynchronous phase is upload. If async upload runs longer than the checkpoint interval, checkpoints overlap.
Replication factorInherits from DFS: S3 (11 nines durability), HDFS (typically 3x), GCS (multi-region options).Flink does not control replication factor; it is a DFS concern. Setting RF=1 on the checkpoint DFS is a one-failure-from-data-loss configuration.
Consistency level optionsExactly-once (default) or at-least-once. At-least-once skips alignment and is faster but reprocesses some records on recovery.At-least-once with idempotent sinks is a valid pattern. Exactly-once with non-transactional sinks is the most common silent at-least-once-end-to-end bug.
Replication lag (typical)Equal to checkpoint interval, commonly 10 seconds to 5 minutes. ForSt streams state changes continuously, cutting effective lag to seconds.Shorter intervals (under 10s) increase IOPS pressure on DFS, particularly S3 PUT request rate limits and multipart upload overhead.
Conflict resolutionN/A. Single writer per key by design. No conflict possible within a job.Cross-job conflicts (two jobs writing to the same Kafka topic) are an application-level problem, not framework-resolvable.
Cross-region replicationSavepoint replication to secondary region’s DFS. Some teams use S3 cross-region replication for checkpoint buckets, but this adds RPO equal to S3 CRR lag.S3 CRR is “minutes” SLA, not “seconds”. For sub-minute cross-region RPO, application-level Kafka replication (MirrorMaker 2 or Cluster Linking) is the answer, not Flink itself.
Replication during partitionIf TaskManager loses contact with JobManager beyond heartbeat timeout, the JM marks it failed and triggers failover. The partitioned TM cannot commit checkpoints.K8s network partitions during rolling node updates have historically caused spurious failovers. Heartbeat tuning (50s default) is the lever; pair with restart strategy that does not thrash.

12 · Better Usage Patterns

Patterns most engineers miss until they have shipped a few Flink jobs to production and read the post-mortems.

PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Set max parallelism explicitly at job creation Leave it as default 128. Hit the cap a year later when traffic grows. Set max parallelism to a high composite-friendly number like 720, 1440, or 4320. Sets ceiling for rescaling. Max parallelism is immutable. Changing it means rebuilding state. One config decision on day 1 saves a 2-week migration on year 2.
Profile before defaulting to RocksDB Use RocksDB everywhere because “it’s the default for production”. Use HashMap backend for low-state, latency-critical jobs (under 4GB per TM). RocksDB only when state genuinely exceeds heap. Per-record state access in heap is sub-microsecond; in RocksDB it is tens of microseconds. For hot-path operators this is a 50x slowdown for state you did not need to persist that way.
Always set state TTL on keyed state Forget TTL. State grows until savepoint becomes too large to restore. Set StateTtlConfig with explicit retention. Use cleanupInRocksdbCompactFilter() to reclaim disk during compaction. Without TTL, state size is monotonically increasing for non-bounded keyspaces. Even with TTL, observable reduction depends on RocksDB compaction, so set it early not as a recovery measure.
Use Application Mode for production K8s Run a long-lived Session cluster with many jobs sharing it. One Application Mode cluster per job. Job lifecycle equals cluster lifecycle. Session clusters create resource isolation problems (noisy neighbors), version pinning problems (cluster must be on the lowest Flink version that all jobs support), and harder failure attribution. Application mode is the K8s-native pattern.
Use 2PC-capable sinks for exactly-once Enable exactly-once checkpointing, pair with non-transactional JDBC sink, ship duplicates on every restart. Use KafkaSink with EXACTLY_ONCE, Iceberg sink, or JDBC XA. Validate end-to-end exactly-once with a chaos test. Framework-level exactly-once is necessary but not sufficient. End-to-end correctness requires sink cooperation. This is the silent correctness bug.
Configure RocksDB managed memory pool, not per-instance memory Hand-tune RocksDB block cache, write buffer, and arena per operator. Drift from the framework’s managed memory accounting. Let Flink’s managed memory pool sub-allocate to RocksDB instances per slot. Tune the total pool, not per-instance knobs. The framework owns sizing across operators sharing a TM. Manual tuning drifts from this accounting and causes OOMs when operators rebalance.
Enable unaligned checkpoints only when needed Turn it on globally because “it solves backpressure”. Enable it only on jobs with sustained backpressure. Watch checkpoint size growth. Keep aligned for jobs with frequent savepoint upgrades. Savepoints are always aligned and unaligned CPs only work with one concurrent checkpoint. Globally enabling unaligned slows your upgrade story.
Use side outputs instead of filter chains Chain multiple filter() calls to split a stream into categories. Use ProcessFunction with OutputTag side outputs. One pass over the stream, multiple outputs. Filter chains re-iterate the same stream. Side outputs do a single pass. At high QPS the savings are linear.
Use Avro or POJO state serializers, never default Kryo Use Kryo for “convenience”. Schema evolution silently breaks on field addition. Explicit Avro schema or strict POJO with @TypeInfo annotations. Validate state evolution in CI with savepoint replay. Kryo’s “evolution” support is by ordinal, not by name. Adding a field shifts ordinals and corrupts state. Avro evolution by name is the only safe pattern at scale.
Disable operator chaining temporarily to debug Try to debug performance with chaining on, conflating multiple operators into one subtask metric. Run pre-prod with env.disableOperatorChaining(). Find the bottleneck. Re-enable for production. Chaining hides which operator is slow. Most performance investigations dead-end on chained tasks until someone splits them.
Use COMPILED PLAN for Flink SQL upgrades Re-plan SQL queries on each Flink version, accept silent plan changes. Compile the plan once with COMPILE PLAN, store as JSON, replay across versions. Test plan stability in CI before upgrade. The Calcite optimizer can rewrite plans on minor version upgrade. Compiled plans pin the physical execution, making upgrades safe.
Use watermark alignment for skewed sources Trust per-partition watermarks. One slow partition stalls global watermark and windows stop firing. Enable WatermarkStrategy.withWatermarkAlignment() (1.15+) plus idleness timeouts. Most “windows not firing” tickets are watermark stalls from skew, not bugs. Alignment is the framework-level fix.

13 · Next-Gen Alternatives

Where would you reach instead of Flink today, and what does each one actually improve?

Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
Flink 2.0 ForSt (within-family) Disaggregated state on object storage, near-instant rescaling and recovery, no local-disk dependency. Same Flink runtime above. Emerging Low for new jobs. Existing jobs need async execution adoption and operator/connector support validation. New cloud-native deployments, jobs with TB+ state that must rescale frequently, or anyone tired of stateful pods on K8s.
RisingWave PostgreSQL-compatible streaming database written in Rust. State on object storage natively. SQL-first, no Java required. Materialized-view-as-pipeline model. Production High for jobs with custom Java operators or complex CEP. Low for SQL-only analytical pipelines. Greenfield streaming analytics under 10 TB of state, SQL-only teams, or where Flink’s operational burden is the cost driver.
Materialize Strict-serializable consistency, recursive CTEs, differential dataflow runtime. PostgreSQL wire protocol. Best correctness story for streaming joins. Production High. Different runtime model. Limited connector ecosystem versus Flink. Single-region for now. Financial-grade correctness on streaming SQL, recursive computations (graph traversal, BOM explosion), or when consistency matters more than throughput.
Arroyo (now Cloudflare Pipelines) Rust on DataFusion. Serverless via Cloudflare. Sub-second latency on stateless and lightly-stateful workloads. Emerging Moderate. Open source engine still available; managed path is Cloudflare-specific post-2025 acquisition. Edge-tier ingest and transform, Cloudflare-native stacks, or teams that want streaming SQL without operating any infrastructure.
Kafka Streams Lightweight Java library. No external cluster. Local state in RocksDB. Sits inside your microservice. Production Low for new Kafka-only pipelines. High for migration from Flink because of state model differences (per-task vs key-group). Single-Kafka-cluster pipelines, microservice-embedded stream processing, teams without dedicated streaming ops.
Spark Structured Streaming Better batch unification, mature DataFrame API, Databricks/EMR managed runtime. Microbatch model. Production Low if already on Spark/Databricks. High if you depend on Flink event-time semantics. Existing Spark shops, hourly+ latency SLAs, workloads that are batch-y in shape and just need streaming for freshness.
ksqlDB SQL over Kafka with managed lifecycle (Confluent). Lower ops than Flink for simple stream-stream joins. Production Moderate. Different consistency model (Kafka-only), more limited windowing. Confluent-centric stacks, SQL pipelines that fit Kafka’s ordering model, teams that want managed-only.
Apache Beam on Flink runner Portable pipeline definition that targets Flink, Spark, Dataflow without rewrite. Avoids lock-in. Production Low for new pipelines. Beam abstractions sometimes hide Flink-specific optimizations. Multi-runtime portability is genuinely needed (e.g., on-prem Flink + GCP Dataflow). Otherwise direct Flink APIs give more control.

© 2026 Siva Reddy. All rights reserved.