Apache Flink
Stateful stream processing at scale. Deep dive on architecture, then a Principal Engineer trade-off pass.
Stream Processing FrameworkAs of 2026-06-02 · Flink 2.2 (Dec 2025) is GA · 1.20 is LTS · 2.0 introduced disaggregated state
Flink is the reference implementation for stateful, event-time, exactly-once stream processing. It earns that title with a true streaming runtime, mature checkpointing, and the largest connector ecosystem in the space. It pays for it with operational complexity that punishes teams without dedicated streaming platform engineers. Flink 2.0’s ForSt disaggregated state finally addresses its cloud-native weak spot, but the JVM-centric runtime, RocksDB tuning surface, and key-group-bound rescaling model still make new SQL-first streaming engines (RisingWave, Materialize, Arroyo) the right call for greenfield analytical workloads under 10 TB of state.
Flink Core Mechanics
01 · Cluster Architecture
Flink runs as a two-tier distributed system: a coordinator process (JobManager) and N worker processes (TaskManagers). Slots are the unit of resource isolation inside a TaskManager. Everything else (parallelism, state placement, checkpoint coordination) is layered on top of this model.
Component responsibilities
| Component | What It Does | PE-Level Detail |
|---|---|---|
Dispatcher | Receives job submissions over REST, spawns a JobMaster per job, exposes the Web UI. | In Application Mode the Dispatcher runs the main() on the cluster, eliminating client-side JobGraph generation. Session Mode keeps one Dispatcher across many JobMasters. |
ResourceManager | Allocates slots to JobMasters. Talks to YARN, Kubernetes, or standalone. | Slot is the unit of scheduling, not CPU. One TaskManager exposes N slots, each gets 1/N of TM managed memory. Slot sharing groups let operator chains share a slot to cut serialization cost. |
JobMaster | Owns one job’s ExecutionGraph, schedules tasks, coordinates checkpoints, handles failover. | This is the checkpoint coordinator. It injects barriers into sources, tracks per-operator ACKs, and commits the checkpoint metadata atomically. Loss of JobMaster without HA means job loss. |
TaskManager | JVM worker process. Runs operator subtasks in slots. Owns local state, network buffers, RocksDB / ForSt. | Heap is for user code and operator framework. Managed memory (off-heap) is for sorting, hash tables, RocksDB block cache. Network memory is for credit-based shuffles. Three pools, three tuning surfaces. |
Slot | Fixed share of one TM’s managed memory plus one task thread per chained operator. | Slots do not isolate CPU. Two slots on a busy TM compete for cores. This is the source of most “noisy neighbor” pain in shared session clusters. |
Client | Builds the JobGraph from user code, submits it to Dispatcher. Optional once submitted. | Session Mode keeps the client around for tracking. Application Mode terminates the client after submission, which is mandatory for K8s-native deployments at scale. |
02 · Execution Model
User code becomes a logical dataflow, which the optimizer rewrites into a JobGraph, which the JobMaster expands into an ExecutionGraph (one vertex per parallel subtask), which gets pinned to slots. Operator chaining fuses adjacent non-shuffling operators into a single thread to skip serialization.
The four graphs
| Graph | Built By | Granularity | What Matters at PE Level |
|---|---|---|---|
| StreamGraph | Client (DataStream API) or Table planner (SQL) | Logical operators | SQL queries go through Calcite plan, then a rule-based stream optimizer. Plan stability across versions is the production risk, which is why COMPILED PLAN exists. |
| JobGraph | Client | Chained operators | Adjacent operators with forward-only edges and matching parallelism fuse into one task. This is where disableChaining() calls show up to expose subtasks for backpressure debugging. |
| ExecutionGraph | JobMaster | Per-subtask | One ExecutionVertex per parallel instance. This is what the Web UI shows and what the scheduler operates on. Region failover scopes resets to the pipelined region of failed vertices, not the whole job. |
| Physical Graph | Scheduler + RM | Subtask to slot | Pinning happens here. Slot sharing groups, co-location constraints, and resource profiles all collapse into a placement decision the operator cannot easily inspect post-deploy. |
API layers
- DataStream API
- Java / Scala. Low level. Direct access to keyed state, timers, side outputs, custom operators. Used when you need pinpoint control over state and time semantics.
- Table API / Flink SQL
- Declarative. Goes through Calcite. Supports CEP via
MATCH_RECOGNIZE, session windows, named parameters (1.19+), per-operator state TTL via SQL hints. The optimizer can rewrite plans across versions, soCOMPILED PLANsnapshots are the upgrade safety net. - PyFlink
- Python. Runs UDFs in a Beam-portability sidecar over gRPC. Significantly slower than Java for stateful ops because state access crosses the language boundary. Acceptable for sources, sinks, and stateless transforms.
- Stateful Functions
- Function-as-a-service style API on top of Flink. Decouples logic from runtime. Largely superseded by Application Mode plus regular DataStream for most teams.
- DataSet API
- Removed in 2.0. Batch is unified under DataStream + Table API now.
Networking and backpressure
Inter-task data movement uses a credit-based flow control protocol over Netty. Downstream subtasks announce available buffer credits to upstream; upstream sends only as many buffers as credits allow. Backpressure is not a config knob, it is the emergent behavior of a slow consumer starving credit upstream.
03 · State Model & Backends
State is what makes Flink Flink. Three kinds of state and three backends, and the choice cascades into checkpoint speed, rescaling cost, and recovery time.
State types
| State Type | Scope | Typical Use | Rescaling Behavior |
|---|---|---|---|
| Keyed State | Per key, per operator | Per-user aggregates, joins, session windows, deduplication | Reassigned by key group on parallelism change. Key groups are fixed at max parallelism, set at job start, immutable thereafter. |
| Operator State | Per operator subtask | Source offsets (Kafka consumer), sink buffers, broadcasted enrichment | Redistributed via ListState (even split) or UnionListState (full broadcast on restore, expensive at scale). |
| Broadcast State | Per operator, replicated to all subtasks | Dynamic rules, feature flags, slowly-changing dimensions joined to the main stream | Replicated identically on every parallel instance. Useful but the broadcast side cannot be repartitioned mid-flight. |
State backends
| Backend | Storage | Best For | Hard Cost |
|---|---|---|---|
HashMapStateBackend | Java heap, plain objects | Small state (under a few GB per TM), latency-sensitive workloads, no serialization overhead per access | GC pressure. Heap fragments under churn. State size capped by heap. Full checkpoint always, no incremental. |
EmbeddedRocksDBStateBackend | RocksDB on local disk, off-heap block cache | Large state (TB scale), incremental checkpoints, time-windowed aggregates with long retention | Every state access is a serde + LSM lookup. Per-record latency goes from sub-microsecond (heap) to tens of microseconds. RocksDB tuning surface is large: write buffer, block cache, compaction style, bloom filters. |
ForSt (Flink 2.0) | Distributed file system as primary, local disk as cache | Cloud-native deployments, fast rescaling, large state without local disk dependency | Network I/O for state access. Mitigated by async execution model and tiered cache. Without local cache, throughput drops around 48% on average for I/O-heavy queries on remote storage. With cache it approaches RocksDB parity. |
04 · Checkpoint Protocol
Flink’s checkpointing is a variant of Chandy-Lamport adapted for cyclic-free dataflows. The JobMaster injects a numbered barrier into each source. Each operator snapshots its state when it has received the barrier from every input channel, then forwards the barrier downstream. Once every sink ACKs the barrier, the checkpoint is committed.
Checkpoint variants
| Variant | Mechanism | When to Use | Cost |
|---|---|---|---|
| Aligned (default) | Operator waits for barrier on every input channel before snapshotting. | Healthy job with no sustained backpressure. Smaller checkpoint size. | Backpressure couples directly into checkpoint duration. Under backpressure, checkpoints may take longer to complete or time out completely. |
| Unaligned (1.11+) | Barrier cuts ahead of in-flight buffers; the buffers are snapshotted as part of operator state. | Persistent backpressure where aligned CPs are timing out. Stateless or low-state operators benefit most. | Checkpoint size grows by in-flight buffer volume. Only works for exactly-once and with one concurrent checkpoint. Savepoints cannot be unaligned. |
| Incremental (RocksDB) | Only uploads RocksDB SSTables changed since last checkpoint. | Large state, frequent checkpoints. Default and correct for almost all RocksDB jobs. | RocksDB compaction creates “new” files even for unchanged data, occasionally blowing up incremental delta size. Background manual compaction in 1.20 mitigates. |
| Generic Log-based (changelog, 1.15+) | Operators write state changes to a durable changelog continuously; materialization happens in the background. | When checkpoint duration must be predictable and short (sub-second p99). Cost is paid continuously, not at CP time. | Doubles write amplification. Extra DFS traffic. Worth it only when CP duration variance is the production pain. |
| Savepoint | Manually triggered, version-stable, format-independent snapshot. | Upgrades, parallelism changes, code changes, A/B job swaps. | Always aligned. Cannot be unaligned even with the feature enabled. Slower than checkpoints. Holds state in the canonical format, not RocksDB native. |
| File Merging (1.20+) | Combines many small checkpoint files into fewer larger ones. | Large parallelism (thousands of subtasks) where filesystem metadata is the bottleneck. | MVP feature, opt-in via execution.checkpointing.file-merging.enabled. Watch for edge cases on retention. |
KafkaSink with EXACTLY_ONCE uses transactional producers. The producer transaction commits only when the corresponding checkpoint commits, with a 2PC handshake driven by the JobMaster. Get this wrong and you have exactly-once on state but at-least-once on output, which is the most common silent correctness bug in Flink deployments.05 · Time & Watermarks
Flink supports three notions of time. Event time is the only one that gives correct results under out-of-order data, which is most production data. Event time relies on watermarks: monotonic timestamps that assert “no more data with timestamp earlier than this will arrive”. Watermarks are how Flink fires windows, expires timers, and handles late data.
| Time Mode | Source of Time | Correctness Property | Production Risk |
|---|---|---|---|
| Event time | Timestamp embedded in the record. | Deterministic results across reprocessing. Same input → same windows → same output. | Skewed sources stall watermarks. One slow Kafka partition can freeze global watermark and block window firing across the entire job. |
| Processing time | Wall clock on the TaskManager processing the record. | Lowest latency, no watermark machinery, no late data. | Non-deterministic. Reprocessing a backlog produces different windows than the live run. Most teams regret choosing this within 6 months. |
| Ingestion time | Wall clock at source operator. | Middle ground. Approximate event-time with no per-record timestamp needed. | Mostly deprecated in practice. If you want event time, use event time. |
Watermark mechanics worth knowing
- Generation
- Per-source-partition watermark, taken as the minimum across partitions for downstream. Idle partition detection (1.11+) lets stalled partitions opt out instead of holding global watermark hostage.
- Allowed lateness
- Window-level grace period after watermark passes window end. Late records reopen the window and re-emit results. Costs state retention for the lateness window.
- Side output for late data
- Records arriving after allowed lateness can be diverted to a side output instead of dropped. The right pattern for forensic correctness without inflating main-path state.
- Watermark alignment (1.15+)
- Coordinates watermark progress across source subtasks to prevent one fast partition from racing ahead and forcing huge buffers on slow ones. Mandatory for backpressure-sensitive workloads.
06 · Trade-Offs
Click any column header to sort. Trade-offs are stated as “give up X to get Y” with the operational scenario where the cost actually bites.
| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| True streaming runtime over microbatch | Per-record latency in low tens of ms, event-time semantics, no batch boundary artifacts | Throughput-per-core lower than Spark Structured Streaming on stateless workloads, more complex resource accounting | Hourly aggregation jobs over Parquet where Spark would just be simpler and cheaper on managed Databricks | Teams pick Flink for “real-time” and then run 90% of their workload as 5-minute tumbling windows. If the SLA tolerates 5 minutes, you are paying streaming-runtime tax for batch outcomes. |
| Exactly-once via aligned checkpoints | Correctness guarantee across state and 2PC sinks, no application-level dedup | Checkpoint duration coupled to backpressure, end-to-end exactly-once requires sink cooperation (Kafka 2PC, Iceberg, JDBC XA) | Sustained backpressure makes aligned checkpoints time out; you flip to unaligned and savepoints take longer for upgrades | Most “exactly-once” Flink jobs in production are at-least-once at the sink because the team used a non-transactional connector. The semantic is end-to-end, not framework-only. |
| RocksDB state backend over heap | TB-scale state per TM, incremental checkpoints, predictable memory footprint via managed memory | Per-state-access latency goes from sub-microsecond to tens of microseconds, large tuning surface (write buffer, block cache, compaction) | Per-record state access on the hot path (e.g., session keying), where heap would be 50x faster and the state actually fits | The default of RocksDB is correct for the common case but wrong for low-state, latency-critical jobs. Profile first. HashMap backend is not deprecated, it is underused. |
| Key groups for hash-partitioned state | Deterministic state placement, parallel recovery, sub-second key routing | Max parallelism is fixed at job submission. Going past it requires a state-incompatible rewrite | Year 2 traffic outgrows the parallelism ceiling you set on day 1. Now you need a savepoint, manual repartitioning, and a release | Always set max parallelism explicitly to something like 720 or 1440 (composite-friendly numbers). Defaults of 128 cap you fast at any real scale. |
| JVM-centric runtime | Mature ecosystem, deep observability via JMX, vast operator and connector library | GC pauses bite at 100GB+ heaps, native ML/SIMD workloads must JNI out, Python is a second-class citizen | P99 latency spikes from G1 mixed GCs at 8 AM traffic ramp, masking real bottlenecks | ZGC and Shenandoah are the right defaults at scale, not G1. Most “Flink is slow” tickets resolve to a GC config from 2018. |
| Event-time semantics with watermarks | Deterministic reprocessing, correct windows over out-of-order data, replay safety | Watermark sensitivity to skewed sources, per-source-partition idleness tuning, late-data accounting | One Kafka partition with no producers freezes global watermark; windows stop firing; alerts trip | Watermark alignment (1.15+) and idle-partition timeouts are not optional at scale. Most “windows not firing” tickets are watermark stalls, not bugs. |
| Mature connector ecosystem | Kafka, Kinesis, Pulsar, JDBC, Iceberg, Delta, HBase, ES, Pinot, CDC connectors covering most production stacks | Quality varies wildly. Some connectors lag Flink versions, others have correctness issues on edge cases | Two-quarter blocker waiting for a connector to support a new Flink minor, or hitting a connector deadlock under specific shutdown sequences | The first-party Kafka and Iceberg connectors are production-grade. Treat community connectors as alpha until you have validated checkpoint + restore + 2PC under chaos testing. |
| Slot sharing for operator chaining | Skips serialization between chained operators, slashes per-record cost for forward edges | CPU contention inside slots, harder backpressure isolation, debugging requires manually un-chaining | Production performance regression that disappears the moment you disableChaining() to see what is slow |
Slot sharing groups are powerful for cost but they hide the truth. For new jobs, deploy with chaining disabled in pre-prod, find the bottleneck, then re-enable. |
| Coupled compute and state (pre-2.0) | Sub-microsecond state access from local disk, predictable RocksDB performance | Stateful pods, slow rescaling (download all state on scale-up), local disk dependency on K8s | Autoscaling event takes 15 minutes for a 200GB-per-TM job because the new TMs must hydrate from S3 first | This is exactly why Flink 2.0 ForSt exists. If you are designing a new cloud-native deployment, evaluate ForSt now; the pre-2.0 model is a structural mismatch for elastic infra. |
07 · Use Cases
Production deployments at the top of the scale curve. Numbers are from public engineering posts and conference talks.
| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Real-time recommendation and transaction processing | Alibaba (Blink fork, now upstreamed) | Sub-100ms p99 stateful joins across user, product, ad streams, exactly-once during Singles’ Day peaks | Alibaba processes 40 billion events per day on Flink | Spark Structured Streaming microbatch boundaries violated the latency SLA. Kafka Streams lacked the multi-stream join and event-time depth. |
| Streaming SQL platform for analysts | Uber AthenaX | SQL-first stream interface so non-engineers can ship streaming queries against canonical event streams | Trillion messages per day across the AthenaX platform | Custom DSLs created onboarding friction; Spark SQL did not give true event-time. Flink SQL hit the sweet spot for self-serve. |
| Ad-hoc stream processing at platform layer | Netflix Keystone | Multi-tenant stream processing where teams ship jobs without owning infrastructure, with isolation per job | Hundreds of jobs, mission-critical recommendation features, low-latency event routing | In-house Mantis covered some cases; Flink was chosen for stateful workloads that Mantis was not designed for. |
| Streaming fraud and risk detection | Stripe (5+ years on Flink), ING | Exactly-once stateful CEP over payment streams with sub-second decision SLAs and replayable audit trail | Stripe scaled Flink across multiple regions and hundreds of jobs | Kafka Streams could not handle the multi-stream join complexity at the required latency. Custom CEP frameworks lacked Flink’s checkpointing. |
| Real-time experimentation and analytics | Pinterest Xenon platform | Event aggregation for experimentation with windowed metrics, deduplication, and low-latency dashboards | Pinterest runs Flink in production with self-service diagnosis tools and custom deployment frameworks | Batch experimentation pipelines lagged 24 hours; Spark Streaming had operator state limits Xenon hit. |
| Real-time gaming and personalization | King (Candy Crush parent) | Stateful processing of player events for in-game personalization and anti-cheat | King handles 30 billion daily events across 200+ countries | Kappa-style replay over batch could not meet the in-session personalization latency. Custom code would not have given exactly-once for free. |
| Telco real-time analytics | Bouygues Telecom | Sub-second event-time windowing over network telemetry with strict SLA for operations dashboards | 30+ production Flink apps processing more than 10 billion raw events per day | Apache Storm (predecessor) lacked stateful CEP and exactly-once. Spark Streaming microbatch latency was too high. |
| End-to-end exactly-once ad event processing | Uber Ad Events (Flink + Kafka + Pinot) | Exactly-once stateful enrichment and dedup of ad impressions with billing-grade correctness | Production at Uber ad-platform scale with audited revenue impact | Application-level dedup at downstream services would have added 30%+ infra cost and is correctness-fragile at peak load. |
08 · Limitations
Real ceilings teams hit in production, with the standard workaround and what the workaround actually costs.
| Limitation | Severity | Workaround | Workaround Cost |
|---|---|---|---|
| Max parallelism is immutable after first job submission | High | Set high max parallelism (720, 1440) upfront. If exceeded, savepoint, rewrite key partitioning, restart from clean state | State-incompatible rewrite. In practice, jobs hit this and accept a hard restart or rebuild state from source. |
| Rescaling requires savepoint + restore (pre-2.0) | High | Use Reactive Mode (1.13+) or migrate to Flink 2.0 + ForSt for instant rescaling | Reactive Mode is opt-in and not production-default; ForSt requires async execution and 2.0 minor version. |
| Local disk dependency on Kubernetes for RocksDB | High | Use PVCs with fast SSD, or move to ForSt with object storage as primary | PVCs slow pod scheduling. ForSt is opt-in and not all operators support async execution. |
| Backpressure propagates into checkpoint duration | High | Enable unaligned checkpoints, enable buffer debloating, fix the actual backpressure source | Unaligned CP size grows; only works for exactly-once with one concurrent checkpoint; savepoints stay aligned and slower. |
| State schema evolution is fragile for non-Avro/POJO state | High | Use Avro or POJO state with explicit schema, never Kryo default serializers for production state |
Refactoring legacy jobs to Avro state is a multi-sprint effort with savepoint-replay validation. |
| Watermark stalls from idle source partitions | Medium | Use WatermarkStrategy.withIdleness() and watermark alignment (1.15+) |
Idleness timeout must be tuned per source; too short causes premature firing, too long causes stalls. Requires per-job analysis. |
| PyFlink is significantly slower than Java for stateful operators | Medium | Keep stateful logic in Java/Scala; use PyFlink only for sources, sinks, and stateless transforms | Mixed-language jobs add deployment and debugging complexity. Pure Python shops face a 2-5x throughput penalty. |
| Connector quality varies; some community connectors lag Flink versions | Medium | Validate connector under chaos testing (restart + replay + 2PC commit) before production | Adds 1-2 weeks per connector to validation cycle. Some teams maintain forks. |
| Job graph compilation time grows with SQL plan size | Medium | Split very large SQL pipelines into multiple jobs; use COMPILED PLAN to cache the optimized plan across upgrades |
Multi-job pipelines need explicit topic-as-boundary contracts. Compiled plans must be tested across Flink versions on upgrade. |
| Window state grows unboundedly without TTL | Critical | Set StateTtlConfig on all keyed state. Use session windows over global windows for unbounded keyspaces |
TTL with incremental checkpoints is based on RocksDB compaction, so state size reduction is only observed after compaction occurs. Misleading checkpoint metrics during cleanup. |
| JobManager is a SPOF without HA configured | Critical | Enable HA via ZooKeeper or Kubernetes leader election; store checkpoint metadata on durable DFS | HA adds operational dependency (ZK cluster) or K8s ConfigMap reliance. Misconfigured HA can split-brain on K8s rolling updates. |
Operator state under UnionListState broadcasts on restore |
Medium | Avoid UnionListState at high parallelism; prefer keyed state for genuinely keyed data |
Some legacy sources still use it. Migration requires connector cooperation. |
09 · Fault Tolerance
Flink’s fault tolerance is checkpoint-and-restart, not in-memory replication. Recovery time is bounded by state size plus source replay, not by RPO/RTO of a replicated quorum.
| Dimension | Behavior | Operational Reality |
|---|---|---|
| Replication model | State snapshotted asynchronously to durable DFS at checkpoint intervals. No in-memory state replicas across TMs. | Durability inherits from the DFS (S3 11 nines, HDFS configurable replication). Loss between checkpoints is unrecoverable from Flink, must be replayed from source. |
| Failure detection | TaskManager heartbeats to JobManager. Default timeout 50 seconds, tunable via heartbeat.timeout. | Tuning lower than 30s causes false positives under GC; longer delays detection. Set 20-50s and pair with appropriate restart strategy. |
| Failover mechanism | Three strategies: full job restart, region failover (since 1.9), or individual failover for embarrassingly parallel jobs. | Region failover is the right default for most jobs. Limits blast radius to the connected pipelined region around the failed task, not the whole job graph. |
| RTO (typical) | Seconds for small state (under 10GB), minutes for large state due to RocksDB hydration from DFS. Flink 2.0 ForSt cuts this dramatically. | Flink 2.0 reports up to 49x faster recovery after failures or rescaling with the disaggregated state model, since state is read from DFS on demand. |
| RPO (typical) | Zero data loss if source is replayable. Effective RPO is the checkpoint interval (commonly 10s to 5min); work between checkpoints is replayed, not lost. | Non-replayable sources (sockets, HTTP push) make RPO non-zero. Always pair Flink with a replayable log (Kafka, Kinesis, Pulsar, file-based source). |
| Split-brain behavior | JobManager HA uses ZooKeeper or K8s leader election. Only one active JobMaster per job at any time. TaskManagers refuse to connect to a non-leader. | K8s ConfigMap-based leader election has known races under rolling restarts. ZK-based HA is more battle-tested at large scale. |
| Blast radius of single-node failure | With region failover: scoped to the connected pipelined region of failed subtasks. With full restart: entire job. | Wide shuffles (keyBy, rebalance) expand the region. Narrow forward edges keep regions small. Operator placement matters more than most teams realize. |
| Cross-region failover story | No native multi-region. Pattern is savepoint replication to secondary region’s DFS plus a standby cluster that resumes from the latest savepoint on disaster. | Operationally non-trivial. Savepoint frequency determines cross-region RPO. Most teams settle for warm-standby with 5-15 minute savepoint cadence. |
| Data loss scenarios | Loss possible only if: source cannot replay (non-replayable), or DFS holding checkpoints loses data, or non-2PC sink emits before checkpoint commit. | The third scenario is the silent killer. A team enables exactly-once on state but uses a non-transactional JDBC sink and ships duplicates on every restart. |
11 · Replication / State Persistence
Replication in Flink lives at the durable storage layer, not in the runtime. State exists on one TaskManager at a time; durability comes from the DFS where checkpoints land.
| Dimension | Behavior | Operational Reality |
|---|---|---|
| Replication topology | Single owner per key range at runtime. Checkpoints write state to durable DFS at intervals. No multi-leader, no quorum reads. | The runtime is master-of-key. Reads from external systems happen against the latest committed checkpoint via Queryable State (deprecated) or external sinks. |
| Sync vs async | Checkpoints are asynchronous: barrier injection is synchronous, but state upload to DFS proceeds in background. Operators continue processing during async upload. | Synchronous phase is alignment + state copy; asynchronous phase is upload. If async upload runs longer than the checkpoint interval, checkpoints overlap. |
| Replication factor | Inherits from DFS: S3 (11 nines durability), HDFS (typically 3x), GCS (multi-region options). | Flink does not control replication factor; it is a DFS concern. Setting RF=1 on the checkpoint DFS is a one-failure-from-data-loss configuration. |
| Consistency level options | Exactly-once (default) or at-least-once. At-least-once skips alignment and is faster but reprocesses some records on recovery. | At-least-once with idempotent sinks is a valid pattern. Exactly-once with non-transactional sinks is the most common silent at-least-once-end-to-end bug. |
| Replication lag (typical) | Equal to checkpoint interval, commonly 10 seconds to 5 minutes. ForSt streams state changes continuously, cutting effective lag to seconds. | Shorter intervals (under 10s) increase IOPS pressure on DFS, particularly S3 PUT request rate limits and multipart upload overhead. |
| Conflict resolution | N/A. Single writer per key by design. No conflict possible within a job. | Cross-job conflicts (two jobs writing to the same Kafka topic) are an application-level problem, not framework-resolvable. |
| Cross-region replication | Savepoint replication to secondary region’s DFS. Some teams use S3 cross-region replication for checkpoint buckets, but this adds RPO equal to S3 CRR lag. | S3 CRR is “minutes” SLA, not “seconds”. For sub-minute cross-region RPO, application-level Kafka replication (MirrorMaker 2 or Cluster Linking) is the answer, not Flink itself. |
| Replication during partition | If TaskManager loses contact with JobManager beyond heartbeat timeout, the JM marks it failed and triggers failover. The partitioned TM cannot commit checkpoints. | K8s network partitions during rolling node updates have historically caused spurious failovers. Heartbeat tuning (50s default) is the lever; pair with restart strategy that does not thrash. |
12 · Better Usage Patterns
Patterns most engineers miss until they have shipped a few Flink jobs to production and read the post-mortems.
| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Set max parallelism explicitly at job creation | Leave it as default 128. Hit the cap a year later when traffic grows. | Set max parallelism to a high composite-friendly number like 720, 1440, or 4320. Sets ceiling for rescaling. | Max parallelism is immutable. Changing it means rebuilding state. One config decision on day 1 saves a 2-week migration on year 2. |
| Profile before defaulting to RocksDB | Use RocksDB everywhere because “it’s the default for production”. | Use HashMap backend for low-state, latency-critical jobs (under 4GB per TM). RocksDB only when state genuinely exceeds heap. | Per-record state access in heap is sub-microsecond; in RocksDB it is tens of microseconds. For hot-path operators this is a 50x slowdown for state you did not need to persist that way. |
| Always set state TTL on keyed state | Forget TTL. State grows until savepoint becomes too large to restore. | Set StateTtlConfig with explicit retention. Use cleanupInRocksdbCompactFilter() to reclaim disk during compaction. |
Without TTL, state size is monotonically increasing for non-bounded keyspaces. Even with TTL, observable reduction depends on RocksDB compaction, so set it early not as a recovery measure. |
| Use Application Mode for production K8s | Run a long-lived Session cluster with many jobs sharing it. | One Application Mode cluster per job. Job lifecycle equals cluster lifecycle. | Session clusters create resource isolation problems (noisy neighbors), version pinning problems (cluster must be on the lowest Flink version that all jobs support), and harder failure attribution. Application mode is the K8s-native pattern. |
| Use 2PC-capable sinks for exactly-once | Enable exactly-once checkpointing, pair with non-transactional JDBC sink, ship duplicates on every restart. | Use KafkaSink with EXACTLY_ONCE, Iceberg sink, or JDBC XA. Validate end-to-end exactly-once with a chaos test. |
Framework-level exactly-once is necessary but not sufficient. End-to-end correctness requires sink cooperation. This is the silent correctness bug. |
| Configure RocksDB managed memory pool, not per-instance memory | Hand-tune RocksDB block cache, write buffer, and arena per operator. Drift from the framework’s managed memory accounting. | Let Flink’s managed memory pool sub-allocate to RocksDB instances per slot. Tune the total pool, not per-instance knobs. | The framework owns sizing across operators sharing a TM. Manual tuning drifts from this accounting and causes OOMs when operators rebalance. |
| Enable unaligned checkpoints only when needed | Turn it on globally because “it solves backpressure”. | Enable it only on jobs with sustained backpressure. Watch checkpoint size growth. Keep aligned for jobs with frequent savepoint upgrades. | Savepoints are always aligned and unaligned CPs only work with one concurrent checkpoint. Globally enabling unaligned slows your upgrade story. |
| Use side outputs instead of filter chains | Chain multiple filter() calls to split a stream into categories. |
Use ProcessFunction with OutputTag side outputs. One pass over the stream, multiple outputs. |
Filter chains re-iterate the same stream. Side outputs do a single pass. At high QPS the savings are linear. |
| Use Avro or POJO state serializers, never default Kryo | Use Kryo for “convenience”. Schema evolution silently breaks on field addition. | Explicit Avro schema or strict POJO with @TypeInfo annotations. Validate state evolution in CI with savepoint replay. |
Kryo’s “evolution” support is by ordinal, not by name. Adding a field shifts ordinals and corrupts state. Avro evolution by name is the only safe pattern at scale. |
| Disable operator chaining temporarily to debug | Try to debug performance with chaining on, conflating multiple operators into one subtask metric. | Run pre-prod with env.disableOperatorChaining(). Find the bottleneck. Re-enable for production. |
Chaining hides which operator is slow. Most performance investigations dead-end on chained tasks until someone splits them. |
| Use COMPILED PLAN for Flink SQL upgrades | Re-plan SQL queries on each Flink version, accept silent plan changes. | Compile the plan once with COMPILE PLAN, store as JSON, replay across versions. Test plan stability in CI before upgrade. |
The Calcite optimizer can rewrite plans on minor version upgrade. Compiled plans pin the physical execution, making upgrades safe. |
| Use watermark alignment for skewed sources | Trust per-partition watermarks. One slow partition stalls global watermark and windows stop firing. | Enable WatermarkStrategy.withWatermarkAlignment() (1.15+) plus idleness timeouts. |
Most “windows not firing” tickets are watermark stalls from skew, not bugs. Alignment is the framework-level fix. |
13 · Next-Gen Alternatives
Where would you reach instead of Flink today, and what does each one actually improve?
| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| Flink 2.0 ForSt (within-family) | Disaggregated state on object storage, near-instant rescaling and recovery, no local-disk dependency. Same Flink runtime above. | Emerging | Low for new jobs. Existing jobs need async execution adoption and operator/connector support validation. | New cloud-native deployments, jobs with TB+ state that must rescale frequently, or anyone tired of stateful pods on K8s. |
| RisingWave | PostgreSQL-compatible streaming database written in Rust. State on object storage natively. SQL-first, no Java required. Materialized-view-as-pipeline model. | Production | High for jobs with custom Java operators or complex CEP. Low for SQL-only analytical pipelines. | Greenfield streaming analytics under 10 TB of state, SQL-only teams, or where Flink’s operational burden is the cost driver. |
| Materialize | Strict-serializable consistency, recursive CTEs, differential dataflow runtime. PostgreSQL wire protocol. Best correctness story for streaming joins. | Production | High. Different runtime model. Limited connector ecosystem versus Flink. Single-region for now. | Financial-grade correctness on streaming SQL, recursive computations (graph traversal, BOM explosion), or when consistency matters more than throughput. |
| Arroyo (now Cloudflare Pipelines) | Rust on DataFusion. Serverless via Cloudflare. Sub-second latency on stateless and lightly-stateful workloads. | Emerging | Moderate. Open source engine still available; managed path is Cloudflare-specific post-2025 acquisition. | Edge-tier ingest and transform, Cloudflare-native stacks, or teams that want streaming SQL without operating any infrastructure. |
| Kafka Streams | Lightweight Java library. No external cluster. Local state in RocksDB. Sits inside your microservice. | Production | Low for new Kafka-only pipelines. High for migration from Flink because of state model differences (per-task vs key-group). | Single-Kafka-cluster pipelines, microservice-embedded stream processing, teams without dedicated streaming ops. |
| Spark Structured Streaming | Better batch unification, mature DataFrame API, Databricks/EMR managed runtime. Microbatch model. | Production | Low if already on Spark/Databricks. High if you depend on Flink event-time semantics. | Existing Spark shops, hourly+ latency SLAs, workloads that are batch-y in shape and just need streaming for freshness. |
| ksqlDB | SQL over Kafka with managed lifecycle (Confluent). Lower ops than Flink for simple stream-stream joins. | Production | Moderate. Different consistency model (Kafka-only), more limited windowing. | Confluent-centric stacks, SQL pipelines that fit Kafka’s ordering model, teams that want managed-only. |
| Apache Beam on Flink runner | Portable pipeline definition that targets Flink, Spark, Dataflow without rewrite. Avoids lock-in. | Production | Low for new pipelines. Beam abstractions sometimes hide Flink-specific optimizations. | Multi-runtime portability is genuinely needed (e.g., on-prem Flink + GCP Dataflow). Otherwise direct Flink APIs give more control. |