Real-Time OLAP — ClickHouse vs Apache Pinot
User-facing analytics at high QPS, sub-second p99, streaming ingest from Kafka. Two engines that converge on the same use cases from opposite architectural ends.
Real-Time OLAP PE / Staff+ DepthAs of 2026-05-31
ClickHouse vs Pinot is not "which is faster." It is "what is your query concurrency, do you need first-class upserts, and is your team SQL-native or Java-native." ClickHouse wins on compression, flexible analytical SQL, ecosystem (Tinybird, Posthog, observability), and operational simplicity — one binary, fewer moving parts. Pinot wins on user-facing p99 under high concurrency, native real-time upserts, star-tree pre-aggregation for low-cardinality dashboards, and deterministic Kafka partition assignment (no consumer-group rebalancing storms). The 2026 reality: ClickHouse Cloud has narrowed Pinot's user-facing moat with separated compute/storage and lightweight updates; Pinot's remaining moat is upserts, the star-tree index, and the LinkedIn/Uber/Stripe pattern of millions of end-users hitting analytical queries. Constant Contact migrated off Pinot to ClickHouse for the flexibility win at their workload shape; LinkedIn/Stripe stay on Pinot for the concurrency win at theirs. Both are correct, in their respective workload shapes.
Best default choices
1. Trade-Offs
Per-technology trade-off lists. The trade space is genuinely different: ClickHouse trades LSM merge-amplification for compression and flexible queries; Pinot trades operational complexity for sub-second p99 and native upserts.
ClickHouse
Columnar LSM OLAP| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Append-only MergeTree LSM architecture | Extreme write throughput; 10-20x compression on typical event data; columnar scans on 1B rows in milliseconds | Single-row updates are anti-pattern; deletes use tombstones with eventual consistency on disk reclamation | Team treats ClickHouse like Postgres for a small operational use case; UPDATE WHERE id = X for each event creates one part per row and compaction can't keep up | The mental model is "you write batches, the engine merges in background." Any pattern that produces one part per write at high frequency degrades the merge pipeline. |
| Async inserts with server-side batching | High-frequency small writes work without client-side batching; thousands of small clients can write directly | wait_for_async_insert=0 trades durability between buffer and flush; wait=1 eliminates the perf gain | Telemetry agent fires 1KB writes at 10K/sec with wait=0; ClickHouse process crashes between buffer and flush, you lose a few seconds of in-flight data | 26.1+ extended async insert deduplication to materialized views. Always pair async_insert with idempotent INSERT statements (use block_id or insert hash) for production durability. |
| Per-column codec compression | Tunable per column: DoubleDelta, Gorilla, LZ4, ZSTD; 30-50% less disk than Pinot equivalents | More configuration burden; default codecs leave significant perf on the table | You leave defaults; storage bill is 2x what it would be with Gorilla on timestamps and LowCardinality on string dimensions | LowCardinality(String) on columns with under ~10K distinct values is the single highest-leverage tuning in ClickHouse. 10x compression and faster filter on the same column. |
| Projections as alternative sort orders or pre-aggregations | Multiple query patterns supported by one table; 25.6+ projection-as-secondary-index reduces storage overhead | Write amplification on insert; projection storage cost; complexity in query optimization | Six projections on a high-throughput table cause merge backpressure during traffic spikes; ingest lag balloons | In 25.6/25.11, projections can store only the sort key plus a _part_offset pointer back to the base table — true secondary-index semantics. Adopt new style before legacy data-copy projections. |
| Materialized views aggregate at insert time | Sub-second queries on pre-aggregated rollups; trigger-style chained MVs let you build aggregation hierarchies | MV-as-trigger semantics are subtle; deduplication scoped per table since 26.1; failures in MV chains cascade | MV chain has 3 levels; backfill produces unexpected double-counts because deduplication is per-table not per-chain | In 26.1, deduplication is on by default for both sync and async inserts. If you explicitly disabled it, you need deduplicate_insert='backward_compatible_choice' to keep old behavior. |
| Single-binary architecture | One process handles ingest, query, replication coordination; simpler ops than Pinot's 5-component cluster | Replication still needs ZooKeeper or ClickHouse Keeper; cluster management is non-trivial at scale | Self-hosted cluster of 50 servers; ZK ensemble becomes the operational bottleneck; you migrate to Keeper mid-project | ClickHouse Keeper is the modern default for new deployments. C++ implementation, lower latency, fewer ops surprises than the ZooKeeper Java pairing. |
| Lightweight Updates (Patch Parts) | UPDATE without table locks; instant read consistency vs old mutation-based updates | Still not OLTP; high-volume updates degrade query performance as patches accumulate | Use case sends 1K updates/sec to a billion-row table; patch parts accumulate faster than merge can collapse them; SELECT performance regresses 30% | Patch Parts is a structural improvement over the old mutation model but does not turn ClickHouse into an OLTP store. For high-rate update workloads, use ReplacingMergeTree or move the workload upstream. |
| Kafka engine via consumer groups | Built-in Kafka ingest as an engine; SELECT FROM Kafka materialized to target table | Consumer group rebalancing pauses ingestion when consumers join or leave; server count limited by Kafka partition count | Cluster autoscale event triggers a Kafka rebalance storm; ingest pauses for tens of seconds; downstream alerting fires | For high-throughput, run an external consumer (Vector, Bytewax, or custom) doing INSERTs instead of the Kafka engine. Better operational control, easier debugging, scales beyond partition count. |
| preads I/O instead of MMAP | Better random read perf; no TLB shootdowns; benefits from CPU prefetching and caching | Requires OS page cache tuning at large scale; cache-cold queries pay disk latency | Server with insufficient RAM for working set; queries hit disk and p99 jumps under load | The Constant Contact migration story (Pinot to ClickHouse) cited preads vs MMAP as a key win for them. Pinot's MMAP shines when segments fit in RAM, hurts otherwise. |
Apache Pinot
Distributed Real-Time OLAP| Trade-Off | What You Gain | What You Give Up | When It Bites You | PE Nuance |
|---|---|---|---|---|
| Star-tree index for multi-dimensional pre-aggregation | Real-time aggregation of billions of rows in milliseconds without explicit materialized views; covers many low-cardinality dimensions in one tree | Storage overhead for the tree; must be configured per query pattern; not free for arbitrary queries | Team enables default star-tree on a wide table; tree doesn't match actual query shapes; storage bill rises but query perf barely moves | Star-tree is purpose-built for "GROUP BY a, b, c with filters on the same columns" — the LinkedIn 'Who viewed your profile' shape. For arbitrary analytical queries, you're better off with ClickHouse's flexible projections. |
| Native real-time upserts via per-segment bitmaps | PK-based upserts at ingestion; in-memory bitmap marks obsolete rows; queries filter at read time; full and partial upserts supported | JVM heap-bound: PK count per server limited by heap size; OSS Pinot tops out at hundreds of millions of PKs per server | Upsert table grows past heap limit; server OOMs; you scale horizontally and discover sticky partition assignment caps your scaling | StarTree Cloud's off-heap upsert metadata lifts the heap ceiling roughly 10x. If your upsert PK count is in the billions, StarTree is the route; OSS Pinot won't scale there. |
| Deterministic Kafka partition-to-server mapping (LLC) | No consumer-group rebalancing storms; predictable ingestion latency; coordinated segment completion across replicas | Sticky assignment: num_partitions × num_replicas must be ≥ num_servers or some nodes go idle | You scale to 10 servers expecting load balancing; you have 6 Kafka partitions and 1 replica, so only 6 of 10 servers ingest anything | Pauseless Ingestion (StarTree Cloud) takes this further to 40M+ events/sec/table. For high-throughput ingest where rebalancing pauses are a hard SLO violation, this is the structural advantage over ClickHouse's Kafka engine. |
| Segment-based architecture with deep store | Clear cold/warm separation; S3 or HDFS as source of truth; servers can be rebuilt from deep store | Five moving pieces: Controller + Broker + Server + Minion + ZK/Helix; ops complexity is the highest in the OLAP space | Team picks Pinot for a 5-engineer startup; spends three engineer-quarters operating the cluster before any feature ships | For self-hosted, the operational depth required is real. Either commit to it with dedicated platform engineers or use StarTree Cloud and pay the managed-service tax. |
| Sub-second p99 at high concurrency | Thousands of QPS user-facing dashboards; LinkedIn 'Who Viewed Your Profile' pattern; Stripe merchant dashboards | Less flexible SQL surface than ClickHouse; v2 engine improves joins but still trails for full-scan analytical patterns | Analyst tries an ad-hoc multi-CTE query against Pinot expecting ClickHouse-like flexibility; query plan is suboptimal or unsupported | Pinot is purpose-built for "many users querying simple analytical questions over fresh data." If your queries look like SQL the analyst wrote, ClickHouse is closer to what you want. |
| MMAP-based segment loading | Fast access when segments fit in memory; OS handles paging | If segments exceed RAM, TLB shootdowns and memory contention degrade performance | Workload grows past RAM working set; queries get slow under contention; you can't tune your way out without adding servers or moving data to tiered storage | The Constant Contact migration cited MMAP scaling issues as a real driver to ClickHouse. Sizing rule: keep hot segments in RAM. Tiered storage in StarTree mitigates this for cold data. |
| Rich indexing toolkit per column | JSON, text, range, bloom, inverted, timestamp, and Array text-index all coexist per column | Configuration burden; choosing wrong indexes costs both storage and query perf | Default everything; segments balloon 3x; query speedup is marginal because indexes don't match access patterns | Indexes in Pinot are a per-column-per-query decision. Profile queries against a representative segment before enabling indexes broadly. Default star-tree config (≤10K cardinality dims, all DateTime cols) is often the wrong shape for production. |
| Pauseless Ingestion (StarTree Cloud) | 40M+ events/sec per table; no pauses on segment completion | Managed-tier feature; OSS Pinot tops out lower; lock-in to StarTree | Self-hosted Pinot hits ingestion ceiling at sustained 1M events/sec/table; migration to StarTree Cloud or re-architect | The OSS-vs-managed gap on Pinot is meaningful for high-throughput workloads. ClickHouse OSS and ClickHouse Cloud have a smaller gap on raw ingest throughput. |
| Off-heap upsert metadata (StarTree) | Reduces JVM heap pressure ~10x; enables billions of PKs per server | StarTree-only differentiator; OSS Pinot upserts hit heap ceiling first | OSS Pinot upserts work for hundreds of millions of PKs; you scale past, GC pauses start tripling p99 | If upsert scale is your driver and you're committed to OSS, you'll hit this. If you need upsert scale without StarTree, ClickHouse's ReplacingMergeTree + lightweight updates is the OSS-friendly alternative. |
2. Use Cases
Concrete production workloads with the driving property that ruled out the alternative. The pattern is consistent: ClickHouse wins on analyst-flexible workloads with high cardinality and complex queries; Pinot wins on user-facing dashboards with predictable query shapes and update semantics.
ClickHouse
Columnar LSM OLAP| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| Observability and logs backend | Uber — Logging platform processing PB-scale logs/traces | Flexible analytical SQL over high-cardinality time-series data; massive compression on log fields | Multi-PB log volume; thousands of fields; ad-hoc multi-dimensional queries from on-call engineers | Pinot's star-tree assumes known query dimensions; log analysis needs ad-hoc dimensionality the engineer didn't pre-define |
| Web analytics SaaS backend | Cloudflare Analytics — request-level visibility for millions of customers | Sub-second queries over high-cardinality dimensions (URL, ASN, country, device); flexible filters per customer | Tens of billions of events per day; petabytes hot; millions of tenants with varied query patterns | Pinot's tenant isolation is harder at this multi-tenancy scale; ClickHouse's flexibility lets each customer query their own shape |
| Migrated from Pinot for flexibility + ops | Constant Contact — moved Automated Path Builder analytics off Pinot to ClickHouse | 10x speedup and 10x cost reduction on their analytical query shape; LSM + preads vs MMAP for their working-set size | Marketing automation event data; complex multi-join queries that didn't match Pinot's star-tree shape | Pinot had been the original choice but operational complexity and MMAP scaling issues drove the migration; ClickHouse's flexibility won for their workload |
| SaaS analytics platform built on ClickHouse | Tinybird — real-time analytics API platform; ClickHouse-as-a-service | Single-binary ops; SQL surface analysts and developers both speak; predictable cost model on managed ClickHouse | Multi-tenant SaaS; thousands of customers building their own analytics products on top | Pinot's operational depth is a SaaS-builder anti-feature; the platform engineer experience matters as much as raw perf |
| Product analytics platform | PostHog — open-source product analytics; events, funnels, retention | Open-source; flexible SQL surface for funnel/retention queries; cost-efficient self-host option for enterprise | Billions of events per day for large customers; multi-step funnel queries; arbitrary cohort definitions | Pinot's licensing is also open but operational complexity is higher; for an OSS product analytics company the simpler ops path matters |
Apache Pinot
Distributed Real-Time OLAP| Use Case | Company / Scenario | Driving Property | Scale Dimension | Why Not Alternative |
|---|---|---|---|---|
| User-facing analytics for millions of end-users | LinkedIn — 'Who viewed your profile,' notifications, member analytics (founding use case) | Sub-second p99 over millions of concurrent users hitting analytical queries; star-tree pre-aggregation makes counts and aggregates instant | Hundreds of millions of members; queries per user per session; freshness measured in seconds | ClickHouse can do high-QPS but the deterministic Kafka mapping + star-tree advantage are LinkedIn-shaped problems |
| Restaurant manager analytics for couriers | Uber — Restaurant Manager dashboards; UberEats restaurant-facing real-time metrics | Sub-second freshness from order events to restaurant-facing dashboard; per-merchant slice queries at high concurrency | Tens of thousands of merchants; per-merchant dashboards; fresh-within-seconds order data | ClickHouse Kafka engine rebalance pauses would violate freshness SLO; Pinot's LLC determinism wins here |
| Merchant dashboards | Stripe — real-time merchant payment analytics | Per-merchant queries fan out to thousands of concurrent dashboards; deterministic latency under load | Millions of merchants; transaction-level events; sub-second dashboard refresh | ClickHouse could serve this but Pinot's purpose-built user-facing pattern is harder to beat at the p99 tail |
| Real-time pricing intelligence | Walmart — pricing and inventory analytics across stores and SKUs | Star-tree index over (store × SKU × time) for sub-second aggregation; upsert support for changing prices | Tens of thousands of stores; millions of SKUs; price updates flow through Kafka | ClickHouse ReplacingMergeTree handles upserts but star-tree on the multi-dim aggregation is where Pinot pulls ahead |
| Connected vehicle telemetry | Mercedes-Benz / automotive — connected fleet analytics | Upsert-friendly model for slowly-changing vehicle state; per-vehicle queries at scale | Millions of vehicles; per-vehicle state with frequent updates; geographic and fleet-level rollups | ClickHouse handles append-heavy IoT well; Pinot's native upserts make the "current state per vehicle" pattern cleaner |
3. Limitations
Per-technology limitation tables. Severity reflects blast radius on production workloads. Workaround Cost is what you actually pay to dodge each one.
ClickHouse
Columnar LSM OLAP| Limitation | Severity | Workaround | Workaround Cost |
|---|---|---|---|
| Single-row INSERTs without async are an anti-pattern | High | Always batch (1000+ rows) client-side OR enable async_insert with appropriate busy_timeout | Client-side batching adds latency; async_insert without wait risks durability between buffer flush |
| Joins historically weaker than DW SQL surface | High | Denormalize at ingest; use dictionaries for small lookup tables; use the new query planner for distributed joins | Storage duplication on denormalized data; complexity at ingest layer |
| Updates and deletes via mutations are expensive on hot data | High | Use Lightweight Updates (Patch Parts) for low-volume updates; ReplacingMergeTree for upsert-like patterns | Patch parts accumulate under high-volume updates and degrade SELECT perf; ReplacingMergeTree returns multiple versions until merge |
| Kafka consumer group rebalancing pauses ingestion | Medium | External consumer process (Vector, Bytewax, custom) doing batched INSERTs instead of Kafka engine | Extra service to operate; loses the built-in Kafka engine convenience |
| Concurrent query handling less optimized than DW engines | Medium | Set max_concurrent_queries; partition workloads via resource_pool / user profiles; scale horizontally | Operational complexity in tuning concurrency limits; horizontal scale requires more cluster ops |
| Tombstone-based deletes are eventually consistent on disk reclamation | Medium | OPTIMIZE TABLE ... FINAL for explicit compaction; partition-level TTL for time-bound data | OPTIMIZE FINAL is heavy; partition TTL only works for time-series-shaped data |
| Cluster scaling requires careful Kafka partition planning if using Kafka engine | Medium | External consumer pattern (separate ingestion from storage scaling) | One more service to operate and monitor; ops surface grows |
Apache Pinot
Distributed Real-Time OLAP| Limitation | Severity | Workaround | Workaround Cost |
|---|---|---|---|
| OSS upsert tables are JVM-heap-bound for PK count per server | High | Move to StarTree Cloud for off-heap upsert metadata; partition tables more aggressively to spread PK count | Managed service cost (StarTree); or operational overhead in partitioning + monitoring heap pressure |
| Operational complexity: 5 components plus ZK/Helix | High | Use StarTree Cloud, or commit a 2-3 engineer platform team to operate it | Managed service premium; or engineer headcount in perpetuity |
| Less mature SQL surface than ClickHouse for complex analytics | Medium | Use v2 query engine (multi-stage); for analyst-facing workloads, layer Trino or another engine; for product-side queries, stick to Pinot-idiomatic shapes | Two-engine complexity; analyst learning curve on Pinot SQL dialect |
| 30-50% more disk than equivalent ClickHouse | Medium | Tune index choices; drop default star-tree on tables that don't need it; use tiered storage for cold | Configuration effort; tiered storage adds query latency variance |
| MMAP I/O causes issues when segments exceed RAM | High | Size servers so hot segments fit in memory; move cold to tiered storage; partition more aggressively | Cluster sizing overhead; cold-tier query latency degrades; possible re-architecture at scale |
| Kubernetes operator less mature than ClickHouse Altinity Operator | Medium | Use StarTree Cloud K8s tooling; or operate on VMs; or invest in custom K8s operator config | Managed service cost or platform engineering effort |
| Join performance trails ClickHouse for complex multi-table queries | Medium | Denormalize at ingest; use lookup joins for small dimension tables; structure data for star-tree-friendly queries | Storage duplication; data model complexity at ingest; less flexibility for ad-hoc joins |
4. Fault Tolerance
Both engines replicate at the data layer, but the durability semantics differ. ClickHouse async inserts have a known in-flight risk window; Pinot's source-of-truth-in-Kafka pattern means recovery is bounded by Kafka offset retention.
| Dimension | ClickHouse | Apache Pinot |
|---|---|---|
| Replication model | ReplicatedMergeTree with async multi-master replication coordinated by ClickHouse Keeper (or ZooKeeper). Write to any replica, propagated to others. | Segment-based replication; Controller assigns segments to multiple Servers via Helix; deep store (S3/HDFS) is the source of truth. |
| Failure detection | Keeper-based heartbeats; replica failure detected via session timeout (default 30s, tunable). | ZooKeeper + Helix manage cluster state; Controller detects Server failure via heartbeat. |
| Failover mechanism | Other replicas serve reads; client retries to other shard via load balancer or smart driver. Distributed table routes around failed replicas. | Broker routes queries to surviving replicas; on persistent Server failure, segments redistributed from deep store to other Servers. |
| RTO (typical) | Seconds for read failover (other replica serves); minutes for full server recovery and replica sync. | Seconds for broker re-routing; minutes for segment redownload from deep store to replacement server. |
| RPO (typical) | Sub-second for replicated commits; async insert buffer at risk between buffer and flush if using wait_for_async_insert=0. | 0 for committed segments (Kafka offsets replayable); sub-second for current consuming segment which is regenerated from Kafka. |
| Split-brain behavior | Keeper quorum prevents split-brain; minority side becomes read-only until partition heals. | ZK quorum prevents controller split-brain; servers in minority partition stop accepting writes until partition heals. |
| Blast radius of single-node failure | Single shard impacted; other shards continue. If single-replica, the shard is unavailable until recovery. | Segments on failed server are redistributed; brief query degradation; deep store as recovery source. |
| Cross-region failover story | Manual: async replica in another region OR Distributed table over WAN; no managed cross-region replication outside ClickHouse Cloud. | Manual: tiered storage + multi-cluster setup; deep store can be regional; StarTree Cloud offers cross-region as a tier. |
| Data loss scenarios | Async insert buffer loss between buffer flush; replica divergence under aggressive K8s autoscaling without graceful shutdown. | Mutable real-time segment loss before commit (rare with Kafka source-of-truth); deep store corruption (extremely rare on managed object stores). |
6. Replication
ClickHouse replication is async multi-master with optional quorum writes. Pinot replication is replica-group-based with Kafka as the durable source of truth for real-time tables — a meaningfully different durability model.
| Dimension | ClickHouse | Apache Pinot |
|---|---|---|
| Replication topology | Multi-master async via ReplicatedMergeTree + Keeper/ZK; writes to any replica propagate to others; Distributed table layers sharding on top. | Replica-group-based; Controller-managed via Helix; deep store as durable source of truth; LLC coordinates Kafka offsets across replicas. |
| Sync vs async | Async by default; quorum INSERT setting (insert_quorum=N) forces sync to N replicas before ACK. | Sync within replica-group on segment commit (LLC + Kafka offset coordination); async push to deep store. |
| Replication factor (default / max) | Configurable per ReplicatedMergeTree table; typical 2-3; bounded by cluster size. | Configurable per table; typical 2-3; sticky Kafka partition assignment constrains practical maximum. |
| Consistency level options | Eventual by default; insert_quorum + select_sequential_consistency for stronger guarantees; consistency tunable per query. | Per-query consistency strict within replica-group view; upsert tables provide read-your-writes within partition. |
| Replication lag (typical) | Sub-second within region for ReplicatedMergeTree; cross-region uses async replica configurations. | Sub-second (Kafka offset-coordinated for real-time tables); deep store sync is also sub-second to seconds. |
| Conflict resolution | No conflict resolution per se — ReplacingMergeTree, CollapsingMergeTree, AggregatingMergeTree provide engine-level dedup or rollup semantics at merge time. | Per-PK upsert resolution at ingestion time using configurable strategy (last-write-wins, partial-upsert with column-level merge). |
| Cross-region replication | Manual: Distributed table over WAN, or async replica in another region; ClickHouse Cloud adds managed cross-region. | Manual: tiered storage + WAN sync; StarTree Cloud provides managed cross-region as a tier. |
| Replication during partition | Replicas continue accepting writes; divergence resolved on heal via merge; insert_quorum prevents writes on minority side. | Replica-group writes continue if quorum maintained; minority side rejects writes to preserve consistency. |
7. Better Usage Patterns
The patterns that distinguish a production deployment from a prototype. ClickHouse's are tuning-heavy (codecs, projections, MV chains); Pinot's are model-heavy (index choice per query, partition-by-PK for upserts, replica-group sizing).
ClickHouse
Columnar LSM OLAP| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Batch inserts or use async_insert | Single-row INSERTs from app code; one part per row; merge pipeline collapses | Batch ≥1000 rows client-side, OR enable async_insert with wait_for_async_insert=1 for durability | Merge throughput stays healthy; storage-level part count stays bounded; SELECT performance doesn't degrade over time |
| LowCardinality(String) for high-repeat string columns | Plain String columns for status, country, device_type, etc. | LowCardinality(String) on any column with <10K distinct values | 10x compression on that column; faster filter and group-by; lowest-effort win in ClickHouse tuning |
| ORDER BY clause as the index | Choose ORDER BY for "logical" sort order without thinking about filter columns | Choose ORDER BY for the columns you filter on most often, in order of cardinality (low-card first) | Primary key skip-index works on ORDER BY; wrong choice means full-table scans on filter predicates |
| ReplacingMergeTree for upsert-like patterns | Try to UPDATE rows on regular MergeTree; mutations queue up; SELECT performance degrades | ReplacingMergeTree with version column; SELECT FINAL or argMax aggregation for current state | Engine handles dedup at merge; no mutation backlog; production-ready upsert semantics |
| Materialized Views for pre-aggregation | GROUP BY at query time on huge tables; latency varies with query parameters | Materialized View with AggregatingMergeTree target; queries hit pre-aggregated data; chain MVs for hierarchical rollups | Order-of-magnitude latency reduction on common aggregations; predictable cost |
| ClickHouse Keeper over ZooKeeper for new deployments | Keep using ZooKeeper because "it's what the docs showed" | ClickHouse Keeper (C++ implementation, lower latency, simpler ops) | Removes JVM operational complexity; lower coordination latency; fewer surprises at scale |
| Avoid SELECT * on wide tables | SELECT * habit from row-oriented databases | Always select the specific columns needed; columnar means column count drives I/O | Scan I/O scales with column count selected, not row count; SELECT * negates the columnar benefit |
| External Kafka consumer for high throughput | Use Kafka engine; hit rebalance pauses and partition-count scaling limits | External consumer (Vector, Bytewax, Benthos) doing batched INSERTs; separates ingest scaling from storage scaling | Eliminates rebalance pauses; scales beyond Kafka partition count; better observability into ingest pipeline |
Apache Pinot
Distributed Real-Time OLAP| Pattern | What Most Teams Do Wrong | The Better Way | Why It Matters |
|---|---|---|---|
| Star-tree for multi-dim aggregations, not arbitrary queries | Enable default star-tree on every table; expect uniform speedup | Configure star-tree explicitly for the (dimensions, aggregations) of your top dashboard queries; profile before broad enablement | Storage overhead pays back only when star-tree matches actual query shape; default config rarely matches production patterns |
| Partition Kafka by PK for upsert tables | Use round-robin or random Kafka partitioning; upsert correctness breaks | Kafka stream MUST be partitioned by primary key; all messages for a PK land on the same Pinot server | Upsert correctness requires this; otherwise PK metadata maps don't converge; this is a hard correctness requirement, not an optimization |
| Cluster sizing: num_partitions × num_replicas ≥ num_servers | Scale to 10 servers with 6 Kafka partitions and 1 replica; 4 servers go idle | Always plan: num_partitions × num_replicas must equal or exceed planned server count; over-provision Kafka partitions for headroom | Sticky partition-to-server mapping means under-partitioning leaves capacity unused; this is non-obvious until you try to scale |
| Minion-based segment compaction for upsert-heavy tables | Let segments accumulate; obsolete-row ratio rises; queries slow over time | Schedule SegmentRefreshTask via Minion to compact segments based on obsolete-row ratio | StarTree Cloud automates this; OSS Pinot requires explicit task scheduling but the same primitive works |
| Bloom filter on high-selectivity columns, inverted on low-cardinality | Default everything; segments balloon 3x; query speedup is marginal | Bloom on columns with high selectivity for point queries; inverted on low-cardinality filter columns; range index on time and numeric ranges | Index choice is per-column-per-query; profile actual query patterns against representative segments before broad enablement |
| Offline + real-time hybrid tables for historical + fresh | Single real-time table; historical data accumulates in real-time tier; cost grows | Real-time table for fresh window (e.g., last 7 days); offline table for historical data, fed by daily batch jobs | Resource isolation; cost optimization (offline servers can be smaller); query routing handles the split transparently |
| Set retention explicitly | OSS Pinot default has no retention; segments accumulate indefinitely | Set retentionTimeUnit + retentionTimeValue in segmentsConfig; verify periodically | Storage cost spirals on long-running tables without retention; this is a "you'll forget about it" footgun |
| Replica-group routing for tenant isolation | Single replica group; all tenants share resources; noisy neighbor under load | Per-tenant or per-workload replica groups; broker routes queries to dedicated replica group | Isolates noisy tenants; predictable p99 per tenant; chargeback becomes possible |
8. Advanced / Next-Gen Alternatives
Where each engine may be augmented or displaced. The real-time OLAP space is consolidating; Druid is fading, ClickHouse and Pinot are the two viable open engines plus their managed services (ClickHouse Cloud, StarTree Cloud).
ClickHouse
Columnar LSM OLAP| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| ClickHouse Cloud (managed) | Separated storage and compute on object store; auto-scaling; managed Keeper; lower ops burden | Production | Low — lift-and-shift from self-hosted; SQL surface is identical; cost model shifts to managed-service line-item | When self-hosted operational burden exceeds the managed-service premium; when you need rapid scaling without operating Keeper + cluster |
| Apache Pinot / StarTree Cloud | Native upserts at scale; sub-second p99 for user-facing dashboards; star-tree for pre-aggregated multi-dim queries | Production | Medium-High — different data model; star-tree shape vs MergeTree shape; rewrite of materialized view logic | When user-facing dashboard p99 becomes the SLO and ClickHouse can't hit it under load; when upsert workload outgrows ReplacingMergeTree |
| DuckDB / Polars (embedded analytical) | Single-node analytical performance for small-to-medium data; no cluster to operate; embeddable | Production | Low for specific workloads — complementary, not replacement; useful for edge analytics or notebook-scale workloads | When data fits on one node (low TB scale) and you'd rather not operate a cluster; for ad-hoc analytics adjacent to ClickHouse |
| Apache Druid | Streaming ingest flexibility for diverse sources and out-of-order events; mature for time-series-shaped workloads | Production | High — different segment model; operational complexity comparable to Pinot; Druid mindshare has been declining | Mostly when inheriting an existing Druid deployment; greenfield, ClickHouse and Pinot are the more active choices in 2026 |
Apache Pinot
Distributed Real-Time OLAP| Successor / Alternative | What It Improves | Maturity | Migration Cost | When To Consider |
|---|---|---|---|---|
| StarTree Cloud (managed Pinot) | Off-heap upsert metadata (10x PK scale); Pauseless Ingestion (40M+ events/sec/table); tiered storage; managed K8s | Production | Low from OSS Pinot — same engine, managed control plane and add-ons | When OSS Pinot's heap-bound upserts limit scaling; when ingest throughput approaches OSS ceiling; when operating 5 components in-house is the bottleneck |
| ClickHouse / ClickHouse Cloud | Operational simplicity (single binary); higher compression; more flexible SQL surface; richer ecosystem | Production | Medium-High — different data model; rewrite of star-tree-shaped queries; index strategy rethink | When ad-hoc analytical queries dominate over pre-defined dashboard queries; when ops headcount is the binding constraint; Constant Contact pattern |
| Apache Druid | More flexible streaming ingest; better tooling polish in some areas (Imply Polaris UI); mature K8s operator | Production | Medium — adjacent architecture but different segment + upsert model; Pinot's upsert maturity is hard to match | When streaming ingest from diverse sources matters more than upsert semantics; mostly relevant for greenfield Druid evaluations |
| Snowflake Hybrid Tables or Databricks Lakebase | OLTP + OLAP unified on a warehouse; no separate operational store needed for some user-facing patterns | Emerging | High — different architecture; not a 1:1 replacement; only viable for specific patterns where warehouse-grade latency is enough | When user-facing latency requirements relax to 100ms+ and the unified-platform pitch outweighs the dedicated OLAP engine perf |