Real-Time OLAP — ClickHouse vs Apache Pinot

User-facing analytics at high QPS, sub-second p99, streaming ingest from Kafka. Two engines that converge on the same use cases from opposite architectural ends.

Real-Time OLAP PE / Staff+ Depth

As of 2026-05-31

PE Verdict

ClickHouse vs Pinot is not "which is faster." It is "what is your query concurrency, do you need first-class upserts, and is your team SQL-native or Java-native." ClickHouse wins on compression, flexible analytical SQL, ecosystem (Tinybird, Posthog, observability), and operational simplicity — one binary, fewer moving parts. Pinot wins on user-facing p99 under high concurrency, native real-time upserts, star-tree pre-aggregation for low-cardinality dashboards, and deterministic Kafka partition assignment (no consumer-group rebalancing storms). The 2026 reality: ClickHouse Cloud has narrowed Pinot's user-facing moat with separated compute/storage and lightweight updates; Pinot's remaining moat is upserts, the star-tree index, and the LinkedIn/Uber/Stripe pattern of millions of end-users hitting analytical queries. Constant Contact migrated off Pinot to ClickHouse for the flexibility win at their workload shape; LinkedIn/Stripe stay on Pinot for the concurrency win at theirs. Both are correct, in their respective workload shapes.

Best default choices

1. Trade-Offs

Per-technology trade-off lists. The trade space is genuinely different: ClickHouse trades LSM merge-amplification for compression and flexible queries; Pinot trades operational complexity for sub-second p99 and native upserts.

ClickHouse

Columnar LSM OLAP
Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Append-only MergeTree LSM architectureExtreme write throughput; 10-20x compression on typical event data; columnar scans on 1B rows in millisecondsSingle-row updates are anti-pattern; deletes use tombstones with eventual consistency on disk reclamationTeam treats ClickHouse like Postgres for a small operational use case; UPDATE WHERE id = X for each event creates one part per row and compaction can't keep upThe mental model is "you write batches, the engine merges in background." Any pattern that produces one part per write at high frequency degrades the merge pipeline.
Async inserts with server-side batchingHigh-frequency small writes work without client-side batching; thousands of small clients can write directlywait_for_async_insert=0 trades durability between buffer and flush; wait=1 eliminates the perf gainTelemetry agent fires 1KB writes at 10K/sec with wait=0; ClickHouse process crashes between buffer and flush, you lose a few seconds of in-flight data26.1+ extended async insert deduplication to materialized views. Always pair async_insert with idempotent INSERT statements (use block_id or insert hash) for production durability.
Per-column codec compressionTunable per column: DoubleDelta, Gorilla, LZ4, ZSTD; 30-50% less disk than Pinot equivalentsMore configuration burden; default codecs leave significant perf on the tableYou leave defaults; storage bill is 2x what it would be with Gorilla on timestamps and LowCardinality on string dimensionsLowCardinality(String) on columns with under ~10K distinct values is the single highest-leverage tuning in ClickHouse. 10x compression and faster filter on the same column.
Projections as alternative sort orders or pre-aggregationsMultiple query patterns supported by one table; 25.6+ projection-as-secondary-index reduces storage overheadWrite amplification on insert; projection storage cost; complexity in query optimizationSix projections on a high-throughput table cause merge backpressure during traffic spikes; ingest lag balloonsIn 25.6/25.11, projections can store only the sort key plus a _part_offset pointer back to the base table — true secondary-index semantics. Adopt new style before legacy data-copy projections.
Materialized views aggregate at insert timeSub-second queries on pre-aggregated rollups; trigger-style chained MVs let you build aggregation hierarchiesMV-as-trigger semantics are subtle; deduplication scoped per table since 26.1; failures in MV chains cascadeMV chain has 3 levels; backfill produces unexpected double-counts because deduplication is per-table not per-chainIn 26.1, deduplication is on by default for both sync and async inserts. If you explicitly disabled it, you need deduplicate_insert='backward_compatible_choice' to keep old behavior.
Single-binary architectureOne process handles ingest, query, replication coordination; simpler ops than Pinot's 5-component clusterReplication still needs ZooKeeper or ClickHouse Keeper; cluster management is non-trivial at scaleSelf-hosted cluster of 50 servers; ZK ensemble becomes the operational bottleneck; you migrate to Keeper mid-projectClickHouse Keeper is the modern default for new deployments. C++ implementation, lower latency, fewer ops surprises than the ZooKeeper Java pairing.
Lightweight Updates (Patch Parts)UPDATE without table locks; instant read consistency vs old mutation-based updatesStill not OLTP; high-volume updates degrade query performance as patches accumulateUse case sends 1K updates/sec to a billion-row table; patch parts accumulate faster than merge can collapse them; SELECT performance regresses 30%Patch Parts is a structural improvement over the old mutation model but does not turn ClickHouse into an OLTP store. For high-rate update workloads, use ReplacingMergeTree or move the workload upstream.
Kafka engine via consumer groupsBuilt-in Kafka ingest as an engine; SELECT FROM Kafka materialized to target tableConsumer group rebalancing pauses ingestion when consumers join or leave; server count limited by Kafka partition countCluster autoscale event triggers a Kafka rebalance storm; ingest pauses for tens of seconds; downstream alerting firesFor high-throughput, run an external consumer (Vector, Bytewax, or custom) doing INSERTs instead of the Kafka engine. Better operational control, easier debugging, scales beyond partition count.
preads I/O instead of MMAPBetter random read perf; no TLB shootdowns; benefits from CPU prefetching and cachingRequires OS page cache tuning at large scale; cache-cold queries pay disk latencyServer with insufficient RAM for working set; queries hit disk and p99 jumps under loadThe Constant Contact migration story (Pinot to ClickHouse) cited preads vs MMAP as a key win for them. Pinot's MMAP shines when segments fit in RAM, hurts otherwise.

Apache Pinot

Distributed Real-Time OLAP
Trade-OffWhat You GainWhat You Give UpWhen It Bites YouPE Nuance
Star-tree index for multi-dimensional pre-aggregationReal-time aggregation of billions of rows in milliseconds without explicit materialized views; covers many low-cardinality dimensions in one treeStorage overhead for the tree; must be configured per query pattern; not free for arbitrary queriesTeam enables default star-tree on a wide table; tree doesn't match actual query shapes; storage bill rises but query perf barely movesStar-tree is purpose-built for "GROUP BY a, b, c with filters on the same columns" — the LinkedIn 'Who viewed your profile' shape. For arbitrary analytical queries, you're better off with ClickHouse's flexible projections.
Native real-time upserts via per-segment bitmapsPK-based upserts at ingestion; in-memory bitmap marks obsolete rows; queries filter at read time; full and partial upserts supportedJVM heap-bound: PK count per server limited by heap size; OSS Pinot tops out at hundreds of millions of PKs per serverUpsert table grows past heap limit; server OOMs; you scale horizontally and discover sticky partition assignment caps your scalingStarTree Cloud's off-heap upsert metadata lifts the heap ceiling roughly 10x. If your upsert PK count is in the billions, StarTree is the route; OSS Pinot won't scale there.
Deterministic Kafka partition-to-server mapping (LLC)No consumer-group rebalancing storms; predictable ingestion latency; coordinated segment completion across replicasSticky assignment: num_partitions × num_replicas must be ≥ num_servers or some nodes go idleYou scale to 10 servers expecting load balancing; you have 6 Kafka partitions and 1 replica, so only 6 of 10 servers ingest anythingPauseless Ingestion (StarTree Cloud) takes this further to 40M+ events/sec/table. For high-throughput ingest where rebalancing pauses are a hard SLO violation, this is the structural advantage over ClickHouse's Kafka engine.
Segment-based architecture with deep storeClear cold/warm separation; S3 or HDFS as source of truth; servers can be rebuilt from deep storeFive moving pieces: Controller + Broker + Server + Minion + ZK/Helix; ops complexity is the highest in the OLAP spaceTeam picks Pinot for a 5-engineer startup; spends three engineer-quarters operating the cluster before any feature shipsFor self-hosted, the operational depth required is real. Either commit to it with dedicated platform engineers or use StarTree Cloud and pay the managed-service tax.
Sub-second p99 at high concurrencyThousands of QPS user-facing dashboards; LinkedIn 'Who Viewed Your Profile' pattern; Stripe merchant dashboardsLess flexible SQL surface than ClickHouse; v2 engine improves joins but still trails for full-scan analytical patternsAnalyst tries an ad-hoc multi-CTE query against Pinot expecting ClickHouse-like flexibility; query plan is suboptimal or unsupportedPinot is purpose-built for "many users querying simple analytical questions over fresh data." If your queries look like SQL the analyst wrote, ClickHouse is closer to what you want.
MMAP-based segment loadingFast access when segments fit in memory; OS handles pagingIf segments exceed RAM, TLB shootdowns and memory contention degrade performanceWorkload grows past RAM working set; queries get slow under contention; you can't tune your way out without adding servers or moving data to tiered storageThe Constant Contact migration cited MMAP scaling issues as a real driver to ClickHouse. Sizing rule: keep hot segments in RAM. Tiered storage in StarTree mitigates this for cold data.
Rich indexing toolkit per columnJSON, text, range, bloom, inverted, timestamp, and Array text-index all coexist per columnConfiguration burden; choosing wrong indexes costs both storage and query perfDefault everything; segments balloon 3x; query speedup is marginal because indexes don't match access patternsIndexes in Pinot are a per-column-per-query decision. Profile queries against a representative segment before enabling indexes broadly. Default star-tree config (≤10K cardinality dims, all DateTime cols) is often the wrong shape for production.
Pauseless Ingestion (StarTree Cloud)40M+ events/sec per table; no pauses on segment completionManaged-tier feature; OSS Pinot tops out lower; lock-in to StarTreeSelf-hosted Pinot hits ingestion ceiling at sustained 1M events/sec/table; migration to StarTree Cloud or re-architectThe OSS-vs-managed gap on Pinot is meaningful for high-throughput workloads. ClickHouse OSS and ClickHouse Cloud have a smaller gap on raw ingest throughput.
Off-heap upsert metadata (StarTree)Reduces JVM heap pressure ~10x; enables billions of PKs per serverStarTree-only differentiator; OSS Pinot upserts hit heap ceiling firstOSS Pinot upserts work for hundreds of millions of PKs; you scale past, GC pauses start tripling p99If upsert scale is your driver and you're committed to OSS, you'll hit this. If you need upsert scale without StarTree, ClickHouse's ReplacingMergeTree + lightweight updates is the OSS-friendly alternative.

2. Use Cases

Concrete production workloads with the driving property that ruled out the alternative. The pattern is consistent: ClickHouse wins on analyst-flexible workloads with high cardinality and complex queries; Pinot wins on user-facing dashboards with predictable query shapes and update semantics.

ClickHouse

Columnar LSM OLAP
Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
Observability and logs backendUber — Logging platform processing PB-scale logs/tracesFlexible analytical SQL over high-cardinality time-series data; massive compression on log fieldsMulti-PB log volume; thousands of fields; ad-hoc multi-dimensional queries from on-call engineersPinot's star-tree assumes known query dimensions; log analysis needs ad-hoc dimensionality the engineer didn't pre-define
Web analytics SaaS backendCloudflare Analytics — request-level visibility for millions of customersSub-second queries over high-cardinality dimensions (URL, ASN, country, device); flexible filters per customerTens of billions of events per day; petabytes hot; millions of tenants with varied query patternsPinot's tenant isolation is harder at this multi-tenancy scale; ClickHouse's flexibility lets each customer query their own shape
Migrated from Pinot for flexibility + opsConstant Contact — moved Automated Path Builder analytics off Pinot to ClickHouse10x speedup and 10x cost reduction on their analytical query shape; LSM + preads vs MMAP for their working-set sizeMarketing automation event data; complex multi-join queries that didn't match Pinot's star-tree shapePinot had been the original choice but operational complexity and MMAP scaling issues drove the migration; ClickHouse's flexibility won for their workload
SaaS analytics platform built on ClickHouseTinybird — real-time analytics API platform; ClickHouse-as-a-serviceSingle-binary ops; SQL surface analysts and developers both speak; predictable cost model on managed ClickHouseMulti-tenant SaaS; thousands of customers building their own analytics products on topPinot's operational depth is a SaaS-builder anti-feature; the platform engineer experience matters as much as raw perf
Product analytics platformPostHog — open-source product analytics; events, funnels, retentionOpen-source; flexible SQL surface for funnel/retention queries; cost-efficient self-host option for enterpriseBillions of events per day for large customers; multi-step funnel queries; arbitrary cohort definitionsPinot's licensing is also open but operational complexity is higher; for an OSS product analytics company the simpler ops path matters

Apache Pinot

Distributed Real-Time OLAP
Use CaseCompany / ScenarioDriving PropertyScale DimensionWhy Not Alternative
User-facing analytics for millions of end-usersLinkedIn — 'Who viewed your profile,' notifications, member analytics (founding use case)Sub-second p99 over millions of concurrent users hitting analytical queries; star-tree pre-aggregation makes counts and aggregates instantHundreds of millions of members; queries per user per session; freshness measured in secondsClickHouse can do high-QPS but the deterministic Kafka mapping + star-tree advantage are LinkedIn-shaped problems
Restaurant manager analytics for couriersUber — Restaurant Manager dashboards; UberEats restaurant-facing real-time metricsSub-second freshness from order events to restaurant-facing dashboard; per-merchant slice queries at high concurrencyTens of thousands of merchants; per-merchant dashboards; fresh-within-seconds order dataClickHouse Kafka engine rebalance pauses would violate freshness SLO; Pinot's LLC determinism wins here
Merchant dashboardsStripe — real-time merchant payment analyticsPer-merchant queries fan out to thousands of concurrent dashboards; deterministic latency under loadMillions of merchants; transaction-level events; sub-second dashboard refreshClickHouse could serve this but Pinot's purpose-built user-facing pattern is harder to beat at the p99 tail
Real-time pricing intelligenceWalmart — pricing and inventory analytics across stores and SKUsStar-tree index over (store × SKU × time) for sub-second aggregation; upsert support for changing pricesTens of thousands of stores; millions of SKUs; price updates flow through KafkaClickHouse ReplacingMergeTree handles upserts but star-tree on the multi-dim aggregation is where Pinot pulls ahead
Connected vehicle telemetryMercedes-Benz / automotive — connected fleet analyticsUpsert-friendly model for slowly-changing vehicle state; per-vehicle queries at scaleMillions of vehicles; per-vehicle state with frequent updates; geographic and fleet-level rollupsClickHouse handles append-heavy IoT well; Pinot's native upserts make the "current state per vehicle" pattern cleaner

3. Limitations

Per-technology limitation tables. Severity reflects blast radius on production workloads. Workaround Cost is what you actually pay to dodge each one.

ClickHouse

Columnar LSM OLAP
LimitationSeverityWorkaroundWorkaround Cost
Single-row INSERTs without async are an anti-patternHighAlways batch (1000+ rows) client-side OR enable async_insert with appropriate busy_timeoutClient-side batching adds latency; async_insert without wait risks durability between buffer flush
Joins historically weaker than DW SQL surfaceHighDenormalize at ingest; use dictionaries for small lookup tables; use the new query planner for distributed joinsStorage duplication on denormalized data; complexity at ingest layer
Updates and deletes via mutations are expensive on hot dataHighUse Lightweight Updates (Patch Parts) for low-volume updates; ReplacingMergeTree for upsert-like patternsPatch parts accumulate under high-volume updates and degrade SELECT perf; ReplacingMergeTree returns multiple versions until merge
Kafka consumer group rebalancing pauses ingestionMediumExternal consumer process (Vector, Bytewax, custom) doing batched INSERTs instead of Kafka engineExtra service to operate; loses the built-in Kafka engine convenience
Concurrent query handling less optimized than DW enginesMediumSet max_concurrent_queries; partition workloads via resource_pool / user profiles; scale horizontallyOperational complexity in tuning concurrency limits; horizontal scale requires more cluster ops
Tombstone-based deletes are eventually consistent on disk reclamationMediumOPTIMIZE TABLE ... FINAL for explicit compaction; partition-level TTL for time-bound dataOPTIMIZE FINAL is heavy; partition TTL only works for time-series-shaped data
Cluster scaling requires careful Kafka partition planning if using Kafka engineMediumExternal consumer pattern (separate ingestion from storage scaling)One more service to operate and monitor; ops surface grows

Apache Pinot

Distributed Real-Time OLAP
LimitationSeverityWorkaroundWorkaround Cost
OSS upsert tables are JVM-heap-bound for PK count per serverHighMove to StarTree Cloud for off-heap upsert metadata; partition tables more aggressively to spread PK countManaged service cost (StarTree); or operational overhead in partitioning + monitoring heap pressure
Operational complexity: 5 components plus ZK/HelixHighUse StarTree Cloud, or commit a 2-3 engineer platform team to operate itManaged service premium; or engineer headcount in perpetuity
Less mature SQL surface than ClickHouse for complex analyticsMediumUse v2 query engine (multi-stage); for analyst-facing workloads, layer Trino or another engine; for product-side queries, stick to Pinot-idiomatic shapesTwo-engine complexity; analyst learning curve on Pinot SQL dialect
30-50% more disk than equivalent ClickHouseMediumTune index choices; drop default star-tree on tables that don't need it; use tiered storage for coldConfiguration effort; tiered storage adds query latency variance
MMAP I/O causes issues when segments exceed RAMHighSize servers so hot segments fit in memory; move cold to tiered storage; partition more aggressivelyCluster sizing overhead; cold-tier query latency degrades; possible re-architecture at scale
Kubernetes operator less mature than ClickHouse Altinity OperatorMediumUse StarTree Cloud K8s tooling; or operate on VMs; or invest in custom K8s operator configManaged service cost or platform engineering effort
Join performance trails ClickHouse for complex multi-table queriesMediumDenormalize at ingest; use lookup joins for small dimension tables; structure data for star-tree-friendly queriesStorage duplication; data model complexity at ingest; less flexibility for ad-hoc joins

4. Fault Tolerance

Both engines replicate at the data layer, but the durability semantics differ. ClickHouse async inserts have a known in-flight risk window; Pinot's source-of-truth-in-Kafka pattern means recovery is bounded by Kafka offset retention.

Dimension ClickHouse Apache Pinot
Replication model ReplicatedMergeTree with async multi-master replication coordinated by ClickHouse Keeper (or ZooKeeper). Write to any replica, propagated to others. Segment-based replication; Controller assigns segments to multiple Servers via Helix; deep store (S3/HDFS) is the source of truth.
Failure detection Keeper-based heartbeats; replica failure detected via session timeout (default 30s, tunable). ZooKeeper + Helix manage cluster state; Controller detects Server failure via heartbeat.
Failover mechanism Other replicas serve reads; client retries to other shard via load balancer or smart driver. Distributed table routes around failed replicas. Broker routes queries to surviving replicas; on persistent Server failure, segments redistributed from deep store to other Servers.
RTO (typical) Seconds for read failover (other replica serves); minutes for full server recovery and replica sync. Seconds for broker re-routing; minutes for segment redownload from deep store to replacement server.
RPO (typical) Sub-second for replicated commits; async insert buffer at risk between buffer and flush if using wait_for_async_insert=0. 0 for committed segments (Kafka offsets replayable); sub-second for current consuming segment which is regenerated from Kafka.
Split-brain behavior Keeper quorum prevents split-brain; minority side becomes read-only until partition heals. ZK quorum prevents controller split-brain; servers in minority partition stop accepting writes until partition heals.
Blast radius of single-node failure Single shard impacted; other shards continue. If single-replica, the shard is unavailable until recovery. Segments on failed server are redistributed; brief query degradation; deep store as recovery source.
Cross-region failover story Manual: async replica in another region OR Distributed table over WAN; no managed cross-region replication outside ClickHouse Cloud. Manual: tiered storage + multi-cluster setup; deep store can be regional; StarTree Cloud offers cross-region as a tier.
Data loss scenarios Async insert buffer loss between buffer flush; replica divergence under aggressive K8s autoscaling without graceful shutdown. Mutable real-time segment loss before commit (rare with Kafka source-of-truth); deep store corruption (extremely rare on managed object stores).

5. Sharding

ClickHouse's sharding is manual and operator-defined; Pinot's is segment-based and controller-managed. The trade is operational simplicity (Pinot's automation) vs flexibility (ClickHouse's hand-tuned distribution).

Dimension ClickHouse Apache Pinot
Sharding model Distributed table over local shards; sharding via expression (hash, rand, or custom) defined per Distributed table. Hash by partition column for offline tables; for real-time tables, segments map to Kafka partitions (1:1 typical).
Shard key constraints Sharding expression set at Distributed table creation; changing it means rebuilding the Distributed table. Real-time upsert tables require Kafka stream to be partitioned by primary key for correctness; partition column is set in table config.
Rebalancing mechanism Manual: INSERT INTO ... SELECT to new sharded table, or external rebalance scripts; no automatic rebalancer. Helix-managed automatic segment assignment; rebalance API; Minion-based segment movement.
Rebalancing cost / impact High — typically full data movement; can require downtime or shadow-table cutover for clean resharding. Online — segments move via deep store; queries continue during rebalance; minor performance variance during the move.
Hot-shard behavior Manual repartition; no auto-rebalance; hot shard is operator-detected and operator-fixed via shard expression change. Segment splits via Minion compaction; star-tree mitigates hot aggregation paths; sticky Kafka partition assignment for real-time upserts can create unbalanced server load.
Maximum shards (practical) 10s to 100s of shards typical; operational complexity rises with shard count due to manual rebalance requirement. Hundreds of servers with tens of thousands of segments; Helix handles segment-to-server assignment automatically.
Resharding without downtime? Difficult — typically requires shadow table + cutover; some shops use rolling resharding tooling but it's custom. Yes — segment-level moves orchestrated by Minion; deep store ensures durability during move.
Cross-shard query support Distributed engine performs scatter-gather; result merge happens at the initiator; new query planner improves distributed join performance. Native Broker scatter-gather; v2 multi-stage engine handles distributed joins; star-tree can answer aggregation queries from a single tree without scatter-gather.

6. Replication

ClickHouse replication is async multi-master with optional quorum writes. Pinot replication is replica-group-based with Kafka as the durable source of truth for real-time tables — a meaningfully different durability model.

Dimension ClickHouse Apache Pinot
Replication topology Multi-master async via ReplicatedMergeTree + Keeper/ZK; writes to any replica propagate to others; Distributed table layers sharding on top. Replica-group-based; Controller-managed via Helix; deep store as durable source of truth; LLC coordinates Kafka offsets across replicas.
Sync vs async Async by default; quorum INSERT setting (insert_quorum=N) forces sync to N replicas before ACK. Sync within replica-group on segment commit (LLC + Kafka offset coordination); async push to deep store.
Replication factor (default / max) Configurable per ReplicatedMergeTree table; typical 2-3; bounded by cluster size. Configurable per table; typical 2-3; sticky Kafka partition assignment constrains practical maximum.
Consistency level options Eventual by default; insert_quorum + select_sequential_consistency for stronger guarantees; consistency tunable per query. Per-query consistency strict within replica-group view; upsert tables provide read-your-writes within partition.
Replication lag (typical) Sub-second within region for ReplicatedMergeTree; cross-region uses async replica configurations. Sub-second (Kafka offset-coordinated for real-time tables); deep store sync is also sub-second to seconds.
Conflict resolution No conflict resolution per se — ReplacingMergeTree, CollapsingMergeTree, AggregatingMergeTree provide engine-level dedup or rollup semantics at merge time. Per-PK upsert resolution at ingestion time using configurable strategy (last-write-wins, partial-upsert with column-level merge).
Cross-region replication Manual: Distributed table over WAN, or async replica in another region; ClickHouse Cloud adds managed cross-region. Manual: tiered storage + WAN sync; StarTree Cloud provides managed cross-region as a tier.
Replication during partition Replicas continue accepting writes; divergence resolved on heal via merge; insert_quorum prevents writes on minority side. Replica-group writes continue if quorum maintained; minority side rejects writes to preserve consistency.

7. Better Usage Patterns

The patterns that distinguish a production deployment from a prototype. ClickHouse's are tuning-heavy (codecs, projections, MV chains); Pinot's are model-heavy (index choice per query, partition-by-PK for upserts, replica-group sizing).

ClickHouse

Columnar LSM OLAP
PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Batch inserts or use async_insertSingle-row INSERTs from app code; one part per row; merge pipeline collapsesBatch ≥1000 rows client-side, OR enable async_insert with wait_for_async_insert=1 for durabilityMerge throughput stays healthy; storage-level part count stays bounded; SELECT performance doesn't degrade over time
LowCardinality(String) for high-repeat string columnsPlain String columns for status, country, device_type, etc.LowCardinality(String) on any column with <10K distinct values10x compression on that column; faster filter and group-by; lowest-effort win in ClickHouse tuning
ORDER BY clause as the indexChoose ORDER BY for "logical" sort order without thinking about filter columnsChoose ORDER BY for the columns you filter on most often, in order of cardinality (low-card first)Primary key skip-index works on ORDER BY; wrong choice means full-table scans on filter predicates
ReplacingMergeTree for upsert-like patternsTry to UPDATE rows on regular MergeTree; mutations queue up; SELECT performance degradesReplacingMergeTree with version column; SELECT FINAL or argMax aggregation for current stateEngine handles dedup at merge; no mutation backlog; production-ready upsert semantics
Materialized Views for pre-aggregationGROUP BY at query time on huge tables; latency varies with query parametersMaterialized View with AggregatingMergeTree target; queries hit pre-aggregated data; chain MVs for hierarchical rollupsOrder-of-magnitude latency reduction on common aggregations; predictable cost
ClickHouse Keeper over ZooKeeper for new deploymentsKeep using ZooKeeper because "it's what the docs showed"ClickHouse Keeper (C++ implementation, lower latency, simpler ops)Removes JVM operational complexity; lower coordination latency; fewer surprises at scale
Avoid SELECT * on wide tablesSELECT * habit from row-oriented databasesAlways select the specific columns needed; columnar means column count drives I/OScan I/O scales with column count selected, not row count; SELECT * negates the columnar benefit
External Kafka consumer for high throughputUse Kafka engine; hit rebalance pauses and partition-count scaling limitsExternal consumer (Vector, Bytewax, Benthos) doing batched INSERTs; separates ingest scaling from storage scalingEliminates rebalance pauses; scales beyond Kafka partition count; better observability into ingest pipeline

Apache Pinot

Distributed Real-Time OLAP
PatternWhat Most Teams Do WrongThe Better WayWhy It Matters
Star-tree for multi-dim aggregations, not arbitrary queriesEnable default star-tree on every table; expect uniform speedupConfigure star-tree explicitly for the (dimensions, aggregations) of your top dashboard queries; profile before broad enablementStorage overhead pays back only when star-tree matches actual query shape; default config rarely matches production patterns
Partition Kafka by PK for upsert tablesUse round-robin or random Kafka partitioning; upsert correctness breaksKafka stream MUST be partitioned by primary key; all messages for a PK land on the same Pinot serverUpsert correctness requires this; otherwise PK metadata maps don't converge; this is a hard correctness requirement, not an optimization
Cluster sizing: num_partitions × num_replicas ≥ num_serversScale to 10 servers with 6 Kafka partitions and 1 replica; 4 servers go idleAlways plan: num_partitions × num_replicas must equal or exceed planned server count; over-provision Kafka partitions for headroomSticky partition-to-server mapping means under-partitioning leaves capacity unused; this is non-obvious until you try to scale
Minion-based segment compaction for upsert-heavy tablesLet segments accumulate; obsolete-row ratio rises; queries slow over timeSchedule SegmentRefreshTask via Minion to compact segments based on obsolete-row ratioStarTree Cloud automates this; OSS Pinot requires explicit task scheduling but the same primitive works
Bloom filter on high-selectivity columns, inverted on low-cardinalityDefault everything; segments balloon 3x; query speedup is marginalBloom on columns with high selectivity for point queries; inverted on low-cardinality filter columns; range index on time and numeric rangesIndex choice is per-column-per-query; profile actual query patterns against representative segments before broad enablement
Offline + real-time hybrid tables for historical + freshSingle real-time table; historical data accumulates in real-time tier; cost growsReal-time table for fresh window (e.g., last 7 days); offline table for historical data, fed by daily batch jobsResource isolation; cost optimization (offline servers can be smaller); query routing handles the split transparently
Set retention explicitlyOSS Pinot default has no retention; segments accumulate indefinitelySet retentionTimeUnit + retentionTimeValue in segmentsConfig; verify periodicallyStorage cost spirals on long-running tables without retention; this is a "you'll forget about it" footgun
Replica-group routing for tenant isolationSingle replica group; all tenants share resources; noisy neighbor under loadPer-tenant or per-workload replica groups; broker routes queries to dedicated replica groupIsolates noisy tenants; predictable p99 per tenant; chargeback becomes possible

8. Advanced / Next-Gen Alternatives

Where each engine may be augmented or displaced. The real-time OLAP space is consolidating; Druid is fading, ClickHouse and Pinot are the two viable open engines plus their managed services (ClickHouse Cloud, StarTree Cloud).

ClickHouse

Columnar LSM OLAP
Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
ClickHouse Cloud (managed)Separated storage and compute on object store; auto-scaling; managed Keeper; lower ops burdenProductionLow — lift-and-shift from self-hosted; SQL surface is identical; cost model shifts to managed-service line-itemWhen self-hosted operational burden exceeds the managed-service premium; when you need rapid scaling without operating Keeper + cluster
Apache Pinot / StarTree CloudNative upserts at scale; sub-second p99 for user-facing dashboards; star-tree for pre-aggregated multi-dim queriesProductionMedium-High — different data model; star-tree shape vs MergeTree shape; rewrite of materialized view logicWhen user-facing dashboard p99 becomes the SLO and ClickHouse can't hit it under load; when upsert workload outgrows ReplacingMergeTree
DuckDB / Polars (embedded analytical)Single-node analytical performance for small-to-medium data; no cluster to operate; embeddableProductionLow for specific workloads — complementary, not replacement; useful for edge analytics or notebook-scale workloadsWhen data fits on one node (low TB scale) and you'd rather not operate a cluster; for ad-hoc analytics adjacent to ClickHouse
Apache DruidStreaming ingest flexibility for diverse sources and out-of-order events; mature for time-series-shaped workloadsProductionHigh — different segment model; operational complexity comparable to Pinot; Druid mindshare has been decliningMostly when inheriting an existing Druid deployment; greenfield, ClickHouse and Pinot are the more active choices in 2026

Apache Pinot

Distributed Real-Time OLAP
Successor / AlternativeWhat It ImprovesMaturityMigration CostWhen To Consider
StarTree Cloud (managed Pinot)Off-heap upsert metadata (10x PK scale); Pauseless Ingestion (40M+ events/sec/table); tiered storage; managed K8sProductionLow from OSS Pinot — same engine, managed control plane and add-onsWhen OSS Pinot's heap-bound upserts limit scaling; when ingest throughput approaches OSS ceiling; when operating 5 components in-house is the bottleneck
ClickHouse / ClickHouse CloudOperational simplicity (single binary); higher compression; more flexible SQL surface; richer ecosystemProductionMedium-High — different data model; rewrite of star-tree-shaped queries; index strategy rethinkWhen ad-hoc analytical queries dominate over pre-defined dashboard queries; when ops headcount is the binding constraint; Constant Contact pattern
Apache DruidMore flexible streaming ingest; better tooling polish in some areas (Imply Polaris UI); mature K8s operatorProductionMedium — adjacent architecture but different segment + upsert model; Pinot's upsert maturity is hard to matchWhen streaming ingest from diverse sources matters more than upsert semantics; mostly relevant for greenfield Druid evaluations
Snowflake Hybrid Tables or Databricks LakebaseOLTP + OLAP unified on a warehouse; no separate operational store needed for some user-facing patternsEmergingHigh — different architecture; not a 1:1 replacement; only viable for specific patterns where warehouse-grade latency is enoughWhen user-facing latency requirements relax to 100ms+ and the unified-platform pitch outweighs the dedicated OLAP engine perf
As-of disclaimer. Specifics reflect public documentation, vendor blogs, and production reports through May 2026. ClickHouse 26.1 (Jan 2026) extended async insert deduplication to materialized views and made deduplication default-on for all inserts. ClickHouse 25.6/25.11 introduced projection-as-secondary-index. Apache Pinot 2026: native upserts mature, v2 query engine for distributed joins, JSON/text/range index toolkit. StarTree Cloud adds Pauseless Ingestion and off-heap upsert metadata as managed-tier differentiators. Constant Contact's 2026 case study documents a real-world Pinot-to-ClickHouse migration. Verify against current docs before architectural commitments.