RisingWave Solves Apache Iceberg Scaling Trade-Offs

This title was summarized by AI from the post below.
View organization page for RisingWave

14,349 followers

Apache Iceberg has become the default lakehouse table format, but scaling updates still forces a painful trade-off. Streaming and CDC pipelines need fast writes. BI dashboards need fast, predictable reads. So, most stacks make you pick a side. RisingWave solves this with configurable per-table write modes for Apache Iceberg, so your lakehouse adapts to your workload, not the other way around. Here's what you need to know 👇 Why Iceberg needs write modes? Iceberg never edits files in place. Every update or delete must either rewrite existing files or track changes separately. That's the CoW vs MoR trade-off. Copy-on-Write (CoW): Rewrites the full data file on every change. ✅ Fast, clean reads, no merging needed ✅ Ideal for dashboards and interactive analytics ⚠️ Higher write latency Merge-on-Read (MoR): Default in RisingWave Keeps base files untouched, appends small delta/delete files. ✅ Fast writes, great for CDC and streaming ✅ Lower storage cost ⚠️ Reads must merge base + deltas until compaction runs Where it applies in RisingWave? Both modes work for: → Iceberg sinks: RisingWave writing to externally managed Iceberg tables → Internal Iceberg tables: Tables created and managed inside RisingWave Configured per table in SQL with a single property: write_mode Don't forget compaction! MoR keeps writes fast by deferring cleanup. Compaction periodically merges deltas back into base files to keep reads efficient. → Iceberg sinks: enable compaction explicitly → Internal tables: compaction is on by default When to use which? Use MoR when ingest speed is the priority: CDC, streaming, frequent updates. Use CoW when read latency must be predictable: dashboards, batch refreshes, ad-hoc analytics. Most teams run both: MoR for raw streams, CoW for curated analytics tables. The bottom line: With CoW + MoR in RisingWave, you get a streaming compute engine and Iceberg writer in one, with full compatibility across Spark, Trino, and DuckDB. Your lakehouse fits your workload, not the other way around. Building a streaming lakehouse with RisingWave? Join the community: go.risingwave.com/slack #ApacheIceberg #Lakehouse #StreamProcessing #DataEngineering #RisingWave

  • No alternative text description for this image

The biggest shift with Iceberg is exactly this which is turning storage into a reliable, transactional layer, not just a collection of files. What’s interesting is that once you get table semantics right everything else (multi-engine access, streaming + batch, AI workloads) becomes much easier to build on top. In practice, we’re seeing teams move toward using Iceberg as the foundation layer, and then plugging in different compute engines as needed. That’s the direction we’re seeing closely at IOMETE as well, open table formats enabling a more flexible and future-proof data stack.

Isn't copy-on-write default for Iceberg v2

See more comments

To view or add a comment, sign in

Explore content categories