RisingWave Solves Apache Iceberg Scaling Trade-Offs

This title was summarized by AI from the post below.

14,349 followers

2w Edited

Apache Iceberg has become the default lakehouse table format, but scaling updates still forces a painful trade-off. Streaming and CDC pipelines need fast writes. BI dashboards need fast, predictable reads. So, most stacks make you pick a side. RisingWave solves this with configurable per-table write modes for Apache Iceberg, so your lakehouse adapts to your workload, not the other way around. Here's what you need to know 👇 Why Iceberg needs write modes? Iceberg never edits files in place. Every update or delete must either rewrite existing files or track changes separately. That's the CoW vs MoR trade-off. Copy-on-Write (CoW): Rewrites the full data file on every change. ✅ Fast, clean reads, no merging needed ✅ Ideal for dashboards and interactive analytics ⚠️ Higher write latency Merge-on-Read (MoR): Default in RisingWave Keeps base files untouched, appends small delta/delete files. ✅ Fast writes, great for CDC and streaming ✅ Lower storage cost ⚠️ Reads must merge base + deltas until compaction runs Where it applies in RisingWave? Both modes work for: → Iceberg sinks: RisingWave writing to externally managed Iceberg tables → Internal Iceberg tables: Tables created and managed inside RisingWave Configured per table in SQL with a single property: write_mode Don't forget compaction! MoR keeps writes fast by deferring cleanup. Compaction periodically merges deltas back into base files to keep reads efficient. → Iceberg sinks: enable compaction explicitly → Internal tables: compaction is on by default When to use which? Use MoR when ingest speed is the priority: CDC, streaming, frequent updates. Use CoW when read latency must be predictable: dashboards, batch refreshes, ad-hoc analytics. Most teams run both: MoR for raw streams, CoW for curated analytics tables. The bottom line: With CoW + MoR in RisingWave, you get a streaming compute engine and Iceberg writer in one, with full compatibility across Spark, Trino, and DuckDB. Your lakehouse fits your workload, not the other way around. Building a streaming lakehouse with RisingWave? Join the community: go.risingwave.com/slack #ApacheIceberg #Lakehouse #StreamProcessing #DataEngineering #RisingWave

4 Comments

Shyaamal Tripathi

IOMETE•5K followers

The biggest shift with Iceberg is exactly this which is turning storage into a reliable, transactional layer, not just a collection of files. What’s interesting is that once you get table semantics right everything else (multi-engine access, streaming + batch, AI workloads) becomes much easier to build on top. In practice, we’re seeing teams move toward using Iceberg as the foundation layer, and then plugging in different compute engines as needed. That’s the direction we’re seeing closely at IOMETE as well, open table formats enabling a more flexible and future-proof data stack.

2 Reactions

Constantin Alexander 🐘

Gryphon AI•30K followers

Isn't copy-on-write default for Iceberg v2

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Andrew Madson

Fivetran•96K followers
3w
Report this post
Apache Iceberg's most underrated feature isn't partitioning, time travel, or schema evolution. It's engine interoperability. And most teams are using it wrong. Open Data Infrastructure: ↳ Open Parquet files. ↳ Open table format. ↳ Open REST catalog. You're not locked into one query engine — you pick the engine that fits your workload. In a mature Iceberg deployment? Multiple engines hit the same tables, at the same time. Zero copies. Zero silos. But "you can use any engine" isn't always easy. Picking the wrong one costs you performance, money, or both. 𝗕𝗲𝗳𝗼𝗿𝗲 𝘆𝗼𝘂 𝘀𝗽𝗶𝗻 𝗮𝗻𝘆𝘁𝗵𝗶𝗻𝗴 𝘂𝗽, 𝗮𝘀𝗸: → 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝘄𝗿𝗶𝘁𝗲𝘀? True continuous processing with sub-second latency = Apache Flink. Micro-batch is fine = Apache Spark. → 𝗪𝗵𝗮𝘁'𝘀 𝘆𝗼𝘂𝗿 𝘀𝗰𝗮𝗹𝗲? Data fits on a single node? Skip the distributed cluster. DuckDB, Polars, and Apache DataFusion give you fast, lightweight access without managing coordinators. → 𝗕𝗮𝘁𝗰𝗵 𝗘𝗧𝗟 𝗼𝗿 𝗶𝗻𝘁𝗲𝗿𝗮𝗰𝘁𝗶𝘃𝗲 𝗾𝘂𝗲𝗿𝗶𝗲𝘀? Apache Spark dominates batch on Kubernetes and cloud-native infra. For interactive analytics, ClickHouse and Dremio are purpose-built for low-latency responses. → 𝗡𝗲𝗲𝗱 𝘁𝗼 𝗳𝗲𝗱𝗲𝗿𝗮𝘁𝗲? Joining Iceberg tables with Postgres, MySQL, or object storage? You need a federation engine. Check out StarRocks or Apache Doris. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝘂𝗻𝗹𝗼𝗰𝗸: You don't answer these questions once. You answer them per workload. Your streaming pipeline runs Flink. Your data scientists query with DuckDB on their laptops. Your BI layer hits Trino. Same tables. Same data. Different engines, each doing what it does best. That's the architectural payoff of betting on open standards. You stop choosing THE engine and start choosing the RIGHT engine for each job. What workload made you realize you needed a different engine? (Yes, I'm an Iceberg fanboy. No, I won't apologize.) Will I see you at Iceberg Summit next week? #ApacheIceberg #DataEngineering #DataLakehouse
42 Comments
Like Comment
To view or add a comment, sign in
Mingyu Chen

VeloDB•812 followers
2w
Report this post
Really like this clear decision tree for choosing a data engine 👏 Visuals like this make complex architecture decisions much easier to reason about. At the same time, I also feel that the boundaries between systems are becoming increasingly blurred. A few trends I've been noticing lately: 🔹 Many databases are introducing CLI tools and embedded versions to better support AI agents and programmatic workflows 🔹 Query engines are expanding beyond their original scope and becoming more multi-source analytics platforms 🔹 The traditional separation between OLAP engines, query engines, and data lake engines is gradually fading For example: * Some systems that started as lake query engines are now evolving into full analytics databases * Many OLAP databases are adding **federated query** capabilities across multiple data sources * Even execution models (batch vs interactive vs streaming) are converging In the case of Apache Doris, these capabilities are already here: * Doris supports querying Iceberg as well as external sources like PostgreSQL, Elasticsearch, and more * We're also exploring "Doris Lite", aiming to support embedded or lightweight single-node deployments, making it easier for agents and local analytics use cases So while decision trees like this are super helpful today, I suspect the future might look less like a tree 🌳 …and more like a mesh of increasingly capable, overlapping systems. Curious to hear how others are thinking about this trend — Are we heading toward converged analytics engines?
Andrew Madson

Head of Developer Relations | GTM Advisor | 250K+ Community Builder | Published O’Reilly Author | Open Source Contributor | andrewmadson.com
3w

Apache Iceberg's most underrated feature isn't partitioning, time travel, or schema evolution. It's engine interoperability. And most teams are using it wrong. Open Data Infrastructure: ↳ Open Parquet files. ↳ Open table format. ↳ Open REST catalog. You're not locked into one query engine — you pick the engine that fits your workload. In a mature Iceberg deployment? Multiple engines hit the same tables, at the same time. Zero copies. Zero silos. But "you can use any engine" isn't always easy. Picking the wrong one costs you performance, money, or both. 𝗕𝗲𝗳𝗼𝗿𝗲 𝘆𝗼𝘂 𝘀𝗽𝗶𝗻 𝗮𝗻𝘆𝘁𝗵𝗶𝗻𝗴 𝘂𝗽, 𝗮𝘀𝗸: → 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝘄𝗿𝗶𝘁𝗲𝘀? True continuous processing with sub-second latency = Apache Flink. Micro-batch is fine = Apache Spark. → 𝗪𝗵𝗮𝘁'𝘀 𝘆𝗼𝘂𝗿 𝘀𝗰𝗮𝗹𝗲? Data fits on a single node? Skip the distributed cluster. DuckDB, Polars, and Apache DataFusion give you fast, lightweight access without managing coordinators. → 𝗕𝗮𝘁𝗰𝗵 𝗘𝗧𝗟 𝗼𝗿 𝗶𝗻𝘁𝗲𝗿𝗮𝗰𝘁𝗶𝘃𝗲 𝗾𝘂𝗲𝗿𝗶𝗲𝘀? Apache Spark dominates batch on Kubernetes and cloud-native infra. For interactive analytics, ClickHouse and Dremio are purpose-built for low-latency responses. → 𝗡𝗲𝗲𝗱 𝘁𝗼 𝗳𝗲𝗱𝗲𝗿𝗮𝘁𝗲? Joining Iceberg tables with Postgres, MySQL, or object storage? You need a federation engine. Check out StarRocks or Apache Doris. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝘂𝗻𝗹𝗼𝗰𝗸: You don't answer these questions once. You answer them per workload. Your streaming pipeline runs Flink. Your data scientists query with DuckDB on their laptops. Your BI layer hits Trino. Same tables. Same data. Different engines, each doing what it does best. That's the architectural payoff of betting on open standards. You stop choosing THE engine and start choosing the RIGHT engine for each job. What workload made you realize you needed a different engine? (Yes, I'm an Iceberg fanboy. No, I won't apologize.) Will I see you at Iceberg Summit next week? #ApacheIceberg #DataEngineering #DataLakehouse
Like Comment
To view or add a comment, sign in
Rilov Paloly Kulankara

RBC Borealis•3K followers
1w
Report this post
This is a great take on Apache Iceberg A lot of discussion is around engines, performance, and flexibility and expectations on things that never supported by any data ecosystem But I feel the real foundation is something else The unified catalog layer.. Open data infrastructure sounds simple Open Parquet files Open table format Multiple engines But without a consistent catalog None of this really works at scale The catalog is what brings order It keeps metadata consistent It enforces access control It manages table versions It enables true cross engine interoperability Without it Your “multi engine” architecture slowly breaks Different engines start seeing different realities Access rules drift Versions get out of sync And before you realize You are back to silos Just with better tools The real value of Iceberg is not just that multiple engines can work on the same data It is that they can do it with a shared understanding of that data One catalog One source of truth Everything else builds on top of that Curious how others are approaching the catalog layer in their Iceberg setups
Andrew Madson

Head of Developer Relations | GTM Advisor | 250K+ Community Builder | Published O’Reilly Author | Open Source Contributor | andrewmadson.com
3w

Apache Iceberg's most underrated feature isn't partitioning, time travel, or schema evolution. It's engine interoperability. And most teams are using it wrong. Open Data Infrastructure: ↳ Open Parquet files. ↳ Open table format. ↳ Open REST catalog. You're not locked into one query engine — you pick the engine that fits your workload. In a mature Iceberg deployment? Multiple engines hit the same tables, at the same time. Zero copies. Zero silos. But "you can use any engine" isn't always easy. Picking the wrong one costs you performance, money, or both. 𝗕𝗲𝗳𝗼𝗿𝗲 𝘆𝗼𝘂 𝘀𝗽𝗶𝗻 𝗮𝗻𝘆𝘁𝗵𝗶𝗻𝗴 𝘂𝗽, 𝗮𝘀𝗸: → 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝘄𝗿𝗶𝘁𝗲𝘀? True continuous processing with sub-second latency = Apache Flink. Micro-batch is fine = Apache Spark. → 𝗪𝗵𝗮𝘁'𝘀 𝘆𝗼𝘂𝗿 𝘀𝗰𝗮𝗹𝗲? Data fits on a single node? Skip the distributed cluster. DuckDB, Polars, and Apache DataFusion give you fast, lightweight access without managing coordinators. → 𝗕𝗮𝘁𝗰𝗵 𝗘𝗧𝗟 𝗼𝗿 𝗶𝗻𝘁𝗲𝗿𝗮𝗰𝘁𝗶𝘃𝗲 𝗾𝘂𝗲𝗿𝗶𝗲𝘀? Apache Spark dominates batch on Kubernetes and cloud-native infra. For interactive analytics, ClickHouse and Dremio are purpose-built for low-latency responses. → 𝗡𝗲𝗲𝗱 𝘁𝗼 𝗳𝗲𝗱𝗲𝗿𝗮𝘁𝗲? Joining Iceberg tables with Postgres, MySQL, or object storage? You need a federation engine. Check out StarRocks or Apache Doris. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝘂𝗻𝗹𝗼𝗰𝗸: You don't answer these questions once. You answer them per workload. Your streaming pipeline runs Flink. Your analysts query with DuckDB. Your BI layer hits Trino. Same tables. Same data. Different engines, each doing what it does best. That's the architectural payoff of betting on open standards. You stop choosing THE engine and start choosing the RIGHT engine for each job. What workload made you realize you needed a different engine? (Yes, I'm an Iceberg fanboy. No, I won't apologize.) Will I see you at Iceberg Summit next week? #ApacheIceberg #DataEngineering #DataLakehouse
2 Comments
Like Comment
To view or add a comment, sign in
RisingWave

14,349 followers
2w
Report this post
Why Apache Iceberg is the future of data lakes? In the past, data lakes didn’t fail because of storage. They fail because tables were never really "tables". Hive-style lakes rely on file paths, partitions, and external coordination, which breaks when you have: multiple writers multiple engines changing schemas petabyte-scale metadata Apache Iceberg fixes this by bringing real table semantics to object storage: ACID transactions (safe concurrent writes) Time travel and rollback (snapshots) Fast planning at scale (manifests and metadata indexing) Schema evolution (add or rename columns without rewrites) Hidden partitioning (no manual partition traps) Multi-engine interoperability (Spark, Flink, Trino, RisingWave, etc.) Iceberg turns your lake from: a pile of files and scripts into a transactional, warehouse-like platform. If your lake needs: Strong consistency Streaming + batch Multiple engines Long-term evolution Then, build your data lake with Apache Iceberg. Want to build a streaming lakehouse? RisingWave lets you build one with Postgres simplicity that natively supports Apache Iceberg.
Like Comment
To view or add a comment, sign in
SIDDHARTH SRIVASTAVA

Cloud Peritus•2K followers
1w
Report this post
Why is everyone migrating to Apache Iceberg? Because your data lake has been lying to you about what "open" really means. Here's the uncomfortable truth: most data lakes lock you into a specific query engine. Switch from Spark to Trino? Rewrite your pipelines. Move to Flink for streaming? Good luck with consistency. Iceberg v3 changes this game entirely. Here's what makes it different: 𝗧𝗵𝗲 𝗠𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗟𝗮𝘆𝗲𝗿 𝗶𝘀 𝘁𝗵𝗲 𝗦𝗲𝗰𝗿𝗲𝘁 Iceberg doesn't just store files — it maintains a metadata tree that tracks every snapshot, manifest list, and manifest file. This is what enables: → Time travel queries (point-in-time reads without extra storage) → Schema evolution without rewriting data → Partition evolution without downtime → ACID transactions across multiple engines 𝗩𝟯 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗟𝗲𝗮𝗽 Deletion Vectors in v3 are a game-changer. Instead of copy-on-write (rewriting entire files for a single row update), Iceberg now marks deleted rows in a separate lightweight file. Result? Up to 10x faster data manipulation on large tables. Add Row Lineage tracking and native VARIANT type support for semi-structured data, and you've got a format that handles JSON-heavy event streams as cleanly as structured dimension tables. 𝗩𝟰 𝗶𝘀 𝗔𝗹𝗿𝗲𝗮𝗱𝘆 𝗕𝗲𝗶𝗻𝗴 𝗗𝗲𝘀𝗶𝗴𝗻𝗲𝗱 The Adaptive Metadata Tree proposal for v4 aims to reduce metadata operations to a single file write — critical for sub-second streaming ingestion. Relative path support will let you move tables across environments (dev → staging → prod) without metadata rewrites. 𝗧𝗵𝗲 𝗥𝗲𝗮𝗹 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝗜 𝗦𝗲𝗲 𝗪𝗼𝗿𝗸𝗶𝗻𝗴 Bronze layer: Raw ingestion with Iceberg's hidden partitioning (no user-visible partition columns) Silver layer: Incremental merge-on-read with deletion vectors Gold layer: Compacted, z-order sorted tables with Bloom filter indexes The catalog layer (Apache Polaris) ties it all together with fine-grained access controls that work across Spark, Trino, Flink, and DuckDB — no vendor lock-in. If you're starting a new lakehouse in 2026, Iceberg isn't just an option. It's the default. #DataEngineering #ApacheIceberg #Lakehouse #DataArchitecture #OpenTableFormat
Like Comment
To view or add a comment, sign in
Sai Bhargava Kallu

LTIMindtree•984 followers
1w
Report this post
Battle of the Lakehouse: Delta Lake vs Apache Iceberg 🧊 Where Apache Iceberg Leads Delta Lake 1. 🔄 Advanced Schema Evolution (Biggest Advantage) Iceberg uses column IDs instead of names Safe operations: Rename columns ✅ Reorder columns ✅ Drop columns without breaking queries ✅ 👉 In Delta: Rename = risky Reorder = problematic Why it matters: Your pipelines don’t break when schemas evolve. 2. ⚙️ True Multi-Engine Support Iceberg works seamlessly across: Apache Spark Trino Apache Flink Presto 👉 Delta Lake: Best with Spark Other engines need connectors Why it matters: You’re not locked into one compute engine. 3. 📂 Scalable Metadata Handling Iceberg uses manifest files + metadata trees Efficient even with millions of files 👉 Delta Lake: Uses a growing transaction log Needs maintenance (OPTIMIZE, VACUUM) Why it matters: Better performance at very large scale (PB-level data) 4. ⏳ Snapshot-Based Versioning Iceberg tracks snapshots of tables Enables: Easy rollback Cleaner time travel Better auditability 👉 Delta: Log-based versioning (works well, but less flexible) 5. 🚀 Better Partition Evolution Change partition strategy without rewriting data 👉 Example: Move from date partition → hour partition seamlessly 👉 Delta: Partition changes are harder and often require data rewrite Why it matters: Huge savings in compute + time 6. 🧠 Hidden Partitioning (Less User Errors) Iceberg automatically manages partition logic Users don’t need to worry about partition columns 👉 Delta: Requires explicit partition handling Why it matters: Fewer bugs, simpler queries 7. 🔮 Future-Proof Architecture Designed from day one for: Multi-engine ecosystems Cloud-native data lakes Long-term scalability 👉 Delta: Evolved from Spark ecosystem ⚖️ Honest Reality Check Iceberg leads in: Flexibility Scalability Multi-engine architecture But Delta still wins in: Streaming performance Tight integration with Databricks Simpler setup in Spark-heavy environments Stay tune for what is manifest files + metadata trees #spark #bigdata #databricks #deltalake
Like Comment
To view or add a comment, sign in
Muhammad Adnan Ali Chohan

I am a Data Engineer focused…•1K followers
2w
Report this post
Delta Lake vs Apache Iceberg vs Apache Hudi. The table format war — and who should win for your use case. All three solve the same problem: ACID transactions on a data lake. But they solve it differently. DELTA LAKE (Databricks) Transaction log: JSON + Parquet checkpoint Time travel: yes, configurable retention Schema evolution: additive changes, enforced Ecosystem: Databricks-native, Spark-first Best for: Databricks shops, Unity Catalog users Weakness: vendor lock-in concerns, slower on other engines APACHE ICEBERG (Netflix origin, Apache) Metadata: manifest files with snapshot isolation Time travel: yes, snapshot-based Schema evolution: full — add, drop, rename, reorder Engine support: Spark, Flink, Trino, Hive, Dremio — truly engine-agnostic Best for: multi-engine environments, open interoperability Weakness: more complex metadata operations APACHE HUDI (Uber origin) Designed for upserts and incremental processing Record-level updates without full partition rewrites Two table types: CoW (read-optimized) and MoR (write-optimized) Best for: CDC ingestion, high-frequency upsert workloads Weakness: steeper learning curve, smaller community MY VERDICT Multi-cloud, multi-engine: Iceberg Databricks-native: Delta Lake High-volume CDC / upsert: Hudi #DataEngineering #DeltaLake #Iceberg #Hudi #DataLakehouse #ApacheSpark
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar

Cloudera•18K followers
1mo
Report this post
What is Apache Parquet's Content Defined Chunking (CDC)? Let's zoom into how Parquet is physically laid out. Parquet stores data as row groups -> column chunks -> data pages. These data pages are not just arbitrary splits - they are the unit at which compression and encoding are applied. In many ways, they are the smallest meaningful unit of storage inside a Parquet file. The problem is that Parquet does not define these pages based on the content itself. Page boundaries are determined by size thresholds or row counts. So even if only a small portion of the dataset changes, the system ends up rewriting and re-uploading large parts of the file. And once that happens, compression output changes, byte layout changes, and from a storage perspective, everything looks new. This is why incremental workloads on Parquet behave like full rewrites, and why deduplication at the storage layer becomes ineffective. Content-Defined Chunking (CDC) comes into play here by redefining how these data pages are formed. Instead of cutting pages based on size, CDC uses "rolling hashes" to determine boundaries based on the content itself. The effect of this is: identical sequences of data produce identical page boundaries, regardless of where they appear in the file or how the dataset evolves over time. So, things like compression becomes consistent for unchanged regions, byte layouts remain stable, and only the pages that actually contain modified data need to be rewritten. The rest of the file can remain untouched & more importantly, reusable. BTW, this is already adopted today by companies like Hugging Face for their storage system ("Xet") Hugging Face used to store over 30 PB of models, datasets, and spaces in Git LFS repositories, and that meant any change to a file = re-uploading the entire dataset, which is super expensive. Parquet CDC gives them stable data pages, and Xet deduplicates those across dataset versions - so unchanged data is never re-uploaded. At scale, that directly translates to faster uploads and significantly lower storage overhead. Also amazing to see this now land in Apache Arrow's Rust implementation. (CDC has been around in PyArrow and C++) and there are plans for integrating it in the Apache Iceberg world as well. Learn more from the diagram & comments. #dataengineering #softwareengineering
6 Comments
Like Comment
To view or add a comment, sign in
Ayush Singh Baghel

Tata Consultancy Services•962 followers
2w
Report this post
🚀 Apache Spark Internals: From "It Works" to "It’s Optimized" ⚙️ Most people can write a Spark job. Very few can tune one for production. I’ve been diving deep into the "engine room" of Apache Spark, and I’ve compiled my summary notes into a 17-page guide on how Spark actually handles data under the hood. If you want to stop guessing why your jobs are slow and start engineering for performance, here are the 5 Golden Rules I’ve learned: 1️⃣ The Spark UI is your Best Friend 📊 Don't fly blind. Use the Spark UI and History Server to track stages and executor allocation. If you aren't monitoring, you aren't optimizing. 2️⃣ Caching is "Lazy" (And sensitive!) 🧠 Calling .cache() doesn't mean the data is instantly in memory. It happens on the first action. • Pro-Tip: If your Analysed Logical Plan changes, Spark might ignore your cache and hit the disk again! 3️⃣ Row-based vs. Columnar Storage 📑 Stop using CSV for big data. Parquet (Columnar) is the gold standard because it only reads the columns you need. It’s the difference between a 10-minute scan and a 10-second one. 4️⃣ Data Locality Matters 📍 Network is slow; Memory is fast. Aim for NODE_LOCAL or PROCESS_LOCAL to keep data close to the computation and minimize shuffle latency. 5️⃣ Cache vs. Persist 🛠️ Use .persist() for granular control. Decide if you want: • Disk vs. Memory 💾 • Serialized vs. Deserialized (Space vs. Speed) ⚡ • Replication for fault tolerance 🔄 #DataEngineering #ApacheSpark #BigData #CloudComputing #SparkOptimization #DataPathway #LearningJourney #TechCommunity
Like Comment
To view or add a comment, sign in
Abhijeet Sarkar

Royal Canadian Air Force |…•457 followers
1w Edited
Report this post
Been exploring Apache Iceberg lately. Here's what I'd tell my past self: Where it genuinely beats Hive and Delta Lake : → Schema changes are metadata-only. No rewrites. Add a column, rename it, reorder it ; existing Parquet files don't care → Time travel is a first-class citizen ; every insert, update, delete creates an immutable snapshot you can query → Partition evolution without data migration ; try doing that in Hive → Engine-agnostic by design ; Spark, Flink, Athena, Trino all speak Iceberg natively → Row-level deletes via delete files ; the engine merges at read time, not write time → Manifest-level column stats mean the engine skips files before touching S3 What will bite you if you're not careful → Streaming pipelines accumulate small files fast ; compaction is not optional → Equality delete files pile up in Merge-on-Read workloads ;query perf degrades silently → Snapshots don't expire themselves ; orphan files and old metadata add up in S3 → Your catalog (Glue, Nessie, HMS) shapes consistency guarantees , choose deliberately → High-frequency inserts grow the manifest list ,monitor metadata overhead, not just data size Bottom line: Iceberg gives you a lakehouse that actually behaves like a database. But compaction and snapshot lifecycle need the same operational respect you'd give a vacuum job in Postgres. #ApacheIceberg #DataLakehouse #DataEngineering #OpenTableFormat #AWS

2 Comments
Like Comment
To view or add a comment, sign in

14,349 followers

View Profile Connect

LinkedIn respects your privacy

RisingWave Solves Apache Iceberg Scaling Trade-Offs

More from this author

The Live Data News March 2026

RisingWave News February 2026

RisingWave News - January 2026

Explore content categories