Hudi handles concurrency without relying on traditional database locking 🔒. We use multi-version concurrency control (MVCC) with immutable file slices 📂. Each Hudi FileGroup holds a tree map of file slices, ordered by commit time. Every table snapshot is a file slice with a base file and log files frozen at a particular moment ❄️. Writers either add logs to existing slices— or build new ones by writing a new base file. Readers access only committed, unchanging file slices. That means no conflicts or locks between reads and writes ⚔️. Table maintenance operations can also work concurrently without blocking writers: 🧹 Cleaner can reclaim storage by deleting expired file slices 🗜️ Compaction produces a phantom base file asynchronously, as writers land new data into log files . End result? Readers, writers and table services work on the same file group simultaneously, with snapshot isolation, no copy overhead on reads, and zero blocking 🚀. #ApacheHudi #DataEngineering #BigData #MVCC #Concurrency
Apache Hudi
Data Infrastructure and Analytics
San Francisco, CA 14,930 followers
Open source pioneer of the lakehouse reimagining batch processing with incremental framework for low latency analytics
About us
Open source pioneer of the lakehouse reimagining old-school batch processing with a powerful new incremental framework for low latency analytics. Hudi brings database and data warehouse capabilities to the data lake making it possible to create a unified data lakehouse for ETL, analytics, AI/ML, and more. Apache Hudi is battle-tested at scale powering some of the largest data lakes on the planet. Apache Hudi provides an open foundation that seamlessly connects to all other popular open source tools such as Spark, Presto, Trino, Flink, Hive, and so much more. Being an open source table format is not enough, Apache Hudi is also a comprehensive platform of open services and tools that are necessary to operate your data lakehouse in production at scale. Most importantly, Apache Hudi is a community built by a diverse group of engineers from all around the globe! Hudi is a friendly and inviting open source community that is growing every day. Join the community in Github: https://github.com/apache/hudi or find links to email lists and slack channels on the Hudi website: https://hudi.apache.org/
- Website
-
https://hudi.apache.org/
External link for Apache Hudi
- Industry
- Data Infrastructure and Analytics
- Company size
- 201-500 employees
- Headquarters
- San Francisco, CA
- Type
- Nonprofit
- Founded
- 2016
- Specialties
- ApacheHudi, DataEngineering, ApacheSpark, ApacheFlink, TrinoDB, Presto, DataAnalytics, DataLakehouse, AWS, GCP, Azure, ChangeDataCapture, and StreamProcessing
Locations
-
Primary
Get directions
San Francisco, CA, US
Employees at Apache Hudi
Updates
-
Non-blocking concurrency isn't magic—it needs a solid ordering mechanism 🛡️. For Hudi, that's creating globally monotonic instant times across distributed writers ⏱️. Smart twist: No heavy centralized time service needed. Hudi's TrueTime-inspired approach uses: • Distributed lock 🔒 • Local time gen ⚙️ • Bounded wait for clock skew ⏳ This delivers monotonic timestamps without long serialized sections 🔄. It's those under-the-hood details that make concurrency practical, not just theoretical 📊. #ApacheHudi #DataEngineering #BigData
-
-
A lot of lakehouse debate still happens at the “format” layer. 📊 That misses the practical question. ❓ When workloads become mutation-heavy, low-latency, and operationally messy, the main differentiators are no longer just metadata spec compatibility. They are things like: • how updates are located 🔍 • how concurrency is handled 🔄 • how compaction is coordinated ⚙️ • whether indexing exists beyond file stats 📈 • whether ingestion and table services are first-class 🛡️ This is why Hudi often feels less like a thin table spec and more like a storage engine plus management layer for mutable data. That distinction matters because many hard production problems live above the format boundary. 🚀 #ApacheHudi #DataLakehouse #DataEngineering
-
-
In Hudi, the timeline is more than a transaction log. 📜 It is the table’s control plane. 🛡️ Every commit, compaction, clean, rollback, and table service operation is recorded as an instant on the timeline. That is what makes several features possible: • incremental reads from a point in time ⏱️ • rollback of failed writes 🔄 • time travel ⏳ • coordination of background table services 🤝 A lot of table formats focus only on the storage format. Hudi’s design is interesting because it invested early in a richer notion of table management. That is why the table can behave more like a managed system and less like a collection of files, to fret over managing by hand. If you want to understand Hudi deeply, start with the timeline. Most of the higher-level behavior hangs off of it. #ApacheHudi #DataLakehouse #DataEngineering
-
-
A surprising source of write performance disruption is not disk or network ⚠️. It is stalling progress due to distributed coordination 🔄. Before Hudi 1.1, Flink writers had to wait for the previous instant to commit before receiving the next instant. That preserved order, but it also created throughput fluctuation around checkpoints 📉. The 1.1 change is conceptually small and operationally important: Allow writers to request the next instant asynchronously before the previous one is fully committed, while still preserving commit ordering and consistency 🚀. This is a useful reminder that “streaming performance” often depends on seemingly small control-plane decisions 💡. #ApacheHudi #DataEngineering #StreamingData #BigData
-
-
Most teams dismiss ingestion as mere connector wiring—until the failures hit. ⚠️ In practice, ingestion is where correctness failures show up first: • duplicates ❌ • missed records 🚫 • schema drift 🔄 • offset/checkpoint issues ⚙️ • source-specific edge cases 🛑 Hudi’s DeltaStreamer is a fully-featured ingest tool that is already powering hundreds of production data lakehouses in a standardized way. A storage layer is stronger when it comes with built-in paths for getting data in correctly, ensure data quality from the get-go. It saves teams months of re-implementing the same reliability code or repeating the performance tuning to avoiding production data movement from breaking on the same edge conditions. #ApacheHudi #DataLakehouse #DataEngineering
-
-
In most table format designs, keys are treated as optional, for semantics. In Hudi, record keys are operationally very important 🔑. They determine: • how updates find prior records 🔄 • how deletes are applied 🗑️ • how deduplication works 🔍 • how records map to file groups 📁 • how concurrency and merging stay bounded ⚖️ That is why Hudi can treat updates as actual updates instead of as a vague rewrite operation over files. Once you define stable record identity, the table can behave more like a database table and less like a batch artifact. A surprising amount of lakehouse performance depends on whether the system understands row identity natively 🚀. #ApacheHudi #DataLakehouse #DataEngineering
-
-
Apache Hudi reposted this
Exciting things are happening in Apache Hudi as we build for the next generation of #Data and #AI and define what it means to be a true "AI Native Lakehouse." AI and ML workloads have fundamentally changed what a data lakehouse needs to support. Modern pipelines don't just store tabular data, they contain vector embeddings for similarity search, large binary objects (images, video, etc) for multimodal data, schema-flexible semi-structured columns for dynamic metadata, and file formats optimized for GPU loading and random access. Today, teams are forced to stitch together a vector database, an object store, and a document store alongside their lakehouse just to run a single AI pipeline. Amongst the original "big three" open table formats — Apache Hudi, Apache Iceberg, and Delta Lake — Apache Hudi has become the first to start landing these core primitives in open source: 🔢 Native VECTOR Type — First-class vector columns capturing dimension and element type semantics (float, double, int8). 📦 Native BLOB Type — Binary large objects as first-class citizens. Inline or out-of-line reference storage with managed lifecycle tracking. 🔍 Vector Search — K-nearest-neighbor search directly in Spark SQL via a hudi_vector_search() — no external vector database required. 📄 Lance File Format support — Native read/write support for the Lance columnar format, designed for AI/ML access patterns. 🧩 VARIANT Type — Semi-structured, schema-flexible columns for JSON like data. The lakehouse is evolving and Apache Hudi is paving the foundation.
-
-
Indexes are great once they exist 💡. The annoying part is creating and dropping them on a large active table 😩. If index creation blocks writes, most teams will postpone it until it becomes painful enough to justify downtime or disruption ⏳. Hudi’s async indexing model is a better answer 🔄. It separates index creation into two phases: • schedule an indexing plan up to a table commit 📅 • execute the build in the background while writers keep ingesting ⚙️ Hudi treats indexing as a live operational concern, not as a one-time setup step. The deeper point is that a lakehouse index subsystem has to be manageable, not just powerful and versatile. #ApacheHudi #DataLakehouse #DataEngineering #Indexing
-
-
Different query patterns need different index strategies 📊. Range filters can't be handled like equality predicates. Record-key point lookups do not behave like partition pruning expressions. Fast mutable writes need to optimize differently than analytical queries ⚙️. That is why Hudi’s indexing subsystem is interesting. It is not a single index. It is a multi-modal index framework sitting in the metadata table 🔍. That includes: • files index 📁 • column stats 📈 • partition stats 🗂️ • record index 🔑 • secondary index 📝 • expression index 🧮 This matters because lakehouse workloads are increasingly mixed by default. A system that only optimizes one access pattern will keep rediscovering bottlenecks somewhere else 🚧. #ApacheHudi #DataLakehouse #DataEngineering
-