Onehouse reposted this
Also doing a talk next week on "Anatomy of Our Data Agent: How AI Supports Analytics at Preset" at OpenXData by Onehouse, Wednesday, April 29 at 2:35 PT. Free, virtual. Hope to see you there! https://hubs.li/Q04dd_ql0
Onehouse, the pioneer in open data lakehouse technology, empowers enterprises to deploy and manage a world-class data lakehouse in minutes on Apache Hudi, Apache Iceberg, and Delta Lake. Delivered as a fully-managed cloud service in your VPC, Onehouse offers high-performance ingestion pipelines for minute-level freshness and optimizes tables for maximum query performance. Thanks to its truly open data architecture, Onehouse eliminates data format, table format, compute and catalog lock-ins, guarantees interoperability with virtually any warehouse/data processing engine, and ensures exceptional ELT and query performance for all your workloads. Companies worldwide rely on Onehouse to power their analytics, reporting, data science, machine learning, and GenAI use cases from a single, unified source of data. Built on Apache Hudi and Apache XTable (Incubating), Onehouse features advanced capabilities such as indexing, ACID transactions, and time travel, ensuring consistent data across all downstream query engines and tools. The platform’s unique incremental processing capabilities deliver unmatched ELT cost and performance by minimizing data movement and optimizing resource usage. With 24/7 reliability, immediate cost savings, and open access for all major tools and query engines, benefit from Onehouse's #nolockin philosophy to future-proof any stack.
External link for Onehouse
150 Mathilda Place
Suite 106
Sunnyvale, California 94086, US
Onehouse reposted this
Also doing a talk next week on "Anatomy of Our Data Agent: How AI Supports Analytics at Preset" at OpenXData by Onehouse, Wednesday, April 29 at 2:35 PT. Free, virtual. Hope to see you there! https://hubs.li/Q04dd_ql0
🔥 Don’t miss the OpenXData opening keynote. Few topics feel more important right now for data and AI teams.
Everyone thinks data platforms are ready for AI agents. But they're still built for human workflows, and the mismatch is forcing a total rethink right now. OpenXData free virtual event, next Wednesday captures this shift perfectly. For the last decade, we built data platforms for humans: dashboards, reports, notebooks, batch pipelines. Now, the workload is changing. AI agents don’t analyze data. They operate on it. Continuously. At machine speed. That shift forces every layer of the stack to evolve. What excites me about this conference: the talks lay out the story plainly. 📄 New file formats like Lance, Vortex, even Parquet push beyond structured data 🗄️ Table formats like Hudi evolve for unstructured data and vector search 💬 Context engineering talk is just data engineering retooled for agents 🤖 AI embeds into core systems like Spark, reshaping pipeline builds and ops 🛠️ Tools turn messy inputs like PDFs into clean context These aren't standalone. They converge on one trend: 👉 Data platforms become context infrastructure for AI systems In my keynote, I'll connect these dots: what's missing, what's emerging, how the lakehouse adapts to this world. If you work in data, infra, or AI systems, pay attention to this moment. Excited to see everyone there. #OpenXData #DataInfrastructure #AI #Lakehouse #DataEngineering
⚡ What does it take to make an open lakehouse work across formats? Join us next Wednesday at OpenXData for a closer look at that question.
I am thrilled to be taking the stage with Yufei Gu from Snowflake at OpenXData 2026 to discuss a critical evolution in modern lakehouses: Polaris Meets Apache Hudi. The vision of a unified lakehouse often hits a roadblock in reality: most production environments aren't "single format." While Apache Polaris™ provides a powerful open metadata and governance layer, many of the most demanding streaming and ingestion-heavy workloads continue to rely on the unique strengths of Apache Hudi. In our session, we’ll dive into how Polaris and Hudi are converging to provide a unified catalog, centralized metadata management, and consistent governance—without forcing teams to compromise on their choice of table format. We will cover: - Architecture: How Polaris integrates with Hudi to create a unified metadata layer. - Governance: Implementing format-aware access control and discovery. - Lessons from the Field: Real-world challenges and what true multi-format interoperability looks like in practice. Whether you're managing complex data pipelines or architecting the next generation of your data platform, this session will provide a blueprint for a more open, flexible lakehouse. 📅 When: April 29, 2026 | 12:50 PM – 1:15 PM PDT 📍 Event: OpenXData 2026 Registration link in the comment below. Hope to see you there! #ApacheHudi #ApachePolaris #Snowflake #Onehouse #DataEngineering #Lakehouse #OpenSource #OpenXData
🔥 A must-watch on what AI-native lakehouse architecture really looks like.
Excited to be speaking at Onehouse's OpenXData conference on April 29th alongside Timothy Brown from the frontier model research lab General Intuition. Our topic will be "Apache Hudi for the next generation of #AI: Unstructured Data and Vector Search on Open Lakehouse Storage." We will be covering how Apache Hudi is evolving to become an "AI native #lakehouse": → Support for #VECTOR data type which captures(dimension + element type) produced by embedding models -> Support for a #BLOB type with inline/out-of-line storage and managed lifecycle tracking for images, videos, etc -> Support for a #VARIANT type for semi-structured data with type shredding. → Support for the #Lance file format — which has fast random access for embeddings and blob data compared to traditional columnar formats → Support for #VECTOR_SEARCH as a Spark SQL function — k-NN with cosine, L2, and dot product distance metrics directly on your lakehouse, no external vector db required. If you're building modern data infrastructure, and interested in what an AI native table format looks like come check it out! Link in comments!
What does compaction, cleaning, and clustering look like when you operate at Uber scale? At OpenXData, Uber engineers Vamshi Pasunuru and Xinli shang will share how their team built scalable table services to balance ingestion latency with query performance, and how they decouple background maintenance to keep data fresh and analytics fast. In a recent blog post, Uber noted that its Apache Hudi deployment supports 19,500 datasets, 10 PB of daily ingestion, and 70,000 table service operations per day. This talk should be especially relevant for teams running large lakehouse deployments where table maintenance directly impacts reliability and performance. Catch it at OpenXData on April 29 👉 https://www.openxdata.ai/ #OpenXData #ApacheHudi #DataEngineering #Lakehouse #DataPlatform #OpenSource
A great example of the potential impact of our Spark Jobs with Quanton acceleration.
A fintech company recently ran their own benchmark of Onehouse Quanton versus Databricks Photon. The results...? 3.5x cheaper and 17min faster job duration. Unlike TPC-DS this was not a synthetic workload, they brought their toughest job to the table and no override tuning allowed by vendors. Money matters to fintech and this organization learned how to save bank 💰
Meta. Lyft. Amazon. Tosh Rayadhurgam has worked on AI and ML platforms at serious scale. What happens when non-deterministic agents enter such systems? That is what he will unpack at #OpenXData. Most data architectures still assume queries are deterministic. Same query in, same answer out. Agents break that assumption. 🤖 Tosh's session gets into what changes for governance, trust, and the architectural decisions teams should make before an agent layer reaches production. If agents are moving from demo to production in your stack, register here: 👉 https://www.openxdata.ai/
Onehouse reposted this
Optimizing a vectorized query engine is 90% analysis, 10% code changes. You run a benchmark. Parse DAG nodes, check cache hit rates, compare cold vs warm runs, capture flame graphs, monitor native memory. Then do it all over again. Dozens of times. The analysis isn't hard. It's just repetitive. So we automated it. At Onehouse, we built agents that handle the entire query execution analysis loop — DAG operator metrics, CPU flame graph analysis, native memory diagnostics down to allocator fragmentation. What used to take hours of manual analysis now takes seconds. It's not about saving time. It's about how many ideas you can test per day. When the analysis loop is fast, the bottleneck shifts from "can I analyze this fast enough" to "what should I optimize next" — which is where real engineering happens. AI doesn't optimize the engine for you. It removes the friction between having an idea and testing it. If you're looking for 4x faster Apache Spark and an engine that keeps getting faster every week try the Quanton Operator in your cluster with a one-line change. https://lnkd.in/gA37rQ_m
Onehouse reposted this
I am speaking at OpenXData (by Onehouse) - the Open Data Architecture event of the year! OpenXData is honestly the event to be at if your focus is technical deep dives. The speaker lineup is pretty amazing and covers an array of things such as: - AI-native data platforms - Data Engineering for AI - Cost/Performance Optimization at scale - Interoperability in Lakehouses, etc. I am doing a talk on "What is Really Open in an Open Lakehouse (ft. Apache Iceberg, Apache Hudi & Delta Lake)" Jargons like 'Open Lakehouse', 'Interoperable Lakehouse' are used everywhere nowadays, but what does it really mean for your data architecture? Join me as I do a technical breakdown of all these. Also, looking forward to these amazing talks by other folks in the list. ✅ What’s new in Spark 4.2 / 4.3 and how to optimize your UDFS in Spark 4+ by Holden Karau ✅ Column Storage for the AI era by Julien Le Dem ✅ Driving Iceberg Adoption with Open Catalog and Open Datasets by Kevin Liu ✅ Polaris Meets Hudi, Unifying Lakehouse Metadata Across Table Formats by Yufei Gu ✅ Apache Hudi™ for the next generation of AI: Unstructured Data and Vector Search on open lakehouse storage by Rahil Chertara & Timothy Brown Link in comments! #dataengineering #softwareengineering
Spark on Kubernetes has always come with a frustrating tradeoff: You want the control, security, and flexibility of running on your own infrastructure. But you do not want to give up the kind of performance you get from a more managed stack. What if you could get both? 🚨 Join us for: 4x Faster Spark on Your Own Infrastructure: Bringing Quanton to Self-Managed Kubernetes. 📅 Apr 14, 2026 ⏰ 9am PT We’ll show how teams can run Spark workloads 3 to 4x faster on self-managed Kubernetes, with zero code changes. In the session, we’ll walk through: 1. what Quanton K8s Operator actually does 2. how to install and configure it 3. how to use Spark Cost Analyzer to estimate savings from your existing Spark history logs So if you are running Spark on K8s and tired of choosing between performance and infrastructure control, this one is for you. Performance you need. Infrastructure you control. Zero code changes. Register here 👇 #ApacheSpark #Kubernetes #DataEngineering #Spark #DataInfrastructure
LinkedIn is better on the app
Don’t have the app? Get it in the Microsoft Store.
Open the app