Stories by WarpStream Labs on Medium

Kafka Isn’t a Database, But We Gave It a Query Engine Anyway

WarpStream Labs — Fri, 20 Mar 2026 13:59:53 GMT

By Simon Chaffetz and Sami Tabet, Software Engineers, WarpStream

Making Life Even Easier for Our Customers

https://medium.com/media/0d10ffeb5c8632c4ba4e98455561010b/href

WarpStream’s BYOC architecture gives users the data governance benefits of self-hosting with the convenience of a fully managed service. Events are a new observability tool that take this convenience one big step further.

When we initially designed WarpStream, our intention was for customers to lean on their existing observability tools to introspect into WarpStream. Scrape metrics and logs from your WarpStream Agent and pipe them into your own observability stack. This approach is thorough and keeps your WarpStream telemetry consolidated with the rest. It works well, but over time we noticed two limitations.

Not all our users have modern observability tooling. Some may take structured logging with search and analytics for granted, while others don’t have centralized logs at all. For various reasons, some teams have to tail their Agents’ logs in the terminal.
High-cardinality data is expensive. Customers with modern observability systems often use third party platforms that charge a fortune for high-cardinality metrics. We restrict the Agent’s telemetry data to keep these costs low for our customers and in turn these restrictions reduce visibility into the Agent.

We realized that to truly make WarpStream the easiest to operate streaming platform on the planet, we needed a way to:

Emit high cardinality structured data.
Query the data efficiently and visualize results easily.
All while keeping the data inside the customer’s environment.

Events address all three of these issues by storing high-cardinality events data, i.e. logs, in WarpStream, as Kafka topics, and making these topics queryable via a lightweight query engine.

Events also solves an immediate problem facing two of our more recent products: Managed Data Pipelines and Tableflow which both empower customers to automate complex data processing all from the WarpStream Console. These products are great, but without a built-in observability feature like Events, customers who want to introspect one of these processes have to fall back to an external tool, and switching from the WarpStream console adds friction to their troubleshooting workflows.

We considered deploying an open-source observability stack alongside each WarpStream cluster, but that would undermine one of WarpStream’s core strengths: no additional infrastructure dependencies. WarpStream is cheap and easy to operate precisely because it’s built on object storage with no extra systems to manage. Adding a sidecar database or log aggregation pipeline would add operational burden and cost.

So we decided to build it directly into WarpStream. WarpStream already has a robust storage engine, so the only missing piece was a small, native query engine. Luckily, many WarpStream engineers helped build Husky at Datadog so we know a little something about building query engines!

This post will have plenty of technical details, but let’s start by diving into the experience of using Events first.

An Intuitive Addition to Your Tool Belt

Events is a built-in observability layer capturing structured operational data from your WarpStream Agents so you can search and visualize it directly in the Console. No external infrastructure is required: the Events data is simply stored as Kafka topics using the same BYOC architecture as WarpStream’s original streaming product. Here’s a quick example of how you might use it.

Concrete Example: Debugging Iceberg Ingestion in Just a Couple Clicks

Suppose you’ve just configured a Warpstream Tableflow cluster to replicate an ‘Orders’ Kafka topic into an Iceberg table, with an AWS Glue catalog integration so your analytics team can query the Iceberg tables data from Athena (AWS’s serverless SQL query engine). A few hours in, you check the WarpStream Console and everything looks healthy. Kafka records are being ingested and the Iceberg table is growing. But when your analysts open Athena, the table isn’t there.

You navigate to the Tableflow cluster in the Console and scroll down to the Events Explorer at the bottom of the table’s detail view. You search for errors: data.log_level == “error”.

Alongside the healthy ingestion events, a parallel stream of aws_glue_table_management_job_failed events appears, one every few minutes since ingestion started. You expand one of the events. The payload includes the table name, the Glue database, and the error message:

"AccessDeniedException: User is not authorized to perform glue:CreateTable on resource"

The IAM role attached to your Agents has the right S3 permissions for object storage, which is why ingestion is working, but is missing the Glue permissions needed to register the table in the catalog. You update the IAM policy, and within minutes the errors are replaced by an aws_glue_table_created event. Your analysts refresh Athena and the table appears.

The data was safe in object storage the entire time, the Iceberg table was healthy, only the catalog registration was failing. Without Events, you would have seen a working pipeline on one side and an empty Athena catalog on the other, with no indication of what was wrong in between. The event payload pointed you directly to the missing permission.

Using the Events Explorer

Events are consumable through the Kafka protocol like any other topic, but raw kafka-console-consumer output isn’t the most pleasant debugging experience. The Events Explorer in the WarpStream Console provides a purpose-built interface for exploring, filtering, and aggregating your events.

The top of the Explorer has four inputs:

Event type: which logs to search: Agent logs, ACL logs, Pipeline logs, or all.
Time range: quick presets like 15m, 1h, 6h, or custom durations. Absolute date ranges are also supported. The search window cannot exceed 24 hours.
Filter: conditions on event fields using a straightforward query language.
Sort order: newest first, oldest first, or unsorted.

Results come back as expandable event cards showing the timestamp, type, and message. Expand a card to see the full structured JSON payload. Following the CloudEvents format, application data lives under data.*.

The timeline charts event volume over time, making it easy to spot patterns, for example an error spike after a deploy, periodic ACL denials, a gradual uptick in pipeline failures. You can group by any field in each event’s payload. Group Agent logs by data.log_level to see how the error-to-warning ratio shifts over time, or group ACL events by data.kafka_principal to see which service accounts generate the most denials.

You’ll also find Events Explorer widgets embedded in the ACLs and Pipelines tabs. These provide tightly scoped views relevant to the current context. For example, the ACLs widget pre-filters for ACL events, and the Pipelines widget only shows events generated by the current pipeline.

A Light Footprint

Your Storage, Your Data

WarpStream Agents run in your cloud account and read/write directly to your object storage. Events fit right into this model. Event data is stored as internal Kafka topics, subject to the same encryption, retention policies, and access controls (ACLs) as any other topic. Importantly, Events queries run in the Agents directly, just like Kafka Produce and Fetch requests, so you don’t have to pay for large volumes of data to be egressed from your VPC.

Query results do pass through the control plane temporarily so they can be rendered in the WarpStream Console, but they aren’t persisted anywhere. In addition, the Events topics themselves are hard-coded to only contain operational metadata such as Agent logs, request types, ACL decisions, and Agent diagnostics. They never contain your topics’ actual messages or raw data.

Cost Impact

Events contribute to your storage and API costs just like any other topic data persisted in your object storage bucket, but we’ve specifically tuned Events to be cheap. For moderately sized clusters, the expected impact is less than a few dollars per month.

If cost is a concern, i.e. for very high-throughput clusters, you can selectively disable event types you don’t need, for example, keeping ACL logs and turning off Agent logs. You can also reduce each event type’s retention period below the 6-hour default.

But How Does It Work? The Query Engine

To put it bluntly, we bolted a query engine onto a distributed log that stores data in a row-oriented format. The storage layer is not columnar, which means we’re never going to win any benchmark competitions. But that’s okay. Our Events product doesn’t need to be the fastest on the market. It just needs to support infinite cardinality and be fast enough, cheap enough, and easy enough to use that it makes life easier for our customers. And that’s what we built.

Lifecycle of a Query

When you submit a query through the Events Explorer, it gets routed to one of your Agents as a query job. The Agent then:

Parses the query into an Abstract Syntax Tree (AST).
Compiles the AST into a logical plan (filter, project, aggregate, sort, limit nodes).
Physically plans the query by resolving topic metadata: fetching partition counts, start/end offsets, and using topic metadata to narrow the offset ranges based on the time filter.
Splits the offset ranges into tasks. Each task covers a contiguous range of offsets for a single partition.
Schedules tasks for execution in stages, with results flowing between stages via output buffers.
Executes tasks using our in-house vectorized query engine.

For now, a single Agent executes the entire query, though the architecture is designed to distribute tasks across multiple Agents in the future.

Pruning Timestamps

The core challenge is that queries are scoped to a time range, but data in Kafka is organized by offset, not timestamp. While WarpStream supports ListOffsets for coarse time-to-offset translation, the index is approximative (for performance reasons), and small time windows like “the last hour” can still end up scanning much more data than necessary.

The query engine addresses this with progressive timestamp pruning. As tasks complete, the engine records the actual timestamp ranges observed at each offset range. We call these timestamp watermarks. These watermarks are then used to skip pending tasks whose offset ranges fall entirely outside the query’s time filter.

The pruning works in both directions:

Lower offsets: If a completed task at offset 200 has a minimum timestamp of 1:30 AM, and the query filters for 2:00 AM–4:00 AM, then all tasks with offsets below 200 can be safely skipped: their timestamps can only be earlier.
Higher offsets: Similarly, if a completed task shows timestamps already past the query’s end time, all tasks at higher offsets can be skipped.

To maximize pruning effectiveness, tasks are not scheduled sequentially. Instead, the scheduler uses a golden-ratio-based spacing strategy (similar to Kronecker search and golden section search) to spread early tasks across the offset space, sampling from the middle first and then filling in gaps. This maximizes the chances that the first few completed tasks produce watermarks that eliminate large swaths of remaining work.

On a typical narrow time-range query, this pruning eliminates the majority of scheduled tasks and allows us to avoid scanning all the stored data.

The Direct Kafka Client

The query engine reads data using the Kafka protocol, fetching records at specific offset ranges just like a consumer would. But in the normal Kafka path, data flows through a chain of processing: the Agent handling the fetch reads from blob storage (or cache), decompresses the data, wraps it in Kafka fetch responses, compresses it for network transfer, and sends it to the requesting Agent, which then decompresses it again (learn about how we handle distributed file caches in this blog post). Even when the query is running on an Agent that is capable of serving the fetch request directly, this involves real network I/O and redundant compression cycles.

The query engine short-circuits this with a direct in-memory client. It connects to itself using Go’s net.Pipe(), creating an in-memory bidirectional pipe that looks like a network connection to both ends but never hits the network stack. On top of that, the direct client signals via its client ID that compression should be disabled, eliminating the compress-decompress round entirely. Additionally, this ensures that the Events feature always works, even when the cluster is secured with advanced authentication schemes like mTLS.

These two optimizations–in-memory transport and disabled compression–more than doubled the data read throughput of the query engine in our benchmarks. Is it faster than a purpose-built observability solution? Absolutely not, but it’s cheap, easy to use, adds zero additional dependencies, and is integrated natively into the product.

Query Fairness and Protecting Ingestion

Events is designed as an occasional debugging tool, not a primary observability system. To make sure queries never impact live Kafka workloads, several safeguards are in place:

Memory limits: Configurable caps on how much memory a single query can consume.
Concurrency control: A semaphore in the control plane limits the maximum number of concurrent queries to 2 per cluster, regardless of the number of Agents. This is intentionally conservative for now and will be relaxed as the system matures.
Scan limits: Restrictions on the amount of data scanned from Kafka per query, to minimize pressure on Agents handling fetch requests.
Query only Agents: It’s possible to restrict some Agents to query workloads (see the documentation here).

More Optimizations

Beyond pruning and the direct client, the query engine applies several standard techniques:

Metadata-only evaluation: For queries that only need record metadata (e.g., counting events by timestamp), the engine skips decoding the record value entirely.
Early exit: For list-events and TopN queries, scanning stops as soon as enough results have been collected.
Adaptive fetch sizing: List-like queries use smaller fetches (to minimize over-reading), while aggregate queries use larger fetches (to maximize throughput).
Progressive results: For timeline queries, multiple sub-queries are scheduled to show results progressively for a more interactive UI.

Data Streams and Future Plans

Events launches with three data streams:

Agent logs: structured logs from every Agent in the cluster, regardless of role. Filter by log level, search for specific error messages, or correlate Agent behavior with a timestamp.
ACL events: every authorization decision, including denials. Captures the principal, resource, operation, and reason. Useful for rolling out ACL changes, managing multi-tenant clusters, and auditing shadow ACL decisions.
Pipeline events: execution logs from WarpStream Managed Data Pipelines. These help you understand why a pipeline is failing and make the Tableflow product much easier to operate, since you can see processing feedback directly in the Console without context-switching to an external logging system.

We plan to add new data streams over time as we identify more areas where built-in observability can make our customers’ lives easier.

Audit Logs

The same infrastructure that powers Events also drives WarpStream’s Audit Logs feature. Audit Logs track control plane and data plane actions–Console requests, API calls, Kafka authentication/authorization events, and Schema Registry operations–using the same CloudEvents format. They are queryable through the Events Explorer with the same query language and enjoy the same query engine optimizations.

The only difference is that in the audit logs product, the WarpStream control plane hosts the Kafka topics and query engine because many audit log events are not tied to any specific cluster.

Getting Started

Events are available now for all WarpStream clusters. To enable them, you’ll need to upgrade your Agents to version v770 or later. Once your Agents are updated:

Go to your cluster’s Events tab in the Console.
Turn on Global Events Status.
Optionally toggle individual event types on or off.
Wait a few minutes for your Agents to capture their first events.
Scroll down to the Events Explorer and run a search.

The global status and individual event types can be toggled on and off via the API or the Terraform provider.

Event topics have a default retention of 6 hours. Retention and partition count are configurable per event type. Find these details and more in the Events Explorer’s public documentation page.

Audit Logs for WarpStream: Full Visibility Into Every Action on Your Clusters

WarpStream Labs — Fri, 13 Feb 2026 14:44:59 GMT

By Jason Lauritzen, Sr. Product Marketing Manager, and Caleb Grillo, Head of Product

When teams run Kafka at scale, “who did what, and when?” becomes a vital question that needs an answer. A topic gets deleted, a consumer group disappears, or a new ACL shows up, and nobody knows who made the change. In regulated industries, that gap isn’t just inconvenient, it can be a compliance violation. Even outside of compliance, the absence of a reliable audit trail makes incident response slower and root-cause analysis harder.

Today we’re releasing Audit Logs for WarpStream, giving you a complete, structured record of every authentication action, authorization decision, and platform operation across your clusters.

Why Audit Logs?

Most Kafka operators are familiar with the pain. Kafka itself provides limited built-in auditing. You can piece together some information from broker logs, but those logs are unstructured, scattered across brokers, and not designed for long-term retention or compliance. Building a proper audit trail on top of vanilla Kafka typically means stitching together log aggregation pipelines, parsing free-text log lines, and hoping nothing falls through the cracks.

With WarpStream, because all metadata flows through the hosted control plane, we can expose a centralized view of every operation that occurs across your clusters. The logs are structured, queryable, and consumable using the tools you already know.

What Gets Logged?

Audit Logs capture two broad categories of events:

Cluster Audit Logs track Kafka-level operations on all Pro and Enterprise clusters. Every authentication attempt, every authorization decision, and every administrative action, including topic creation and deletion, consumer group deletion, ACL modification, and more, is recorded as a structured event. If a client connects with the wrong credentials and gets rejected, an audit log will be emitted. If someone creates a topic, you will see the full configuration that was applied.
Platform Audit Logs track account-level operations in the WarpStream Console. Creating or deleting clusters, managing API keys, modifying user accounts — any mutating operation against the WarpStream API generates an audit event.

Every event follows the CloudEvents spec and conforms to the same schema used by Confluent Cloud Audit Logs, so if you already have tooling built around that format, it works out of the box.

Built on Kafka, Consumed With Kafka

WarpStream Audit Logs aren’t just written to a database somewhere. They’re produced into a fully-managed WarpStream cluster running on WarpStream’s own cloud infrastructure. That means you can consume your Audit Logs using the Kafka protocol, with any Kafka client, and export them anywhere: your SIEM, a data lake, an Iceberg table via Tableflow, and more.

This approach gives you the best of both worlds. You get a built-in UI in the WarpStream Console for browsing and searching audit events directly, and you also get the full power and flexibility of the Kafka ecosystem for building whatever downstream integrations your compliance or security team needs.

To get started consuming Audit Logs programmatically, just navigate to the Audit section of the Console, create SASL credentials under the “Credentials” tab, and use the connection details from the “Connect” tab. It’s the same workflow you’d use to connect any Kafka client to any WarpStream cluster.

What Does an Audit Log Event Look Like?

Each audit log event is a structured JSON payload. Here’s a simplified example of what you’d see when a client creates a topic:

{
  "type": "warpstream.kafka/request",
  "time": "2025-12-17T09:29:36.120450058Z",
  "data": {
    "methodName": "kafka.CreateTopics",
    "authenticationInfo": {
      "credentials": {
        "mechanism": "SASL_SCRAM",
        "idSecretCredentials": {
          "credentialId": "ccun_xxxx"
        }
      }
    },
    "request": {
      "accessType": "MODIFICATION",
      "data": {
        "name": "my-new-topic",
        "numPartitions": 4,
        "replicationFactor": 1,
        "configs": [
          {"name": "retention.ms", "value": "86400000"},
          {"name": "cleanup.policy", "value": "compact,delete"}
        ]
      }
    },
    "result": {
      "status": "SUCCESS",
      "data": {"errorCode": 0}
    }
  }
}

The event tells you exactly what happened (a topic was created), who did it (which credential was used), when it happened, and what the outcome was. Failed operations are logged, too. If a DeleteGroups call fails because the group doesn’t exist, you’ll see the error code and message in the result, which is invaluable for debugging client issues or spotting misconfigured applications.

Platform-level events follow the same structure.

Note that the Audit Logs events still only contain metadata. No data is egressed from your WarpStream clusters, and no access is delegated to WarpStream. The hosted cluster is simply a convenient interface that we expose so that you don’t have to deploy and manage another BYOC cluster in order to store and retrieve your Audit Logs.

Pricing

Because the Audit Logs cluster is hosted on WarpStream’s infrastructure rather than in your cloud account, billing works slightly differently than for BYOC clusters. It’s metered across four dimensions: uncompressed data written, storage (GiB-minutes), network ingress, and network egress.

For the vast majority of use cases, Audit Logs are inexpensive. Most clusters produce well under 500 GiB/month of audit log data, which translates to less than $10/month for writes. Storage and networking charges are similarly modest. You can find the full pricing breakdown in our docs.

Getting Started

Audit Logs are available today for all WarpStream accounts on Pro or Enterprise clusters. To enable them:

Make sure your Agents are running version v736 or above.
Navigate to the “Audit” section in the WarpStream Console.
Enable Audit Logs with a single click.

Once enabled, logs start flowing immediately. No additional configuration, sidecar processes, or log aggregation pipelines need to be built.

The Art of Being Lazy(log): Lower latency and Higher Availability With Delayed Sequencing

WarpStream Labs — Wed, 04 Feb 2026 19:20:36 GMT

By Manu Cupcic, Sr. Manager, Engineering, and Yusuf Birader, Sr. Software Engineer, WarpStream

WarpStream is an Apache Kafka® protocol compatible data streaming platform built directly on top of object storage. As we discuss in the original blogpost, WarpStream uses cloud object storage as its sole data layer, delivering excellent throughput and durability while being extremely cost efficient.

The tradeoff? Latency. The minimum latency for a PUT operation in traditional object stores is on the order of a few hundred milliseconds, whereas a modern SSD can complete an I/O in less than a millisecond. As a result, Warpstream typically achieves a p99 produce latency of 400ms in its default configuration.

When S3 Express One Zone (S3EOZ) launched (offering significantly lower latencies at a different price point), we immediately added support and tested it. We found that with S3EOZ we could lower WarpStream’s median produce latency to 105ms, and the p99 to 170ms.

Today we are introducing Lightning Topics. Combined with S3EOZ, WarpStream Lightning Topics running in our lowest-latency configuration achieved a median produce latency of 33ms and p99 of 50ms — a 70% reduction compared to the previous S3EOZ results. Best of all, Lightning Topics deliver this reduced latency with zero increase in costs.

We are also introducing a new Ripcord Mode that allows the WarpStream Agents to continue processing Produce requests even when the Control Plane is unavailable.

Lightning Topics and Ripcord Mode may seem like two unrelated features at first glance, but when we dive into the technical details, you will see that they are two sides of the same coin.

How WarpStream Handles Produce Requests

To understand how Lightning Topics work, we first need to understand the full lifecycle of a Produce request in WarpStream. When a WarpStream Agent receives a Produce request from a Kafka client, it executes the following four steps sequentially:

Buffer records. The Agent buffers incoming records in memory for a configurable batch timeout (250ms by default) or until reaching a certain size (4MB by default). This batching is essential because each object storage PUT operation incurs a cost.
Write file to object storage. When the buffer is ready, the Agent uploads a single file to object storage.
Commit file metadata in the Control Plane. After the upload succeeds, the Agent notifies the Control Plane to “commit” the file. Critically, the file does not “exist” in WarpStream until this step is completed even though the data has already been made durable in object storage. The reason for this is that the WarpStream Agents are stateless and leaderless, which means that any Agent can process Produce requests for any partition. As a result, before records can “enter” the system they must first be sequenced by a centralized authority to ensure that every record has a deterministic offset.
Acknowledge. With the file now stored in S3 and sequenced by the Control Plane, the Agent responds to the Kafka client with a successful acknowledgement and the assigned offset for every record. This step is nearly instantaneous — the Agent simply executes local code and writes to the TCP connection.

WarpStream’s latency is tunable. Specifically, the batch timeout in Step 1 can be configured as low as 25ms (10x lower than the default of 250ms), and low-latency storage backends like S3EOZ provide much lower latency than traditional object storage. In a typical “low-latency” configuration (50ms batch timeout and using S3EOZ), the produce latency roughly is roughly distributed as follows:

~ 30ms to buffer records (some records arrive toward the end of the 50ms buffer window).
~ 20ms to write records to a quorum of S3EOZ buckets.
~ 50ms to commit to the Control Plane. The Control Plane also batches the requests for a short amount of time.

If we want to reduce the latency of Produce requests any further than ~100ms, one of these three variables has to give. In the normal “high latency” configuration mode of WarpStream, the Control Plane commit latency is a small fraction of the overall latency, but when WarpStream is configured with a low-latency storage backend (like S3EOZ) and a low batch timeout, the Control Plane latency dominates!

There are steps we could take to reduce the Control Plane latency, but zero latency is better than low latency, so we took a step back to see if we could do something more radical: remove the need to commit to the Control Plane in the critical path entirely. If we could accomplish that then we’d be able to cut the latency of Produce requests in the “low-latency” configuration by half!

Of course this begs the question: isn’t the Control Plane commit important? It wouldn’t be in the critical path otherwise. Earlier, we explained how the Control Plane is in the critical path of Produce requests so it can sequence the records and assign offsets that can be returned back to the client.

But what if we just didn’t return offsets back to the client? We’d still have to commit the file metadata to the Control Plane at some point, otherwise the records would never be assigned offsets, but that could happen asynchronously without blocking the Producer client.

It turns out this isn’t a novel idea!

LazyLog: Decoupling Durability and Sequencing

Lightning Topics are loosely based on an idea developed in the LazyLog paper. The key idea in the LazyLog paper is that every log needs: (1) durability and (2) ordering, but that while durability needs to be ensured upfront, sequencing rarely does.

More specifically, sequencing must complete before records are consumed from the log, but it is not necessary to wait for sequencing to complete before returning a successful acknowledgement to clients producing to the log unless those clients need to know the exact offset of the records they just produced. It turns out, producer clients almost never need that information.

And this is exactly what WarpStream Lightning Topics do: the call that commits the new files to the Control Plane is executed lazily (asynchronously) so that the records are eventually sequenced and can be consumed, but it’s not a prerequisite for successfully acknowledging a Produce request.

Let’s look at the lifecycle of a Produce request in WarpStream again, but this time for a Lightning Topic. Unlike Classic (non-Lightning) Topic produce requests which have four steps, Lightning Topic produce requests only have three steps before returning a response to the client:

Buffer records. The Agents buffers incoming records, just like they do for “Classic” Topics.
Write file to object storage. When the buffer is ready, the Agent uploads a single file to object storage, but in a different location than it normally would. The file is written to what we call a “sequence”. A sequence is just a folder that contains about 1,000 consecutive files written by a single Agent. When the sequence has too many files or has been opened for too long, the Agent starts a new sequence.
Acknowledge. Immediately when the file has been safely written, the Agent responds to the Kafka client. Since it has not received the Control Plane response yet, it cannot include the exact offsets of every record that was inserted into the log. In practice this means: The Agent returns 0 as the offset of all inserted records in the produce response. The Agent rejects writes from clients with Kafka’s “idempotent producer” feature enabled (the Control Plane is necessary for processing them correctly). The Agent rejects writes from transactional clients (the Control Plane is necessary for processing them correctly).
Commit asynchronously. Finally, the Agent tries to commit the new file to the Control Plane. The Control Plane accepts the file and commits the metadata. The new records are now visible to consumers (remember, we mentioned earlier in the LazyLog section that sequencing must complete before LazyLog records can be read by consumers).

This shorter process cuts the latency of writing data into a “low-latency” WarpStream cluster in half. We ran a small Open Messaging Benchmark workload where we lowered the batch timeout (the amount of time the Agents will wait to accumulate records before flushing a file to object storage) from 50ms to 25ms, and we were able to achieve a median produce latency of just 33ms and a P99 Produce latency of <50ms.

While Lightning Topics decrease produce latency, they have no noticeable effect on end-to-end latency. If we refer back to the LazyLog paper, consumers cannot progress until records have been made durable and they have been sequenced. So while the sequencing is not in the critical path for producers anymore, it is still in the critical path for consumers. For this reason, in Lightning Topics, the sequencing starts immediately after the records have been journaled to S3. This way, the consumers see the data as quickly as they would have if the topic had been a Classic Topic.

It’s Not That Simple, Unfortunately

The previous section was a bit of an oversimplification. It explains the “happy path” perfectly, but WarpStream is a large-scale distributed system. Therefore we have to take into account how the system will behave when things don’t go according to plan. The most obvious failure mode for Lightning Topics occurs when an Agent acknowledges a write to the client successfully (because the records were durably written to object storage), but then fails to commit the file to the Control Plane asynchronously. This can happen in several different ways:

The Control Plane RPC fails.
The Agent crashes immediately after acknowledging the write back to the producer, but before it manages to commit the file metadata to the Control Plane.

For a Classic Topic, when the Commit phase fails, we return an error to the client, and the client knows it needs to retry.

But for a Lightning Topic, it’s too late: the response has already been sent.

If WarpStream let this happen, it would be a big problem because failing to commit a file to the Control Plane is equivalent to data loss. The Agent would tell the client “your data is safely stored”, but that data would never actually get sequenced and made visible to consumers. From the customer’s perspective, acknowledged data would simply have vanished. In the next section we explain what WarpStream does for this not to happen.

The Slow Path

To solve this problem, the Agents need to make sure that every journaled file gets sequenced (eventually). To accomplish this, the Agents run a periodic background job that lists all files in the object storage journal and checks whether they need to be sequenced. This background job also handles expiring stale journal entries.

1. The Control Plane Orchestrates Discovery and Replay

The Control Plane’s scheduler periodically assigns scan jobs to Agents, instructing them to list the available sequences. When an Agent reports back with discovered sequences, the scheduler immediately dispatches replay jobs to process each one.

2. Agents Replay Sequences in Three Steps

When an Agent receives a replay job for a sequence, it:

Lists the folder to discover all uncommitted files.
Downloads the metadata section of the file, and extracts the metadata needed for commit.
Calls the Control Plane to commit each file.

Once committed, the data becomes immediately readable by Kafka clients.

3. Cleanup Happens Automatically

As time passes, the compaction process merges these files with the rest of the data. When a file from a sequence has been compacted away, the system marks it as expired and the Agent is allowed to delete it from object storage. Once all files in a sequence are gone, the sequence itself is closed and removed from the scan list — no further work required.

The Result

The slow path guarantees zero data loss. Even when the fast path fails to commit files correctly, the slow path ensures 100% of the journaled data is eventually sequenced and made visible to consumers.

We already discussed the trade-offs associated with Lightning Topics earlier: no returned offsets, no idempotent producers, and no transactions. But the existence of the “Slow Path” highlights one additional trade-off with Lightning topics: loss of external consistency.

With WarpStream Classic Topics, if a Kafka client produces one record, gets a successful acknowledgment, and then produces another record and gets a second successful acknowledgement, WarpStream guarantees that the offset of the second message is strictly greater than the offset of the first message (e.g., if it was added after, it is inserted after in the partition).

This is no longer guaranteed to be true with Lightning Topics. For example, the first produce could fail to commit when we try to commit it immediately. By the time the slow path picks it up and commits it to the Control Plane, the second record will already be sequenced, and its offset will be lower than the offset of the first one.

Ripcord Mode

The keen reader has likely noticed: the slow path guarantees eventual ingestion of Lightning Topic files, even when the Control Plane is unavailable. This is a useful property because it means, at least in theory, that Lightning Topics should be able to continue to store data even if the Control Plane is completely unavailable — whether from a networking issue, regional cloud provider outage, or incident in WarpStream’s Control Plane itself.

This data won’t be immediately available to consumers of course (remember, sequencing is in the critical path of consumers, but not producers), but producers can keep accepting data without interruption. This is critical for customers who cannot backpressure their data sources — applications that must either write records immediately or drop them entirely. With Lightning Topics, those records are journaled to object storage, waiting to be sequenced once the Control Plane recovers.

Unfortunately, Lightning Topics are not enough to guarantee uptime even when the Control Plane is unavailable. One reason is that most clusters will contain a mixture of Classic and Lightning Topics, and so even if the Lightning Topics keep working, there’s a chance that client retries for the Classic Topics generates so much additional load on the Agents that it disrupts the Lightning Topics. Worse, the Agents have other dependencies (like the Agent’s service discovery system, and internal caches with maximum staleness guarantees) that are dependent on the Control Plane being available for continued functioning.

As a result of all this, we developed a separate feature called “Ripcord Mode”. At a very high level, Ripcord Mode is a flag that can be set on Agent deployments that causes them to treat all topics as if they were Lightning topics. In addition, Ripcord Mode modifies the behavior of several internal systems to make them resilient to Control Plane unavailability. For example, when Ripcord Mode is enabled:

The Agents will automatically fallback to using object storage as the service discovery mechanism instead of the Control Plane if the Control Plane is unavailable.
The Agents will adjust the behavior of some internal caches to favor availability over consistency instead of consistency over availability. For example, a recently-created topic may not be available for Produce requests for several seconds or even minutes whereas in the normal configuration it’s guaranteed to be available for writes within on second of being created.

We tested Ripcord extensively, including by running continuous interruptions that happen multiple times every day in some of our internal test clusters. Below, we demonstrate the results of one of our manual tests where we disrupted the connection between the Agents and the Control Plane for roughly one hour.

Agents continued accepting Produce requests the whole time. A backlog of files waiting to be ingested built up and then slowly receded when the Agents were allowed to reconnect to the Control Plane. Success!

When Agents cannot connect to the control plane, they cannot consume. But they never stop producing. When the Agent can reconnect to the Control Plane, consuming catches up.

Number of sequences opened by the Agent during a disruption. First, it goes up because the agent can write files in new sequences but cannot properly ingest the files into the Control Plane. Then, when the disruption is over, the number of sequences stabilizes and goes back down to almost 0. Number of files the Agent manages to insert properly in the Control Plane. The yellow bars show the number of “fast path” files that could be ingested, and the red bars the number of “slow path” files. For about an hour, the connection from the Agent to the Control Plane was disrupted so the Agent cannot process a single file. When it’s re-established, it processes both the live traffic and starts processing the backlog of files that need to be replayed.

Ripcord Does Not Eliminate the Need for the Control Plane

While Agents can continue accepting writes, any operation requiring Control Plane metadata will still fail if the Control Plane is unavailable.

What remains blocked:

Consuming data and all read operations. All metadata is held in the Control Plane.
Topic management. Creating, editing, or deleting topics requires Control Plane access.
Consumer group management.
Basically anything that is not just producing batches of data and making sure they’re durable.

Also, new Agents cannot start when the Control Plane is unavailable. The WarpStream Agent caches a lot of information from the Control Plane, and the only way these caches can be warmed up is by loading it from the Control Plane. For this reason, currently, you cannot start (or restart) an Agent while the Control Plane is unavailable.

We will invest in this area in the future, and probably keep a copy of these caches in the object storage bucket to make it possible to start a new Agent entirely without a connection to the Control Plane. Currently, an Agent will fail during initialization if it cannot reach the Control Plane.

How Do I Use This ?

Both Lightning Topics and Ripcord Agents are Generally Available starting today.

Download the Agent (at least v744).
Create Lightning Topics by setting warpstream.topic.type=Lightning (or in the Console) as explained in the documentation.

Or start the Agent in Ripcord Mode with -enableRipcord as explained in the documentation.

Shadowing Kafka ACLs: A Safer Path to Authorization

WarpStream Labs — Wed, 17 Dec 2025 15:38:38 GMT

By Yusuf Birader, Software Engineer, WarpStream

Access Control Lists (ACLs) are Kafka’s native mechanism for controlling who is allowed to do what within a cluster. They define which principals (e.g. users and applications) are allowed to perform which operations (e.g. produce, consume, etc.) on which resources (e.g. topics, consumer groups, etc.).

ACLs are an important tool for securing Kafka clusters. Without them, any authenticated client can read and write any piece of data freely, as well as perform destructive actions like deleting entire topics.

The Risks of Enabling ACLs Blindly

At first glance, enabling ACLs in a live Kafka cluster seems like a simple affair:

Ensure that every client is connecting to the cluster with authenticatable credentials like SASL or mTLS.
Pre-create all of the necessary ACL rules so that every user has the exact set of permissions required for their application.
Enable ACL enforcement.

Seems straightforward enough, but step 3 poses a challenge. Once ACL enforcement is enabled, Kafka defaults to a “deny all” policy, where any request from a non-superuser is denied unless there is an explicit ACL granting access. Whilst great for security, it can make enabling ACLs on a production Kafka cluster for the first time terrifying, because a single missing permission or misconfigured ACL can halt producers, stall consumers, and disrupt connectors, making operators understandably cautious.

The risk is amplified by how easy ACLs are to misconfigure. Wildcards are a common source of errors. For example, Prefixed(“*”) matches nothing, despite appearing to be a catch-all, whereas the Literal(*) is required for a true wildcard. Other mistakes, like forgetting that principal names are case-sensitive, can similarly cause unexpected denials and production issues.

These sorts of subtleties are easy to miss during ACL creation and are nearly impossible to notice until enforcement begins — after which it’s already too late.

Even perfectly written ACLs may not behave as expected. Since ACLs apply to authenticated principals, any misconfiguration in SASL, Kerberos, certificates, or identity mapping can result in clients not being recognised or being treated as the wrong user. At enablement time, Kafka will simply deny access, with operators left to deal with the consequences.

The Need for a Safer ACL Approach

Given how risky it is to turn on ACL enforcement blindly, what we really need is a way to see how our ACLs would behave before they start blocking real clients. If we could run ACLs in a mode where they are evaluated but not enforced, we would be able to surface misconfigurations early, without putting production traffic at risk, and free operators from the typical enable-and-pray predicament.

That idea led us to build ACL Shadowing for WarpStream, a way to validate your ACLs and simulate authorization decisions safely on a live cluster before they’re enforced.

With the flip of a toggle in the console UI, you can enable ACL Shadowing on any Kafka cluster. Once it’s active, the cluster begins running your ACL rules in a shadow mode. ACLs are fully evaluated against live traffic, but their results are not enforced.

In other words, your WarpStream cluster continues operating normally, but you gain full visibility into what would happen if authorization were turned on.

Whenever an operation would be denied by your ACLs, the system surfaces this immediately, in a manner similar to how a real authorization failure would be communicated: deny logs are generated for every failing check and a diagnostic is emitted showing the principal, operation, and resource that would have been blocked.

Because ACL Shadowing never interferes with real traffic, teams can inspect and fix any issues like incorrect principals, wildcard mistakes, or misaligned authentication, without risking an outage.

By the time you enable ACLs, you have much greater visibility into how your rules are likely to behave. Fewer surprises, less firefighting, and a lower risk of impacting production, allowing developers to go live with confidence.

ACL Shadowing is available now for all WarpStream clusters. To get started, navigate to the ACLs tab for your cluster. If you have any questions, contact us.

What React and Apache Iceberg Have in Common: Scaling Iceberg with Virtual Metadata

WarpStream Labs — Fri, 12 Dec 2025 17:25:13 GMT

By Richard Artoul, Co-Founder, WarpStream

I want to start this blog post by talking about React. I know what you’re thinking: “React? I thought this was a blog post about Iceberg!” It is, so please just give me a few minutes to explain before you click away.

React’s Lesson in Declarative Design

React launched in 2013 and one of the main developers, Pete Hunt, wrote a thoughtful and concise answer to a simple question: “Why did we build react?”. The post is short, so I encourage you to go read it right now, but I’ll summarize what I think are the the most important points:

React’s declarative model makes building applications where data changes over time easy. Without React, developers had to write the declarative code to render the application’s initial state, but they also had to write all of the imperative code for every possible state transition that the application would ever make. With React, developers only have to write one declarative function to render the application given any input state, and React takes care of all the imperative state transitions automatically.
React is backend-agnostic. Most popularly, it’s used to render dynamic web applications with HTML, but it can also be used to drive UI rendering outside the context of a browser entirely. For example, react-native allows developers to write native iOS and Android applications as easily as they write web applications.

The Similarity Between the DOM and Iceberg Metadata

So what does any of this have to do with data lakes and Iceberg? Well, it turns out that engineers building data lakes in 2025 have a lot in common with frontend engineers in 2013. At its most basic level, Iceberg is “just” a formal specification for representing tables as a tree of metadata files.

Similarly, the browser Document Object Model (DOM) is “just” a specification for representing web pages as a tree of objects.

While this may sound trite (“Ok, they’re both trees, big deal”) the similarities are more than superficial. For example, the biggest problem for engineers interacting with either abstraction is the same: building the initial tree is easy, keeping the tree updated in real-time (efficiently and without introducing any bugs) is hard.

For example, consider the performance difference between inserting 10,000 new items into the DOM iteratively vs. inserting them all at once as a batch. The batch approach is almost 20x faster because it performs significantly less DOM tree mutations and re-rendering.

Thanks to Cursor, a WarpStream customer, all my benchmarks come with a sweet UI now.

In this particular scenario, writing the batching code is an easy fix. But it’s not always easy to remember, and as applications grow larger and more complex, performance problems like this can be hard to spot just by looking at the code.

React’s Virtual DOM: A Layer of Indirection for Efficiency and Correctness

React solves this problem more generally by introducing a layer of indirection between the programmer and the actual DOM. This layer of indirection is called the “virtual DOM”. When a programmer creates a component in React, all they have to do is write a declarative render() function that accepts data and generates the desired tree of objects.

However, the render function does not interact with the browser DOM directly. Instead, React takes the output of the render function (a tree of objects) and diffs it with the previous output. It then uses this diff to generate the minimal set of DOM manipulations to transition the DOM from the old state to the new desired state. A programmer could write this code themselves, but React automates it even for large and complex applications.

This layer of indirections also introduces many opportunities for optimization. For example, React can delay updating the DOM for a short period of time so that it can accumulate additional virtual DOM manipulations before updating the actual DOM all at once (automatic batching).

Managing Iceberg Metadata: A Tedious, Error-Prone Chore

Let’s transition back to Iceberg now. Let’s walk through all of the steps required to add a new Parquet file (that’s already been generated) to an existing Iceberg table:

Locate and read the current metadata.json file for the table.
Validate compatibility with the Iceberg table’s schema.
Compute the partition values for the new file.
Create the DataFile metadata object.
Read the old manifest file(s).
Create a new manifest file listing the new data file(s).
Generate a new version of the metadata.json file.

Optionally (but must be done at some point):

Expire old snapshots (metadata cleanup).
Rewrite manifests for optimization.
Reorder or compact files if needed for read performance.

That’s a lot of steps, and it all has to be done 100% correctly or the table will become corrupted. Worse, all of these steps have to be done single-threaded for the most part.

This complexity is the reason that there is very little Iceberg adoption outside of the Java ecosystem. It’s almost impossible to do any of this correctly without access to the canonical Java libraries. That’s also the reason why Spark has historically been the only game in town for building real time data lakes.

How We Built WarpStream Tableflow

WarpStream’s Tableflow implementation has a lot in common with React. At the most basic level, the goal of our Tableflow product is to continuously modify the D̶O̶M̶ Iceberg metadata tree efficiently and correctly. There are two ways to do this:

Manipulate the metadata tree directly in object storage. This is what Spark and everybody else does.
Create a “virtual” version of the metadata tree, manipulate that, and then reflect those changes back into object storage asynchronously, akin to what React does.

We went with the latter option for a number of reasons, foremost of which is performance. Normally, Iceberg metadata operations must be executed single-threaded, but our virtual metadata system can be updated millions of times per second. This allows us to reduce ingestion latency dramatically and scale seamlessly from from 1MiB/s to 10+GiB/s of ingestion with minimal orchestration.

But what exactly is the “virtual metadata tree”? In WarpStream’s case, it’s just a database. The exact same database that powers metadata for WarpStream’s Kafka clusters! This database isn’t just fast, it also provides extremely strong guarantees in terms of consistency and isolation (all transactions are fully serializable) which makes it much easier to implement data lake features and ensure that they’re correct.

Tableflow in Action: Exactly-Once Ingestion at Scale

So what does this all look like in practice?

Let’s track the flow of data starting with data stored in a Kafka topic (WarpStream or otherwise):

The WarpStream Agent issues fetch request(s) against the Kafka cluster to fetch records for a specific topic.
The Agent deserializes the records, optionally applies any user-specified transformation functions, makes sure the records match the table schema, creates a Parquet file, and then flushes that file to the object store.
The Agent commits the existence of a new Parquet file to the WarpStream control plane. This operation also atomically updates the set of consumed offsets tracked by the control plane which provides the system with exactly-once ingestion guarantees (practically for free!).
At this point the records are “durable” in WarpStream Tableflow (the source Kafka cluster could go offline or the Kafka topic could be deleted and we wouldn’t lose any records), but not yet queriable by external query engine. The reason for this is that even though the records have been written to a Parquet file in object storage, we still need to update the Iceberg metadata in the object store to reflect the existence of these new files.
Finally, the WarpStream control plane takes a new snapshot of its internal table state and generates a new set of Iceberg metadata files in the object store. Now the newly-ingested data is queryable by external query engine [1].

That’s just the ingestion path. WarpStream Tableflow provides a ton of other table management services like:

Data expiration
Orphan file cleanup
Background compaction
Custom partitioning
Sorting, etc

But for brevity, we won’t go into the implementation details of those features in this post.

It’s easy to see why this approach is much more efficient than the alternative despite introducing an additional layer of indirection: we can perform concurrency control using a low-latency transactional database (millions of operations/s), which reduces the window for conflicts when compared to a single-writer model on top of object storage alone. For table operations which don’t conflict, we can freely execute them concurrently and only abort those with true conflicts. The most common operation in Tableflow, the append of new records, is one of those operations that is extremely unlikely to have a true conflict due to how our append jobs are scheduled within the cluster.

In summary, unlike traditional data lake implementations that perform all metadata mutation operations directly against the object store (single-threaded), our implementation trivially parallelizes itself and scales to millions of metadata operations per second. In addition, we don’t need to worry about the number of partitions or files that participate in any individual operation.

Earlier in the post, I alluded to the fact that this approach makes developing new features easier, as well as guaranteeing that they’re correct. Take another look at step 3 above: WarpStream Tableflow guarantees exactly once ingestion into the data lake table and I almost never remember to brag about it because it just falls so naturally out of the design we barely thought about it. When you have a fast database that provides fully serializable transactions, strong guarantees and correctness (almost) just happen [2].

Multiple Table Formats (for Free!)

We’ve spent a lot of time talking about performance and correctness, but the original React creators had more than that in mind when they came up with the idea of the virtual DOM: multiple backends. While it was originally designed to power web applications, today React has a variety of backends like React Native that enable the same code and libraries to power UIs outside of the browser like native Android and iOS apps.

Virtual metadata provides the same benefit for Tableflow. Today, we only support the Iceberg table format, but in the near future we’ll add full support for Delta Lake as well. Let’s take another look at the Tableflow architecture diagram:

The only step that needs to change to support Delta Lake is step 4. There is (effectively) a single transformation function that takes Tableflow’s virtual metadata snapshot as an input and outputs Iceberg metadata files. All we need to do is write another function that takes Tableflow’s virtual metadata snapshot as input and outputs Delta Lake metadata files, and we’re done.

Every other feature of Tableflow (compaction, orphan file cleanup, stateless transformations, partitioning, sorting, etc) remains completely unchanged. If Tableflow operated on the metadata in the object store directly, every single one of those features would have to be rewritten to accommodate Delta Lake as well.

Interested in trying WarpStream Tableflow? You can learn more about Tableflow here, read the Tableflow docs, and fill out this form to get early access to Tableflow.

Footnotes

‍[1] For security reasons, the WarpStream control plane doesn’t actually have access to the customer’s object storage bucket, so instead it writes the new metadata to a WarpStream-owned bucket and sends the Agents a pre-signed URL they use to copy the files into the customer’s bucket.

[2] I’m exaggerating a little bit, our engineers work very hard to guarantee the correctness of our products.

Robinhood Swaps Kafka for WarpStream to Tame Logging Workloads and Costs

WarpStream Labs — Tue, 09 Dec 2025 16:21:18 GMT

By Robinhood + Jason Lauritzen

By switching from Kafka to WarpStream for their logging workloads, Robinhood saved 45%. WarpStream auto-scaling always keeps clusters right-sized, and features like Agent Groups eliminate issues like noisy neighbors and complex networking like PrivateLink and VPC peering.

Robinhood is a financial services company that allows electronic trading of stocks, cryptocurrency, automated portfolio management and investing, and more. With over 14 million monthly active users and over 10 terabytes of data processed per day, its data scale and needs are massive.

Robinhood software engineers Ethan Chen and Renan Rueda presented a talk at Current New Orleans 2025 (see the appendix for slides, a video of their talk, and before-and-after cost-reduction charts) about their transition from Kafka to WarpStream for their logging needs, which we’ve reproduced below.

Why Robinhood Picked WarpStream for Its Logging Workload

Logs at Robinhood fall into two categories: application-related logs and observability pipelines, which are powered by Vector. Prior to WarpStream, these were produced and consumed by Kafka.

The decision to migrate was driven by the highly cyclical nature of Robinhood’s platform activity, which is directly tied to U.S. stock market hours. There’s a consistent pattern where market hours result in higher workloads. External factors can vary the load throughout the day and sudden spikes are not unusual. Nights and weekends are usually low traffic times.

Traditional Kafka cloud deployments that rely on provisioned storage like EBS volumes lack the ability to scale up and down automatically during low- and high-traffic times, leading to substantial compute (since EC2 instances must be provisioned for EBS) and storage waste.

“If we have something that is elastic, it would save us a big amount of money by scaling down when we don’t have that much traffic,” said Rueda.

WarpStream’s S3-compatible diskless architecture combined with its ability to auto-scale made it a perfect fit for these logging workloads, but what about latency?

“Logging is a perfect candidate,” noted Chen. “Latency is not super sensitive.”

Architecture and Migration

The logging system’s complexity necessitated a phased migration to ensure minimal disruption, no duplicate logs, and no impact on the log-viewing experience.

Before WarpStream, the logging setup was:

Logs were produced to Kafka from the Vector daemonset.
Vector consumed the Kafka logs.
Vector shipped logs to the logging service.
The logging application used Kafka as the backend.

To migrate, the Robinhood team broke the monolithic Kafka cluster into two WarpStream clusters — one for the logging service and one for the vector daemonset, and split the migration into two distinct phases: one for the Kafka cluster that powers their logging service, and one for the Kafka cluster that powers their vector daemonset.

For the logging service migration, Robinhood’s logging Kafka setup is “all or nothing.” They couldn’t move everything over bit by bit — it had to be done all at once. They wanted as little disruption or impact as possible (at most a few minutes), so they:

Temporarily shut off Vector ingestion.
Buffered logs in Kafka.
Waited until the logging application finished processing the queue.
Performed the quick switchover to WarpStream.

For the Vector logging shipping, it was a more gradual migration, and involved two steps:

They temporarily duplicated their Vector consumers, so one shipped to Kafka and the other to WarpStream.
Then gradually pointed the log producers to WarpStream turned off Kafka.

Now, Robinhood leverages this kind of logging architecture, allowing them more flexibility:

Deploying WarpStream

Below, you can see how Robinhood set up its WarpStream cluster.

The team designed their deployment to maximize isolation, configuration flexibility, and efficient multi-account operation by using Agent Groups. This allowed them to:

Assign particular clients to specific groups, which isolated noisy neighbors from one another and eliminated concerns about resource contention.
Apply different configurations as needed, e.g., enable TLS for one group, but plaintext for another.

This architecture also unlocked another major win: it simplified multi-account infrastructure. Robinhood granted permissions to read and write from a central WarpStream S3 bucket and then put their Agent Groups in different VPCs. An application talks to one Agent Group to ship logs to S3, and another Agent Group consumes them, eliminating the need for complex inter-VPC networking like VPC peering or AWS PrivateLink setups.

Configuring WarpStream

WarpStream is optimized for reduced costs and simplified operations out of the box. Every deployment of WarpStream can be further tuned based on business needs.

WarpStream’s standard instance recommendation is one core per 4 GiB of RAM, which Robinhood followed. They also leveraged:

Horizontal pod auto-scaling (HPA). This auto-scaling policy was critical for handling their cyclical traffic. It allowed fast scale ups that handled sudden traffic spikes (like when the market opens) and slow, graceful scale downs that prevented latency spikes by allowing clients enough time to move away from terminating Agents.
AZ-aware scaling. To match capacity to where workloads needed it, they deployed three K8s deployments (one per AZ), each with its own HPA and made them AZ aware. This allowed each zone’s capacity to scale independently based on its specific traffic load.
Customized batch settings. They chose larger batch sizes which resulted in fewer S3 requests and significant S3 API savings. The latency increase was minimal (see the before and after chart below) — an increase from 0.2 to 0.45 seconds, which is an acceptable trade-off for logging.

Robinhood’s average produce latency before and after batch tuning (in seconds).

Pros of Migrating and Cost Savings

Compared to their prior Kafka-powered logging setup, WarpStream massively simplified operations by:

Simplifying storage. Using S3 provides automatic data replication, lower storage costs than EBS, and virtually unlimited capacity, eliminating the need to constantly increase EBS volumes.‍
Eliminating Kafka control plane maintenance. Since the WarpStream control plane is managed by WarpStream, this operations item was completely eliminated.‍
Increasing stability. WarpStream’s removed the burden of dealing with URPs (under-replicated partitions) as that’s handled by S3 automatically.‍
Reducing on-call burden. Less time is spent keeping services healthy.‍
Faster automation. New clusters can be created in a matter of hours.

And how did that translate into more networking, compute, and storage efficiency, and cost savings vs. Kafka? Overall, WarpStream saved Robinhood 45% compared to Kafka. This efficiency stemmed from eliminating inter-AZ networking fees entirely, reducing compute costs by 36%, and reducing storage costs by 13%.

Appendix

You can grab a PDF copy of the slides from ShareChat’s presentation by clicking here. Below, you’ll find a video version of the presentation:

https://medium.com/media/29a2ba12ce25539bc512d02c7a7506fc/href

Robinhood’s inter-AZ, storage, and compute costs before and after WarpStream.

Going All in on Protobuf With Schema Registry and Tableflow

WarpStream Labs — Wed, 03 Dec 2025 17:53:26 GMT

By Maud Gautier, Software Engineer, WarpStream

Protocol Buffers (Protobuf) have become one of the most widely-adopted data serialization formats, used by countless organizations to exchange structured data in APIs and internal services at scale.

At WarpStream, we originally launched our BYOC Schema Registry product with full support for Avro schemas. However, one missing piece was Protobuf support.

Today, we’re excited to share that we have closed that gap: our Schema Registry now supports Protobuf schemas, with complete compatibility with Confluent’s Schema Registry.

A Refresher of WarpStream’s BYOC Schema Registry Architecture

Like all schemas in WarpStream’s BYOC Schema Registry, your Protobuf schemas are stored directly in your own object store. Behind the scenes, the WarpStream Agent runs inside your own cloud environment and handles validation and compatibility checks locally, while the control plane only manages lightweight coordination and metadata.

This ensures your data never leaves your environment and requires no separate registry infrastructure to manage. As a result, WarpStream’s Schema Registry requires zero operational overhead or inter-zone networking fees, and provides instant scalability by increasing the number of stateless Agents (for more details, see our previous blog post for a deep-dive on the architecture).

Compatibility Rules in the Schema Registry

In many cases, implementing a new feature via an application code change also requires a change to be made in a schema (to add a new field, for example). Oftentimes, new versions of the code are deployed to one node at a time via rolling upgrades. This means that both old and new versions of the code may coexist with old and new data formats at the same time.

Two terms are usually employed to characterize those evolutions:

Backward compatibility, i.e., new code can read old data. In the context of a Schema Registry, that means that if consumers are upgraded first, they should still be able to read the data written by old producers.
Forward compatibility, i.e., old code can read new data. In the context of a Schema Registry, that means that if producers are upgraded first, the data they write should still be readable by the old consumers.

This is why compatibility rules are a critical component of any Schema Registry: they determine whether a new schema version can safely coexist with existing ones.

Like Confluent’s Schema Registry, WarpStream’s BYOC Schema Registry offers seven compatibility types: BACKWARD, FORWARD, FULL (i.e., both BACKWARD and FORWARD), NONE (i.e., all checks are disabled), BACKWARD_TRANSITIVE (i.e., BACKWARD but checked against all previous versions), FORWARD_TRANSITIVE (i.e., FORWARD but checked against all previous versions) and FULL_TRANSITIVE. (i.e., BACKWARD and FORWARD but checked against all previous versions).

Getting these rules right is essential: if an incompatible change slips through, producers and consumers may interpret the same bytes on the wire differently, thus leading to deserialization errors or even data loss.

Wire-Level Compatibility: Relying on Protobuf’s Wire Encoding

Whether two schemas are compatible ultimately comes down to the following question: will the exact same sequence of bytes on the wire using one schema still be interpreted correctly using the other schema? If yes, the change is compatible. If not, the change is incompatible.

In Protobuf, this depends heavily on how each type is encoded. For example, both int32 and bool types are serialized as a variable-length integer, or “varint”. Essentially, varints are an efficient way to transmit integers on the wire, as they minimize the number of bytes used: small numbers (0 to 128) use a single byte, moderately large number (129 to 16384) use 2 bytes, etc.

Because both types share the same encoding, turning an int32 into a bool is a wire-compatible change. The reader interprets a 0 as false and any non-zero value as true, but the bytes remain meaningful to both types.

However, a change from an int32 into a sint32 (signed integer) is not wire-compatible, because sint32 uses a different encoding: the “ZigZag” encoding. Essentially, this encoding remaps numbers by literally zigzagging between positive and negative numbers: -1 is encoded as 1, 1 as 2, -3 as 3, 2 as 4, etc. This gives negative integers the ability to be encoded efficiently as varints, since they have been remapped to small numbers requiring very few bytes to be transmitted. (Comparatively, a negative int32 is encoded as a two’s complement and always requires a full 10 bytes).

Because of the difference in encoding, the same sequence of bytes would be interpreted differently. For example, the bytes 0x01 would decode to 1 when read as an int32 but as -1 when read as a sint32 after ZigZag decoding. Therefore, converting an int32 to a sint32 (and vice-versa) is incompatible.

Note that since compatibility rules are so fundamentally tied to the underlying wire encoding, they also differ across serialization formats: while int32 -> bool is compatible in Protobuf as discussed above, the analogous change int -> boolean is incompatible in Avro (because booleans are encoded as a single bit in Avro, and not as a varint).

The Testing Framework We Used To Guarantee Full Compatibility

These examples are only two among dozens of compatibility rules required to properly implement a Protobuf Schema Registry that behaves exactly like Confluent’s. The full set is extensive, and manually writing test cases for all of them would have been unrealistic.

Instead, we built a random Protobuf schema generator and a mutation engine to produce tens of thousands of different schema pairs (see Figure 1). We submit each pair to both Confluent’s Schema Registry and WarpStream BYOC Schema Registry and then compare the compatibility results (see Figure 2). Any discrepancy reveals a missing rule, a subtle edge case, or an interaction between rules that we failed to consider. This testing approach is similar in spirit to CockroachDB’s metamorphic testing: in our case, the input space is explored via the generator and mutator combo, while the two Schema Registry implementations serve as alternative configurations whose outputs must match.

Our random generator covers every Protobuf feature: all scalar types, nested messages (up to three levels deep), oneof blocks, maps, enums with or without aliases, reserved fields, gRPC services, imports, repeated and optional fields, comments, field options, etc. Essentially, any feature listed in the Protobuf docs.

Our mutation engine then applies random schema evolutions on each generated schema. We created a pool of more than 20 different mutation types corresponding to real evolutions of a schema, such as: adding or removing a message, changing a field type, moving a field into or out of a oneof block, converting a map to a repeated message, changing a field’s cardinality (e.g., switching between optional, required, and repeated), etc…

For each test case, the engine picks one to five of those mutations randomly from that pool to generate the final mutated schema. We repeat this operation hundreds of times to generate hundreds of pairs of schemas that may or may not be compatible.

Figure 1: Exploring the input space with randomness: the random generator generates an initial schema and the mutation engine picks 1–5 mutations randomly from a pool to generate the final schema. This is repeated N times so we can generate N distinct pairs of schemas that may or may not be compatible.

Each pair of writer/reader schemas is then submitted to both Confluent’s and WarpStream’s Schema Registries. For each run, we compare the two responses: we’re aiming for them to be identical for any random pair of schemas.

Figure 2: Comparing the responses of Confluent’s and WarpStream’s Schema Registry implementations with every pair of writer-reader schemas. An identical response (left) indicates the two implementations are aligned but a different response (right) indicates a missing compatibility rule or an overlooked corner case that needs to be looked into.

This framework allowed us to improve our implementation until it perfectly matched Confluent’s. In particular, the fact that the mutation engine selects not one, but multiple mutations atomically allowed us to uncover a few rare interactions between schema changes that would not have appeared had we tested each mutation in isolation. This was notably the case for changes around oneof fields, whose compatibility rules are a bit subtle.

For example, removing a field from a oneof block is a backward-incompatible change. Let’s take the following schema versions for the writer and reader:

// Writer schema
message User {
  oneof ContactMethod {
    string email = 1;
    int32 phone = 2;
    int32 fax = 3;
  }
}

// Reader schema
message User {
  oneof ContactMethod {
    string email = 1;
    int32 phone = 2;
  }
}

As you can see, the writer’s schema allows for three contact methods (email, phone, fax) whereas the reader’s schema allows for only the first two. In this case, the reader may receive data where the field fax was set (encoded with the writer’s schema) and incorrectly assume no contact method exists. This results in information loss as there was a contact method when the record was written. Hence, removing a oneof field is backward-incompatible.

However, if the oneof block gets renamed to ModernContactMethod on top of the removal of the fax field from the oneof block:

// Reader schema
message User {
  oneof ModernContactMethod {
    string email = 1;
    int32 phone = 2;
  }
}

Then the semantics change: the reader no longer claims that “these are the possible contact methods” but instead “these are the possible modern contact methods”. Now, reading a record where the fax field was set results in no data loss: the truth is preserved that no modern contact method was set at the time the record was written.

This kind of subtle interaction where the compatibility of one change depends on another was uncovered by our testing framework, thanks to the mutation engine’s ability to combine multiple schemas at once.

All in all, combining a comprehensive schema generator with a mutation engine and consistently getting the same response from Confluent’s and WarpStream’s Schema Registries over hundreds of thousands of tests gave us exceptional confidence in the correctness of our Protobuf Schema Registry.

So what about WarpStream Tableflow? While that product is still in early access, we’ve had exceptional demand for Protobuf support there as well, so that’s what we’re working on next. We expect that Tableflow will have full Protobuf support by the end of this year.

If you are looking for a place to store your Protobuf schemas with minimal operational and storage costs, and guaranteed compatibility, the search is over. Check out our docs to get started or reach out to our team to learn more.

How Superwall Uses WarpStream and ClickHouse Cloud to Scale Subscription Monetization

WarpStream Labs — Fri, 10 Oct 2025 14:37:44 GMT

This engineering blog is reproduced with permission from ClickHouse and Superwall.

Superwall is a small team doing big things. With only 14 people, the company powers paywall monetization for hundreds of mobile subscription apps that reach billions of end users.

The pitch is straightforward: install Superwall’s SDK once, and you can instantly A/B test checkout flows, adjust pricing globally, and roll out discounts, all without shipping an app store update. As co-founder and CTO Brian Anglin explains, at the heart of the platform is a focus on continuously balancing supply and demand. “With digital products, usually you have high fixed costs and low variable costs,” he says. “You want to be able to accept any offer without cannibalizing customers who would have paid more.”

But that type of experimentation and optimization at scale is anything but simple. Every paywall view, conversion, and experiment generates an event that must be captured, streamed, and analyzed in near real-time. Dashboards have to update in seconds so customers can see results. To ingest 100 MB of events per second and store 300+ TB of data, Superwall needed a stack that was fast, durable, and perhaps most importantly, low maintenance.

We caught up with Brian to learn how they landed on WarpStream and ClickHouse Cloud, how their stack has evolved over the past few years, and the missing piece that finally tied it all together.

Getting Started With Apache Kafka® and ClickHouse

When Superwall first began building its platform, the team had two main needs: a reliable data streaming backbone to handle the flood of events from its SDK, and an analytical database that could make sense of those events quickly enough to power customer dashboards. “We didn’t want to settle for hourly or daily roll ups,” Brian says. “We wanted something near real-time.”

On the streaming side, Kafka compatibility was important. “It really sucks to lose data, and we also make mistakes,” he says. By turning up retention, the team could replay events if something broke or reprocess data when new use cases emerged. That durability gave the team confidence to move quickly.

For analytics, they chose ClickHouse. Blog posts from PostHog and Cloudflare — including one showing how Cloudflare processed 6 million requests per second — gave Brian confidence it could handle web-scale workloads. “I was like, ‘If their Kafka-plus-ClickHouse stack can do that, it’ll be enough for our piddly little SDK with, like, one customer,” he jokes.

ClickHouse’s materialized views, he says, were like “magic.” Instead of writing custom jobs to update counters or roll up metrics, the team could lean on ClickHouse to pre-aggregate in real time. “When I realized I could do that with SQL, I was like, oh, this is so much easier.”

Compression was another unexpected win, allowing the team to store huge amounts of data without jacking up costs. “I was shocked by the compression ratios we were able to achieve,” Brian says. “You don’t get that for free in any other database systems.”

The initial production stack was simple: Kafka up front, ClickHouse on a single EC2 instance behind it. “Once we had those two things working,” Brian says, “we started writing queries against it and had everything we needed to deliver a super simple MVP that was more real-time than anyone else in the space.”

Evolving With WarpStream and ClickHouse Cloud

Within a few months, however, the limits of that early stack started to show. With ClickHouse running on EC2, disk space was a constant headache. “I’d find myself over and over trying to resize the instance,” Brian recalls. “It became this background task I was always thinking about.”

They turned to a third-party ClickHouse consulting company, which solved some problems but introduced others, including cost. “When I actually did the math about what we were paying between compute, backups, the managed service, and S3 usage in our own AWS account, it got really, really expensive,” he says.

As they looked for a better long-term foundation, ClickHouse Cloud became the logical next step for analytics. Its separation of storage and compute promised to eliminate the endless disk resizing, while SharedMergeTree offered elastic scaling that could flex with demand. Just as important, the pricing model aligned with their usage. “Being able to just charge my credit card and have more storage was pretty compelling,” Brian says.

On the streaming side, WarpStream (which was acquired by Confluent in September 2024) emerged as the answer. It preserved Kafka compatibility — meaning no rewrites — while introducing a bring-your-own-cloud model that made it cheaper and easier to run at scale.

“Unlike trying to run Kafka locally, I ran the little agent demo and it just worked,” Brian says. “It was so refreshing to have a Kafka-compatible API running out of the box.”

Together, ClickHouse Cloud and WarpStream offered Superwall the scalable, cost-effective, ops-light stack it needed to keep up with growth. There was only one missing piece…

Ingesting From WarpStream to ClickHouse with ClickPipes

Before Superwall could fully migrate to the new ClickHouse-Cloud-plus-WarpStream stack, they needed a reliable way to move data between the two without taking on operational overhead. Any break in ingestion risked leaving customer dashboards stale.

They had previously relied on the Kafka table engine in ClickHouse OSS. However, this approach provided limited monitoring, lacked notifications, and coupled scaling to the ClickHouse server — too many gray areas. “Every time something broke, the question was always, did everything stop? Or did reporting just stop?” Brian says. Running connectors themselves wasn’t appealing either. “I just wasn’t confident we could do it perfectly.”

ClickPipes offered a much better and fully managed way. By streaming data directly from Kafka-compatible sources like WarpStream into ClickHouse Cloud, it gave Brian and the team both durability and peace of mind. Invalid records, which once threatened to bring pipelines to a halt, now land in a dedicated errors table for review. “If something goes wrong, we can still remain available for valid records, and we have an auditable trail for what didn’t make it,” he says.

Today, ClickPipes reliably moves more than 100 MB of events per second from WarpStream into ClickHouse Cloud, feeding a dataset of over 300 TB of total data that’s growing by 40 TB each month. Instead of adding another system to build and maintain, it takes the ingestion burden off their plates, while removing the gray areas that used to slow them down. That clarity lets them stay focused on building product features and spinning up new integrations quickly, knowing the underlying pipeline is solid.

Confluent + ClickHouse = “Durable, Scalable, Powerful”

Even at Superwall’s scale, customer dashboards still load in seconds. Developers get the immediate visibility they need to measure the success of pricing experiments or checkout flows without delay.

For Brian, the strength of the new stack lies in its balance of durability, scalability, and simplicity. Kafka compatibility ensures no data is lost and makes it easy to reprocess streams when new use cases arise. ClickHouse’s materialized views and indexing strategies give them the performance to support complex breakdowns (like analyzing paywall conversion rates across dozens of attributes) without trade-offs. And because the stack is fully managed, the team can focus on product innovation instead of operations.

The next diagram shows Superwall’s data stack today, powered by WarpStream and ClickHouse Cloud:

“It’s totally elastic,” Brian says of the new stack. “We can increase our volume as much as we want, both on the storage side and on the streaming side. And it’s pretty much ops-free, which is the goal we were trying to get to.”

With WarpStream by Confluent and ClickHouse Cloud, what was once a patchwork of managed services and constant disk resizing has become a cohesive architecture that connects streaming and analytics in a single, resilient pipeline. Asked to sum up the stack in three words, Brian pauses for a moment, and says, “Durable, scalable, powerful.”

Helping Mobile Apps Monetize Smarter

Today, Superwall is starting to layer on new AI-driven features like Demand Score, which rolls countless signals into a single metric predicting a user’s likelihood to subscribe. It’s an early example of how the company plans to bring more intelligence and personalization to developers. “There’s a lot more we’ll be doing on top of Confluent and ClickHouse,” Brian says.

That solid foundation gives them the freedom to focus on what matters: building product.

“We wanted to make our systems as horizontally scalable, simple, and operationally non-intensive as possible,” Brian says. “Because for us to be successful, we don’t have to be 10x Kafka engineers or 10x ClickHouse engineers — we need to make a great product.”

Superwall’s ambitions are big, but the team remains grounded. “We’re a small team and we do the best we can,” Brian says. “But we definitely rely on great vendors and partners.” With Confluent and ClickHouse working side by side, they’re ready to keep building, iterating, and helping mobile subscription apps optimize monetization.

Ready to take your data stack to the next level? Get started with a free trial of ClickHouse Cloud and WarpStream by Confluent today.

The Case for an Iceberg-Native Database: Why Spark Jobs and Zero-Copy Kafka Won’t Cut It

WarpStream Labs — Tue, 30 Sep 2025 13:47:51 GMT

By Richard Artoul, Co-Founder, WarpStream

TLDR: We launched a new product called WarpStream Tableflow that is the easiest, cheapest, and most flexible way to convert Kafka topic data into Iceberg tables with low latency, and keep them compacted. If you’re already familiar with the challenges of converting Kafka topics into Iceberg tables, feel free to skip ahead to our solution in the “What if We Had a Magic Box?” section.

Apache Iceberg and Delta Lake are table formats that provide the illusion of a traditional database table on top of object storage, including schema evolution, concurrency control, and partitioning that is transparent to the user. These table formats allow many open-source and proprietary query engines and data warehouse systems to operate on the same underlying data, which prevents vendor lock-in and allows using best-of-breed tools for different workloads without making additional copies of that data that are expensive and hard to govern.

Table formats are really cool, but they’re just that, formats. Something or someone has to actually build and maintain them. As a result, one of the most debated topics in the data infrastructure space right now is the best way to build Iceberg and Delta Lake tables from real-time data stored in Kafka.

The Problem With Apache Spark

The canonical solution to this problem is to use Spark batch jobs.

This is how things have been done historically, and it’s not a terrible solution, but there are a few problems with it:

You have to write a lot of finicky code to do the transformation, handle schema migrations, etc.
Latency between data landing in Kafka and the Iceberg table being updated is very high, usually hours or days depending on how frequently the batch job runs if compaction is not enabled (more on that shortly). This is annoying if we’ve already gone through all the effort of setting up real-time infrastructure like Kafka.
Apache Spark is an incredibly powerful, but complex piece of technology. For companies that are already heavy users of Spark, this is not a problem, but for companies that just want to land some events into a data lake, learning to scale, tune, and manage Spark is a huge undertaking.

Problems 1 and 3 can’t be solved with Spark, but we might be able to solve problem 2 (table update delay) by using Spark Streaming and micro-batching processing:

Well not quite. It’s true that if you use Spark Streaming to run smaller micro-batch jobs, your Iceberg table will be updated much more frequently. However, now you have two new problems in addition to the ones you already had:

Small file problem
Single writer problem

Anyone who has ever built a data lake is familiar with the small files problem: the more often you write to the data lake, the faster it will accumulate files, and the longer your queries will take until eventually they become so expensive and slow that they stop working altogether.

That’s OK though, because there is a well known solution: more Spark!

We can create a new Spark batch job that periodically runs compactions that take all of the small files that were created by the Spark Streaming job and merges them together into bigger files:

The compaction job solves the small file problem, but it introduces a new one. Iceberg tables suffer from an issue known as the “single writer problem” which is that only one process can mutate the table concurrently. If two processes try to mutate the table at the same time, one of them will fail and have to redo a bunch of work [1].

This means that your ingestion process and compaction processes are racing with each other, and if either of them runs too frequently relative to the other, the conflict rate will spike and the overall throughput of the system will come crashing down.

Of course, there is a solution to this problem: run compaction infrequently (say once a day), and with coarse granularity. That works, but it introduces two new problems:

If compaction only runs once every 24 hours, the query latency at hour 23 will be significantly worse than at hour 1.
The compaction job needs to process all of the data that was ingested in the last 24 hours in a short period of time. For example, if you want to bound your compaction job’s run time at 1 hour, then it will require ~24x as much compute for that one hour period as your entire ingestion workload [2]. Provisioning 24x as much compute once a day is feasible in modern cloud environments, but it’s also extremely difficult and annoying.

Exhausted yet? Well, we’re still not done. Every Iceberg table modification results in a new snapshot being created. Over time, these snapshots will accumulate (costing you money) and eventually the metadata JSON file will get so large that the table becomes un-queriable. So in addition to compaction, you need another periodic background job to prune old snapshots.

Also, sometimes your ingestion or compaction jobs will fail, and you’ll have orphan parquet files stuck in your object storage bucket that don’t belong to any snapshot. So you’ll need yet another periodic background job to scan the bucket for orphan files and delete them.

It feels like we’re playing a never-ending game of whack-a-mole where every time we try to solve one problem, we end up introducing two more. Well, there’s a reason for that: the Iceberg and Delta Lake specifications are just that, specifications. They are not implementations.

Imagine I gave you the specification for how PostgreSQL lays out its B-trees on disk and some libraries that could manipulate those B-trees. Would you feel confident building and deploying a PostgreSQL-compatible database to power your company’s most critical applications? Probably not, because you’d still have to figure out: concurrency control, connection pool management, transactions, isolation levels, locking, MVCC, schema modifications, and the million other things that a modern transactional database does besides just arranging bits on disk.

The same analogy applies to data lakes. Spark provides a small toolkit for manipulating parquet and Iceberg manifest files, but what users actually want is 50% of the functionality of a modern data warehouse. The gap between what Spark actually provides out of the box, and what users need to be successful, is a chasm.

When we look at things through this lens, it’s no longer surprising that all of this is so hard. Saying: “I’m going to use Spark to create a modern data lake for my company” is practically equivalent to announcing: “I’m going to create a bespoke database for every single one of my company’s data pipelines”. No one would ever expect that to be easy. Databases are hard.

Most people want nothing to do with managing any of this infrastructure. They just want to be able to emit events from one application and have those events show up in their Iceberg tables within a reasonable amount of time. That’s it.

It’s a simple enough problem statement, but the unfortunate reality is that solving it to a satisfactory degree requires building and running half of the functionality of a modern database.

It’s no small undertaking! I would know. My co-founder and I (along with some other folks at WarpStream) have done all of this before.

Can I Just Use Kafka Please?

TLDR: If you don’t care about buzzwords like “Iceberg topics”, “Zero Copy Iceberg”, and “Diskless Kafka” feel free to skip this section and jump directly to the “What if We Had a Magic Box?” section.

Hopefully by now you can see why people have been looking for a better solution to this problem. Many different approaches have been tried, but one that has been gaining traction recently is to have Kafka itself (and its various different protocol-compatible implementations) build the Iceberg tables for you.

The thought process goes like this: Kafka (and many other Kafka-compatible implementations) already have tiered storage for historical topic data. Once records / log segments are old enough, Kafka can tier them off to object storage to reduce disk usage and costs for data that is infrequently consumed.

Why not “just” have the tiered log segments be parquet files instead, then add a little metadata magic on-top and voila, we now have a “zero-copy” streaming data lake where we only have to maintain one copy of the data to serve both Kafka consumers and Iceberg queries, and we didn’t even have to learn anything about Spark!

Problem solved, we can all just switch to a Kafka implementation that supports this feature, modify a few topic configs, and rest easy that our colleagues will be able to derive insights from our real time Iceberg tables using the query engine of their choice.

Of course, that’s not actually true in practice. This is the WarpStream blog after all, so dedicated readers will know that the last 4 paragraphs were just an elaborate axe sharpening exercise for my real point which is this: none of this works, and it will never work.

I know what you’re thinking: “Richie, you say everything doesn’t work. Didn’t you write like a 10 page rant about how tiered storage in Kafka doesn’t work?”. Yes, I did.

I will admit, I am extremely biased against tiered storage in Kafka. It’s an idea that sounds great in practice, but falls flat on its face in most practical implementations. Maybe I am a little jaded because a non-trivial percentage of all migrations to WarpStream get (temporarily) stalled at some point when the customer tries to actually copy the historical data out of their Kafka cluster into WarpStream and loading the historical from tiered storage degrades their Kafka cluster.

But that’s exactly my point: I have seen tiered storage fail at serving historical reads in the real world, time and time again.

I won’t repeat the (numerous) problems associated with tiered storage in Apache Kafka and most vendor implementations in this blog post, but I will (predictably) point out that changing the tiered storage format fixes none of those problems, makes some of them worse, and results in a sub-par Iceberg experience to boot.

Iceberg Makes Existing (Already Bad) Tiered Storage Implementations Worse

Let’s start with how the Iceberg format makes existing tiered storage implementations that already perform poorly, perform even worse. First off, generating parquet files is expensive. Like really expensive. Compared to copying a log segment from the local disk to object storage, it uses at least an order of magnitude more CPU cycles and significant amounts of memory.

That would be fine if this operation were running on a random stateless compute node, but it’s not, it’s running on one of the incredibly important Kafka brokers that is the leader for some of the topic-partitions in your cluster. This is the worst possible place to perform computationally expensive operations like generating parquet files.

To make matters worse, loading the tiered data from object storage to serve historical Kafka consumers (the primary performance issue with tiered storage) becomes even more operationally difficult and expensive because now the Parquet files have to be decoded and converted back into the Kafka record batch format, once again, in the worst possible place to perform computationally expensive operations: the Kafka broker responsible for serving the producers and consumers that power your real-time workloads.

This approach works in prototypes and technical demos, but it will become an operational and performance nightmare for anyone who tries to take this approach into production at any kind of meaningful scale. Or you’ll just have to massively over-provision your Kafka cluster, which essentially amounts to throwing an incredible amount of money at the problem and hoping for the best.

Tiered Storage Makes Sad Iceberg Tables

Let’s say you don’t believe me about the performance issues with tiered storage. That’s fine, because it doesn’t really matter anyways. The point of using Iceberg as the tiered storage format for Apache Kafka would be to generate a real-time Iceberg table that can be used for something. Unfortunately, tiered storage doesn’t give you Iceberg tables that are actually useful.

If the Iceberg table is generated by Kafka’s tiered storage system then the partitioning of the Iceberg table has to match the partitioning of the Kafka topic. This is extremely annoying for all of the obvious reasons. Your Kafka partitioning strategy is selected for operational use-cases, but your Iceberg partitioning strategy should be selected for analytical use-cases.

There is a natural impedance mismatch here that will constantly get in your way. Optimal query performance is always going to come from partitioning and sorting your data to get the best pruning of files on the Iceberg side, but this is impossible if the same set of files must also be capable of serving as tiered storage for Kafka consumers as well.

There is an obvious way to solve this problem: store two copies of the tiered data, one for serving Kafka consumers, and the other optimized for Iceberg queries. This is a great idea, and it’s how every modern data system that is capable of serving both operational and analytic workloads at scale is designed.

But if you’re going to store two different copies of the data, there’s no point in conflating the two use-cases at all. The only benefit you get is perceived convenience, but you will pay for it dearly down the line in unending operational and performance problems.

In summary, the idea of a “zero-copy” Iceberg implementation running inside of production Kafka clusters is a pipe dream. It would be much better to just let Kafka be Kafka and Iceberg be Iceberg.

I’m Not Even Going to Talk About Compaction

Remember the small file problem from the Spark section? Unfortunately, the small file problem doesn’t just magically disappear if we shove parquet file generation into our Kafka brokers. We still need to perform table maintenance and file compaction to keep the tables queryable.

This is a hard problem to solve in Spark, but it’s an even harder problem to solve when the maintenance and compaction work has to be performed in the same nodes powering your Kafka cluster. The reason for that is simple: Spark is a stateless compute layer that can be spun up and down at will.

When you need to run your daily major compaction session on your Iceberg table with Spark, you can literally cobble together a Spark cluster on-demand from whatever mixed-bag, spare-part virtual machines happen to be lying around your multi-tenant Kubernetes cluster at the moment. You can even use spot instances, it’s all stateless, it just doesn’t matter!

The VMs powering your Spark cluster. Probably.

No matter how much compaction you need to run, or how compute intensive it is, or how long it takes, it will never in a million years impair the performance or availability of your real-time Kafka workloads.

Contrast that with your pristine Kafka cluster that has been carefully provisioned to run on high end VMs with tons of spare RAM and expensive SSDs/EBS volumes. Resizing the cluster takes hours, maybe even days. If the cluster goes down, you immediately start incurring data loss in your business. THAT’S where you want to spend precious CPU cycles and RAM smashing Parquet files together!?

It just doesn’t make any sense.

What About Diskless Kafka Implementations?

“Diskless” Kafka implementations like WarpStream are in a slightly better position to just build the Iceberg functionality directly into the Kafka brokers because they separate storage from compute which makes the compute itself more fungible.

However, I still think this is a bad idea, primarily because building and compacting Iceberg files is an incredibly expensive operation compared to just shuffling bytes around like Kafka normally does. In addition, the cost and memory required to build and maintain Iceberg tables is highly variable with the schema itself. A small schema change to add a few extra columns to the Iceberg table could easily result in the load on your Kafka cluster increasing by more than 10x. That would be disastrous if that Kafka cluster, diskless or not, is being used to serve live production traffic for critical applications.

Finally, all of the existing Kafka implementations that do support this functionality inevitably end up tying the partitioning of the Iceberg tables to the partitioning of the Kafka topics themselves, which results in sad Iceberg tables as we described earlier. Either that, or they leave out the issue of table maintenance and compaction altogether.

A Better Way: What If We Just Had a Magic Box?

Look, I get it. Creating Iceberg tables with any kind of reasonable latency guarantees is really hard and annoying. Tiered storage and diskless architectures like WarpStream and Freight are all the rage in the Kafka ecosystem right now. If Kafka is already moving towards storing its data in object storage anyways, can’t we all just play nice, massage the log segments into parquet files somehow (waves hands), and just live happily ever after?

I get it, I really do. The idea is obvious, irresistible even. We all crave simplicity in our systems. That’s why this idea has taken root so quickly in the community, and why so many vendors have rushed poorly conceived implementations out the door. But as I explained in the previous section, it’s a bad idea, and there is a much better way.

What if instead of all of this tiered storage insanity, we had, and please bear with me for a moment, a magic box.

Behold, the humble magic box.

Instead of looking inside the magic box, let’s first talk about what the magic box does. The magic box knows how to do only one thing: it reads from Kafka, builds Iceberg tables, and keeps them compacted. Ok that’s three things, but I fit them into a short sentence so it still counts.

That’s all this box does and ever strives to do. If we had a magic box like this, then all of our Kafka and Iceberg problems would be solved because we could just do this:

And life would be beautiful.

Again, I know what you’re thinking: “It’s Spark isn’t it? You put Spark in the box!?”

What’s in the box?!

That would be one way to do it. You could write an elaborate set of Spark programs that all interacted with each other to integrate with schema registries, carefully handle schema migrations, DLQ invalid records, handle upserts, solve the concurrent writer problem, gracefully schedule incremental compactions, and even auto-scale to boot.

And it would work.

But it would not be a magic box.

It would be Spark in a box, and Spark’s sharp edges would always find a way to poke holes in our beautiful box.

I promise you wouldn’t like the contents of this box.

That wouldn’t be a problem if you were building this box to run as a SaaS service in a pristine environment operated by the experts who built the box. But that’s not a box that you would ever want to deploy and run yourself.

Spark is a garage full of tools. You can carefully arrange the tools in a garage into an elaborate rube goldberg machine that with sufficient and frequent human intervention periodically spits out widgets of varying quality.

But that’s not what we need. What we need is an Iceberg assembly line. A coherent, custom-built, well-oiled machine that does nothing but make Iceberg, day in and day out, with ruthless efficiency and without human supervision or intervention. Kafka goes in, Iceberg comes out.

THAT would be a magic box that you could deploy into your own environment and run yourself.

It’s a matter of packaging.

We Built the Magic Box (Kind Of)

You’re on the WarpStream blog, so this is the part where I tell you that we built the magic box. It’s called Tableflow, and it’s not a new idea. In fact, Confluent Cloud users have been able to enjoy Tableflow as a fully managed service for over 6 months now, and they love it. It’s cost effective, efficient, and tightly integrated with Confluent Cloud’s entire ecosystem, including Flink.

However, there’s one problem with Confluent Cloud Tableflow: it’s a fully managed service that runs in Confluent Cloud, and therefore it doesn’t work with WarpStream’s BYOC deployment model. We realized that we needed a BYOC version of Tableflow, so that all of Confluent’s WarpStream users could get the same benefits of Tableflow, but in their own cloud account with a BYOC deployment model.

So that’s what we built!

WarpStream Tableflow (henceforth referred to as just Tableflow in this blog post) is to Iceberg generating Spark pipelines what WarpStream is to Apache Kafka.

It’s a magic, auto-scaling, completely stateless, single-binary database that runs in your environment, connects to your Kafka cluster (whether it’s Apache Kafka, WarpStream, AWS MSK, Confluent Platform, or any other Kafka-compatible implementation) and manufactures Iceberg tables to your exacting specification using a declarative YAML configuration.

source_clusters:
 - name: "benchmark"
   credentials:
     sasl_username_env: "YOUR_SASL_USERNAME"
     sasl_password_env: "YOUR_SASL_PASSWORD"
   bootstrap_brokers:
     - hostname: "your-kafka-brokers.example.com"
      port: 9092

tables:
   - source_cluster_name: "benchmark"
     source_topic: "example_json_logs_topic"
     source_format: "json"
     schema_mode: "inline"
     schema:
       fields:
         - { name: environment, type: string, id: 1}
         - { name: service, type: string, id: 2}
         - { name: status, type: string, id: 3}
         - { name: message, type: string, id: 4}
   - source_cluster_name: "benchmark"          
     source_topic: "example_avro_events_topic"
     source_format: "avro"
     schema_mode: "inline"
     schema:
       fields:
           - { name: event_id, id: 1, type: string }
           - { name: user_id, id: 2, type: long }
           - { name: session_id, id: 3, type: string }
           - name: profile
             id: 4
             type: struct
             fields:
               - { name: country, id: 5, type: string }
               - { name: language, id: 6, type: string }

Tableflow automates all of the annoying parts about generating and maintaining Iceberg tables:

It auto-scales.
It integrates with schema registries or lets you declare the schemas inline.
It has a DLQ.
It handles upserts.
It enforces retention policies.
It can perform stateless transformations as records are ingested.
It keeps the table compacted, and it does so continuously and incrementally without having to run a giant major compaction at regular intervals.
It cleans up old snapshots automatically.
It detects and cleans up orphaned files that were created as part of failed inserts or compactions.
It can ingest data at massive rates (GiBs/s) while also maintaining strict (and configurable) freshness guarantees.
It speaks multiple table formats (yes, Delta lake too).
It works exactly the same in every cloud.

Unfortunately, Tableflow can’t actually do all of these things yet. But it can do a lot of them, and the missing gaps will all be filled in shortly.

How does it work? Well, that’s the subject of our next blog post. But to summarize: we built a custom, BYOC-native and cloud-native database whose only function is the efficient creation and maintenance of streaming data lakes.

More on the technical details in our next post, but if this interests you, please check out our documentation, and contact us to get admitted to our early access program. You can also subscribe to our newsletter to make sure you’re notified when we publish our next post in this series with all the gory technical details.

Footnotes

This whole problem could have been avoided if the Iceberg specification defined an RPC interface for a metadata service instead of a static metadata file format, but I digress.
This isn’t 100% true because compaction is usually more efficient than ingestion, but it’s directionally true.

The Road to 100PiBs and Hundreds of Thousands of Partitions: Goldsky Case Study

WarpStream Labs — Mon, 29 Sep 2025 14:02:00 GMT

By Richard Artoul, Co-Founder, WarpStream

Goldsky

Goldsky is a Web3 developer platform focused on providing real-time and historical blockchain data access. They enable developers to index and stream data from smart contracts and entire chains via API endpoints, Subgraphs, and streaming pipelines, making it easier to build dApps without managing complex infrastructure.

Goldsky has two primary products: subgraph and mirror. Subgraph enables customers to index and query blockchain data in real-time. Mirror is a solution for CDCing blockchain data directly into customer-owned databases so that application data and blockchain data can coexist in the same data stores and be queried together.

Use Case

As a real-time data streaming company, Goldsky’s engineering stack naturally was built on-top of Apache Kafka. Goldsky uses Kafka as the storage and transport layer for all of the blockchain data that powers their product.

But Goldsky doesn’t use Apache Kafka like most teams do. They treat Kafka as a permanent, historical record of all of the blockchain data that they’ve ever processed. That means that almost all of Goldsky’s topics are configured with infinite retention and topic compaction enabled. In addition, they track many different sources of blockchain data and as a result, have many topics in their cluster (7,000+ topics and 49,000+ partitions before replication).

In other words, Goldsky uses Kafka as a database and they push it hard. They run more than 200 processing jobs and 10,000 consumer clients against their cluster 24/7 and that number grows every week. Unsurprisingly, when we first met Goldsky they were struggling to scale their growing Kafka cluster. They tried multiple different vendors (incurring significant migration effort each time), but ran into both cost and reliability issues with each solution.

“One of our customers churned when our previous Kafka provider went down right when that customer was giving an investor demo. That was our biggest contract at the time. The dashboard with the number of consumers and kafka error rate was the first thing I used to check every morning.”

– Jeffrey Ling, Goldsky CTO

In addition to all of the reliability problems they were facing, Goldsky was also struggling with costs.

The first source of huge costs for them was due to their partition counts. Most Kafka vendors charge for partitions. For example, in AWS MSK an express.m7g.4xl broker is recommended to host no more than 6000 partitions (including replication). That means that Goldsky’s workload would require: (41,000 * 3 (replication factor)) / 6000 = 20 express.m7g.4xl brokers just to support the number of partitions in their cluster. At $3.264/broker/hr, their cluster would cost more than half a million dollars a year with zero traffic and zero storage!

It’s not just the partition counts in this cluster that makes it expensive though. The sheer amount of data being stored due to all the topics having infinite retention also makes it very expensive.

The canonical solution to high storage costs in Apache Kafka is to use tiered storage, and that’s what Goldsky did. They enabled tiered storage in all the different Kafka vendors they used, but they found that while the tiered storage implementations did reduce their storage costs, they also dramatically reduced the performance of reading historical data and caused significant reliability problems.

In some cases the poor performance of consuming historical data would lock up the entire cluster and prevent them from producing/consuming recent data. This problem was so acute that the number one obstacle for Goldsky’s migration to WarpStream was the fact that they had to severely throttle the migration to avoid overloading their previous Kafka cluster’s tiered storage implementations when copying their historical data into WarpStream.

WarpStream Migration

Goldsky saw a number of cost and performance benefits with their migration to WarpStream. Their total cost of ownership (TCO) with WarpStream is less than 1/10th of what their TCO was with their previous vendor. Some of these savings came from WarpStream’s reduced licensing costs, but most of them came from WarpStream’s diskless architecture reducing their infrastructure costs.

“We ended up picking WarpStream because we felt that it was the perfect architecture for this specific use case, and was much more cost effective for us due to less data transfer costs, and horizontally scalable Agents.”

– Jeffrey Ling, Goldsky CTO

Decoupling of Hardware from Partition Counts

With their previous vendor, Goldsky had to constantly scale their cluster (vertically or horizontally) as the partition count of the cluster naturally increased over time. With WarpStream, there is no relationship between the number of partitions and the hardware requirements of the cluster. Now Goldsky’s WarpStream cluster auto-scales based solely on the write and read throughput of their workloads.

Tiered Storage That Actually Works

Goldsky had a lot of problems with the implementation of tiered storage with their previous vendor. The solution to these performance problems was to over-provision their cluster and hope for the best. With WarpStream these performance problems have disappeared completely and the performance of their workloads remains constant regardless of whether they’re consuming live or historical data.

This is particularly important to Goldsky because the nature of their product means that customers taking a variety of actions can trigger processing jobs to spin up and begin reading large topics in their entirety. These new workloads can spin up at any moment (even in the middle of the night), and with their previous vendor they had to ensure their cluster was always over-provisioned. With WarpStream, Goldsky can just let the cluster scale itself up automatically in the middle of the night to accommodate the new workload, and then let it scale itself back down when the workload is complete.

Goldsky’s cluster auto-scaling in response to changes in write / read throughput.

Zero Networking Costs

With their previous solution, Goldsky incurred significant costs just from networking due to the traffic between clients and brokers. This problem was particularly acute due to the 4x consumer fanout of their workload. With WarpStream, their networking costs dropped to zero as WarpStream zonally aligns traffic between producers, consumers, and the WarpStream Agents. In addition, WarpStream relies on object storage for replication across zones which does not incur any networking costs either.

In addition to the reduced TCO, Goldsky also gained dramatically better reliability with WarpStream.

“The sort of stuff we put our WarpStream cluster through wasn’t even an option with our previous solution. It kept crashing due to the scale of our data, specifically the amount of data we wanted to store. WarpStream just worked. I used to be able to tell you exactly how many consumers we had running at any given moment. We tracked it on a giant health dashboard because if it got too high, the whole system would come crashing down. Today I don’t even keep track of how many consumers are running anymore.”

– Jeffrey Ling, Goldsky CTO

Continued Growth

Goldsky was one of WarpStream’s earliest customers. When we first met with them their Kafka cluster had <10,000 partitions and 20 TiB of data in tiered storage. Today their WarpStream cluster has 3.7 PiB of stored data, 11,000+ topics, 41,000+ partitions, and is still growing at an impressive rate!

When we first onboarded Goldsky, we told them we were confident that WarpStream would scale up to “several PiBs of stored data” which seemed like more data than anyone could possibly want to store in Kafka at the time. However as Goldsky’s cluster continued to grow, and we encountered more and more customers who needed infinite retention for high volumes of data streams, we realized that a “several PiBs of stored data” wasn’t going to be enough.

At this point you might be asking yourself: “What’s so hard about storing a few PiBs of data? Isn’t the object store doing all the heavy lifting anyways?”. That’s mostly true, but to provide the abstraction of Kafka over an object store, the WarpStream control plane has to store a lot of metadata. Specifically, WarpStream has to track the location of each batch of data in the object store. This means that the amount of metadata tracked by WarpStream is a function of both the number of partitions in the system, as well as the overall retention:

metadata = O(num_partitions * retention)

This is a gross oversimplification, but the core of the idea is accurate. We realized that if we were going to scale WarpStream to the largest Kafka clusters in the world, we’d have to break this relationship or at least alter its slope.

WarpStream Storage V2 AKA “Big Clusters”

We decided to redesign WarpStream’s storage engine and metadata store to solve this problem. This would enable us to accommodate even more growth in Goldsky’s cluster, and onboard even larger infinite retention use-cases with high partition counts.

We called the project “Big Clusters” and set to work. The details of the rewrite are beyond the scope of this blog post, but to summarize: WarpStream’s storage engine is a log structured merge tree. We realized that by re-writing our compaction planning algorithms we could dramatically reduce the amount of metadata amplification that our compaction system generated.

We came up with a solution that reduces the metadata amplification by up to 256, but it came with a trade-off: steady-state write amplification from compaction increased from 2x to 4x for some workloads. To solve this, we modified the storage engine to have two different “modes”: “small” clusters and “big” clusters. All clusters start off in “small” mode by default, and a background process running in our control plane analyzes the workload continuously and automatically upgrades the cluster into “big” mode when appropriate.

This gives our customers the best of both worlds: highly optimized workloads remain as efficient as they are today, and only difficult workloads with hundreds of thousands of partitions and many PiBs of storage are upgraded to “big” mode to support their increased growth. Customers never have to think about it, and WarpStream’s control plane makes sure that each cluster is using the most efficient strategy automatically. As an added bonus, the fact that WarpStream hosts the customer’s control plane means we were able to perform this upgrade for all of our customers without involving them. The process was completely transparent to them.

This is what the upgrade from “small” mode to “big” mode looked like for Goldsky’s cluster:

An 85% reduction in the amount of metadata tracked by WarpStream’s control plane!

Paradoxically, Goldsky’s compaction amplification actually decreased when we upgraded their cluster from “small” mode to “big” mode:

That’s because previously the compaction system was overscheduling compactions to keep the size of the cluster’s metadata under control, but with the new algorithm this was no longer necessary. There’s a reason compaction planning is an NP-hard problem, and the results are not always intuitive!

With the new “big” mode, we’re confident now that a single WarpStream cluster can support up to 100PiBs of data, even when the cluster has hundreds of thousands of topic-partitions.

If you’re struggling with infinite retention, tiered storage, or high costs from a high number of topic-partitions in your Kafka clusters then get in touch with us!