Dremio Blog

14 minute read · April 10, 2026

The VARIANT Type: How to Store JSON Without the Pain

Will Martin Will Martin Technical Evangelist
Start For Free
The VARIANT Type: How to Store JSON Without the Pain
Image
Copied to clipboard

Key Takeaways

  • Working with JSON in an Iceberg lakehouse has been challenging due to performance issues with VARCHAR strings and schema bloat from flattened tables.
  • Apache Iceberg v3 introduces the VARIANT type, a native semi-structured data type that stores JSON-like data in compact binary format for faster field-level reads.
  • VARIANT supports various data types and allows flexible structures without requiring a fixed schema across rows, improving query performance.
  • The 'shredding' optimization extracts frequently occurring fields into typed Parquet columns, enhancing read performance, especially for filter-heavy queries.
  • Dremio's implementation of VARIANT on Iceberg v3 tables allows for efficient querying and supports diverse use cases like application logs and telemetry data.

Working with JSON in an Iceberg lakehouse has always been a compromise: you either store JSON as VARCHAR strings and accept the performance hit every time a query needs to extract a field, or you flatten the JSON into a wide table of nullable columns and watch your schema bloat. Both work fine but have always felt like workarounds to a problem rather than a proper solution.

Apache Iceberg now introduces a better approach, and Dremio now supports it: the VARIANT type. It's a native semi-structured data type that stores JSON-like data in compact binary encoding and supports a Parquet-level optimisation that makes field-level reads significantly faster than anything VARCHAR gives you.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Why VARCHAR Was Never the Right Answer for JSON in a Lakehouse

Storing JSON as a string is very appealing. It requires no schema agreement upfront, it accepts any payload, and it never fails on load. However, it's when you get to querying the data that the pain shows up.

Every query that extracts a field from a VARCHAR JSON column has to parse the entire string at runtime. For a table with 10 million log events, reading a single nested field means deserialising 10 million JSON strings, pulling the value you need, and discarding the rest. There is no predicate pushdown into the string, no column pruning, and no opportunity for the Parquet reader to skip irrelevant row groups. The query planner can't help you as it has no visibility into the structure of a string column.

Flattening JSON into a wide typed table solves this parsing problem but creates a different one. A payload with 50 possible fields becomes a table with 50 nullable columns, most of them potentially empty for any given row. Parquet handles sparse data reasonably well, but the metadata overhead grows and any schema changes become table alterations. For data that genuinely varies in structure across rows, a fixed schema is the wrong abstraction. You end up either missing fields or maintaining a schema that is mostly nulls.

What Actually is VARIANT?

VARIANT is a self-describing binary type defined in the Apache Parquet project and added to Apache Iceberg in the v3 specification, announced in June 2025. A VARIANT column holds any combination of the following within the same table, without requiring consistent structure across rows:

  • Primitives, such as strings, integers, VARCHARs, and timestamps
  • Ordered arrays of variant values
  • Objects with string keys and variant values

The type range is wider than JSON, which just accommodates strings, numbers, booleans, arrays, objects, and null. VARIANT natively encodes dates, timestamps with and without timezone offset, decimals, and binary data. Meaning that a field stored as a date inside a VARIANT is typed as a date, not a string that happens to parse like one. And range filters and date comparisons work without casting at query time.

In Dremio, VARIANT is available exclusively on Apache Iceberg v3 tables. Adding a VARIANT column to a v1 or v2 table is not supported, which is a constraint of the spec rather than in Dremio's implementation. If your existing tables are on earlier format versions, you will need to migrate them to v3 first.

Getting Shredded for Performance Gains

Binary encoding is already a meaningful improvement over VARCHAR for JSON storage. But the more impressive performance gains are thanks to "shredding", and it is worth understanding in detail.

Shredding is a Parquet-level optimisation. When Dremio writes VARIANT data to an Iceberg v3 table, it analyses the values being written and extracts frequently occurring fields into separate typed Parquet columns alongside the main VARIANT binary column. The binary column still exists and still holds the full value for every row, with no data lost and without imposing a schema on the table. But for the fields that were "shredded out", queries that access those fields can read directly from the typed column, skipping the binary decoding step entirely.

The practical effect is significant for filter-heavy queries. Consider a query like:

SELECT
  VARIANT_GET(payload, '$.user_id', 'VARCHAR') AS user_id,
  VARIANT_GET(payload, '$.revenue_usd', 'DECIMAL(10,2)') AS revenue
FROM purchase_events
WHERE VARIANT_GET(payload, '$.event_type', 'VARCHAR') = 'purchase'
  AND VARIANT_GET(payload, '$.revenue_usd', 'DECIMAL(10,2)') > 100

On a non-shredded table, every row's VARIANT binary value must be decoded to evaluate both filter conditions. On a shredded table where event_type and revenue_usd were extracted during write, the Parquet reader can push both predicates down to the typed shredded columns, apply row group and page-level skipping, and only decode the full VARIANT binary for the rows that pass the filter. For high-cardinality tables with selective filters, this difference is substantial.

How Dremio Handles Shredding

Dremio applies shredding by default on every write to a VARIANT column. If your workload is write-heavy and the overhead of shredding during ingestion is a concern, you can disable it at the table level by setting the iceberg.enableVariantShredding table property to false. For analytics workloads where reads outnumber writes, the default is the right setting.

Shredding is also interoperable across engines. When Apache Spark (4.1 or later) writes shredded VARIANT data to a shared Iceberg v3 table, Dremio reads it and uses the shredded layout transparently on the read path. A multi-engine lakehouse where Spark handles ingestion and Dremio handles analytics can take full advantage of shredding end to end, without any coordination between the two engines beyond both conforming to the Iceberg and Parquet variant specifications.

Getting Data Into a VARIANT Column

There are two functions to load data into VARIANT columns, and they handle types differently.

  • TO_VARIANT converts a SQL-typed value into a VARIANT while preserving the original type. For example, a DATE stays a DATE. If your source data already has SQL types and you want type fidelity inside the VARIANT, this is the option for you.

  • PARSE_JSON parses a raw JSON string into a VARIANT. It preserves JSON's native primitive types of strings, numbers, booleans, and null. Date and timestamp strings won't be interpreted as dates or timestamps unless explicitly cast after extraction. If your pipeline lands JSON directly from Kafka, a webhook, or an application log, PARSE_JSON is the natural entry point.

Querying VARIANT Data With VARIANT_GET

Values inside a VARIANT column are extracted using VARIANT_GET, which takes the column, a path expression, and a target type. The function returns a typed SQL value ready for aggregation, filtering, or joining without a separate casting step:

SELECT
  VARIANT_GET(event_payload, '$.session.duration_ms', 'BIGINT') AS duration_ms,
  VARIANT_GET(event_payload, '$.device.os', 'VARCHAR') AS operating_system,
  VARIANT_GET(event_payload, '$.created_at', 'TIMESTAMP') AS event_time
FROM app_events
WHERE VARIANT_GET(event_payload, '$.event_type', 'VARCHAR') = 'session_end'
  AND VARIANT_GET(event_payload, '$.session.duration_ms', 'BIGINT') > 5000

On a shredded table, if event_type and session.duration_ms were extracted during write, the WHERE clause predicates push down to the typed Parquet columns. The full VARIANT binary is only decoded for rows that pass both conditions. The analyst writing this query does not need to think about whether shredding happened. The query looks the same either way with the performance difference being handled by the storage and execution layers.

The Use Cases That Benefit Most From VARIANT and Shredding

VARIANT earns its place in workloads where data is genuinely variable in structure, evolves over time, or comes from sources that don't have a stable schema.

Application event logs are the clearest fit. A single events table typically receives payloads from many product surfaces, each with different fields, different nesting, and different types depending on the event. Flattening that into a fixed schema means either creating hundreds of nullable columns or rebuilding the schema every time a new event type ships. A VARIANT column accepts all of it. Shredding ensures the fields that appear in most queries remain fast to read without defining them in the table schema.

Telemetry and IoT data have similar characteristics: high volume, variable structure across device types and firmware versions, and a small set of fields that drive most analysis. Landing data in VARIANT and shredding the high-frequency fields gives you fast access to what you query most while keeping the full payload available for ad-hoc investigation.

Streaming and CDC pipelines that land JSON from Kafka or similar systems benefit from VARIANT because it removes the requirement to agree on a flat schema before ingestion starts. Data lands as-is, shredding runs on write, and the fields needed for monitoring queries are fast from the first day. Schema evolution in the upstream application no longer requires a pipeline pause and a table migration.

A Better Model for Semi-Structured Data in the Lakehouse

The case for VARIANT is straightforward. You replace a VARCHAR column that requires full string parsing at query time with a binary-encoded, self-typed column that supports predicate pushdown and Parquet-level skipping on the fields that matter most. For data whose structure does not fit a fixed schema, this is a better model than the alternatives data teams have been working with for years.

Dremio's implementation gives you full read and write support on Iceberg v3 tables, shredding applied by default on write, transparent reading of shredded data written by other engines, per-table control over shredding behaviour, and the VARIANT_GET function for typed field extraction. It is available now in Dremio Cloud.

If you want to test VARIANT against your own JSON data, you can spin up a free Dremio environment at dremio.com/get-started. Iceberg v3 table creation and VARIANT column support are available from day one.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.