From 3d6ee4072f47153a83806581766401811065bc61 Mon Sep 17 00:00:00 2001 From: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com> Date: Tue, 25 Nov 2025 23:11:29 -0500 Subject: [PATCH 1/4] Variant blog post --- _data/contributors.yml | 3 + _posts/2025-11-25-variant.md | 114 +++++++++++++++++++++++++++++++++++ 2 files changed, 117 insertions(+) create mode 100644 _posts/2025-11-25-variant.md diff --git a/_data/contributors.yml b/_data/contributors.yml index 8cefa3151222..42c0e5a1a6cd 100644 --- a/_data/contributors.yml +++ b/_data/contributors.yml @@ -76,4 +76,7 @@ - name: Andrew Lamb apacheId: alamb githubId: alamb +- name: Matthew Kim + apacheId: friendlymatthew # Not a real apacheId + githubId: friendlymatthew # End contributors.yml diff --git a/_posts/2025-11-25-variant.md b/_posts/2025-11-25-variant.md new file mode 100644 index 000000000000..f2bc876df9c6 --- /dev/null +++ b/_posts/2025-11-25-variant.md @@ -0,0 +1,114 @@ +--- +layout: post +title: "Variant: tbd" +author: friendlymatthew +date: "2025-11-25 00:00:00" +categories: [release] +--- + + + +Variant is a data type designed to solve JSON's performance problems in OLAP systems, and its initial implementation has been released in Arrow 57.0.0. + +This article explains the limitations of JSON, how Variant solves these problems, and why you should be excited about it by analyzing performance characteristics on a real world dataset. + +## Whats the problem with JSON? + +Many real world datasets are messy and lack a rigid schema. As a result, it's common to store such data as a JSON column. + +The access pattern of this model looks like: + +- **read**: for every row you scan, decode the entire JSON into memory and evaluate it +- **write**: for every record you write, encode it as its own JSON value + +There are many problems with this approach. + +_Read performance degrades quickly._ To evaluate any predicate requires scanning every row and deserializing the entire JSON payload into memory, no matter its size. Even multi-megabyte documents must be completely decoded just to inspect a single field. + +_Write efficiency is also poor._ Because each row stores an independently encoded JSON object, common object fields are redundantly serialized over and over. This increases storage size and adds unnecessary encoding overhead. + +These problems aren't new and solutions exist that improve the access patterns of a JSON column. One common solution is to extract commonly occuring object keys into a dedicated column, also known as object shredding. Yet, this requires substantial engineering effort and implementation details vary by query engine. + +## How Variant solves these problems + +Variant is a data type with an efficient binary encoding. It's designed to store JSON data in a way that is more performant for OLAP query engines. + +### Variant has richer data types + +JSON is limited to just six data types: strings, numbers, booleans, nulls, objects, and arrays. This simplicity comes at a cost, as specialized data types like timestamps and UUIDs must be encoded as strings, then inferred and parsed back at query time. + +Variant removes this overhead by supporting a much broader range of native types organized into three categories: primitives, objects, and arrays. Objects and arrays work exactly like their JSON counterparts, supporting arbitrary nested structures. Variant extends the primitive category to include 20 specialized types. Dates, timestamps, UUIDs, binary data, and integers and floats of various width all have their own native representations. Values are encoded in type-specific binary formats, optimized for their native type rather than a stringifed representation. This reduces both storage-overhead and query-time parsing costs. + +_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON and Variant logically and physically]_ + +### Variant has efficient serialization through a 2-column design + +JSON columns have naive serialization which leads to unnecessary encoding overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" }` thousands of times, the field names `user_id` and `timestamp` are written out in full for each row, even though they're identical across rows. + +Variant avoids such overhead by splitting data across 2 columns: `metadata` and `value`. + +The `metadata` column stores a dictionary of unique field names. The `value` column stores the actual data. When encoding an object, field names aren't written inline. Instead, each field references its name by offset position in the metadata dictionary. + +_[Figure: diagram of a primitive variant and variant object beign written to a metadata and value column]_ + +At the file level, this design enables a powerful optimization. You can build a single metadata dictionary per column chunk, containing the union of all unique field names across every row in that chunk. Rows with different schemas both reference the same shared dictionary. This way, each unique field name appears exactly once in the dictionary, even if it's used in thousands of rows. + +_[Figure: diagram of multiple variants pointing to the same metadata dictionary]_ + +### Variant guarantees faster search performance + +When objects are encoded, field entries must be written in lexicographic order by field name. This ordering constraint enables efficient field lookups: finding a value by field name takes O(log(n)) time by binary search, where n is the number of fields in the object. + +Without this guarantee, you'd need to sequentially scan through every field. For deeply nested objects with dozens of fields, this difference compounds quickly. + +_maybe i should add some benchmarks here?_ + +The metadata dictionary can similarly be optimized for faster search performance by making the list of field names unique and sorted. + +### Variant can leverage file format capabilities + +Even with fast field lookups, Variant still requires full deserialization to access any field, as the entire data must be decoded just to read the single value. This wastes CPU and memory on data that you don't need. + +Variant solves this problem with its own shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. For example, if you frequently query a timestamp field called `start_timestamp` or a 64-bit integer field called `user_id`, it can be shredded into dedicated timestamp and integer columns alongside the Variant columns. + +This enables columnar file formats like Parquet to leverage their full optimization toolset Suppose you had to evaluate a query like: + +```sql +SELECT user_id, start_timestamp from events +WHERE start_timestamp > '2025-01-01' AND user_id = 12345; +``` + +Zone maps can skip entire row groups where `start_timestamp` falls outside the query range. Dictionary encoding can compress repeated `user_id` values efficiently. Bloom filters can rule out row groups without your target user. More importantly, you only deserialize the shredded columns you need, not the Variant columns. + +Shredding makes the trade-off explicit: extract frequently queried fields into optimized columns at write time, and keep everything else flexible. You get columnar performance for common access patterns and schema flexibility for everything else. + +## Why you should be excited about Variant + +In this section, we'll explore the performance characteristics of JSON and Variant data. We'll use Clickhouse's JSON Bench [dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events), a billion-record collection of Bluesky events designed to represent real-world production data. + +Our benchmarks focus on three key metrics: write performance when serializing data to Parquet files, storage efficiency comparing both compressed and uncompressed file sizes, and query perfomance across common access patterns. + +For query execution, we use Datafusion, a popular query engine. To work with Variant in Datafusion, we use [datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant), a library that implements native Variant type support. + +Full benchmark implementation code is available [here](). + +### Write performance + +### Read performance From dcba452f5d6522794f2036aae058573a37a4d274 Mon Sep 17 00:00:00 2001 From: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com> Date: Mon, 1 Dec 2025 12:24:02 +0100 Subject: [PATCH 2/4] make edits --- _posts/2025-11-25-variant.md | 43 ++++++++++++++++++++++-------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/_posts/2025-11-25-variant.md b/_posts/2025-11-25-variant.md index f2bc876df9c6..05a1dcbc257c 100644 --- a/_posts/2025-11-25-variant.md +++ b/_posts/2025-11-25-variant.md @@ -40,11 +40,11 @@ The access pattern of this model looks like: There are many problems with this approach. -_Read performance degrades quickly._ To evaluate any predicate requires scanning every row and deserializing the entire JSON payload into memory, no matter its size. Even multi-megabyte documents must be completely decoded just to inspect a single field. +_Read performance degrades quickly._ Evaluating any predicate requires scanning every row and deserializing the entire JSON payload into memory, no matter its size. Even multi-megabyte documents must be completely decoded just to inspect a single field. _Write efficiency is also poor._ Because each row stores an independently encoded JSON object, common object fields are redundantly serialized over and over. This increases storage size and adds unnecessary encoding overhead. -These problems aren't new and solutions exist that improve the access patterns of a JSON column. One common solution is to extract commonly occuring object keys into a dedicated column, also known as object shredding. Yet, this requires substantial engineering effort and implementation details vary by query engine. +These problems aren't new and solutions exist that improve the access patterns of a JSON column. One common solution is to extract commonly occuring object keys into a dedicated column, also known as object shredding. The challenge is that without a standardized specification, each query engine would shred JSON differently, using incompatible column naming schemas, type mappings or encoding strategies. This would force downstream systems to either re-shred the data (wasting compute and storage) or fall back to full deserialization (losing the performance benefits). ## How Variant solves these problems @@ -52,15 +52,17 @@ Variant is a data type with an efficient binary encoding. It's designed to store ### Variant has richer data types -JSON is limited to just six data types: strings, numbers, booleans, nulls, objects, and arrays. This simplicity comes at a cost, as specialized data types like timestamps and UUIDs must be encoded as strings, then inferred and parsed back at query time. +JSON is limited to just six data types: strings, numbers, booleans, nulls, objects, and arrays. This simplicity comes at a cost: specialized data types like timestamps and UUIDs have to be coerced into strings at write time, and then parsed back into their original types at read time. -Variant removes this overhead by supporting a much broader range of native types organized into three categories: primitives, objects, and arrays. Objects and arrays work exactly like their JSON counterparts, supporting arbitrary nested structures. Variant extends the primitive category to include 20 specialized types. Dates, timestamps, UUIDs, binary data, and integers and floats of various width all have their own native representations. Values are encoded in type-specific binary formats, optimized for their native type rather than a stringifed representation. This reduces both storage-overhead and query-time parsing costs. +Variant removes this overhead by supporting a much broader range of native types. It still has the composite data types-- arrays and objects-- but extends the primitive category to include 20 specialized types. Dates, timestamps, UUIDs, binary data, and integers and floats of various width all have their own native representations. Values are encoded in type-specific binary formats, optimized for their native type rather than a stringifed representation. This reduces both storage-overhead and query-time parsing costs. + +For example, UUIDs get a 2x size improvement by being stored as a `u128` instead of a 36 character hex string. _[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON and Variant logically and physically]_ -### Variant has efficient serialization through a 2-column design +### Variant can be compactly encoded through a 2-column design -JSON columns have naive serialization which leads to unnecessary encoding overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" }` thousands of times, the field names `user_id` and `timestamp` are written out in full for each row, even though they're identical across rows. +JSON columns serialize every row independently, which leads to unnecessary encoding overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" }` thousands of times, the field names `user_id` and `timestamp` are written out in full for each row, even though they're identical across rows. Variant avoids such overhead by splitting data across 2 columns: `metadata` and `value`. @@ -68,17 +70,13 @@ The `metadata` column stores a dictionary of unique field names. The `value` col _[Figure: diagram of a primitive variant and variant object beign written to a metadata and value column]_ -At the file level, this design enables a powerful optimization. You can build a single metadata dictionary per column chunk, containing the union of all unique field names across every row in that chunk. Rows with different schemas both reference the same shared dictionary. This way, each unique field name appears exactly once in the dictionary, even if it's used in thousands of rows. +At the file level, this design enables a powerful optimization. In Parquet, for example, data is written in row groups, and each column within a row group forms a column chunk- a contiguous set of values. You can build a single metadata dictionary per column chunk, containing the union of all unique field names across every row in that chunk. Rows with different schemas both reference the same shared dictionary. This way, each unique field name appears exactly once in the dictionary, even if it's used in thousands of rows. _[Figure: diagram of multiple variants pointing to the same metadata dictionary]_ ### Variant guarantees faster search performance -When objects are encoded, field entries must be written in lexicographic order by field name. This ordering constraint enables efficient field lookups: finding a value by field name takes O(log(n)) time by binary search, where n is the number of fields in the object. - -Without this guarantee, you'd need to sequentially scan through every field. For deeply nested objects with dozens of fields, this difference compounds quickly. - -_maybe i should add some benchmarks here?_ +When objects are encoded, field entries must be written in lexicographic order by field name. This ordering constraint enables efficient field lookups: finding a value by field name takes `O(log(n))` time, where `n` is the number of fields in the object. Without this guarantee, you'd need to sequentially scan through every field. For deeply nested objects with dozens of fields, this difference compounds quickly. The metadata dictionary can similarly be optimized for faster search performance by making the list of field names unique and sorted. @@ -86,16 +84,27 @@ The metadata dictionary can similarly be optimized for faster search performance Even with fast field lookups, Variant still requires full deserialization to access any field, as the entire data must be decoded just to read the single value. This wastes CPU and memory on data that you don't need. -Variant solves this problem with its own shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. For example, if you frequently query a timestamp field called `start_timestamp` or a 64-bit integer field called `user_id`, it can be shredded into dedicated timestamp and integer columns alongside the Variant columns. +Variant solves this problem with its own shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. By defining a common specification, a query engine that shreds Variant data into separate columns means any other Variant-compatible engine can understand and query those shredded columns directly. This interopability is crucial in modern systems where data written by Spark might be read by Datafusion for example. + +In the following example, we'll showcase the benefits when reading shredded Variants from Parquet files. -This enables columnar file formats like Parquet to leverage their full optimization toolset Suppose you had to evaluate a query like: +Let's say we have a column of Variant data, and we notice the timestamp field `start_timestamp` and a 64-bit integer field called `user_id` being frequently queried: ```sql -SELECT user_id, start_timestamp from events -WHERE start_timestamp > '2025-01-01' AND user_id = 12345; +SELECT count(*) from events +WHERE typed_value.start_timestamp > '2025-01-01' AND typed_value.user_id = 12345; ``` -Zone maps can skip entire row groups where `start_timestamp` falls outside the query range. Dictionary encoding can compress repeated `user_id` values efficiently. Bloom filters can rule out row groups without your target user. More importantly, you only deserialize the shredded columns you need, not the Variant columns. +By shredding out `start_timestamp` and `user_id` fields into dedicated timestamp and integer columns, we gain the full benefits of a Parquet native column. + +todo matthew: work on this section and flesh this out + +Pair the shredding with the benefits that a columnar file format like Parquet offers, and the execution looks like this: + +- **zone maps** check each row group's `start_timestamp` range. Row groups with `max(start_timestamp) <= '2025-01-01'` are skipped entirely. +- **bloom filters** check remaining row groups for `user_id = 12345`. Row groups without this user are skipped. + +More importantly, you only deserialize the shredded columns you need, not the Variant columns. Shredding makes the trade-off explicit: extract frequently queried fields into optimized columns at write time, and keep everything else flexible. You get columnar performance for common access patterns and schema flexibility for everything else. From cd1a22371a346b50881cf74d48c35ee6b3b1b973 Mon Sep 17 00:00:00 2001 From: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com> Date: Mon, 1 Dec 2025 12:30:18 +0100 Subject: [PATCH 3/4] Update _posts/2025-11-25-variant.md Co-authored-by: Andrew Lamb --- _posts/2025-11-25-variant.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-11-25-variant.md b/_posts/2025-11-25-variant.md index 05a1dcbc257c..802aa1b5cddd 100644 --- a/_posts/2025-11-25-variant.md +++ b/_posts/2025-11-25-variant.md @@ -27,7 +27,7 @@ limitations under the License. Variant is a data type designed to solve JSON's performance problems in OLAP systems, and its initial implementation has been released in Arrow 57.0.0. -This article explains the limitations of JSON, how Variant solves these problems, and why you should be excited about it by analyzing performance characteristics on a real world dataset. +This article explains the limitations of JSON, introduces the Variant type that solves these problems, and explains why you should be excited about it by analyzing performance characteristics on a real world dataset. ## Whats the problem with JSON? From bdddfdcb48d3ca1dc2bbfe85fa4a8bb58ddac0f4 Mon Sep 17 00:00:00 2001 From: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com> Date: Mon, 1 Dec 2025 12:31:02 +0100 Subject: [PATCH 4/4] more edits --- _posts/2025-11-25-variant.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2025-11-25-variant.md b/_posts/2025-11-25-variant.md index 802aa1b5cddd..cfdbb1570840 100644 --- a/_posts/2025-11-25-variant.md +++ b/_posts/2025-11-25-variant.md @@ -84,7 +84,7 @@ The metadata dictionary can similarly be optimized for faster search performance Even with fast field lookups, Variant still requires full deserialization to access any field, as the entire data must be decoded just to read the single value. This wastes CPU and memory on data that you don't need. -Variant solves this problem with its own shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. By defining a common specification, a query engine that shreds Variant data into separate columns means any other Variant-compatible engine can understand and query those shredded columns directly. This interopability is crucial in modern systems where data written by Spark might be read by Datafusion for example. +Variant solves this problem by standardizing a shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. By defining a common specification, a query engine that shreds Variant data into separate columns means any other Variant-compatible engine can understand and query those shredded columns directly. This interopability is crucial in modern systems where data written by Spark might be read by Datafusion for example. In the following example, we'll showcase the benefits when reading shredded Variants from Parquet files.