From 66d24700c9387f747bc12ada5b5995ebbca11122 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 18 Mar 2026 07:23:17 -0600 Subject: [PATCH 1/3] Comet 0.14 blog post --- .../2026-03-18-datafusion-comet-0.14.0.md | 110 ++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 content/blog/2026-03-18-datafusion-comet-0.14.0.md diff --git a/content/blog/2026-03-18-datafusion-comet-0.14.0.md b/content/blog/2026-03-18-datafusion-comet-0.14.0.md new file mode 100644 index 00000000..40ea8824 --- /dev/null +++ b/content/blog/2026-03-18-datafusion-comet-0.14.0.md @@ -0,0 +1,110 @@ +--- +layout: post +title: Apache DataFusion Comet 0.14.0 Release +date: 2026-03-18 +author: pmc +categories: [subprojects] +--- + + + +[TOC] + +The Apache DataFusion PMC is pleased to announce version 0.14.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. + +Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for +improved performance and efficiency without requiring any code changes. + +This release covers approximately eight weeks of development work and is the result of merging 189 PRs from 21 +contributors. See the [change log] for more information. + +[change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.14.0.md + +## Key Features + +### Native Columnar-to-Row Conversion + +Comet now uses a native columnar-to-row (C2R) conversion by default. This +feature replaces Comet's JVM-based columnar-to-row transition with a native Rust implementation, reducing JVM memory overhead +when data flows from Comet's native execution back to Spark operators that require row-based input. + +### Native Iceberg Improvements + +Comet's fully-native Iceberg integration received several enhancements: + +**Per-Partition Plan Serialization**: `CometExecRDD` now supports per-partition plan data, reducing serialization +overhead for native Iceberg scans and enabling dynamic partition pruning (DPP). + +**Vended Credentials**: Native Iceberg scans now support passing vended credentials from the catalog, improving +integration with cloud storage services. + +**Performance Optimizations**: + +- Single-pass `FileScanTask` validation for reduced planning overhead +- Configurable data file concurrency via `spark.comet.scan.icebergNative.dataFileConcurrency` +- Channel-based executor thread parking instead of `yield_now()` for reduced CPU overhead +- Reuse of `CometConf` and native utility instances in batch decoding + +### New Expressions + +This release adds support for the following expressions: + +- Date/time functions: `make_date`, `next_day` +- String functions: `right`, `string_split` +- Math functions: `width_bucket`, `crc32` +- Map functions: `map_contains_key`, `map_from_entries` +- Conversion functions: `to_csv`, `luhn_check` +- Cast support: date to timestamp, integer to timestamp, numeric to timestamp, integer to binary, boolean to decimal, date to numeric + +### ANSI Mode Error Messages + +ANSI SQL mode now produces proper error messages matching Spark's expected output, improving compatibility for +workloads that rely on strict SQL error handling. + +### DataFusion Configuration Passthrough + +DataFusion session-level configurations can now be set directly from Spark using the `spark.comet.datafusion.*` +prefix. This enables tuning DataFusion internals such as batch sizes and memory limits without modifying Comet code. + +## Performance Improvements + +This release includes extensive performance optimizations: + +- **Sum aggregation**: Specialized implementations for each eval mode eliminate per-row mode checks +- **Contains expression**: SIMD-based scalar pattern search for faster string matching +- **Batch coalescing**: Reduced IPC schema overhead in `BufBatchWriter` by coalescing small batches +- **Tokio runtime**: Worker threads now initialize from `spark.executor.cores` for better resource utilization +- **Decimal expressions**: Optimized decimal arithmetic operations +- **Row-to-columnar transition**: Improved performance for JVM shuffle data conversion +- **Aligned pointer reads**: Optimized `SparkUnsafeRow` field accessors using aligned memory reads + +## Deprecations and Removals + +The deprecated `native_comet` scan mode has been removed. Use `native_datafusion` instead. Note +that the `native_iceberg_compat` scan is now deprecated and will be removed from a future release. + +## Compatibility + +This release upgrades to DataFusion 52.3, Arrow 57.3, and iceberg-rust 0.9.0. Published binaries now target +x86-64-v3 and neoverse-n1 CPU architectures for improved performance on modern hardware. + +Supported platforms include Spark 3.4.3, 3.5.4-3.5.8, and Spark 4.0.x with various JDK and Scala combinations. + +The community encourages users to test Comet with existing Spark workloads and welcomes contributions to ongoing development. From 08240a08de9e58bcbe316d36ed5adbaac4b30a00 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 18 Mar 2026 07:53:58 -0600 Subject: [PATCH 2/3] fix: add write permissions to stage-site workflow The pull_request event defaults to read-only GITHUB_TOKEN permissions, causing the pelican action to fail when pushing to asf-staging branch. --- .github/workflows/stage-site.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.github/workflows/stage-site.yml b/.github/workflows/stage-site.yml index 52b37d6a..f34492f9 100644 --- a/.github/workflows/stage-site.yml +++ b/.github/workflows/stage-site.yml @@ -9,6 +9,9 @@ name: Stage Site on: pull_request: branches: ["main"] +permissions: + contents: write + jobs: build-pelican: if: startsWith(github.head_ref, 'site/') From 3348f36523c082634164840a780e238b845c36db Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 18 Mar 2026 08:11:01 -0600 Subject: [PATCH 3/3] Revert "fix: add write permissions to stage-site workflow" This reverts commit 08240a08de9e58bcbe316d36ed5adbaac4b30a00. --- .github/workflows/stage-site.yml | 3 --- 1 file changed, 3 deletions(-) diff --git a/.github/workflows/stage-site.yml b/.github/workflows/stage-site.yml index f34492f9..52b37d6a 100644 --- a/.github/workflows/stage-site.yml +++ b/.github/workflows/stage-site.yml @@ -9,9 +9,6 @@ name: Stage Site on: pull_request: branches: ["main"] -permissions: - contents: write - jobs: build-pelican: if: startsWith(github.head_ref, 'site/')