Skip to content

[GLUTEN-11605][VL] Write per-block column statistics in shuffle writer#11769

Open
acvictor wants to merge 2 commits intoapache:mainfrom
acvictor:acvictor/writerChanges
Open

[GLUTEN-11605][VL] Write per-block column statistics in shuffle writer#11769
acvictor wants to merge 2 commits intoapache:mainfrom
acvictor:acvictor/writerChanges

Conversation

@acvictor
Copy link
Contributor

@acvictor acvictor commented Mar 16, 2026

What changes are proposed in this pull request?

This PR adds per-block column statistics (min/max/hasNull) to the shuffle writer pipeline as a prerequisite for block-level pruning using dynamic filters at the shuffle reader. When spark.gluten.sql.columnar.backend.velox.valueStream.dynamicFilter.enabled is true, the shuffle writer computes per-column min/max statistics from raw Arrow buffers during evictBuffers() and serializes them as a kStatisticsPayload block before each non-dictionary payload in the output file. This mirrors how parquet row group statistics enable predicate pushdown.

How was this patch tested?

Added new tests and also ran the CI with config set to true.

Was this patch authored or co-authored using generative AI tooling?

No

Related issue: #11605

@github-actions github-actions bot added the VELOX label Mar 16, 2026
@acvictor acvictor force-pushed the acvictor/writerChanges branch 6 times, most recently from cb073fd to 19b8d5a Compare March 16, 2026 11:46
@acvictor acvictor force-pushed the acvictor/writerChanges branch from 19b8d5a to 50e0444 Compare March 16, 2026 13:51
@acvictor acvictor marked this pull request as ready for review March 17, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant