Skip to content

GH-3457: Validate column index for null_pages/null_counts contradiction#3458

Open
clee704 wants to merge 2 commits intoapache:masterfrom
clee704:validate-column-index-null-pages
Open

GH-3457: Validate column index for null_pages/null_counts contradiction#3458
clee704 wants to merge 2 commits intoapache:masterfrom
clee704:validate-column-index-null-pages

Conversation

@clee704
Copy link
Copy Markdown

@clee704 clee704 commented Mar 28, 2026

Fixes #3457

What changes were proposed in this pull request?

Add validation in ColumnIndexBuilder.build(PrimitiveType) to detect a contradictory column index where null_pages[i] is true (page consists entirely of null values) but null_counts[i] is 0 (page has zero null values). When detected, build() returns null, following the existing pattern for invalid min/max values.

Also fix a pre-existing NPE in the static build() method where build(type) could return null but was not null-checked before accessing boundaryOrder.

Why are the changes needed?

Column index filtering silently excludes pages with this contradiction from query results for all predicates:

  • Non-null predicates (e.g., WHERE col = 50): BoundaryOrder comparators iterate over pageIndexes, which omits pages where null_pages[i] is true. These pages are never evaluated and their rows are excluded.
  • Null predicates (e.g., WHERE col IS NULL): ColumnIndexBase.visit(Eq) checks nullCounts[pageIndex] > 0, which returns false when null_counts is 0. The page is excluded.

The result is silent data loss with no error or warning. Only unfiltered reads return correct results.

ColumnIndexBuilder.build(PrimitiveType) already returns null for other kinds of invalid metadata (empty pages, invalid min/max values). This adds one more validation of the same kind.

How was this patch tested?

3 new test methods (15 assertions) in TestColumnIndexBuilder:

  • testBuildReturnsNullForNullPageCountContradiction: 5 rejection cases — contradiction at first/middle/last page, single page, all pages
  • testBuildPreservesValidColumnIndex: 6 preservation cases — legitimate null pages, all non-null pages, single pages, boundary null_counts=1
  • testBuildWithoutNullCountsIsNotRejected: null_counts absent (optional field) is not rejected

A null_pages entry of true indicates the page consists entirely of null values;
a corresponding null_counts entry of zero contradicts this, as it indicates the
page has no null values at all. This inconsistency is a sign of invalid metadata
that would cause incorrect page skipping during column index filtering —
the reader treats such pages as all-null and silently excludes them from query
results for predicates requiring non-null values.

Add validation in ColumnIndexBuilder.build(PrimitiveType) to detect this
contradiction and return null, consistent with the existing pattern for invalid
min/max values. The null propagates through the read path, causing the filter to
fall back to reading all pages for the affected column.

Also fix a pre-existing NPE in the static build() method where build(type)
could return null but was not null-checked before accessing boundaryOrder.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wgtmac wgtmac changed the title Validate column index for null_pages/null_counts contradiction GH-3457: Validate column index for null_pages/null_counts contradiction Apr 19, 2026
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clee704 for improving this! While I doubt that if this could really happen, but it would be good for the reader to gracefully deal with malformed stats.

…n/columnindex/ColumnIndexBuilder.java

Co-authored-by: Gang Wu <ustcwg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Column index with null_pages=true and null_counts=0 causes silent data loss during filter pushdown

2 participants