Skip to content

fix: widen pruned nested struct schemas to preserve Arrow child ordinals#442

Open
butnaruandrei wants to merge 1 commit intolance-format:mainfrom
butnaruandrei:read-schema-nested-struct-widening
Open

fix: widen pruned nested struct schemas to preserve Arrow child ordinals#442
butnaruandrei wants to merge 1 commit intolance-format:mainfrom
butnaruandrei:read-schema-nested-struct-widening

Conversation

@butnaruandrei
Copy link
Copy Markdown

@butnaruandrei butnaruandrei commented Apr 16, 2026

fix: widen pruned nested struct schemas to preserve Arrow child ordinals

Problem

When Spark reads a Lance dataset with a nested struct column (e.g. a metadata struct
{first: String, second: Long}), its optimizer applies column pruning: if a query
only references metadata.second, Spark calls SupportsPushDownRequiredColumns#pruneColumns
with a narrowed schema in which metadata only contains {second}.

The previous implementation of pruneColumns blindly assigned this narrowed schema:

this.schema = requiredSchema; // metadata → {second}

This causes silent data corruption in the vectorized read path. LanceStructAccessor indexes Arrow
struct children by ordinal — positionally:

public ColumnVector getChild(int ordinal) {
    return childColumns[ordinal]; // built from vector.getChildByOrdinal(i)
}

Lance always returns the full struct from the native scanner — it has no mechanism to strip
individual sub-fields from a struct column. So the Arrow batch always contains:

child[0] = first  (String)
child[1] = second (Long)

But after pruning, Catalyst believes the struct is {second} only, so it reads ordinal 0
expecting second and gets first instead — wrong type, wrong data, no exception.

Why top-level column pruning is safe

Top-level column pruning (dropping entire columns) does not have this problem. Lance natively
handles projection at the scanner level and simply does not return vectors for unprojected top-level
columns. The ordinal mismatch only occurs when Lance returns a struct column but Catalyst has a
pruned view of its sub-fields.

Fix

Two changes:

1. LanceScanBuilder — capture the original full table schema before pruneColumns can
overwrite it, then route through ReadSchemaNestedStructWidening:

this.fullTableSchema = schema; // saved at construction time

@Override
public void pruneColumns(StructType requiredSchema) {
    this.schema = ReadSchemaNestedStructWidening.widenRequiredSchema(requiredSchema, fullTableSchema);
}

2. ReadSchemaNestedStructWidening — for each field in the required schema, if it is a
StructType whose fields are a name-subset of the corresponding table struct, replace the pruned
struct with the full table struct definition (restoring Arrow ordinal alignment). Top-level column
pruning is left unchanged. Widening recurses into ArrayType and MapType element/value types.

The subset check and widening are done in a single pass over table fields using a HashMap index
over the required fields, making the operation O(n) in the number of struct sub-fields.

Note: the schema passed to the Lance native scanner after widening will include all sub-fields of
any nested struct, meaning Lance reads them all from disk even if the query only needs one. This is
correct — Lance was already reading them all (it has no sub-struct projection), and the bug was that
the Catalyst schema disagreed with what Arrow actually delivered.

Tests

ReadSchemaNestedStructWideningTest covers four cases:

Test What it verifies
widensSubsetNestedStructToFullTableFieldOrderAndFields Core case: pruned nested struct is restored to full table field order
widensSubsetStructNestedInsideArrayElementType Widening recurses into ArrayType element structs
doesNotWidenTopLevelColumnPruning Top-level column drops are left untouched
doesNotWidenWhenRequiredFieldIsAbsentFromTableSchema Diverged/evolved schemas are not widened to avoid corruption

@butnaruandrei butnaruandrei changed the title feat: implement ReadSchemaNestedStructWidening for nested struct handling fix: implement ReadSchemaNestedStructWidening for nested struct handling Apr 16, 2026
@github-actions github-actions Bot added enhancement New feature or request bug Something isn't working labels Apr 16, 2026
- Introduced `ReadSchemaNestedStructWidening` to align Spark's pruned read schema with full Arrow batches from Lance, ensuring correct child ordinals during vectorized execution.
- Updated `LanceScanBuilder` to utilize the new widening logic when pruning columns, preserving the full table schema for nested structures.
- Added unit tests for `ReadSchemaNestedStructWidening` to validate behavior across various scenarios, including nested structs within arrays and handling of absent fields in the table schema.
@butnaruandrei butnaruandrei force-pushed the read-schema-nested-struct-widening branch from 23992a5 to 86e759e Compare April 16, 2026 17:19
@butnaruandrei butnaruandrei changed the title fix: implement ReadSchemaNestedStructWidening for nested struct handling fix: widen pruned nested struct schemas to preserve Arrow child ordinals Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants