fix: widen pruned nested struct schemas to preserve Arrow child ordinals by butnaruandrei · Pull Request #442 · lance-format/lance-spark

butnaruandrei · 2026-04-16T17:18:44Z

fix: widen pruned nested struct schemas to preserve Arrow child ordinals

Problem

When Spark reads a Lance dataset with a nested struct column (e.g. a metadata struct
{first: String, second: Long}), its optimizer applies column pruning: if a query
only references metadata.second, Spark calls SupportsPushDownRequiredColumns#pruneColumns
with a narrowed schema in which metadata only contains {second}.

The previous implementation of pruneColumns blindly assigned this narrowed schema:

this.schema = requiredSchema; // metadata → {second}

This causes silent data corruption in the vectorized read path. LanceStructAccessor indexes Arrow
struct children by ordinal — positionally:

public ColumnVector getChild(int ordinal) {
    return childColumns[ordinal]; // built from vector.getChildByOrdinal(i)
}

Lance always returns the full struct from the native scanner — it has no mechanism to strip
individual sub-fields from a struct column. So the Arrow batch always contains:

child[0] = first  (String)
child[1] = second (Long)

But after pruning, Catalyst believes the struct is {second} only, so it reads ordinal 0
expecting second and gets first instead — wrong type, wrong data, no exception.

Why top-level column pruning is safe

Top-level column pruning (dropping entire columns) does not have this problem. Lance natively
handles projection at the scanner level and simply does not return vectors for unprojected top-level
columns. The ordinal mismatch only occurs when Lance returns a struct column but Catalyst has a
pruned view of its sub-fields.

Fix

Two changes:

1. LanceScanBuilder — capture the original full table schema before pruneColumns can
overwrite it, then route through ReadSchemaNestedStructWidening:

this.fullTableSchema = schema; // saved at construction time

@Override
public void pruneColumns(StructType requiredSchema) {
    this.schema = ReadSchemaNestedStructWidening.widenRequiredSchema(requiredSchema, fullTableSchema);
}

2. ReadSchemaNestedStructWidening — for each field in the required schema, if it is a
StructType whose fields are a name-subset of the corresponding table struct, replace the pruned
struct with the full table struct definition (restoring Arrow ordinal alignment). Top-level column
pruning is left unchanged. Widening recurses into ArrayType and MapType element/value types.

The subset check and widening are done in a single pass over table fields using a HashMap index
over the required fields, making the operation O(n) in the number of struct sub-fields.

Note: the schema passed to the Lance native scanner after widening will include all sub-fields of
any nested struct, meaning Lance reads them all from disk even if the query only needs one. This is
correct — Lance was already reading them all (it has no sub-struct projection), and the bug was that
the Catalyst schema disagreed with what Arrow actually delivered.

Tests

ReadSchemaNestedStructWideningTest covers four cases:

Test	What it verifies
`widensSubsetNestedStructToFullTableFieldOrderAndFields`	Core case: pruned nested struct is restored to full table field order
`widensSubsetStructNestedInsideArrayElementType`	Widening recurses into `ArrayType` element structs
`doesNotWidenTopLevelColumnPruning`	Top-level column drops are left untouched
`doesNotWidenWhenRequiredFieldIsAbsentFromTableSchema`	Diverged/evolved schemas are not widened to avoid corruption

- Introduced `ReadSchemaNestedStructWidening` to align Spark's pruned read schema with full Arrow batches from Lance, ensuring correct child ordinals during vectorized execution. - Updated `LanceScanBuilder` to utilize the new widening logic when pruning columns, preserving the full table schema for nested structures. - Added unit tests for `ReadSchemaNestedStructWidening` to validate behavior across various scenarios, including nested structs within arrays and handling of absent fields in the table schema.

butnaruandrei changed the title ~~feat: implement ReadSchemaNestedStructWidening for nested struct handling~~ fix: implement ReadSchemaNestedStructWidening for nested struct handling Apr 16, 2026

github-actions Bot added enhancement New feature or request bug Something isn't working labels Apr 16, 2026

butnaruandrei force-pushed the read-schema-nested-struct-widening branch from 23992a5 to 86e759e Compare April 16, 2026 17:19

butnaruandrei changed the title ~~fix: implement ReadSchemaNestedStructWidening for nested struct handling~~ fix: widen pruned nested struct schemas to preserve Arrow child ordinals Apr 16, 2026

ivscheianu approved these changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: widen pruned nested struct schemas to preserve Arrow child ordinals#442

fix: widen pruned nested struct schemas to preserve Arrow child ordinals#442
butnaruandrei wants to merge 1 commit intolance-format:mainfrom
butnaruandrei:read-schema-nested-struct-widening

butnaruandrei commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

butnaruandrei commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!