fix: widen pruned nested struct schemas to preserve Arrow child ordinals#442
Open
butnaruandrei wants to merge 1 commit intolance-format:mainfrom
Open
fix: widen pruned nested struct schemas to preserve Arrow child ordinals#442butnaruandrei wants to merge 1 commit intolance-format:mainfrom
butnaruandrei wants to merge 1 commit intolance-format:mainfrom
Conversation
- Introduced `ReadSchemaNestedStructWidening` to align Spark's pruned read schema with full Arrow batches from Lance, ensuring correct child ordinals during vectorized execution. - Updated `LanceScanBuilder` to utilize the new widening logic when pruning columns, preserving the full table schema for nested structures. - Added unit tests for `ReadSchemaNestedStructWidening` to validate behavior across various scenarios, including nested structs within arrays and handling of absent fields in the table schema.
23992a5 to
86e759e
Compare
ivscheianu
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix: widen pruned nested struct schemas to preserve Arrow child ordinals
Problem
When Spark reads a Lance dataset with a nested struct column (e.g. a metadata struct
{first: String, second: Long}), its optimizer applies column pruning: if a queryonly references
metadata.second, Spark callsSupportsPushDownRequiredColumns#pruneColumnswith a narrowed schema in which
metadataonly contains{second}.The previous implementation of
pruneColumnsblindly assigned this narrowed schema:This causes silent data corruption in the vectorized read path.
LanceStructAccessorindexes Arrowstruct children by ordinal — positionally:
Lance always returns the full struct from the native scanner — it has no mechanism to strip
individual sub-fields from a struct column. So the Arrow batch always contains:
But after pruning, Catalyst believes the struct is
{second}only, so it reads ordinal 0expecting
secondand getsfirstinstead — wrong type, wrong data, no exception.Why top-level column pruning is safe
Top-level column pruning (dropping entire columns) does not have this problem. Lance natively
handles projection at the scanner level and simply does not return vectors for unprojected top-level
columns. The ordinal mismatch only occurs when Lance returns a struct column but Catalyst has a
pruned view of its sub-fields.
Fix
Two changes:
1.
LanceScanBuilder— capture the original full table schema beforepruneColumnscanoverwrite it, then route through
ReadSchemaNestedStructWidening:2.
ReadSchemaNestedStructWidening— for each field in the required schema, if it is aStructTypewhose fields are a name-subset of the corresponding table struct, replace the prunedstruct with the full table struct definition (restoring Arrow ordinal alignment). Top-level column
pruning is left unchanged. Widening recurses into
ArrayTypeandMapTypeelement/value types.The subset check and widening are done in a single pass over table fields using a
HashMapindexover the required fields, making the operation O(n) in the number of struct sub-fields.
Note: the schema passed to the Lance native scanner after widening will include all sub-fields of
any nested struct, meaning Lance reads them all from disk even if the query only needs one. This is
correct — Lance was already reading them all (it has no sub-struct projection), and the bug was that
the Catalyst schema disagreed with what Arrow actually delivered.
Tests
ReadSchemaNestedStructWideningTestcovers four cases:widensSubsetNestedStructToFullTableFieldOrderAndFieldswidensSubsetStructNestedInsideArrayElementTypeArrayTypeelement structsdoesNotWidenTopLevelColumnPruningdoesNotWidenWhenRequiredFieldIsAbsentFromTableSchema