fix: propagate index_details from distributed index creation#475
Open
LuciferYang wants to merge 1 commit intolance-format:mainfrom
Open
fix: propagate index_details from distributed index creation#475LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang wants to merge 1 commit intolance-format:mainfrom
Conversation
…defensive describeIndices Root cause: AddIndexExec discarded protobuf index_details bytes from worker-created indexes, causing describeIndices() to fail with "Index details are required for index description" on the read path. Write-side fix (AddIndexExec): - FragmentIndexTask and RangeBTreeIndexBuilder now capture index_details bytes from the created Index and return them to the driver - Driver sets index_details on the Index.Builder before committing - Extract IndexUtils.extractIndexDetails() (Java Optional → Scala Option) and IndexUtils.collectFirstIndexDetails() to eliminate duplication and unify the serialization format across fragment-based and range-based paths Read-side fix (LanceScanBuilder): - findZonemapIndexedColumns() uses describeIndices(criteria) with empty IndexCriteria so the Rust side filters out legacy indexes that lack index_details, instead of failing Tests: - Add testBTreeIndexHasIndexDetails, testRangeBTreeIndexHasIndexDetails, testFtsIndexHasIndexDetails covering all three index creation paths - Shared verifyIndexDetails() helper checks index_details presence, criteria-based and no-arg describeIndices success
911a24f to
158c3c2
Compare
Collaborator
|
Great fix! |
LuciferYang
added a commit
to LuciferYang/lance-spark
that referenced
this pull request
Apr 24, 2026
…e index
Extends TpcdsBenchmarkRunner to sweep DFP state per lance run:
- --dfp-mode {on,off,both,default} flag. When "both", each lance query is
run twice (once pinning spark.lance.runtime.filtering.enabled=true, once
false); non-lance formats run once with dfp_mode="n/a"
- dfp_mode column in the output CSV
- BenchmarkReporter emits a separate "DFP On-vs-Off Comparison (Lance only)"
table with per-query median wall-clock, speedup, fragmentsScanned on/off,
pruning %, plus a geometric-mean summary
- TpcdsQueryRunner now captures the fragmentsScanned SQLMetric by walking
the executed plan (reaching through AdaptiveSparkPlanExec via reflection)
and stores it on QueryMetrics.lanceFragmentsScanned
Supporting pieces:
- DfpClusterRebuilder: Spark job that re-sorts a Lance table by a target
column so downstream zonemap index bounds are tight per fragment (DFP
benefit requires fact-side data clustering)
- index/store_sales.sql: fact-side btree indexes on ss_sold_date_sk,
ss_item_sk, ss_customer_sk, ss_store_sk, ss_promo_sk — the join keys DFP
needs to operate on
- TpcdsIndexBuilder adds store_sales to the table list so the above resource
is applied by the standard indexing step
- run-benchmark.sh forwards a DFP_MODE env var through to --dfp-mode
End-to-end DFP activity is still blocked upstream by Lance's btree
getZonemapStats returning "must be retrained" for indexes built with
lance-core 6.0.0-beta.2 — see lance-format#475
which propagates index_details from distributed index creation and uses the
safer describeIndices(criteria) overload. Once lance-format#475 merges this harness
produces real pruning numbers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
After distributed index creation via
AddIndexExec, querying zonemap indexes on the read side fails with:The root cause: when executors call
createIndex, the returnedIndexobject carriesindex_details(protobuf bytes), but those bytes were discarded when merging results back on the driver. The committedIndexMetadataends up with emptyindex_details. The no-argdescribeIndices()on the Rust side doesn't filter these out and blows up.Fix
Write side (AddIndexExec)
FragmentIndexTaskandRangeBTreeIndexBuildernow captureindex_detailsbytes aftercreateIndexand ship them back to the driver via KryoIndexUtils.collectFirstIndexDetails()and sets it onIndex.Builder.indexDetails()before committing the transactionIndexUtils.extractIndexDetails()for Java Optional → Scala Option conversion; both the fragment and range paths use the same serialization formatRead side (LanceScanBuilder)
findZonemapIndexedColumns()now callsdescribeIndices(criteria)with an emptyIndexCriteriainstead of the no-arg overload, so the Rust side silently skips indexes missingindex_details— backward compatible with datasets created before this fixTests
testBTreeIndexHasIndexDetails— fragment-based btree pathtestRangeBTreeIndexHasIndexDetails— range-based btree pathtestFtsIndexHasIndexDetails— FTS (inverted) pathAll three verify that
index_detailsis populated,describeIndices(criteria)returns the correct index type, and the no-argdescribeIndices()no longer throws.