Skip to content

fix: propagate index_details from distributed index creation#475

Open
LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang:fix/propagate-index-details
Open

fix: propagate index_details from distributed index creation#475
LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang:fix/propagate-index-details

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 23, 2026

Background

After distributed index creation via AddIndexExec, querying zonemap indexes on the read side fails with:

WARN LanceScanBuilder: Failed to query zonemap indexes:
LanceError(Index): Index details are required for index description.
This index must be retrained to support this method.

The root cause: when executors call createIndex, the returned Index object carries index_details (protobuf bytes), but those bytes were discarded when merging results back on the driver. The committed IndexMetadata ends up with empty index_details. The no-arg describeIndices() on the Rust side doesn't filter these out and blows up.

Fix

Write side (AddIndexExec)

  • FragmentIndexTask and RangeBTreeIndexBuilder now capture index_details bytes after createIndex and ship them back to the driver via Kryo
  • Driver picks the first non-empty result with IndexUtils.collectFirstIndexDetails() and sets it on Index.Builder.indexDetails() before committing the transaction
  • Extracted IndexUtils.extractIndexDetails() for Java Optional → Scala Option conversion; both the fragment and range paths use the same serialization format

Read side (LanceScanBuilder)

  • findZonemapIndexedColumns() now calls describeIndices(criteria) with an empty IndexCriteria instead of the no-arg overload, so the Rust side silently skips indexes missing index_details — backward compatible with datasets created before this fix

Tests

  • testBTreeIndexHasIndexDetails — fragment-based btree path
  • testRangeBTreeIndexHasIndexDetails — range-based btree path
  • testFtsIndexHasIndexDetails — FTS (inverted) path

All three verify that index_details is populated, describeIndices(criteria) returns the correct index type, and the no-arg describeIndices() no longer throws.

@github-actions github-actions Bot added the bug Something isn't working label Apr 23, 2026
…defensive describeIndices

Root cause: AddIndexExec discarded protobuf index_details bytes from
worker-created indexes, causing describeIndices() to fail with
"Index details are required for index description" on the read path.

Write-side fix (AddIndexExec):
- FragmentIndexTask and RangeBTreeIndexBuilder now capture index_details
  bytes from the created Index and return them to the driver
- Driver sets index_details on the Index.Builder before committing
- Extract IndexUtils.extractIndexDetails() (Java Optional → Scala Option)
  and IndexUtils.collectFirstIndexDetails() to eliminate duplication and
  unify the serialization format across fragment-based and range-based paths

Read-side fix (LanceScanBuilder):
- findZonemapIndexedColumns() uses describeIndices(criteria) with empty
  IndexCriteria so the Rust side filters out legacy indexes that lack
  index_details, instead of failing

Tests:
- Add testBTreeIndexHasIndexDetails, testRangeBTreeIndexHasIndexDetails,
  testFtsIndexHasIndexDetails covering all three index creation paths
- Shared verifyIndexDetails() helper checks index_details presence,
  criteria-based and no-arg describeIndices success
@LuciferYang LuciferYang force-pushed the fix/propagate-index-details branch from 911a24f to 158c3c2 Compare April 23, 2026 11:12
@fangbo
Copy link
Copy Markdown
Collaborator

fangbo commented Apr 24, 2026

Great fix!

LuciferYang added a commit to LuciferYang/lance-spark that referenced this pull request Apr 24, 2026
…e index

Extends TpcdsBenchmarkRunner to sweep DFP state per lance run:

- --dfp-mode {on,off,both,default} flag. When "both", each lance query is
  run twice (once pinning spark.lance.runtime.filtering.enabled=true, once
  false); non-lance formats run once with dfp_mode="n/a"
- dfp_mode column in the output CSV
- BenchmarkReporter emits a separate "DFP On-vs-Off Comparison (Lance only)"
  table with per-query median wall-clock, speedup, fragmentsScanned on/off,
  pruning %, plus a geometric-mean summary
- TpcdsQueryRunner now captures the fragmentsScanned SQLMetric by walking
  the executed plan (reaching through AdaptiveSparkPlanExec via reflection)
  and stores it on QueryMetrics.lanceFragmentsScanned

Supporting pieces:

- DfpClusterRebuilder: Spark job that re-sorts a Lance table by a target
  column so downstream zonemap index bounds are tight per fragment (DFP
  benefit requires fact-side data clustering)
- index/store_sales.sql: fact-side btree indexes on ss_sold_date_sk,
  ss_item_sk, ss_customer_sk, ss_store_sk, ss_promo_sk — the join keys DFP
  needs to operate on
- TpcdsIndexBuilder adds store_sales to the table list so the above resource
  is applied by the standard indexing step
- run-benchmark.sh forwards a DFP_MODE env var through to --dfp-mode

End-to-end DFP activity is still blocked upstream by Lance's btree
getZonemapStats returning "must be retrained" for indexes built with
lance-core 6.0.0-beta.2 — see lance-format#475
which propagates index_details from distributed index creation and uses the
safer describeIndices(criteria) overload. Once lance-format#475 merges this harness
produces real pruning numbers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants