Skip to content

feat: Support Distributed Segmented BTREE Index Building#21

Merged
rchowell merged 2 commits into
daft-engine:mainfrom
beinan:feat/segmented-btree-indices
Jun 5, 2026
Merged

feat: Support Distributed Segmented BTREE Index Building#21
rchowell merged 2 commits into
daft-engine:mainfrom
beinan:feat/segmented-btree-indices

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented Jun 5, 2026

Summary

This PR implements Segmented BTREE Index support when building distributed scalar indexes with daft-lance using the new segmented=True parameter.

What's Changing

  • Introduced SegmentedFragmentIndexHandler that builds a fully formed independent lance.Index segment on each worker.
  • Replaced the manual index transaction creation block on the coordinator with a clean, native atomic commit of the gathered index segments using commit_existing_index_segments().
  • Added extensive test coverage for segmented BTree indexes (string/integer columns, multiple segments, replacements, and fallback behaviors).
  • Upgraded the pylance dependency constraint to >=7.0.0 to support the required segment commitment APIs.

Why This is Better

  1. Resolves describe_indices() Crash: The previous manual transaction building block produced indices with empty index_details. Downstream tooling calling describe_indices() would crash. This change preserves proper protobuf-serialized index_details in the committed index segments.
  2. Clean Transactions: Uses Lance's native atomic index commitment mechanics instead of low-level LanceOperation crafting.

🚀 Architectural Benefits for Extremely Large Tables

Building indices on extremely large tables (multi-billion rows, hundreds of gigabytes/terabytes of data) introduces severe scaling and performance bottlenecks. Adopting a segmented index building approach over a single monolithic file provides immense advantages:

1. Zero Shuffling or Global Sorting Overhead

A BTree index is fundamentally a sorted data structure. Under the classic partitioned-and-merged approach, merging parallel partitions requires a global K-way sort merge across workers' outputs.

  • The bottleneck: For multi-terabyte datasets, doing a sort-merge over S3/network page storage creates massive memory pressure and I/O bottlenecks on the coordinator.
  • Segmented benefit: Each worker builds a fully formed, independent, pre-sorted index segment for its local fragments. There is zero coordination, zero shuffling, and zero global sorting required at the end. The merge step is reduced to an $O(1)$ atomic metadata transaction registering the existing segments.

2. Elastic Horizontal Scalability (No Coordinator Bottlenecks)

In the partitioned-and-merged model, the merge phase runs entirely on the coordinator node.

  • The bottleneck: The coordinator becomes a hard bottleneck; as dataset size and worker counts scale up, the coordinator's CPU and memory requirements grow quadratically, often resulting in Out-Of-Memory (OOM) crashes.
  • Segmented benefit: All computation scales horizontally with your Daft workers. The coordinator only executes a lightweight transaction commit registering pre-built segments. You can scale to thousands of workers without bottlenecking the coordinator.

3. Highly Cost-Effective Incremental index Updates (Append-Friendly)

In real-world data lakes, datasets are rarely static; new data is appended constantly.

  • The bottleneck: With a monolithic index, appending new fragments forces a full rewrite of the entire unified index. Re-indexing a 10TB dataset to index a new 50GB append is prohibitively expensive.
  • Segmented benefit: Since the index is segmented, you can build new index segments for only the newly appended fragments, and atomically attach them to the existing index using commit_existing_index_segments. Already-indexed data remains untouched, turning index maintenance costs from $O(\text{Total Dataset})$ to $O(\text{Append Size})$.

4. Resiliency & Fault-Tolerance

If a distributed job fails mid-way:

  • Legacy: All work is lost because a single unified file must be created at the end.
  • Segmented: Individual worker tasks commit fully formed index segments. Failed worker tasks can be retried independently without invalidating or discarding the successfully built segments of other workers.

Test Plan

Ran pytest on local suite:

  • Added comprehensive unit tests in test_lancedb_scalar_index.py verifying multiple segment generation, query correctness, index description, and fallback workflows.
  • All 29 tests pass successfully with no regressions.

🤖 Generated with Claude Code

…e segment API

This change introduces parallel/distributed segmented BTREE index building in daft-lance.
By building full independent index segments per worker and committing them atomically via the dataset's  API, we resolve the issue where describe_indices failed on distributed indexes due to empty index_details metadata.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@beinan beinan force-pushed the feat/segmented-btree-indices branch from 80b79e3 to bca48a3 Compare June 5, 2026 05:23
Refactors the `type: ignore` comment in `daft_lance/lance_scalar_index.py` at line 101 to omit `arg-type` as keyword arguments are now used and the warning is no longer emitted.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@beinan beinan force-pushed the feat/segmented-btree-indices branch from 1c79c59 to c9a7f51 Compare June 5, 2026 06:24
@rchowell
Copy link
Copy Markdown
Contributor

rchowell commented Jun 5, 2026

Thanks for the contribution!

@rchowell rchowell merged commit e7cbf97 into daft-engine:main Jun 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants