feat: add param `rows_per_range` for range-based btree index built by fangbo · Pull Request #439 · lance-format/lance-spark

fangbo · 2026-04-16T07:43:13Z

Background

Currently the partition(range) number is configured by spark parameter like: spark.sql.adaptive.coalescePartitions.initialPartitionNum or spark.sql.shuffle.partitions when building btree index using range-mode.

The current approach cannot dynamically adjust the number of ranges based on changes in the total row count of the Dataset. This becomes quite inconvenient when the total row count of the Dataset continues to grow.

Design

So, we add a new parameter rows_per_range for range-mode. This param specifies the row number for each range. The spark partition number is calculated by Dataset.total_rows/rows_per_range. This method can dynamically adjust the number of Ranges based on the Dataset's row count.

@hamersaw @puchengy Could you please take a look and see if this makes sense? Thank you.

puchengy · 2026-04-16T16:51:51Z

@fangbo Hi, I will try to take a look but not sure if I can review it becasue I am still ramping up. Please ping @hamersaw if I don't provide comment in time, thanks!

hamersaw

Appreciate the add! I think overall this looks good, we just need to understand the implications of moving the dataset initialization.

hamersaw · 2026-04-18T05:12:20Z

+    val dataset = Utils.openDatasetBuilder(readOptions).build()
+
    // Create distributed index job and run it
-    createIndexJob(lanceDataset, readOptions, uuid.toString, fragmentIds).run()
+    createIndexJob(dataset, lanceDataset, readOptions, uuid.toString, fragmentIds).run()


This change means if createIndexJob throws an exception then dataset is never close because it's outside of the try-finally.

Also, I think there may be a bigger issue where reading the manifest before createIndexJob can potentially miss index entries because it runs on a stale handle. We need to make sure this results in correct, performant execution.

This change means if createIndexJob throws an exception then dataset is never close because it's outside of the try-finally.

Thank you for pointing this, I have fixed it.

Also, I think there may be a bigger issue where reading the manifest before createIndexJob can potentially miss index entries because it runs on a stale handle. We need to make sure this results in correct, performant execution.

I think the readOptions is not changed when index building job is created and running. Logically, the whole process is based the same version. In my opinion, it is no problem. If I've misunderstood, please point it out for me. Thank you.

feat: add param rows_per_range for range-based btree index built

9d05bca

github-actions Bot added the enhancement New feature or request label Apr 16, 2026

hamersaw reviewed Apr 18, 2026

View reviewed changes

fix cr issue to close dataset correctly

746cd24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add param `rows_per_range` for range-based btree index built#439

feat: add param `rows_per_range` for range-based btree index built#439
fangbo wants to merge 2 commits intolance-format:mainfrom
fangbo:rows-per-range

fangbo commented Apr 16, 2026 •

edited

Loading

Uh oh!

puchengy commented Apr 16, 2026

Uh oh!

hamersaw left a comment

Uh oh!

hamersaw Apr 18, 2026

Uh oh!

fangbo Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fangbo commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Design

Uh oh!

puchengy commented Apr 16, 2026

Uh oh!

hamersaw left a comment

Choose a reason for hiding this comment

Uh oh!

hamersaw Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

fangbo Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fangbo commented Apr 16, 2026 •

edited

Loading