Skip to content

feat: add param rows_per_range for range-based btree index built#439

Open
fangbo wants to merge 2 commits intolance-format:mainfrom
fangbo:rows-per-range
Open

feat: add param rows_per_range for range-based btree index built#439
fangbo wants to merge 2 commits intolance-format:mainfrom
fangbo:rows-per-range

Conversation

@fangbo
Copy link
Copy Markdown
Collaborator

@fangbo fangbo commented Apr 16, 2026

Background

Currently the partition(range) number is configured by spark parameter like: spark.sql.adaptive.coalescePartitions.initialPartitionNum or spark.sql.shuffle.partitions when building btree index using range-mode.

The current approach cannot dynamically adjust the number of ranges based on changes in the total row count of the Dataset. This becomes quite inconvenient when the total row count of the Dataset continues to grow.

Design

So, we add a new parameter rows_per_range for range-mode. This param specifies the row number for each range. The spark partition number is calculated by Dataset.total_rows/rows_per_range. This method can dynamically adjust the number of Ranges based on the Dataset's row count.

@hamersaw @puchengy Could you please take a look and see if this makes sense? Thank you.

@github-actions github-actions Bot added the enhancement New feature or request label Apr 16, 2026
@puchengy
Copy link
Copy Markdown
Contributor

@fangbo Hi, I will try to take a look but not sure if I can review it becasue I am still ramping up. Please ping @hamersaw if I don't provide comment in time, thanks!

Copy link
Copy Markdown
Collaborator

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the add! I think overall this looks good, we just need to understand the implications of moving the dataset initialization.

Comment on lines +87 to +90
val dataset = Utils.openDatasetBuilder(readOptions).build()

// Create distributed index job and run it
createIndexJob(lanceDataset, readOptions, uuid.toString, fragmentIds).run()
createIndexJob(dataset, lanceDataset, readOptions, uuid.toString, fragmentIds).run()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means if createIndexJob throws an exception then dataset is never close because it's outside of the try-finally.

Also, I think there may be a bigger issue where reading the manifest before createIndexJob can potentially miss index entries because it runs on a stale handle. We need to make sure this results in correct, performant execution.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means if createIndexJob throws an exception then dataset is never close because it's outside of the try-finally.

Thank you for pointing this, I have fixed it.

Also, I think there may be a bigger issue where reading the manifest before createIndexJob can potentially miss index entries because it runs on a stale handle. We need to make sure this results in correct, performant execution.

I think the readOptions is not changed when index building job is created and running. Logically, the whole process is based the same version. In my opinion, it is no problem. If I've misunderstood, please point it out for me. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants