feat: add param rows_per_range for range-based btree index built#439
feat: add param rows_per_range for range-based btree index built#439fangbo wants to merge 2 commits intolance-format:mainfrom
rows_per_range for range-based btree index built#439Conversation
hamersaw
left a comment
There was a problem hiding this comment.
Appreciate the add! I think overall this looks good, we just need to understand the implications of moving the dataset initialization.
| val dataset = Utils.openDatasetBuilder(readOptions).build() | ||
|
|
||
| // Create distributed index job and run it | ||
| createIndexJob(lanceDataset, readOptions, uuid.toString, fragmentIds).run() | ||
| createIndexJob(dataset, lanceDataset, readOptions, uuid.toString, fragmentIds).run() |
There was a problem hiding this comment.
This change means if createIndexJob throws an exception then dataset is never close because it's outside of the try-finally.
Also, I think there may be a bigger issue where reading the manifest before createIndexJob can potentially miss index entries because it runs on a stale handle. We need to make sure this results in correct, performant execution.
There was a problem hiding this comment.
This change means if createIndexJob throws an exception then
datasetis never close because it's outside of the try-finally.
Thank you for pointing this, I have fixed it.
Also, I think there may be a bigger issue where reading the manifest before createIndexJob can potentially miss index entries because it runs on a stale handle. We need to make sure this results in correct, performant execution.
I think the readOptions is not changed when index building job is created and running. Logically, the whole process is based the same version. In my opinion, it is no problem. If I've misunderstood, please point it out for me. Thank you.
Background
Currently the partition(range) number is configured by spark parameter like:
spark.sql.adaptive.coalescePartitions.initialPartitionNumorspark.sql.shuffle.partitionswhen building btree index using range-mode.The current approach cannot dynamically adjust the number of ranges based on changes in the total row count of the Dataset. This becomes quite inconvenient when the total row count of the Dataset continues to grow.
Design
So, we add a new parameter
rows_per_rangefor range-mode. This param specifies the row number for each range. The spark partition number is calculated byDataset.total_rows/rows_per_range. This method can dynamically adjust the number of Ranges based on the Dataset's row count.@hamersaw @puchengy Could you please take a look and see if this makes sense? Thank you.