feat: zone-level SPJ partitioning#440
Open
LuciferYang wants to merge 2 commits intolance-format:mainfrom
Open
Conversation
Enable a single Lance fragment to contribute to multiple SPJ partitions when its zonemap zones carry different single-valued partition keys. Changes: - Replace fragment-level `computeFragmentPartitionValues` with zone-level `computeZonePartitions` that builds per-zone (fragmentId, value) assignments - Add safety guards: nullCount>0 rejection, type allowlist (String/Long/ Integer/Boolean), max-assignments cap (10k), fragment-coverage check - Expand `planInputPartitions` to emit one partition per assignment with a `col == V` predicate AND-ed into the scan WHERE clause - Report stable `KeyGroupedPartitioning` via `FieldReference.column()` - Enable `USING zonemap` in CREATE INDEX syntax - Add projection-aware statistics estimation Blocked: end-to-end SPJ activation requires lance-core to expose btree zone stats via `getZonemapStats` (positive integration test is @disabled).
f88a434 to
0d8a714
Compare
…, add 3.4 subclass - Rename to match project naming convention (no version suffix) - Add ZoneLevelPartitioningTest for lance-spark-3.4_2.12
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements approach (2) from #430 — partition filter injection. A single fragment can now contribute to multiple SPJ partitions when its zonemap zones carry different single-valued keys.
Convention choice: strict values (
min == maxper zone), not ranges. This preserves cross-format SPJ compatibility with Iceberg/Hive (which use discrete partition values). If any zone hasmin != max, SPJ is skipped and the query runs normally. Range-based partitioning can layer on top as a follow-up.Core changes:
ZonemapFragmentPruner: replacecomputeFragmentPartitionValueswithcomputeZonePartitions— checks each zone independently, builds per-zone(fragmentId, value)assignmentsLanceScan.planInputPartitions: emit one partition per assignment withcol == VAND-ed into the WHERE clauseLanceScanBuilder: add fragment-coverage check to prevent data loss from unindexed fragmentsLanceStatistics: add projection-aware size estimation for accurate broadcast decisionsAddIndexExec: acceptUSING zonemapin CREATE INDEX syntaxSafety guards: null-count rejection, type allowlist (
String/Long/Integer/Boolean), fragment-coverage check, 10k assignment cap.Blocked on lance-core:
getZonemapStatscan't read btree zone data — btree stores stats inpage_lookup.lance(schema:min/max/null_count/page_idx) whilegetZonemapStatsreadszonemap.lance(schema addsfragment_id/zone_start/zone_length).USING zonemapvia SQL also doesn't work (zonemap trainer rejects fragment-based training). See #430 comment for discussion on lance-core fix options. The Spark side already accepts both"BTREE"and"ZONEMAP"— once lance-core exposes zone stats from either path, the feature activates with no further Spark changes. Positive SPJ integration test is@Disabledpending this fix.Test plan
computeZonePartitions(multi-value, dedup, null-count, type rejection, boundary)compilePredicate(string and long values)LanceStatistics.estimateProjectedbehavior