Skip to content

feat: zone-level SPJ partitioning#440

Open
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:issue-430-zone-partitioning
Open

feat: zone-level SPJ partitioning#440
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:issue-430-zone-partitioning

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 16, 2026

Summary

Implements approach (2) from #430 — partition filter injection. A single fragment can now contribute to multiple SPJ partitions when its zonemap zones carry different single-valued keys.

Convention choice: strict values (min == max per zone), not ranges. This preserves cross-format SPJ compatibility with Iceberg/Hive (which use discrete partition values). If any zone has min != max, SPJ is skipped and the query runs normally. Range-based partitioning can layer on top as a follow-up.

Core changes:

  • ZonemapFragmentPruner: replace computeFragmentPartitionValues with computeZonePartitions — checks each zone independently, builds per-zone (fragmentId, value) assignments
  • LanceScan.planInputPartitions: emit one partition per assignment with col == V AND-ed into the WHERE clause
  • LanceScanBuilder: add fragment-coverage check to prevent data loss from unindexed fragments
  • LanceStatistics: add projection-aware size estimation for accurate broadcast decisions
  • AddIndexExec: accept USING zonemap in CREATE INDEX syntax

Safety guards: null-count rejection, type allowlist (String/Long/Integer/Boolean), fragment-coverage check, 10k assignment cap.

Blocked on lance-core: getZonemapStats can't read btree zone data — btree stores stats in page_lookup.lance (schema: min/max/null_count/page_idx) while getZonemapStats reads zonemap.lance (schema adds fragment_id/zone_start/zone_length). USING zonemap via SQL also doesn't work (zonemap trainer rejects fragment-based training). See #430 comment for discussion on lance-core fix options. The Spark side already accepts both "BTREE" and "ZONEMAP" — once lance-core exposes zone stats from either path, the feature activates with no further Spark changes. Positive SPJ integration test is @Disabled pending this fix.

Test plan

  • 8 new unit tests for computeZonePartitions (multi-value, dedup, null-count, type rejection, boundary)
  • 2 new unit tests for compilePredicate (string and long values)
  • 1 new test pinning LanceStatistics.estimateProjected behavior
  • 3 integration tests: non-single-valued zone fallback, unindexed-fragment coverage check, SPJ join (@disabled)
  • 262 existing tests pass
  • Checkstyle clean

@github-actions github-actions Bot added the enhancement New feature or request label Apr 16, 2026
Enable a single Lance fragment to contribute to multiple SPJ partitions
when its zonemap zones carry different single-valued partition keys.

Changes:
- Replace fragment-level `computeFragmentPartitionValues` with zone-level
  `computeZonePartitions` that builds per-zone (fragmentId, value) assignments
- Add safety guards: nullCount>0 rejection, type allowlist (String/Long/
  Integer/Boolean), max-assignments cap (10k), fragment-coverage check
- Expand `planInputPartitions` to emit one partition per assignment with
  a `col == V` predicate AND-ed into the scan WHERE clause
- Report stable `KeyGroupedPartitioning` via `FieldReference.column()`
- Enable `USING zonemap` in CREATE INDEX syntax
- Add projection-aware statistics estimation

Blocked: end-to-end SPJ activation requires lance-core to expose btree
zone stats via `getZonemapStats` (positive integration test is @disabled).
@LuciferYang LuciferYang force-pushed the issue-430-zone-partitioning branch from f88a434 to 0d8a714 Compare April 16, 2026 13:40
…, add 3.4 subclass

- Rename to match project naming convention (no version suffix)
- Add ZoneLevelPartitioningTest for lance-spark-3.4_2.12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant