Star-tree Index Builder: Support dimension columns with raw forward index + shared dictionary#18637
Conversation
… shared dictionary Thread IndexLoadingConfig through MultipleTreesBuilder and SegmentPreProcessor so the segment loader has the table's TableConfig available when the forward index is backed by external storage. Add a raw-FI + separate-dictionary fast path in PinotSegmentColumnReader.getDictId so the star-tree builder can resolve dictionary IDs for such columns. Propagate segment-level custom metadata to remote index buffers so downstream readers can locate their data source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18637 +/- ##
============================================
- Coverage 56.82% 56.79% -0.04%
- Complexity 1 7 +6
============================================
Files 2571 2576 +5
Lines 149210 149501 +291
Branches 24109 24154 +45
============================================
+ Hits 84789 84907 +118
- Misses 57249 57410 +161
- Partials 7172 7184 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| public int getDictId(int docId) { | ||
| return _forwardIndexReader.getDictId(docId, _forwardIndexReaderContext); | ||
| if (_forwardIndexReader.isDictionaryEncoded()) { |
There was a problem hiding this comment.
Can we add a test for the new use case?
| case STRING: | ||
| return _dictionary.indexOf(_forwardIndexReader.getString(docId, _forwardIndexReaderContext)); | ||
| case BYTES: | ||
| return _dictionary.indexOf(new ByteArray(_forwardIndexReader.getBytes(docId, _forwardIndexReaderContext))); |
There was a problem hiding this comment.
this allocates a fresh byte[] + ByteArray wrapper per row during the star-tree build (numDocs × numBytesDims allocations total). Have you measured the build-time cost of this path on a representative segment? Can we optimize?
raghavyadav01
left a comment
There was a problem hiding this comment.
On MultipleTreesBuilder.java:151 — the second constructor (MultipleTreesBuilder(List<StarTreeIndexConfig>, boolean enableDefaultStarTree, ...)) still uses ImmutableSegmentLoader.load(indexDir, ReadMode.mmap).
Do we need to apply this change to other code paths that build star-trees too? How does this work for minion tasks — is there a separate path?
As of today, we don't have a way to support StarTree index when the forward index is RAW, this PR adds in that support when the dictionary is separated out.
It also downstreams some properties needed for this to happen