feat(search): native hybrid search with score fusion (fuse + hybrid)#9738
Open
shaunpatterson wants to merge 19 commits into
Open
feat(search): native hybrid search with score fusion (fuse + hybrid)#9738shaunpatterson wants to merge 19 commits into
shaunpatterson wants to merge 19 commits into
Conversation
Add BM25 relevance-ranked text search to Dgraph, enabling users to query text predicates and receive results ordered by relevance score instead of boolean matching. Implementation: - New BM25 tokenizer using the fulltext pipeline (normalize, stopwords, stem) that preserves term frequencies for TF counting - BM25-specific index storage: per-term TF posting lists, doc length lists, and corpus statistics (doc count, total terms) - Query execution with full BM25 scoring: score = IDF * (k+1) * tf / (k * (1 - b + b * dl/avgDL) + tf) IDF = log1p((N - df + 0.5) / (df + 0.5)) - DQL syntax: bm25(predicate, "query" [, "k", "b"]) as root func or filter - Schema syntax: @index(bm25) - Parameter validation (k > 0, 0 <= b <= 1) - Early UID intersection for filter-mode performance - All-stopword document and query handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three critical bugs fixed: 1. REF postings lose Value during rollup: The posting list encode/rollup cycle strips the Value field from REF postings without facets (list.go:1630). BM25 term frequencies and doc lengths were stored in Value and lost. Fix: Store TF and doclen as facets on REF postings, which are preserved. 2. Missing function validation: query/query.go has a separate isValidFuncName check from dql/parser.go. "bm25" was only added to the parser, causing "Invalid function name: bm25" at query time. 3. Unsorted UIDs break query pipeline: BM25 returned UIDs sorted by score, but the query pipeline (algo.MergeSorted, child predicate fetching) requires UID-ascending order. Fix: Sort UIDs ascending in UidMatrix, apply first/offset pagination on score-sorted results before UID sorting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the facet-based BM25 storage (~40-50 bytes/posting) with compact varint-encoded binary blobs stored as direct Badger KV entries (~4-6 bytes/posting, ~10x reduction). Add bm25_score pseudo-predicate for variable-based score ordering following the similar_to pattern. - Add posting/bm25enc package for compact binary encode/decode - Rewrite write path in posting/index.go for direct Badger KV - Add bm25Writes buffer to LocalCache with read-your-own-writes - Flush BM25 blobs in CommitToDisk with BitBM25Data UserMeta - Rewrite read path in worker/task.go with direct blob decoding - Add bm25_score pseudo-predicate in query/query.go - Add score ordering integration tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cases Cover incremental add/update/delete, IDF score stability as corpus grows, large corpus pagination, unicode, stopwords, uid filtering, score validation, and concurrent batch adds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c tests Addresses test coverage gaps identified during code review against ArangoDB's BM25 implementation: - TestBM25ExactScoreValues: validates numerical correctness of BM25 formula using b=0 to enable hand-computed expected scores - TestBM25BM15NoLengthNormalization: verifies b=0 disables length normalization and contrasts with default b=0.75 behavior - TestBM25SingleMatchingDocument: covers df=1 edge case with high IDF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 of BM25 scaling plan. Introduces bm25block package with: - BlockMeta/Dir types for block directory encoding/decoding - SplitIntoBlocks: splits monolithic entry slices into 128-entry blocks - MergeAllBlocks: compacts overlapping blocks with dedup and tombstone removal - ComputeUBPre/SuffixMaxUBPre: WAND upper-bound precomputation - New key functions: BM25TermDirKey, BM25TermBlockKey, BM25DocLenDirKey, BM25DocLenBlockKey for block-addressed Badger KV storage 17 unit tests and benchmarks for the block storage format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phases 2-4 of BM25 scaling plan: Phase 2 - Segmented mutation path: - addBM25IndexMutations now writes to block-based storage - Each term's postings split into ~128-entry blocks with a directory - Blocks automatically split when exceeding 256 entries - Doc-length list also uses block-based storage - Block removal and directory cleanup on deletes Phase 3 - WAND top-k query path: - New bm25wand.go with listIter for block-based posting list iteration - WAND algorithm with min-heap for top-k early termination - Per-block upper bounds (UBPre) computed from maxTF at query time - Suffix-max UBPre for efficient threshold checking - Falls back to scoring all docs when no first: limit or offset is used Phase 4 - Block-Max WAND: - skipToWithBMW skips entire blocks whose UB + other terms can't beat theta - Avoids Badger reads for blocks that can't contribute to top-k - Enabled by default in handleBM25Search Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 5 - Migration support: - newListIter falls back to legacy monolithic blob when no block directory exists - lookupDocLen falls back to legacy BM25DocLenKey blob - wandSearch falls back to legacy BM25IndexKey for df computation - Legacy data transparently served through synthetic single-block directory - New writes always use block format; old data works until overwritten Unit tests for WAND components: - TestTopKHeapBasic: heap operations, threshold, eviction - TestTopKHeapTieBreaking: deterministic ordering on score ties - TestBm25ScoreFunction: formula verification, tf/dl/b edge cases - TestBm25ScoreNaN: no NaN/Inf for edge-case inputs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes critical bugs and performance issues identified by GPT-5 review: - Fix negative inBlockPos panic: guard currentDoc/currentTF/skipTo against inBlockPos < 0 (possible before first next() call) - Fix empty block pathological behavior: next()/skipTo()/skipToWithBMW() now skip empty blocks instead of leaving iterator in invalid state with MaxUint64 pivotDoc - Fix legacy loadBlock: no longer resets inBlockPos to 0 (was moving pointer backwards, could cause re-scoring or infinite loops) - Fix remainingUB panic: guard against blockIdx < 0 (before first next()) - Add docLenCache: caches doclen directory + block reads within a single query, avoiding repeated Badger reads per scored document - Optimize BMW otherUB: compute as sumUB - thisUB (O(1)) instead of iterating all other terms (O(q^2) -> O(q)) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UB underestimate Three fixes: 1. CRITICAL: addBM25IndexMutations now checks if a UID already exists in doclen blocks before incrementing stats, preventing double-counting on SET when the document was already indexed (defensive guard for batch mutations). 2. HIGH: WAND sumUB now accumulates across ALL iterators (not just up to pivot), so BMW's otherUB calculation is correct and won't skip valid candidate blocks. 3. PERF: newListIter accepts pre-read Dir to eliminate duplicate Badger reads (directory was read once for df, then again inside newListIter). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DecodeCount() to bm25enc for O(1) entry count reads without full decode, preventing OOM on legacy migration with large posting lists (e.g., common terms with millions of entries) - Use DecodeCount in WAND search legacy DF calculation path - Fix integer overflow in DecodeDir bounds check by using uint64 arithmetic (prevents panic on corrupted data with MaxUint32 count) - Pre-allocate shared score buffer in handleBM25Search with three-index slices to prevent accidental append corruption - Document bm25Writes concurrency model and limitations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the parallel block-storage + retrieval stack (declined in review) with an implementation that rides Dgraph's standard posting-list machinery, addressing the maintainer's feedback (independently endorsed by GPT-5 and Gemini). Net ~1300 fewer lines. Storage / indexing: - BM25 term postings are standard index posting lists at IndexKey(attr, IdentBM25||term), written via the normal delta path, so they inherit MVCC, deltas, rollup, splits, backup and snapshot. Each posting is a REF posting whose value packs (term-frequency, doc-length) as two uvarints. - Fix the linchpin: List.encode() now retains REF postings that carry a value through rollup (otherwise the term frequency was silently stripped). Mirrors how faceted postings already coexist in Pack (uid) + Postings (payload). - Document length is packed into the posting value rather than a separate list, avoiding a write-hot doclen key and a per-candidate random read at query time. - Corpus stats (docCount, totalTerms) are sharded across 32 buckets keyed by uid%32 so concurrent writers rarely contend, while same-bucket updates still conflict-and-retry (mirrors the @count pattern). Term postings get the standard index conflict key (fingerprint(key)^uid), so two docs sharing a term commit concurrently without conflict -- resolving the concurrency regression that the block version only mitigated via Raft serialization. - Delete posting/bm25block and posting/bm25enc; remove LocalCache.bm25Writes, BitBM25Data, and the BM25 commit branch in mvcc.go. Query / scoring: - WAND / Block-Max WAND reworked over the standard posting-list iterator: per-term cursors are materialized from the in-memory List with per-128-posting maxTF/minDocLen bounds -- no parallel block format, no proto changes. - Surface the score via Dgraph's existing value-variable mechanism: the bm25 root function binds its per-doc score to its own variable (Uids + Vals), so `scores as var(func: bm25(...))` works with uid(scores), val(scores) and orderdesc: val(scores). Removes the bm25_score pseudo-predicate and the __bm25_scores__ ParentVars channel. - Skip the query-layer pagination pass for bm25 roots (the worker already paginates over score order), mirroring the existing `has` handling, to avoid double-applying first/offset. Tests: - Rollup TF/doc-length survival, bucketed-stats accumulation (incl. same-bucket in-transaction read-your-own-writes and deletes), value-codec round trip, and a 200-trial randomized WAND/Block-Max-WAND vs brute-force correctness check. - Convert the query integration tests to the value-variable syntax. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
updateBM25Stats maintains the bucketed (docCount, totalTerms) counters via read-modify-write, but read them with LocalCache.GetFromDelta, which skips disk and returns only the current transaction's in-memory delta. Across separately committed mutations each transaction therefore started from zero and overwrote its stats bucket instead of accumulating, collapsing the corpus document count (e.g. to the per-term df). Since avgDL = totalTerms/docCount and idf depends on N = docCount, every length- and idf-weighted BM25 score was wrong, while result ordering (a constant idf factor for a single-term query) still looked correct. Read the committed stats with LocalCache.Get instead. Term postings are unaffected: they are additive deltas that merge through the normal posting-list mechanism and never read-modify-write. Found by the BM25 integration tests (exact-score and uid-filter cases). The prior unit test only exercised a single transaction, where read-your-own-writes masked the bug; add TestBM25StatsAccumulateAcrossTxns covering the multi- transaction path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BM25 scores a single document (one value) per UID, so per-document length and corpus statistics are ill-defined for a list predicate. The bucketed stats also rely on conflict detection that a list predicate's value-dependent conflict key would not provide (a code-review concern about stats integrity on list/ @noconflict predicates). Reject the combination in checkSchema rather than silently mis-scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The redesign rationale (including the doc-length storage decision) lives in code comments and the PR description; the planning doc does not belong in the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add native multi-signal search fusion so a single DQL query can combine BM25 text relevance with vector similarity (and any other scored value variable), instead of issuing separate queries and fusing in application code. New surface: - fuse(v1, v2, ..., method:"rrf"|"linear", k:60, weights:"0.3,0.7", normalize:"max"|"none", topk:N): an N-way combinator over already-scored DQL value variables. RRF = sum 1/(k+rank); linear = weighted sum of (optionally max-normalized) scores. Outer-join/union semantics: a uid missing from a channel contributes nothing (RRF) or 0 (linear), never dropped. Computed coordinator-side over resolved variables; the existing dependency scheduler orders it after its channel blocks. - hybrid(textPred, "query", vecPred, $vec, topk:N, method:..., k:...): convenience sugar rewritten at parse time into bm25 + similar_to channel blocks plus a fuse() block (no distinct execution path). Vector scores surfaced: - similar_to now binds a higher-is-better similarity score (cosine/dot as-is; euclidean as 1/(1+d^2)) to a value variable, so vector results can be a fusion channel alongside BM25. New SearchScored / SearchWithUidScored mirror Search / SearchWithUid exactly, so scoring a plain vector query does not change which neighbors it returns; the *Options* scored variants apply only when ef/distance-threshold is supplied. Robustness (post adversarial review by GPT-5 + Gemini): - non-finite scores are dropped before fusion so they cannot break the sort comparator or poison linear sums; - fused value variable follows the bm25 value-variable contract (ascending uid set + uid->score map; ranked order via orderdesc: val(var); topk selected before the ascending sort); - undefined fuse channel vars are rejected at parse time; - the __hybrid prefix is reserved to avoid synthetic/user var collisions. Tests: fusion-core unit tests (RRF/linear/union/ties/NaN/topk/determinism), parser tests (fuse + hybrid + error paths), score-orientation tests, and end-to-end integration tests combining BM25 + vector (fuse and hybrid). Design and review notes in docs/superpowers/specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Auto-fixed issues flagged by a majority (>=3/5) of /judge runs across Claude, GPT5, and Gemini. Minority findings documented in the deep-review report. - hybrid(): bound the generated bm25 channel to topk (first:) so a broad text query no longer scores the entire corpus before fusion (5/5). - fuse(): apply per-channel weights under RRF too, not only linear, so a user passing weights with the default method no longer has them silently ignored; default weight 1.0 keeps standard RRF (4/5). - hybrid(): check the reserved __hybrid var prefix across nested blocks, not just top-level (4/5); reject malformed (odd) option lists instead of dropping them (3/5). - fuse(): distinguish a genuinely missing channel variable (internal invariant error) from a channel that ran but matched nothing (valid empty channel) (3/5). Added unit tests for weighted RRF and the hybrid bm25 bound; updated the hybrid/fuse equivalence test to mirror the bounded bm25 channel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two issues found in deep review of native hybrid search: 1. fuse() rejected at subgraph conversion. The DQL parser accepts `fuse` (validFuncName), but query.go's isValidFuncName did not list it, so every fuse()/hybrid() query failed with "Invalid function name: fuse". The integration tests that exercise this are CI-gated and had not been run. Added "fuse"; all 25 fuse/hybrid/similar_to-score integration tests now pass. (hybrid is rewritten to fuse before this check, so only fuse needs registering.) 2. Score/UID misalignment under @filter. The generalized bm25/similar_to ranker binding zipped uidMatrix[0] with valueMatrix positionally; a later @filter on the ranker block runs updateUidMatrix (and pagination), which shrinks/reorders uidMatrix[0] in place without touching valueMatrix — misbinding scores to UIDs and feeding wrong scores into fuse() channels. Snapshot the aligned worker result into a uid->score map (sg.rankerScores) at result time, before any mutation, and bind by UID. Identical behavior on the tested no-filter paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Native hybrid search with score fusion (
fuse+hybrid)Brings hybrid retrieval (dense vector + sparse/keyword, fused into one ranked list)
natively into a single DQL query, instead of issuing separate queries and fusing in
application code.
Important
This branch is stacked on the BM25 ranked-text-search work, which is not yet
upstream. Of the commits here, 17 are the BM25 foundation and 2 are hybrid
search (
feat(search): native hybrid search…+ a fix). The hybrid feature reusesBM25's value-variable ranker contract and needs it as a fusion channel, so it
cannot be reviewed standalone. This PR should land after a BM25 PR (or be
split once BM25 is upstreamed). Opening it now for visibility/review of the
fusion layer.
Two pieces
fuse()— N-way fusion combinator over already-scored DQL value variables(the general primitive). Methods:
rrf(Reciprocal Rank Fusion, default,k=60)and
linear(weighted sum of optionally max-normalized scores). Supportsper-channel
weights,normalize, andtopk.hybrid()— convenience sugar for the common BM25 + vector case, rewritten atparse time into two channel blocks + a
fuse(). No distinct executor path.Design notes
channels, not an intersection: a uid missing from a channel contributes nothing
(RRF: rank = ∞; linear: score = 0), never dropped, never NaN.
fuse()is never dispatched to a worker; it's computed inpopulateVarMapover already-resolved value variables. The existing variabledependency scheduler runs channel blocks first.
similar_tonow also surfaces ahigher-is-better similarity score (cosine/dot as-is; euclidean as
1/(1+d)) sovector results can be a fusion channel.
Search*Scoredmirror the existingSearch*neighbor selection exactly — scoring a plain query does not change whichneighbors it returns; the scored path is only taken when a score is needed.
the sort comparator's strict-weak-ordering or poison a linear sum.
Tests
query/fuse_test.go,dql/fuse_parser_test.go): RRF ranks/union/ties/weights,linear normalize/weights/none, topk truncation, determinism, parser options,
hybrid expansion, error paths.
query): RRF/linear over BM25 channels, BM25+vector fusion,hybrid()equivalent to explicitfuse(), pagination, error handling, andsimilar_toscore-variable surfacing.25 fuse/hybrid/similar-score integration tests + the fusion/parser unit tests pass.
🤖 Generated with Claude Code