feat(search): native hybrid search with score fusion (fuse + hybrid) by shaunpatterson · Pull Request #9738 · dgraph-io/dgraph

shaunpatterson · 2026-06-10T16:41:33Z

Native hybrid search with score fusion (`fuse` + `hybrid`)

Brings hybrid retrieval (dense vector + sparse/keyword, fused into one ranked list)
natively into a single DQL query, instead of issuing separate queries and fusing in
application code.

Important

This branch is stacked on the BM25 ranked-text-search work, which is not yet
upstream. Of the commits here, 17 are the BM25 foundation and 2 are hybrid
search (feat(search): native hybrid search… + a fix). The hybrid feature reuses
BM25's value-variable ranker contract and needs it as a fusion channel, so it
cannot be reviewed standalone. This PR should land after a BM25 PR (or be
split once BM25 is upstreamed). Opening it now for visibility/review of the
fusion layer.

Two pieces

fuse() — N-way fusion combinator over already-scored DQL value variables
(the general primitive). Methods: rrf (Reciprocal Rank Fusion, default, k=60)
and linear (weighted sum of optionally max-normalized scores). Supports
per-channel weights, normalize, and topk.
hybrid() — convenience sugar for the common BM25 + vector case, rewritten at
parse time into two channel blocks + a fuse(). No distinct executor path.

v as var(func: bm25(text, "quick brown fox"))
e as var(func: similar_to(emb, 100, $queryVec))
f as var(func: fuse(v, e, method: "rrf", k: 60))
{ result(func: uid(f), orderdesc: val(f), first: 10) { uid val(f) } }

Design notes

Outer-join (union) semantics. Fusion is a set union of candidate uids across
channels, not an intersection: a uid missing from a channel contributes nothing
(RRF: rank = ∞; linear: score = 0), never dropped, never NaN.
Coordinator-side. fuse() is never dispatched to a worker; it's computed in
populateVarMap over already-resolved value variables. The existing variable
dependency scheduler runs channel blocks first.
Vector score surfacing (prerequisite). similar_to now also surfaces a
higher-is-better similarity score (cosine/dot as-is; euclidean as 1/(1+d)) so
vector results can be a fusion channel. Search*Scored mirror the existing
Search* neighbor selection exactly — scoring a plain query does not change which
neighbors it returns; the scored path is only taken when a score is needed.
NaN/Inf safety. Non-finite channel scores are dropped so they can never break
the sort comparator's strict-weak-ordering or poison a linear sum.

Tests

Unit (query/fuse_test.go, dql/fuse_parser_test.go): RRF ranks/union/ties/weights,
linear normalize/weights/none, topk truncation, determinism, parser options,
hybrid expansion, error paths.
Integration (query): RRF/linear over BM25 channels, BM25+vector fusion,
hybrid() equivalent to explicit fuse(), pagination, error handling, and
similar_to score-variable surfacing.

25 fuse/hybrid/similar-score integration tests + the fusion/parser unit tests pass.

🤖 Generated with Claude Code

Add BM25 relevance-ranked text search to Dgraph, enabling users to query text predicates and receive results ordered by relevance score instead of boolean matching. Implementation: - New BM25 tokenizer using the fulltext pipeline (normalize, stopwords, stem) that preserves term frequencies for TF counting - BM25-specific index storage: per-term TF posting lists, doc length lists, and corpus statistics (doc count, total terms) - Query execution with full BM25 scoring: score = IDF * (k+1) * tf / (k * (1 - b + b * dl/avgDL) + tf) IDF = log1p((N - df + 0.5) / (df + 0.5)) - DQL syntax: bm25(predicate, "query" [, "k", "b"]) as root func or filter - Schema syntax: @index(bm25) - Parameter validation (k > 0, 0 <= b <= 1) - Early UID intersection for filter-mode performance - All-stopword document and query handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three critical bugs fixed: 1. REF postings lose Value during rollup: The posting list encode/rollup cycle strips the Value field from REF postings without facets (list.go:1630). BM25 term frequencies and doc lengths were stored in Value and lost. Fix: Store TF and doclen as facets on REF postings, which are preserved. 2. Missing function validation: query/query.go has a separate isValidFuncName check from dql/parser.go. "bm25" was only added to the parser, causing "Invalid function name: bm25" at query time. 3. Unsorted UIDs break query pipeline: BM25 returned UIDs sorted by score, but the query pipeline (algo.MergeSorted, child predicate fetching) requires UID-ascending order. Fix: Sort UIDs ascending in UidMatrix, apply first/offset pagination on score-sorted results before UID sorting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the facet-based BM25 storage (~40-50 bytes/posting) with compact varint-encoded binary blobs stored as direct Badger KV entries (~4-6 bytes/posting, ~10x reduction). Add bm25_score pseudo-predicate for variable-based score ordering following the similar_to pattern. - Add posting/bm25enc package for compact binary encode/decode - Rewrite write path in posting/index.go for direct Badger KV - Add bm25Writes buffer to LocalCache with read-your-own-writes - Flush BM25 blobs in CommitToDisk with BitBM25Data UserMeta - Rewrite read path in worker/task.go with direct blob decoding - Add bm25_score pseudo-predicate in query/query.go - Add score ordering integration tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…cases Cover incremental add/update/delete, IDF score stability as corpus grows, large corpus pagination, unicode, stopwords, uid filtering, score validation, and concurrent batch adds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…c tests Addresses test coverage gaps identified during code review against ArangoDB's BM25 implementation: - TestBM25ExactScoreValues: validates numerical correctness of BM25 formula using b=0 to enable hand-computed expected scores - TestBM25BM15NoLengthNormalization: verifies b=0 disables length normalization and contrasts with default b=0.75 behavior - TestBM25SingleMatchingDocument: covers df=1 edge case with high IDF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 1 of BM25 scaling plan. Introduces bm25block package with: - BlockMeta/Dir types for block directory encoding/decoding - SplitIntoBlocks: splits monolithic entry slices into 128-entry blocks - MergeAllBlocks: compacts overlapping blocks with dedup and tombstone removal - ComputeUBPre/SuffixMaxUBPre: WAND upper-bound precomputation - New key functions: BM25TermDirKey, BM25TermBlockKey, BM25DocLenDirKey, BM25DocLenBlockKey for block-addressed Badger KV storage 17 unit tests and benchmarks for the block storage format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phases 2-4 of BM25 scaling plan: Phase 2 - Segmented mutation path: - addBM25IndexMutations now writes to block-based storage - Each term's postings split into ~128-entry blocks with a directory - Blocks automatically split when exceeding 256 entries - Doc-length list also uses block-based storage - Block removal and directory cleanup on deletes Phase 3 - WAND top-k query path: - New bm25wand.go with listIter for block-based posting list iteration - WAND algorithm with min-heap for top-k early termination - Per-block upper bounds (UBPre) computed from maxTF at query time - Suffix-max UBPre for efficient threshold checking - Falls back to scoring all docs when no first: limit or offset is used Phase 4 - Block-Max WAND: - skipToWithBMW skips entire blocks whose UB + other terms can't beat theta - Avoids Badger reads for blocks that can't contribute to top-k - Enabled by default in handleBM25Search Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 5 - Migration support: - newListIter falls back to legacy monolithic blob when no block directory exists - lookupDocLen falls back to legacy BM25DocLenKey blob - wandSearch falls back to legacy BM25IndexKey for df computation - Legacy data transparently served through synthetic single-block directory - New writes always use block format; old data works until overwritten Unit tests for WAND components: - TestTopKHeapBasic: heap operations, threshold, eviction - TestTopKHeapTieBreaking: deterministic ordering on score ties - TestBm25ScoreFunction: formula verification, tf/dl/b edge cases - TestBm25ScoreNaN: no NaN/Inf for edge-case inputs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes critical bugs and performance issues identified by GPT-5 review: - Fix negative inBlockPos panic: guard currentDoc/currentTF/skipTo against inBlockPos < 0 (possible before first next() call) - Fix empty block pathological behavior: next()/skipTo()/skipToWithBMW() now skip empty blocks instead of leaving iterator in invalid state with MaxUint64 pivotDoc - Fix legacy loadBlock: no longer resets inBlockPos to 0 (was moving pointer backwards, could cause re-scoring or infinite loops) - Fix remainingUB panic: guard against blockIdx < 0 (before first next()) - Add docLenCache: caches doclen directory + block reads within a single query, avoiding repeated Badger reads per scored document - Optimize BMW otherUB: compute as sumUB - thisUB (O(1)) instead of iterating all other terms (O(q^2) -> O(q)) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…UB underestimate Three fixes: 1. CRITICAL: addBM25IndexMutations now checks if a UID already exists in doclen blocks before incrementing stats, preventing double-counting on SET when the document was already indexed (defensive guard for batch mutations). 2. HIGH: WAND sumUB now accumulates across ALL iterators (not just up to pivot), so BMW's otherUB calculation is correct and won't skip valid candidate blocks. 3. PERF: newListIter accepts pre-read Dir to eliminate duplicate Badger reads (directory was read once for df, then again inside newListIter). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add DecodeCount() to bm25enc for O(1) entry count reads without full decode, preventing OOM on legacy migration with large posting lists (e.g., common terms with millions of entries) - Use DecodeCount in WAND search legacy DF calculation path - Fix integer overflow in DecodeDir bounds check by using uint64 arithmetic (prevents panic on corrupted data with MaxUint32 count) - Pre-allocate shared score buffer in handleBM25Search with three-index slices to prevent accidental append corruption - Document bm25Writes concurrency model and limitations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the parallel block-storage + retrieval stack (declined in review) with an implementation that rides Dgraph's standard posting-list machinery, addressing the maintainer's feedback (independently endorsed by GPT-5 and Gemini). Net ~1300 fewer lines. Storage / indexing: - BM25 term postings are standard index posting lists at IndexKey(attr, IdentBM25||term), written via the normal delta path, so they inherit MVCC, deltas, rollup, splits, backup and snapshot. Each posting is a REF posting whose value packs (term-frequency, doc-length) as two uvarints. - Fix the linchpin: List.encode() now retains REF postings that carry a value through rollup (otherwise the term frequency was silently stripped). Mirrors how faceted postings already coexist in Pack (uid) + Postings (payload). - Document length is packed into the posting value rather than a separate list, avoiding a write-hot doclen key and a per-candidate random read at query time. - Corpus stats (docCount, totalTerms) are sharded across 32 buckets keyed by uid%32 so concurrent writers rarely contend, while same-bucket updates still conflict-and-retry (mirrors the @count pattern). Term postings get the standard index conflict key (fingerprint(key)^uid), so two docs sharing a term commit concurrently without conflict -- resolving the concurrency regression that the block version only mitigated via Raft serialization. - Delete posting/bm25block and posting/bm25enc; remove LocalCache.bm25Writes, BitBM25Data, and the BM25 commit branch in mvcc.go. Query / scoring: - WAND / Block-Max WAND reworked over the standard posting-list iterator: per-term cursors are materialized from the in-memory List with per-128-posting maxTF/minDocLen bounds -- no parallel block format, no proto changes. - Surface the score via Dgraph's existing value-variable mechanism: the bm25 root function binds its per-doc score to its own variable (Uids + Vals), so `scores as var(func: bm25(...))` works with uid(scores), val(scores) and orderdesc: val(scores). Removes the bm25_score pseudo-predicate and the __bm25_scores__ ParentVars channel. - Skip the query-layer pagination pass for bm25 roots (the worker already paginates over score order), mirroring the existing `has` handling, to avoid double-applying first/offset. Tests: - Rollup TF/doc-length survival, bucketed-stats accumulation (incl. same-bucket in-transaction read-your-own-writes and deletes), value-codec round trip, and a 200-trial randomized WAND/Block-Max-WAND vs brute-force correctness check. - Convert the query integration tests to the value-variable syntax. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

updateBM25Stats maintains the bucketed (docCount, totalTerms) counters via read-modify-write, but read them with LocalCache.GetFromDelta, which skips disk and returns only the current transaction's in-memory delta. Across separately committed mutations each transaction therefore started from zero and overwrote its stats bucket instead of accumulating, collapsing the corpus document count (e.g. to the per-term df). Since avgDL = totalTerms/docCount and idf depends on N = docCount, every length- and idf-weighted BM25 score was wrong, while result ordering (a constant idf factor for a single-term query) still looked correct. Read the committed stats with LocalCache.Get instead. Term postings are unaffected: they are additive deltas that merge through the normal posting-list mechanism and never read-modify-write. Found by the BM25 integration tests (exact-score and uid-filter cases). The prior unit test only exercised a single transaction, where read-your-own-writes masked the bug; add TestBM25StatsAccumulateAcrossTxns covering the multi- transaction path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BM25 scores a single document (one value) per UID, so per-document length and corpus statistics are ill-defined for a list predicate. The bucketed stats also rely on conflict detection that a list predicate's value-dependent conflict key would not provide (a code-review concern about stats integrity on list/ @noconflict predicates). Reject the combination in checkSchema rather than silently mis-scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The redesign rationale (including the doc-length storage decision) lives in code comments and the PR description; the planning doc does not belong in the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add native multi-signal search fusion so a single DQL query can combine BM25 text relevance with vector similarity (and any other scored value variable), instead of issuing separate queries and fusing in application code. New surface: - fuse(v1, v2, ..., method:"rrf"|"linear", k:60, weights:"0.3,0.7", normalize:"max"|"none", topk:N): an N-way combinator over already-scored DQL value variables. RRF = sum 1/(k+rank); linear = weighted sum of (optionally max-normalized) scores. Outer-join/union semantics: a uid missing from a channel contributes nothing (RRF) or 0 (linear), never dropped. Computed coordinator-side over resolved variables; the existing dependency scheduler orders it after its channel blocks. - hybrid(textPred, "query", vecPred, $vec, topk:N, method:..., k:...): convenience sugar rewritten at parse time into bm25 + similar_to channel blocks plus a fuse() block (no distinct execution path). Vector scores surfaced: - similar_to now binds a higher-is-better similarity score (cosine/dot as-is; euclidean as 1/(1+d^2)) to a value variable, so vector results can be a fusion channel alongside BM25. New SearchScored / SearchWithUidScored mirror Search / SearchWithUid exactly, so scoring a plain vector query does not change which neighbors it returns; the *Options* scored variants apply only when ef/distance-threshold is supplied. Robustness (post adversarial review by GPT-5 + Gemini): - non-finite scores are dropped before fusion so they cannot break the sort comparator or poison linear sums; - fused value variable follows the bm25 value-variable contract (ascending uid set + uid->score map; ranked order via orderdesc: val(var); topk selected before the ascending sort); - undefined fuse channel vars are rejected at parse time; - the __hybrid prefix is reserved to avoid synthetic/user var collisions. Tests: fusion-core unit tests (RRF/linear/union/ties/NaN/topk/determinism), parser tests (fuse + hybrid + error paths), score-orientation tests, and end-to-end integration tests combining BM25 + vector (fuse and hybrid). Design and review notes in docs/superpowers/specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Auto-fixed issues flagged by a majority (>=3/5) of /judge runs across Claude, GPT5, and Gemini. Minority findings documented in the deep-review report. - hybrid(): bound the generated bm25 channel to topk (first:) so a broad text query no longer scores the entire corpus before fusion (5/5). - fuse(): apply per-channel weights under RRF too, not only linear, so a user passing weights with the default method no longer has them silently ignored; default weight 1.0 keeps standard RRF (4/5). - hybrid(): check the reserved __hybrid var prefix across nested blocks, not just top-level (4/5); reject malformed (odd) option lists instead of dropping them (3/5). - fuse(): distinguish a genuinely missing channel variable (internal invariant error) from a channel that ran but matched nothing (valid empty channel) (3/5). Added unit tests for weighted RRF and the hybrid bm25 bound; updated the hybrid/fuse equivalence test to mirror the bounded bm25 channel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@filter

Two issues found in deep review of native hybrid search: 1. fuse() rejected at subgraph conversion. The DQL parser accepts `fuse` (validFuncName), but query.go's isValidFuncName did not list it, so every fuse()/hybrid() query failed with "Invalid function name: fuse". The integration tests that exercise this are CI-gated and had not been run. Added "fuse"; all 25 fuse/hybrid/similar_to-score integration tests now pass. (hybrid is rewritten to fuse before this check, so only fuse needs registering.) 2. Score/UID misalignment under @filter. The generalized bm25/similar_to ranker binding zipped uidMatrix[0] with valueMatrix positionally; a later @filter on the ranker block runs updateUidMatrix (and pagination), which shrinks/reorders uidMatrix[0] in place without touching valueMatrix — misbinding scores to UIDs and feeding wrong scores into fuse() channels. Snapshot the aligned worker result into a uid->score map (sg.rankerScores) at result time, before any mutation, and bind by UID. Identical behavior on the tested no-filter paths.

shaunpatterson and others added 19 commits June 3, 2026 19:56

fix(bm25): clamp startPos in skipTo to prevent negative sort.Search l…

81b18c1

…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore(bm25): drop working design notes

7b063f0

The redesign rationale (including the doc-length storage decision) lives in code comments and the PR description; the planning doc does not belong in the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shaunpatterson requested a review from a team as a code owner June 10, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): native hybrid search with score fusion (fuse + hybrid)#9738

feat(search): native hybrid search with score fusion (fuse + hybrid)#9738
shaunpatterson wants to merge 19 commits into
dgraph-io:mainfrom
shaunpatterson:sp/hybrid-search

shaunpatterson commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

shaunpatterson commented Jun 10, 2026

Native hybrid search with score fusion (fuse + hybrid)

Two pieces

Design notes

Tests

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Native hybrid search with score fusion (`fuse` + `hybrid`)