Skip to content

feat(search): native hybrid search with score fusion (fuse + hybrid)#9738

Open
shaunpatterson wants to merge 19 commits into
dgraph-io:mainfrom
shaunpatterson:sp/hybrid-search
Open

feat(search): native hybrid search with score fusion (fuse + hybrid)#9738
shaunpatterson wants to merge 19 commits into
dgraph-io:mainfrom
shaunpatterson:sp/hybrid-search

Conversation

@shaunpatterson

Copy link
Copy Markdown
Contributor

Native hybrid search with score fusion (fuse + hybrid)

Brings hybrid retrieval (dense vector + sparse/keyword, fused into one ranked list)
natively into a single DQL query, instead of issuing separate queries and fusing in
application code.

Important

This branch is stacked on the BM25 ranked-text-search work, which is not yet
upstream.
Of the commits here, 17 are the BM25 foundation and 2 are hybrid
search (feat(search): native hybrid search… + a fix). The hybrid feature reuses
BM25's value-variable ranker contract and needs it as a fusion channel, so it
cannot be reviewed standalone. This PR should land after a BM25 PR (or be
split once BM25 is upstreamed). Opening it now for visibility/review of the
fusion layer.

Two pieces

  • fuse() — N-way fusion combinator over already-scored DQL value variables
    (the general primitive). Methods: rrf (Reciprocal Rank Fusion, default, k=60)
    and linear (weighted sum of optionally max-normalized scores). Supports
    per-channel weights, normalize, and topk.
  • hybrid() — convenience sugar for the common BM25 + vector case, rewritten at
    parse time into two channel blocks + a fuse(). No distinct executor path.
v as var(func: bm25(text, "quick brown fox"))
e as var(func: similar_to(emb, 100, $queryVec))
f as var(func: fuse(v, e, method: "rrf", k: 60))
{ result(func: uid(f), orderdesc: val(f), first: 10) { uid val(f) } }

Design notes

  • Outer-join (union) semantics. Fusion is a set union of candidate uids across
    channels, not an intersection: a uid missing from a channel contributes nothing
    (RRF: rank = ∞; linear: score = 0), never dropped, never NaN.
  • Coordinator-side. fuse() is never dispatched to a worker; it's computed in
    populateVarMap over already-resolved value variables. The existing variable
    dependency scheduler runs channel blocks first.
  • Vector score surfacing (prerequisite). similar_to now also surfaces a
    higher-is-better similarity score (cosine/dot as-is; euclidean as 1/(1+d)) so
    vector results can be a fusion channel. Search*Scored mirror the existing
    Search* neighbor selection exactly — scoring a plain query does not change which
    neighbors it returns; the scored path is only taken when a score is needed.
  • NaN/Inf safety. Non-finite channel scores are dropped so they can never break
    the sort comparator's strict-weak-ordering or poison a linear sum.

Tests

  • Unit (query/fuse_test.go, dql/fuse_parser_test.go): RRF ranks/union/ties/weights,
    linear normalize/weights/none, topk truncation, determinism, parser options,
    hybrid expansion, error paths.
  • Integration (query): RRF/linear over BM25 channels, BM25+vector fusion,
    hybrid() equivalent to explicit fuse(), pagination, error handling, and
    similar_to score-variable surfacing.

25 fuse/hybrid/similar-score integration tests + the fusion/parser unit tests pass.

🤖 Generated with Claude Code

shaunpatterson and others added 19 commits June 3, 2026 19:56
Add BM25 relevance-ranked text search to Dgraph, enabling users to query
text predicates and receive results ordered by relevance score instead of
boolean matching.

Implementation:
- New BM25 tokenizer using the fulltext pipeline (normalize, stopwords, stem)
  that preserves term frequencies for TF counting
- BM25-specific index storage: per-term TF posting lists, doc length lists,
  and corpus statistics (doc count, total terms)
- Query execution with full BM25 scoring:
  score = IDF * (k+1) * tf / (k * (1 - b + b * dl/avgDL) + tf)
  IDF = log1p((N - df + 0.5) / (df + 0.5))
- DQL syntax: bm25(predicate, "query" [, "k", "b"]) as root func or filter
- Schema syntax: @index(bm25)
- Parameter validation (k > 0, 0 <= b <= 1)
- Early UID intersection for filter-mode performance
- All-stopword document and query handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three critical bugs fixed:

1. REF postings lose Value during rollup: The posting list encode/rollup
   cycle strips the Value field from REF postings without facets (list.go:1630).
   BM25 term frequencies and doc lengths were stored in Value and lost.
   Fix: Store TF and doclen as facets on REF postings, which are preserved.

2. Missing function validation: query/query.go has a separate isValidFuncName
   check from dql/parser.go. "bm25" was only added to the parser, causing
   "Invalid function name: bm25" at query time.

3. Unsorted UIDs break query pipeline: BM25 returned UIDs sorted by score,
   but the query pipeline (algo.MergeSorted, child predicate fetching) requires
   UID-ascending order. Fix: Sort UIDs ascending in UidMatrix, apply
   first/offset pagination on score-sorted results before UID sorting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the facet-based BM25 storage (~40-50 bytes/posting) with compact
varint-encoded binary blobs stored as direct Badger KV entries (~4-6
bytes/posting, ~10x reduction). Add bm25_score pseudo-predicate for
variable-based score ordering following the similar_to pattern.

- Add posting/bm25enc package for compact binary encode/decode
- Rewrite write path in posting/index.go for direct Badger KV
- Add bm25Writes buffer to LocalCache with read-your-own-writes
- Flush BM25 blobs in CommitToDisk with BitBM25Data UserMeta
- Rewrite read path in worker/task.go with direct blob decoding
- Add bm25_score pseudo-predicate in query/query.go
- Add score ordering integration tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cases

Cover incremental add/update/delete, IDF score stability as corpus
grows, large corpus pagination, unicode, stopwords, uid filtering,
score validation, and concurrent batch adds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c tests

Addresses test coverage gaps identified during code review against ArangoDB's
BM25 implementation:

- TestBM25ExactScoreValues: validates numerical correctness of BM25 formula
  using b=0 to enable hand-computed expected scores
- TestBM25BM15NoLengthNormalization: verifies b=0 disables length normalization
  and contrasts with default b=0.75 behavior
- TestBM25SingleMatchingDocument: covers df=1 edge case with high IDF

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 of BM25 scaling plan. Introduces bm25block package with:
- BlockMeta/Dir types for block directory encoding/decoding
- SplitIntoBlocks: splits monolithic entry slices into 128-entry blocks
- MergeAllBlocks: compacts overlapping blocks with dedup and tombstone removal
- ComputeUBPre/SuffixMaxUBPre: WAND upper-bound precomputation
- New key functions: BM25TermDirKey, BM25TermBlockKey, BM25DocLenDirKey,
  BM25DocLenBlockKey for block-addressed Badger KV storage

17 unit tests and benchmarks for the block storage format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phases 2-4 of BM25 scaling plan:

Phase 2 - Segmented mutation path:
- addBM25IndexMutations now writes to block-based storage
- Each term's postings split into ~128-entry blocks with a directory
- Blocks automatically split when exceeding 256 entries
- Doc-length list also uses block-based storage
- Block removal and directory cleanup on deletes

Phase 3 - WAND top-k query path:
- New bm25wand.go with listIter for block-based posting list iteration
- WAND algorithm with min-heap for top-k early termination
- Per-block upper bounds (UBPre) computed from maxTF at query time
- Suffix-max UBPre for efficient threshold checking
- Falls back to scoring all docs when no first: limit or offset is used

Phase 4 - Block-Max WAND:
- skipToWithBMW skips entire blocks whose UB + other terms can't beat theta
- Avoids Badger reads for blocks that can't contribute to top-k
- Enabled by default in handleBM25Search

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 5 - Migration support:
- newListIter falls back to legacy monolithic blob when no block directory exists
- lookupDocLen falls back to legacy BM25DocLenKey blob
- wandSearch falls back to legacy BM25IndexKey for df computation
- Legacy data transparently served through synthetic single-block directory
- New writes always use block format; old data works until overwritten

Unit tests for WAND components:
- TestTopKHeapBasic: heap operations, threshold, eviction
- TestTopKHeapTieBreaking: deterministic ordering on score ties
- TestBm25ScoreFunction: formula verification, tf/dl/b edge cases
- TestBm25ScoreNaN: no NaN/Inf for edge-case inputs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes critical bugs and performance issues identified by GPT-5 review:

- Fix negative inBlockPos panic: guard currentDoc/currentTF/skipTo against
  inBlockPos < 0 (possible before first next() call)
- Fix empty block pathological behavior: next()/skipTo()/skipToWithBMW() now
  skip empty blocks instead of leaving iterator in invalid state with
  MaxUint64 pivotDoc
- Fix legacy loadBlock: no longer resets inBlockPos to 0 (was moving pointer
  backwards, could cause re-scoring or infinite loops)
- Fix remainingUB panic: guard against blockIdx < 0 (before first next())
- Add docLenCache: caches doclen directory + block reads within a single
  query, avoiding repeated Badger reads per scored document
- Optimize BMW otherUB: compute as sumUB - thisUB (O(1)) instead of
  iterating all other terms (O(q^2) -> O(q))

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UB underestimate

Three fixes:
1. CRITICAL: addBM25IndexMutations now checks if a UID already exists in doclen
   blocks before incrementing stats, preventing double-counting on SET when the
   document was already indexed (defensive guard for batch mutations).
2. HIGH: WAND sumUB now accumulates across ALL iterators (not just up to pivot),
   so BMW's otherUB calculation is correct and won't skip valid candidate blocks.
3. PERF: newListIter accepts pre-read Dir to eliminate duplicate Badger reads
   (directory was read once for df, then again inside newListIter).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ength

Defensive hardening from GPT-5 review: if inBlockPos exceeds block length
after next() reaches end of block, the sort.Search span could go negative.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DecodeCount() to bm25enc for O(1) entry count reads without
  full decode, preventing OOM on legacy migration with large posting
  lists (e.g., common terms with millions of entries)
- Use DecodeCount in WAND search legacy DF calculation path
- Fix integer overflow in DecodeDir bounds check by using uint64
  arithmetic (prevents panic on corrupted data with MaxUint32 count)
- Pre-allocate shared score buffer in handleBM25Search with
  three-index slices to prevent accidental append corruption
- Document bm25Writes concurrency model and limitations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the parallel block-storage + retrieval stack (declined in review)
with an implementation that rides Dgraph's standard posting-list machinery,
addressing the maintainer's feedback (independently endorsed by GPT-5 and
Gemini). Net ~1300 fewer lines.

Storage / indexing:
- BM25 term postings are standard index posting lists at
  IndexKey(attr, IdentBM25||term), written via the normal delta path, so they
  inherit MVCC, deltas, rollup, splits, backup and snapshot. Each posting is a
  REF posting whose value packs (term-frequency, doc-length) as two uvarints.
- Fix the linchpin: List.encode() now retains REF postings that carry a value
  through rollup (otherwise the term frequency was silently stripped). Mirrors
  how faceted postings already coexist in Pack (uid) + Postings (payload).
- Document length is packed into the posting value rather than a separate list,
  avoiding a write-hot doclen key and a per-candidate random read at query time.
- Corpus stats (docCount, totalTerms) are sharded across 32 buckets keyed by
  uid%32 so concurrent writers rarely contend, while same-bucket updates still
  conflict-and-retry (mirrors the @count pattern). Term postings get the standard
  index conflict key (fingerprint(key)^uid), so two docs sharing a term commit
  concurrently without conflict -- resolving the concurrency regression that the
  block version only mitigated via Raft serialization.
- Delete posting/bm25block and posting/bm25enc; remove LocalCache.bm25Writes,
  BitBM25Data, and the BM25 commit branch in mvcc.go.

Query / scoring:
- WAND / Block-Max WAND reworked over the standard posting-list iterator:
  per-term cursors are materialized from the in-memory List with per-128-posting
  maxTF/minDocLen bounds -- no parallel block format, no proto changes.
- Surface the score via Dgraph's existing value-variable mechanism: the bm25
  root function binds its per-doc score to its own variable (Uids + Vals), so
  `scores as var(func: bm25(...))` works with uid(scores), val(scores) and
  orderdesc: val(scores). Removes the bm25_score pseudo-predicate and the
  __bm25_scores__ ParentVars channel.
- Skip the query-layer pagination pass for bm25 roots (the worker already
  paginates over score order), mirroring the existing `has` handling, to avoid
  double-applying first/offset.

Tests:
- Rollup TF/doc-length survival, bucketed-stats accumulation (incl. same-bucket
  in-transaction read-your-own-writes and deletes), value-codec round trip, and
  a 200-trial randomized WAND/Block-Max-WAND vs brute-force correctness check.
- Convert the query integration tests to the value-variable syntax.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
updateBM25Stats maintains the bucketed (docCount, totalTerms) counters via
read-modify-write, but read them with LocalCache.GetFromDelta, which skips disk
and returns only the current transaction's in-memory delta. Across separately
committed mutations each transaction therefore started from zero and overwrote
its stats bucket instead of accumulating, collapsing the corpus document count
(e.g. to the per-term df). Since avgDL = totalTerms/docCount and idf depends on
N = docCount, every length- and idf-weighted BM25 score was wrong, while result
ordering (a constant idf factor for a single-term query) still looked correct.

Read the committed stats with LocalCache.Get instead. Term postings are
unaffected: they are additive deltas that merge through the normal posting-list
mechanism and never read-modify-write.

Found by the BM25 integration tests (exact-score and uid-filter cases). The
prior unit test only exercised a single transaction, where read-your-own-writes
masked the bug; add TestBM25StatsAccumulateAcrossTxns covering the multi-
transaction path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BM25 scores a single document (one value) per UID, so per-document length and
corpus statistics are ill-defined for a list predicate. The bucketed stats also
rely on conflict detection that a list predicate's value-dependent conflict key
would not provide (a code-review concern about stats integrity on list/
@noconflict predicates). Reject the combination in checkSchema rather than
silently mis-scoring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The redesign rationale (including the doc-length storage decision) lives in code
comments and the PR description; the planning doc does not belong in the tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add native multi-signal search fusion so a single DQL query can combine
BM25 text relevance with vector similarity (and any other scored value
variable), instead of issuing separate queries and fusing in application
code.

New surface:
- fuse(v1, v2, ..., method:"rrf"|"linear", k:60, weights:"0.3,0.7",
  normalize:"max"|"none", topk:N): an N-way combinator over already-scored
  DQL value variables. RRF = sum 1/(k+rank); linear = weighted sum of
  (optionally max-normalized) scores. Outer-join/union semantics: a uid
  missing from a channel contributes nothing (RRF) or 0 (linear), never
  dropped. Computed coordinator-side over resolved variables; the existing
  dependency scheduler orders it after its channel blocks.
- hybrid(textPred, "query", vecPred, $vec, topk:N, method:..., k:...):
  convenience sugar rewritten at parse time into bm25 + similar_to channel
  blocks plus a fuse() block (no distinct execution path).

Vector scores surfaced:
- similar_to now binds a higher-is-better similarity score (cosine/dot
  as-is; euclidean as 1/(1+d^2)) to a value variable, so vector results can
  be a fusion channel alongside BM25. New SearchScored / SearchWithUidScored
  mirror Search / SearchWithUid exactly, so scoring a plain vector query does
  not change which neighbors it returns; the *Options* scored variants apply
  only when ef/distance-threshold is supplied.

Robustness (post adversarial review by GPT-5 + Gemini):
- non-finite scores are dropped before fusion so they cannot break the sort
  comparator or poison linear sums;
- fused value variable follows the bm25 value-variable contract (ascending
  uid set + uid->score map; ranked order via orderdesc: val(var); topk
  selected before the ascending sort);
- undefined fuse channel vars are rejected at parse time;
- the __hybrid prefix is reserved to avoid synthetic/user var collisions.

Tests: fusion-core unit tests (RRF/linear/union/ties/NaN/topk/determinism),
parser tests (fuse + hybrid + error paths), score-orientation tests, and
end-to-end integration tests combining BM25 + vector (fuse and hybrid).
Design and review notes in docs/superpowers/specs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Auto-fixed issues flagged by a majority (>=3/5) of /judge runs across
Claude, GPT5, and Gemini. Minority findings documented in the deep-review report.

- hybrid(): bound the generated bm25 channel to topk (first:) so a broad text
  query no longer scores the entire corpus before fusion (5/5).
- fuse(): apply per-channel weights under RRF too, not only linear, so a user
  passing weights with the default method no longer has them silently ignored;
  default weight 1.0 keeps standard RRF (4/5).
- hybrid(): check the reserved __hybrid var prefix across nested blocks, not just
  top-level (4/5); reject malformed (odd) option lists instead of dropping them (3/5).
- fuse(): distinguish a genuinely missing channel variable (internal invariant
  error) from a channel that ran but matched nothing (valid empty channel) (3/5).

Added unit tests for weighted RRF and the hybrid bm25 bound; updated the
hybrid/fuse equivalence test to mirror the bounded bm25 channel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two issues found in deep review of native hybrid search:

1. fuse() rejected at subgraph conversion. The DQL parser accepts `fuse`
   (validFuncName), but query.go's isValidFuncName did not list it, so every
   fuse()/hybrid() query failed with "Invalid function name: fuse". The
   integration tests that exercise this are CI-gated and had not been run.
   Added "fuse"; all 25 fuse/hybrid/similar_to-score integration tests now
   pass. (hybrid is rewritten to fuse before this check, so only fuse needs
   registering.)

2. Score/UID misalignment under @filter. The generalized bm25/similar_to
   ranker binding zipped uidMatrix[0] with valueMatrix positionally; a later
   @filter on the ranker block runs updateUidMatrix (and pagination), which
   shrinks/reorders uidMatrix[0] in place without touching valueMatrix —
   misbinding scores to UIDs and feeding wrong scores into fuse() channels.
   Snapshot the aligned worker result into a uid->score map (sg.rankerScores)
   at result time, before any mutation, and bind by UID. Identical behavior on
   the tested no-filter paths.
@shaunpatterson shaunpatterson requested a review from a team as a code owner June 10, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant