Skip to content

research(seprag): BET 2⊗4 filtered-ANN region-pruning — qualified NO-GO (ADR-201) [finding, not a feature]#536

Open
shaal wants to merge 8 commits into
ruvnet:mainfrom
shaal:docs/seprag-bet2-filtered-ann
Open

research(seprag): BET 2⊗4 filtered-ANN region-pruning — qualified NO-GO (ADR-201) [finding, not a feature]#536
shaal wants to merge 8 commits into
ruvnet:mainfrom
shaal:docs/seprag-bet2-filtered-ann

Conversation

@shaal

@shaal shaal commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

⚠️ This is a research finding, not a feature. The verdict is a qualified NO-GO. Merging records an ADR + reusable benchmark tooling — it does not add a production code path or claim a win. No urgency to merge; opened for visibility and the record (same as the ADR-199 kill).

TL;DR

BET 2 ⊗ BET 4 of the SepRAG research line (#534): does region-pruned IVF search beat the ruvector-acorn incumbent on correlated filtered queries? Pre-registered ≥5× distance-eval gate.

Verdict: qualified NO-GO. Region-pruning beats vanilla ACORN 6–48× at selectivity ≤ 1% — but the win does not survive the mandatory adversarial check: giving ACORN a predicate-aware entry (a simple, standard enhancement) collapses the gap to ~2× at high correlation, below the bar. A real but narrow edge remains at moderate correlation (ρ≈0.7). Full reasoning: docs/adr/ADR-201.

What's in this PR (independent of #535)

  • New self-contained crate ruvector-filtered-bench — depends only on ruvector-acorn + ruvector-rairs; zero dependency on ruvector-seprag/PR SepRAG: CCH-inspired retrieval exploration + customizable re-weighting (ADRs 196-200, ruvector-seprag) #535.
  • ADR-201 + the pre-registration doc (gate frozen before the run).
  • Additive, result-preserving instrumentation of ruvector-acorn: acorn_search_counted, flat_filtered_search_counted, acorn_search_seeded_counted. Existing functions delegate unchanged; 13 acorn tests prove behavior is preserved. Useful tooling for anyone measuring ACORN's distance-eval cost.

Why it may still be worth landing despite the NO-GO

  • ADR-201 is a documented, cited finding — kills are first-class in this thread (cf. ADR-199).
  • The ACORN eval-counting variants and the ρ-correlation-knob benchmark are reusable for the named follow-ups (multi-predicate conjunctions; large-n).

Honest verdict (cost at matched recall, n=20k arxiv)

ρ sel A vs vanilla ACORN A vs predicate-aware-entry ACORN
1.0 0.1% 25.9× 2.4× — below the 5× bar
1.0 1% 6.1× 1.9× — below the bar
0.7 1% 9.0× 6.5× — holds

Central lesson: a filtered-ANN cost claim is meaningless without a predicate-aware-entry baseline.

Notes for review

Resolves the BET 2 ⊗ BET 4 item of #534 (qualified NO-GO). Follow-ups (conjunctions, large-n, BET 4 standalone) noted in ADR-201 and #534.

shaal added 7 commits June 4, 2026 14:44
… (issue ruvnet#534)

Region-pruned filtered ANN vs tuned ACORN. New self-contained crate
ruvector-filtered-bench, depending only on ruvector-acorn (incumbent + oracle)
and ruvector-rairs (IVF) — independent of ruvector-seprag/PR ruvnet#535.

Pre-registration (docs/plans/bet2-filtered-ann/PRE-REGISTRATION.md) freezes a
selectivity-shaped win/kill gate before any contender runs: at correlation
rho>=0.7, contender A within 2% filtered-recall@10 of tuned ACORN at >=5x fewer
distance-evals/query at sel<=1% (>=2x at sel=5%), monotonic in selectivity;
graceful-degradation and wall-clock honesty guards; rho=0 recall-collapse kill
control.

M0 (plumbing, pre-freeze-safe):
- data.rs: aligned ogbn-arxiv feat/label/year loader.
- predicate.rs: rho-correlation knob holding selectivity exactly constant across
  rho, plus natural label/year predicate families.
- tests/oracle_gate.rs: exact_filtered_knn cross-checked against an independent
  brute force on a real arxiv slice (sel x rho grid). 5 tests green, clippy clean.
… baseline

Instrument ruvector-acorn with additive, result-preserving counted-search variants
(acorn_search_counted, flat_filtered_search_counted) so distance-evals — the
pre-registered primary cost metric — are measured exactly on ACORN-as-shipped.
13 acorn tests pass incl. a counted==uncounted + flat-evals==#matches invariant.

filtered-bench contenders (src/contenders.rs):
- B: ACORN predicate-agnostic search (the incumbent), exact eval counts.
- C: classic post-filter (retrieve top-pool unfiltered, then filter) — the floor.

M1 findings (n=20k arxiv, ρ=1, k=10):
- TEETH (examples/teeth.rs): at the gate-relevant low selectivity, post-filter
  collapses while ACORN holds — sel=0.1%: 73.7% vs 22.7%; sel=0.5%: 90.4% vs 59.7%;
  sel=1%: 92.6% vs 79.3%. At sel>=5% post-filter is fine (as theory predicts).
  Benchmark is demonstrably sensitive (50+ pt recall swing) — the negative control.
- TUNED ACORN (examples/acorn_tune.rs): ACORN reaches ~92.6% recall at sel=1% with
  gamma=2, ef=512, at ~1622 evals/query; evals are ~flat in ef (early-termination
  bound), so "tuned" = crank ef for recall at near-constant cost. This is the fair
  incumbent baseline for the M3 gate, and it validates the >=5x bar: contender A must
  reach >=90.6% recall at <=~324 evals/query to win.
src/prune.rs: RegionPruneIvf, built on ruvector-rairs k-means (ADR-193 substrate).
Two stacked prunings realizing the salvaged SepRAG kernel on the treewidth-immune
IVF hierarchy:
  1. predicate pruning — skip clusters with zero matching members (the BET-2 win).
  2. branch-and-bound distance pruning — triangle-inequality lower bound
     (dist(q,centroid) - radius); once the top-k heap is full, clusters whose LB
     exceeds the worst result are skipped. Probe in LB order so the bound lets us
     break, not just skip — a strict improvement over the M2-sketch's match-count
     ordering, and it yields EXACT filtered top-k.

Cost metric = nclusters (routing) + matching members scanned; the O(1) predicate
gates the expensive distance, so non-matching points cost nothing (the asymmetry
vs ACORN, which evaluates a distance per expanded node regardless of predicate).

max_probe knob: None = exact B&B (recall 1.0); Some(p) caps match-clusters probed
(trades recall for fewer evals, mirroring ACORN's ef) for equal-recall comparison.

Tests: exact_bb_matches_oracle (recall 1.0 vs exact_filtered_knn on 20 queries) and
zero_match_clusters_are_skipped (1% selectivity → <1000 evals vs 4000 full scan).
8 unit + 1 integration green, clippy clean.
… sel<=1%)

examples/sweep.rs: full selectivity x rho grid, cost-at-matched-recall comparison
(tune A's probe cap to ACORN's recall, then compare distance-evals), with the
wall-clock honesty guard and the rho=0 kill control.

VERDICT vs the frozen gate (n=20k, ACORN gamma2 ef=512, IVF nclusters=64):
- WIN at sel<=1%, rho>=0.7: region-pruned IVF beats tuned ACORN by 6.1-48x evals
  and 4.7-26x wall-clock at equal-or-better recall (A's exact B&B recall >= ACORN).
  e.g. rho=1 sel=1%: ACORN 92.6%@1622 evals vs A 99.9%@264 evals = 6.1x (4.7x wall).
- MISS at sel=5%: best 1.5x (gate wanted >=2x). The win is a low-selectivity
  (<=1%) phenomenon — the dominant production metadata-filter regime, but a real
  boundary, not the full pre-registered claim.
- Mechanism partly refuted: A also wins at rho=0 (low sel), so the eval advantage
  is selectivity-driven (few matches -> cheap exact B&B) more than correlation-
  driven; correlation governs recall, not cost. Reported, not buried.
- rho=0 kill control: A does NOT collapse (recall-safe); high-sel (>=10%) A loses
  as expected (ACORN's regime). Wall-clock guard: PASS (win survives the clock).

nclusters is A's tuning knob (parallel to ACORN's ef): 64 beats 128 in the win
regime (cheaper routing); both confirm the same boundary.
…y fails the gate

Adds predicate-aware-entry ACORN (the rule-ruvnet#5 "tune harder" adversary):
- ruvector-acorn: acorn_search_seeded_counted (beam starts from caller seeds instead
  of multi-probe entry); acorn_search_impl refactored to take Option<seeds>, existing
  fns pass None — 13 acorn tests still green (behavior preserved).
- contenders.rs: Acorn::search_predicate_entry — stride-sample probes, predicate-test
  free, distance-eval only matching probes, seed the beam from the nearest matches.
- examples/adversarial.rs: A vs best-of(vanilla-B, predicate-entry-D) at matched recall.

FINDING (rule ruvnet#5 changed the verdict): predicate-aware entry slashes ACORN's cost at
HIGH correlation (rho=1 sel=0.1%: 3753 -> 203 evals), collapsing A's advantage from
44.7x (vs vanilla) to 2.4x — BELOW the pre-registered 5x bar. A vs best ACORN:
  rho=1.0: 2.4x / 2.3x / 1.9x (sel .001/.005/.01) — MISS at the 5x bar.
  rho=0.7: 38.8x / 14.6x / 6.5x — WIN (D's seeding is weak at moderate correlation,
           where matches are scattered so a seeded walk still wanders).
So A and predicate-entry-ACORN exploit the SAME structure and converge (~2x) at high
correlation; A's clean win is NOT robust to a properly-tuned ACORN. Honest verdict:
largely a KILL at the pre-registered bar, with a narrower conditional edge at rho~0.7.
Caveat favoring A: D's seeding leans on ~16k "free" predicate tests (the eval metric
ignores the O(1) predicate scan); at scale that scan isn't free, restoring some edge.
…O-GO (M4)

Writes up the BET 2 ⊗ BET 4 outcome with ADR-199/200 honesty. Verdict: region-pruned
IVF beats VANILLA ACORN 6-48x evals (4.7-26x wall-clock) at sel<=1%, but the
pre-registered >=5x WIN does NOT survive the rule-ruvnet#5 adversarial check — giving ACORN
a predicate-aware entry collapses the gap to ~2x at high correlation (rho=1), below the
bar. A retains a narrow conditional edge at moderate correlation (rho~0.7, 6-39x) plus
an at-scale caveat (D's seeding leans on a ~full predicate scan the eval metric treats
as free). Net: the bet does not cleanly pay; the clean win was an artifact of an
under-equipped incumbent. Central lesson: a filtered-ANN cost claim is meaningless
without a predicate-aware-entry baseline.

Also strips a stray tag from the pre-registration doc (non-semantic).
The experiment's own evidence points to two flip conditions (conjunctions where
ACORN's predicate-seeding degrades but cluster-skip composes; large-n where the
predicate scan stops being free) and the open BET 4 standalone baseline.
… hold)

A conjunction is a single O(1) boolean predicate of selectivity = product; in the
distance-eval metric it reduces to (selectivity, scatter) — both already swept. The
'exponentially-unlikely seed' reasoning was wrong (testing a conjunction is O(1)).
Residual leads downgraded to narrow/speculative (predicate-eval cost, large-n).
Recommend closing BET 2 ⊗ BET 4; thread value is BET 1 productionization + BET 3.
ruvnet added a commit that referenced this pull request Jun 18, 2026
…e — scale-gated WIN (ADR-206) (#542)

* docs(bet4): pre-register LB-B&B IVF vs plain-IVF nprobe gate (FROZEN)

Closes the BET 4 caveat left open by ADR-201: the region-pruning IVF
kernel was only run against ACORN (BET 2), never against its natural
incumbent, plain IVF nprobe, on unfiltered ANN. Frozen gate: WIN = >=2x
member-scan reduction at matched recall@10 (R=0.95) AND wall-clock win
across nclusters in {64,256,1024}; KILL = <1.5x or wall-clock reverses.
Two controls: exact-vs-exact pruning-fraction probe + low-d (PCA-8)
soundness control. Honest prior: NO-GO lean (128-d concentration makes
the triangle-inequality bound loose) — the IVF-level companion to
ADR-199. Branch off clean main; B&B kernel rebuilt self-contained
(BET 2's lives only on #536).

* feat(bet4): M0 — self-contained BnBIvf kernel + oracle gate (exactness certified)

New crate ruvector-bet4-ivf-bench (deps: ruvector-rairs, rand).
- data.rs: aligned arxiv 128-d feature CSV loader.
- kernel.rs: BnBIvf — IVF probed in ascending lower-bound order with B&B
  early termination (break when LB >= kth-best); LB(q,c)=max(0,|q-mu_c|-r_c),
  r_c=max member radius. Full budget = exact; max_probe cap = nprobe analogue.
  Built on ruvector-rairs kmeans so it shares centroids with the IvfFlat
  incumbent (shared-index pre-reg requirement).
- oracle.rs: brute-force exact kNN + recall@k + shared true-L2 helper.
- M0 gate test PASSES on real arxiv slice: full-budget B&B == oracle
  (recall@10 >= 0.999) → B&B invariant certified. clippy clean.

Frozen gate: docs/plans/bet4-ivf-pruning/PRE-REGISTRATION.md. Off clean main.

* feat(bet4): M1 — instrumented plain-IVF incumbent on shared index + faithfulness gate

BnBIvf::search_nprobe: the plain-IVF incumbent strategy (nprobe nearest
centroids, scan all members, no B&B) on the SAME centroids/lists as the
B&B contender, with member-eval counting. Refactored top-k accumulation
into shared consider()/finalize() so both strategies accumulate
identically and only the probe loop differs (shared-index pre-reg
requirement). New gate instrumented_nprobe_matches_rairs PASSES: recall
matches ruvector-rairs::IvfFlat within 0.01 at matched params → the
cost-measured incumbent is algorithmically the real one. 3 tests green.

* feat(bet4): M2/M3 — steelman B&B + PCA-8 control + matched-recall sweep

- kernel: search_bnb_skip — the STEELMAN. Centroid-distance order (the
  effective nprobe ordering) + per-cluster LB-skip (correctness-safe in
  any order, unlike the LB-order global break). The strongest cluster-level
  B&B: if it can't beat tuned nprobe, the bound doesn't pay.
- pca: minimal power-iteration top-m PCA (no linalg dep) for the low-dim
  control — projects real arxiv features to 8-d where the bound is tight.
- examples/ivf_pruning_sweep: 3 contenders share one index per nclusters
  (plain nprobe / B&B LB-order / B&B steelman) x 2 regimes (128-d, PCA-8),
  exact-regime pruning probe, matched-recall@0.95, frozen-gate verdict.

RESULT (n=20k & n=50k both): steelman = 1.00x evals vs nprobe in EVERY
cell, BOTH regimes. NO-GO. Mechanism is structural, not dimensional: the
LB bound only prunes FAR clusters that tuned nprobe already skips, so it's
redundant with nprobe's centroid-distance cutoff. Exact-prune fraction
scales correctly with dim (0-13% @128-d, 8-87% @PCA-8) => kernel sound;
the redundancy is fundamental. LB-ORDER (faithful BET-2 kernel) is strictly
WORSE (0.18-0.25x) — LB-ordering probes far large-radius clusters early.

* docs(bet4): ADR-205 — cluster-pruning vs plain IVF nprobe = structural NO-GO

Verdict: NO-GO (robust, structural). Steelman B&B (centroid order +
LB-skip) ties tuned nprobe at exactly 1.00x member-evals in every cell,
n=20k & n=50k, 128-d & PCA-8. Mechanism: the triangle-inequality bound
only prunes FAR clusters that tuned nprobe already skips => redundant with
nprobe's centroid-distance cutoff; win is structurally impossible, not
just hard in high-d. LB-order (faithful BET-2 kernel) strictly worse
(0.18-0.25x). Companion to ADR-199.

Honest deviation recorded: the pre-registered PCA-8 control expected a B&B
WIN (tight bound). It tied instead — the premise was false (tight bound
beats full-scan, not tuned nprobe). Control still valid: exact-prune
fraction scales correctly with dim (0-13% @128-d, 8-82% @PCA-8) => kernel
sound; it revealed the structural redundancy. Scoreboard 2 WINS / 4 KILLS.

* chore(bet4): lockfile for ruvector-bet4-ivf-bench workspace member

* docs(bet5): FROZEN pre-registration — PQ/IVFADC within-list pruning vs tuned nprobe

Opens the one lever ADR-205 left explicitly open (within-list PQ asymmetric
distance, orthogonal to the killed cluster-level bound). Frozen gate: PQ must
beat the cheaper of {plain full-L2, early-abandon exact-L2} nprobe by >=2x
full-L2-equivalent member-evals at recall@10=0.95 AND wall-clock, across
nclusters{64,256,1024} at >=1 scale N>=50k. Honest prior: ~55% win-at-scale,
named kill-paths = amortization crossover + concentration re-rank ceiling.
Stacked on feat/seprag-bet4-ivf-pruning to reuse ruvector-bet4-ivf-bench.
Thread #534.

* feat(bet5): M0 — PqIvf (IVFADC) kernel + early-abandon steelman + gate

PqIvf trains m sub-quantizers on the shared ruvector-rairs k-means substrate
(kmeans assignments ARE the PQ codes), encodes corpus to m-byte codes, and adds
search_adc_rerank (cheap ADC scan of nprobe lists + exact L2 re-rank of top-R)
plus search_adc_only (pure-ADC ceiling probe). AdcCost charges everything in one
honest unit: 256 (LUT) + adc_members*m/D + rerank*1 full-L2-equivalents.
BnBIvf gains search_nprobe_abandon = the early-abandon exact-L2 steelman
incumbent (user-confirmed verdict-setter), charged in dims_touched/D.

Gates (real 2k arxiv slice): PqIvf shares centroids w/ BnBIvf; PQ@full-rerank
exact (recall>=0.999); early-abandon exact vs full L2 (<0.001). 6 tests green,
clippy clean. Thread #534, BET5 pre-reg frozen at 1d920b3.

* feat(bet5): M1/M2/M3 — matched-recall PQ sweep harness

examples/pq_pruning_sweep.rs: shared index per nclusters; tune incumbent nprobe
to min reaching recall@10>=0.95; PQ scans the SAME nprobe lists (cannot rerank an
unscanned neighbour) and we tune the smallest re-rank R recovering >=0.95. Charges
all PQ ops in full-L2-equivalents (256 LUT + adc*m/D + R rerank). Reports pure-ADC
ceiling, R*, early-abandon dim-prune fraction, wall-clock, crossover n*, frozen gate.
Thread #534.

* style(bet5): clippy-clean PQ kernel + sweep (iterator idioms, type alias)

* perf(bet5): shared IvfParts — build k-means once per cell, not per contender

Extract build_ivf -> IvfParts; BnBIvf::from_parts + PqIvf::from_parts reuse one
seeded k-means for the incumbent and every PQ(m). Cuts the worst cell (nc=1024
@100k) from 3x k-means to 1x while guaranteeing the shared-index property by
construction. Behavior-preserving (N=5000 numbers identical). 6 tests green.

* fix(bet5): charge routing (nclusters centroid evals) to both contenders

Pre-reg accounting + 'no free routing' adversarial check require the nclusters
query-centroid routing evals charged equally to incumbent AND PQ. Harness omitted
it, silently flattering PQ where routing dominates (high nclusters). Now prints
member-only ratio (transparency) AND the gate-deciding TOTAL ratio with routing;
verdict decided on total. Wall-clock already included routing (search computes
centroid dists) so the wall guard was already honest. Re-run authoritative.

* docs(bet5): ADR-206 — PQ/IVFADC within-list pruning = scale-gated WIN

Opens ADR-205's one open lever (within-list PQ asymmetric distance, orthogonal
to the killed cluster-level bound). PQ (cheap ADC scan + exact top-R rerank)
beats tuned plain nprobe AND the early-abandon exact-L2 steelman by >=2x
full-L2-equivalent member-evals at recall@10=0.95 AND wall-clock, across all
three nclusters{64,256,1024} at N=100k. Win GROWS with N, crossover n* RISES
with nclusters (routing amortization) -> >=2x at nclusters~sqrt(n) from n~20-50k.

Honest caveats (none buried): win rides on the exact rerank not pure ADC
(ceiling ~0.5) = IVFADC+refine validated, not a new method; scale-gated (full
sweep only at 100k); nc=1024/100k knife-edge 2.03x; m=16 tuned; recall-floor
tunability flatters PQ modestly; steelman halved the naive-L2 ratio. Routing
charge bug in my own harness caught by the pre-registered 'no free routing'
check (nc=1024/50k 2.24x member -> 1.65x total). Scoreboard 3 WINS / 4 KILLS.
Thread #534, pre-reg frozen at 1d920b3.

---------

Co-authored-by: ruv <ruvnet@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant