[ADR-258] Multiresolution Hash Encoding & Neural Index v2 — Phase 1 by shaal · Pull Request #3 · shaal/ruvector

shaal · 2026-06-18T05:19:42Z

What & why

Implements Phase 1 of ADR-258, adapting the Instant-NGP multiresolution hash encoding (Müller et al., SIGGRAPH 2022, arXiv:2201.05989) into RuVector's GNN-over-HNSW self-learning loop.

The self-learning loop today is bandwidth-bound: online InfoNCE updates touch full d_embed embeddings and accumulate dense gradients through MmapGradientAccumulator. Multiresolution hash encoding replaces that with O(L) cache-resident table lookups whose gradients are sparse (only 2^d_idx·L params per sample), giving the GNN an explicit multi-scale signal aligned with the HNSW layer hierarchy, at a fixed, bounded memory budget.

advanced/neural_hash.rs already exists but is binary LSH (non-differentiable, single-scale). MHE is additive — a different, trainable, continuous, interpolated object — not a replacement.

Full rationale, alternatives, success criteria, risks, and the phased plan are in the ADR: docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md.

What's in this PR

New crate ruvector-hashenc (dependency-light: thiserror only, optional memmap2; WASM-friendly):

HashEncoder — projection (LockedRandom/PcaInit) into a low-d_idx index space, hashed multiresolution grid (L levels, T table size, F features), d-linear interpolation; dense collision-free coarse levels + spatial hashing (h(x)=⊕ xᵢπᵢ mod T) for fine levels.
FeatureTables + GradAccum — trainable tables with a sparse-scatter backward and fused AXPY update; file persistence.
Tests: partition-of-unity, determinism, dense-coarse, save/load, and a finite-difference vs analytic gradient check (the differentiability proof).
Criterion benches (encode, forward+backward).

GNN integration (behind hashenc feature, default off → backward compatible):

FeatureSource trait with FlatEmbedding (legacy, zero-overhead default) and HashAugmented (concat raw + encoded). RuvectorLayer::forward signature unchanged; only input_dim grows.

Self-learning validation harness (ruvector-selflearn):

Reproducible online workload on a low-dim latent manifold lifted to high-D with multi-frequency (multi-scale) relevance; contrastive (InfoNCE) updates with hard negatives.
Statistical rigor: ≥5 seeds, mean ± 95% CI, Cohen's d; emits CSV + ASCII curve + bench_results/selflearn_REPORT.md.

Phase-1 results (5 seeds)

Metric	Baseline	HashEnc	Δ	Effect size
Recall@10 (final)	0.196	0.289	+47.3%	Cohen's d = 1.24
Recall@100 (final)	0.280	0.341	+21.8%	Cohen's d = 1.00
Sessions to surpass baseline	—	1.0	reaches + exceeds baseline	—
Encode cost added per query	—	+1.83 µs	+3.1% of a ~60µs query (S7 ✓)	—

Baseline learns a diagonal metric; the hashenc variant freezes the linear part and learns only the encoder, isolating its contribution on a relevance target a linear metric provably cannot capture.

Reproduce / verify

cargo test -p ruvector-hashenc                                     # incl. gradient check
cargo run -p ruvector-hashenc --bin ruvector-selflearn --release   # -> bench_results/
cargo test -p ruvector-gnn --features hashenc feature_source       # integration
cargo check -p ruvector-gnn                                        # default build unaffected

All green locally; clippy clean on the new crate.

Scope / safety

Default builds are unchanged (hashenc and the new crate are opt-in).
No changes to existing crate behavior; ruvector-hashenc added to workspace members.

Roadmap (tracked here since Issues are disabled on this repo)

Phase 1 — encoder + GNN FeatureSource integration + self-learning harness + gradient-check (this PR)
Phase 2 — residual GAT attention, hard-negative mining, temperature annealing, temporal replay + EWC guard, learned projection; rerun harness against the live GNN-over-HNSW index (targets S1–S3, S7)
Phase 3 — TieredFeatureStore (HOT MHE / WARM PQ·RaBitQ / COLD block-aligned, wiring quantization into the live path, DbOptions.quantization is silently ignored — never applied by VectorDB (misleading default + docs) ruvnet/RuVector#563), async query path, AVX512/NEON/wasm gather kernels, full suite + comparison report; promote default if S-criteria pass

Success criteria (ADR-258 §5)

S1 Recall@10 +25–50% · S2 Recall@100 +15–35% · S3 convergence 2–3× · S4 QPS 1.8–3× · S5 p50→25–40µs · S6 mem −25–45% · S7 overhead ≤+15%. Promotion gated on S1 ∧ S3 with ≥5 seeds, 95% CI, Cohen's d ≥ 0.8.

🤖 Generated with claude-flow

Generated by Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added an optional hash-encoding capability (feature-flagged) to enhance GNN node feature representation with trainable multiresolution parameters.
- Introduced a pluggable node feature source abstraction and a new residual attention block for improved message passing.
Tests / Benchmarks
- Added gradient-check and end-to-end online-learning tests for the new encoder, plus a self-learning validation harness and microbenchmarks.
Documentation
- Added ADR-258 and a self-learning validation report with baseline vs. hash-encoding performance comparisons.

Design for adopting Instant-NGP-style multiresolution hash encoding (arXiv:2201.05989) as a trainable, multi-scale, mmap-persisted feature source for the GNN-over-HNSW self-learning loop, plus GNN/self-learning upgrades, unified tiered storage (PQ/RaBitQ), and an async query path. Includes measurable success criteria, validation plan, risks, and a phased rollout behind a `hashenc` feature flag. Co-Authored-By: claude-flow <ruv@ruv.net> Claude-Session: https://claude.ai/code/session_01X6ARy2fKVicSxK3GtjkV7x

…tegration + self-learning harness (ADR-258) Implements Phase 1 of ADR-258 (Instant-NGP-style multiresolution hash encoding for RuVector's neural index, arXiv:2201.05989). New crate `ruvector-hashenc`: - HashEncoder: projection (LockedRandom/PcaInit) into a low-d index space, hashed multiresolution grid (L levels, T table size, F features), d-linear interpolation; dense collision-free coarse levels + spatial hashing for fine. - FeatureTables + GradAccum: trainable tables with a sparse-scatter backward (only 2^d_idx*L params per sample) and a fused AXPY update; file persistence. - Dependency-light (thiserror only; optional memmap2), WASM-friendly. - Tests: partition-of-unity, determinism, dense-coarse, save/load, and a finite-difference vs analytic gradient check (differentiability proof). - Criterion benches for encode and forward+backward. GNN integration (behind `hashenc` feature, default off): - `FeatureSource` trait with `FlatEmbedding` (legacy, zero-overhead default) and `HashAugmented` (concat raw + encoded). Layer signature unchanged. Self-learning validation harness (`ruvector-selflearn`): - Reproducible online workload on a low-dim latent manifold lifted to high-D with multi-frequency (multi-scale) relevance; contrastive (InfoNCE) updates with hard negatives; tracks recall@10/100 over sessions. - Statistical rigor: >=5 seeds, mean +/- 95% CI, Cohen's d; emits CSV + ASCII curve + REPORT.md. - Phase-1 result (5 seeds): recall@10 +47.3% (d=1.24), recall@100 +21.8% (d=1.00), encoder adds +1.83us/query (+3.1% of a ~60us query). Co-Authored-By: claude-flow <ruv@ruv.net> Claude-Session: https://claude.ai/code/session_01X6ARy2fKVicSxK3GtjkV7x

coderabbitai · 2026-06-18T05:20:02Z

📝 Walkthrough

Walkthrough

A new ruvector-hashenc crate is introduced implementing a trainable multiresolution hash encoder with SplitMix64 PRNG, spatial hashing, feature tables, d-linear interpolation, learned projection, and forward/backward API. The encoder is integrated into ruvector-gnn behind a hashenc feature flag via a new FeatureSource trait abstraction and a ResidualGatBlock residual-attention component. Supporting utilities include contrastive negative sampling, temperature scheduling, and tiered storage with int8 quantization. A self-learning validation harness binary, Criterion benchmarks, gradient correctness tests, and ADR-258 documentation are also added.

Changes

Multiresolution Hash Encoder, GNN Integration, and Validation

Layer / File(s)	Summary
Workspace registration and crate manifests `Cargo.toml`, `crates/ruvector-hashenc/Cargo.toml`, `crates/ruvector-gnn/Cargo.toml`	Registers `ruvector-hashenc` in the workspace, creates its manifest with `std`/`mmap`/`simd` feature flags, binary and benchmark targets, and wires the optional `ruvector-hashenc` dependency plus `hashenc` feature into `ruvector-gnn`.
PRNG, encoder config, and spatial hashing `crates/ruvector-hashenc/src/rng.rs`, `crates/ruvector-hashenc/src/config.rs`, `crates/ruvector-hashenc/src/hash.rs`	Implements `SplitMix64` PRNG with uniform/normal sampling, `HashEncConfig` with `ProjectionKind` enum (LockedRandom, PcaInit, Learned) and hyperparameter helpers (`resolution`, `output_dim`, `level_rows`, `level_is_dense`, `validate`), and spatial hashing primitives (`PRIMES`, `spatial_hash`, `dense_index`, `row_index`).
Feature tables and gradient accumulator `crates/ruvector-hashenc/src/tables.rs`	Adds `FeatureTables` (per-level contiguous `f32` buffers, row accessors, param counting, `save`/`load` with magic-header validation) and `GradAccum` (mirrored layout with `add`, fused SGD `apply`, `zero`, `l2_norm`); includes unsafe `f32`↔`u8` casting for serialization.
Projection: random/PCA initialization and apply `crates/ruvector-hashenc/src/projection.rs`	Adds `Projection` with seeded random Gaussian row initialization, optional PCA power-iteration `fit` for PcaInit and Learned variants, and `apply` that dot-products input then logistic-squashes output to (0,1) for stable grid indexing. Includes `ProjGrad` accumulator for gradient backprop and SGD updates.
D-linear interpolation and EncodeCache `crates/ruvector-hashenc/src/interp.rs`	Adds `EncodeCache` (per-level `(row, weight)` corner storage) and `dlinear` to compute floor/frac coordinates, multilinearly accumulate feature-row contributions across all 2^d hypercube corners, and cache non-zero corner pairs. Includes `dlinear_coord_grad` for backpropagating through coordinate derivatives.
HashEncoder public API and unit tests `crates/ruvector-hashenc/src/lib.rs`	Defines `HashEncError` and `HashEncoder` with constructors, introspection (`output_dim`, `config`, `tables`), forward encoding (`encode`, `fresh_cache`, `encode_into`), backward sparse gradient scatter (`backward`), projection accessors and update, and table persistence. Adds unit tests for dimensionality, weight partition-of-unity, determinism, density, and save/load roundtrip.
GNN FeatureSource trait and implementations `crates/ruvector-gnn/src/feature_source.rs`, `crates/ruvector-gnn/src/lib.rs`	Introduces `FeatureSource` trait (`node_features` returning `Cow`, `out_dim`), `FlatEmbedding` as legacy identity implementation, and `hashenc`-gated `HashAugmented` that concatenates optional raw prefix with encoder output. Re-exports all three from `ruvector-gnn` with appropriate feature gating; includes unit tests for both implementations.
ResidualGatBlock with attention and edge gating `crates/ruvector-gnn/src/residual.rs`, `crates/ruvector-gnn/src/lib.rs`	Adds `ResidualGatBlock` combining residual skip, multi-head attention over node-neighbor pairs, and learned edge-gain-weighted neighbor aggregation normalized by edge weights, with `LayerNorm` applied to output. Includes unit tests for dimensionality, empty-neighbor behavior, and parameter sensitivity.
Contrastive sampling and temperature schedule `crates/ruvector-hashenc/src/sampling.rs`	Adds `NegativeSampler` enum (Random, HnswHard band-based, Mixed hard+random) for deterministic negative sampling via `SplitMix64` while excluding specified indices. Includes `TemperatureSchedule` with cosine-annealed InfoNCE temperature from start to end over total_steps; unit tests validate band sourcing, exclusion handling, and monotonic annealing.
Tiered feature storage: quantized warm and hot layers `crates/ruvector-hashenc/src/tiered.rs`	Adds `WarmInt8` for int8 quantization (per-vector min/scale, 4× compression), `TierStats`, and `TieredFeatureStore`. Implements L2 distance with runtime dispatch to AVX2 SIMD on x86_64 or scalar fallback (unsafe SIMD includes accumulation/tail). Unit tests cover reconstruction error, compression ratio, SIMD-vs-scalar equivalence, and rerank behavior.
Gradient correctness and end-to-end learning tests `crates/ruvector-hashenc/tests/gradient_check.rs`, `crates/ruvector-hashenc/tests/learning.rs`	Adds `gradient_check.rs` with finite-difference validation: table-entry gradients to 1e-3 tolerance and learned projection parameters to 2e-3 tolerance. Adds `learning.rs` test confirming end-to-end trainability via InfoNCE loss reduction over 200 epochs with tables and learned projection SGD updates.
Self-learning validation harness and Criterion benchmarks `crates/ruvector-hashenc/src/bin/selflearn.rs`, `crates/ruvector-hashenc/benches/encode.rs`	Adds `ruvector-selflearn` binary with synthetic Warp dataset, baseline/hashenc Model variants, InfoNCE contrastive training per session, statistics (CI/Cohen's d), ASCII curve rendering, and report writer (selflearn_REPORT.md, selflearn.csv). Includes Criterion benchmarks measuring encode and forward/backward throughput.
ADR-258 and associated documentation `docs/adr/ADR-258-*.md`, `bench_results/selflearn_REPORT.md`	Adds ADR-258 describing ruvector-hashenc design, FeatureSource integration, GNN/self-learning upgrades, tiered storage, and validation criteria with accepted decision outcome. Includes issue/PR draft templates and the generated self-learning benchmark report with baseline vs. hashenc comparison and success criteria.

Sequence Diagram(s)

sequenceDiagram
  participant GNNLayer
  participant FeatureSource
  participant HashEncoder
  participant Projection
  participant FeatureTables
  participant GradAccum

  rect rgba(100, 149, 237, 0.5)
    note over GNNLayer,FeatureTables: Forward Pass
    GNNLayer->>FeatureSource: node_features(node_id, raw)
    FeatureSource->>HashEncoder: encode_into(raw, cache)
    HashEncoder->>Projection: apply(raw, coords)
    Projection-->>HashEncoder: coords in (0,1)^d_idx
    loop per resolution level
      HashEncoder->>FeatureTables: dlinear accumulate
      HashEncoder->>HashEncoder: record (row, weight) in EncodeCache
    end
    HashEncoder-->>FeatureSource: encoded Vec f32
    FeatureSource-->>GNNLayer: Cow [raw || encoded]
  end

  rect rgba(60, 179, 113, 0.5)
    note over GNNLayer,GradAccum: Backward Pass
    GNNLayer->>HashEncoder: backward(cache, grad_out, grad_accum)
    loop per cached (level, row, weight)
      HashEncoder->>GradAccum: add(level, row, feat, weight × grad_out[feat])
    end
    GNNLayer->>GradAccum: apply(tables, lr)
    GradAccum->>FeatureTables: fused SGD update + zero accum
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 Hopping through hash grids, level by level I go,
Projecting embeddings with a logistic-squash glow,
Dlinear corners, each weight duly found,
GradAccum scatters gradients back around.
From selflearn sessions, recall curves arise—
A neural index, upgraded, reaches for the skies! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title '[ADR-258] Multiresolution Hash Encoding & Neural Index v2 — Phase 1' directly and specifically refers to the main change: implementing ADR-258 for multiresolution hash encoding as the core feature source upgrade. It highlights the phase and proposal number, matching the objectives and file summaries.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/ruvector-hash-encoding-upgrade-1gouzf

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

socket-security · 2026-06-18T05:21:28Z

Dependency limit exceeded — report not shown.

This pull request scan exceeded the 10,000-dependency limit applied to this scan, so the results are incomplete and may be inaccurate. To avoid reporting false positives, Socket has not posted a report.

Upgrade your plan to raise the dependency limit and get complete reports, or view the partial scan in the dashboard.

Socket is always free for open source. If this is a non-commercial open source project, contact us to request a free Team account.

coderabbitai

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ruvector-gnn/src/feature_source.rs`:
- Around line 35-40: The `node_features()` method implementations do not
guarantee that returned vectors have a fixed length matching the value
advertised by `out_dim()`. Ensure that all implementations of `node_features()`
(including `FlatEmbedding` around lines 67-68 and `HashAugmented` around lines
82-87) always return vectors with exact length equal to `out_dim()`. If the raw
input length differs from the output dimension, apply appropriate
transformations (truncation, padding, or resizing) to enforce the fixed-width
contract so downstream tensor operations always receive consistently-sized
vectors.

In `@crates/ruvector-hashenc/src/bin/selflearn.rs`:
- Around line 74-117: The parse() function accepts CLI arguments without
validating that the configuration values satisfy invariants required by the rest
of the code (such as needing at least 10 ranked items and ensuring n_pos + n_neg
does not exceed the length of ranked items). Add validation checks at the end of
the parse() function before returning the Cfg struct to verify these invariants
are met, and either exit with a clear error message or provide sensible defaults
if the constraints are violated, preventing downstream panic paths.
- Around line 229-234: The L2-normalization of the representation variable
(variable `r`) in lines 229-234 is not accounted for in the gradient
backpropagation logic in lines 424-454. To fix this, apply the proper Jacobian
of the L2-normalization operation when computing gradients during
backpropagation. Specifically, when propagating gradients back through the
normalized representation in the gradient computation section, include the
mathematical transformation that accounts for how the normalization affects the
gradient flow, ensuring that the backprop correctly optimizes the same objective
as the forward pass with normalization.
- Line 658: The writeln! macro call that generates the Markdown code block for
the learning curve is using a bare triple-backtick fence without a language tag,
which triggers markdownlint MD040 warnings. Add a language tag (such as "text")
immediately after the opening triple-backticks in the format string to properly
specify the code block language. Locate the writeln! call in the selflearn.rs
file that outputs the learning curve and update the opening fence from ``` to
```text.
- Line 553: The session range label in the writeln! macro on line 553 displays
an off-by-one range. The label currently shows "0 .. base.len()" but since array
indices are zero-based, the maximum valid index is base.len() - 1. Change the
writeln! statement to display the upper bound as base.len() - 1 instead of
base.len() to accurately reflect the range of plotted indices.

In `@crates/ruvector-hashenc/src/config.rs`:
- Around line 111-129: The validate method is missing a validation check for the
n_min field. Add a validation condition that checks if n_min is less than 1 and
returns an appropriate error message. This check should be added alongside the
existing n_max >= n_min validation in the validate method to prevent n_min from
being 0, which would cause ln(0) to produce NaN/infinity values that corrupt
downstream grid indexing calculations in resolution.

In `@crates/ruvector-hashenc/src/lib.rs`:
- Around line 219-221: The roundtrip test uses a fixed filename
"rhe_roundtrip.bin" when calling save_tables() which causes file path collisions
during parallel test execution. Replace the hardcoded filename with a unique
identifier such as generating a temporary file using the tempfile crate,
creating a UUID-based filename, or appending a process ID and timestamp to
ensure each test instance has its own isolated file path.
- Around line 66-67: The new constructor method is calling .expect() on
cfg.validate(), which causes a panic if validation fails instead of returning an
error. Change the return type of the new method from Self to Result<Self,
HashEncError>, and replace the cfg.validate().expect() call with cfg.validate()?
to propagate the validation error. This allows callers to handle invalid
configurations gracefully without triggering a process-level panic.

In `@crates/ruvector-hashenc/src/projection.rs`:
- Around line 56-68: The code in the PCA indexing loops assumes every sample in
the samples collection has at least d elements without validating this
assumption. Add validation before the mean calculation loop to ensure each
sample has at least d dimensions. Check that all samples in the samples iterator
have length >= d, and either return an error or panic with a descriptive message
if any sample is shorter than expected. This validation should occur before the
first loop that accesses s[i] to prevent runtime panics when indexing into
undersized samples.

In `@crates/ruvector-hashenc/src/tables.rs`:
- Around line 109-131: The load function reads the saved levels and
features_per_level values from the file at lines 118-119 but never validates
them against the cfg parameter, which means corrupted weights could be silently
loaded if the file was saved with different hyperparameters. After reading the
u32buf values for levels and features_per_level, extract them as u32 integers,
compare each against cfg.levels and cfg.features_per_level respectively, and
return an io::Error with ErrorKind::InvalidData if either value does not match
before proceeding to load the table data in the for loop.

In `@docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md`:
- Line 297: Update the path reference for the self-learning simulation harness
in the ADR document at line 297. The current reference to
`crates/ruvector-bench` is incorrect; it should be updated to
`crates/ruvector-hashenc/src/bin/selflearn.rs` to match the actual
implementation location where the selflearn subcommand is implemented. This
ensures the ADR accurately reflects the actual codebase structure for
auditability.
- Around line 114-129: The fenced code blocks in the markdown document are
missing language identifiers required by markdownlint (MD040 rule). Add the
appropriate language identifier after the opening triple backticks (```) for all
code fences throughout the document. For the code block showing the Rust crate
structure and all other occurrences at lines 133-169, 173-186, 190-213, 221-234,
and 247-255, specify either `text` for generic file structures or `rust` for
Rust code blocks to ensure consistent markdown compliance.
- Line 3: The ADR document has an inconsistent status between line 3 and line
327. Review both the Status field at the beginning of the document (line 3) and
the status mentioned at line 327, then update both locations to use the same
explicit status. Choose one status value and ensure it is consistently applied
throughout the document. Consider using a more descriptive status like "Proposed
(pending Phase-1 results)" or "Accepted (conditional)" to provide additional
context about the decision lifecycle state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 81fc105a-430c-427d-81d6-67f748d191ce

📥 Commits

Reviewing files that changed from the base of the PR and between a5ed288 and 7464c2f.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
bench_results/selflearn.csv is excluded by !**/*.csv

📒 Files selected for processing (17)

Cargo.toml
bench_results/selflearn_REPORT.md
crates/ruvector-gnn/Cargo.toml
crates/ruvector-gnn/src/feature_source.rs
crates/ruvector-gnn/src/lib.rs
crates/ruvector-hashenc/Cargo.toml
crates/ruvector-hashenc/benches/encode.rs
crates/ruvector-hashenc/src/bin/selflearn.rs
crates/ruvector-hashenc/src/config.rs
crates/ruvector-hashenc/src/hash.rs
crates/ruvector-hashenc/src/interp.rs
crates/ruvector-hashenc/src/lib.rs
crates/ruvector-hashenc/src/projection.rs
crates/ruvector-hashenc/src/rng.rs
crates/ruvector-hashenc/src/tables.rs
crates/ruvector-hashenc/tests/gradient_check.rs
docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md

coderabbitai · 2026-06-18T05:28:59Z

+    fn node_features<'a>(&self, _node_id: u64, raw: &'a [f32]) -> Cow<'a, [f32]> {
+        Cow::Borrowed(raw)
+    }
+    #[inline]
+    fn out_dim(&self) -> usize {
+        self.dim


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce fixed-width output invariants for node_features().

out_dim() is advertised as a fixed layer input width, but current logic can return variable-length vectors (FlatEmbedding if raw.len() != dim, HashAugmented when raw.len() < raw_dim due to min). This can desynchronize downstream tensor shapes and fail at runtime in message passing layers.

💡 Proposed patch

impl FeatureSource for FlatEmbedding { #[inline] fn node_features<'a>(&self, _node_id: u64, raw: &'a [f32]) -> Cow<'a, [f32]> { + assert_eq!( + raw.len(), + self.dim, + "FlatEmbedding expected raw dim {}, got {}", + self.dim, + raw.len() + ); Cow::Borrowed(raw) } @@ impl FeatureSource for HashAugmented { fn node_features<'a>(&self, _node_id: u64, raw: &'a [f32]) -> Cow<'a, [f32]> { let enc = self.encoder.encode(raw); if self.include_raw { - let mut v = Vec::with_capacity(self.raw_dim + enc.len()); - v.extend_from_slice(&raw[..self.raw_dim.min(raw.len())]); + assert!( + raw.len() >= self.raw_dim, + "HashAugmented expected raw len >= {}, got {}", + self.raw_dim, + raw.len() + ); + let mut v = Vec::with_capacity(self.out_dim); + v.extend_from_slice(&raw[..self.raw_dim]); v.extend_from_slice(&enc); Cow::Owned(v) } else { Cow::Owned(enc) }

Also applies to: 67-68, 82-87

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-gnn/src/feature_source.rs` around lines 35 - 40, The `node_features()` method implementations do not guarantee that returned vectors have a fixed length matching the value advertised by `out_dim()`. Ensure that all implementations of `node_features()` (including `FlatEmbedding` around lines 67-68 and `HashAugmented` around lines 82-87) always return vectors with exact length equal to `out_dim()`. If the raw input length differs from the output dimension, apply appropriate transformations (truncation, padding, or resizing) to enforce the fixed-width contract so downstream tensor operations always receive consistently-sized vectors.

coderabbitai · 2026-06-18T05:29:00Z

+    fn parse() -> Self {
+        let mut c = Cfg {
+            n_items: 1500,
+            n_queries: 200,
+            dim: 24,
+            latent_dim: 3,
+            sessions: 60,
+            queries_per_session: 128,
+            n_pos: 4,
+            n_neg: 16,
+            temperature: 0.3,
+            lr_w: 0.01,
+            lr_enc: 0.05,
+            beta: 1.0,
+            seeds: vec![1, 2, 3, 4, 5],
+            out_dir: PathBuf::from("bench_results"),
+        };
+        let mut args = std::env::args().skip(1);
+        while let Some(a) = args.next() {
+            match a.as_str() {
+                "--sessions" => c.sessions = args.next().unwrap().parse().unwrap(),
+                "--seeds" => {
+                    let k: usize = args.next().unwrap().parse().unwrap();
+                    c.seeds = (1..=k as u64).collect();
+                }
+                "--items" => c.n_items = args.next().unwrap().parse().unwrap(),
+                "--queries" => c.n_queries = args.next().unwrap().parse().unwrap(),
+                "--out" => c.out_dir = PathBuf::from(args.next().unwrap()),
+                "--quick" => {
+                    c.n_items = 600;
+                    c.n_queries = 80;
+                    c.sessions = 40;
+                    c.queries_per_session = 64;
+                    c.seeds = vec![1, 2, 3];
+                }
+                "--help" | "-h" => {
+                    println!("ruvector-selflearn [--sessions N] [--seeds K] [--items N] [--queries N] [--out DIR] [--quick]");
+                    std::process::exit(0);
+                }
+                _ => {}
+            }
+        }
+        c
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add invariant checks after CLI parsing to prevent panic paths.

Line 74-117 accepts values that later violate assumptions (e.g., Line 255 needs at least 10 ranked items; Lines 370-372 require n_pos + n_neg <= ranked.len()). This currently turns user input into runtime panics.

Suggested fix

impl Cfg { + fn validate(&self) -> Result<(), String> { + if self.sessions == 0 { return Err("sessions must be > 0".into()); } + if self.seeds.is_empty() { return Err("seeds must be >= 1".into()); } + if self.n_queries == 0 { return Err("queries must be > 0".into()); } + if self.n_items < 10 { return Err("items must be >= 10 for Recall@10".into()); } + let min_ranked = self.n_pos + self.n_neg; + if self.n_items < min_ranked { + return Err(format!("items must be >= n_pos + n_neg ({min_ranked})")); + } + if self.n_neg == 0 || self.n_pos == 0 { + return Err("n_pos and n_neg must be > 0".into()); + } + if self.temperature <= 0.0 { + return Err("temperature must be > 0".into()); + } + Ok(()) + } + fn parse() -> Self { let mut c = Cfg { n_items: 1500, @@ - c + c.validate().expect("invalid selflearn configuration"); + c } }

Also applies to: 255-255, 370-372

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/bin/selflearn.rs` around lines 74 - 117, The parse() function accepts CLI arguments without validating that the configuration values satisfy invariants required by the rest of the code (such as needing at least 10 ranked items and ensuring n_pos + n_neg does not exceed the length of ranked items). Add validation checks at the end of the parse() function before returning the Cfg struct to verify these invariants are met, and either exit with a clear error message or provide sensible defaults if the constraints are violated, preventing downstream panic paths.

coderabbitai · 2026-06-18T05:29:00Z

+        // L2-normalize so training (dot) == evaluation (cosine); controls norm
+        // drift and keeps the contrastive objective consistent with retrieval.
+        let norm = r.iter().map(|v| v * v).sum::<f32>().sqrt().max(1e-9);
+        for v in &mut r {
+            *v /= norm;
+        }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Backprop currently ignores the L2-normalization Jacobian.

Line 229-234 normalizes rep, but Line 424-454 propagates gradients as if rep were unnormalized. That optimizes a different objective and can bias reported learning gains.

Also applies to: 424-454

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/bin/selflearn.rs` around lines 229 - 234, The L2-normalization of the representation variable (variable `r`) in lines 229-234 is not accounted for in the gradient backpropagation logic in lines 424-454. To fix this, apply the proper Jacobian of the L2-normalization operation when computing gradients during backpropagation. Specifically, when propagating gradients back through the normalized representation in the gradient computation section, include the mathematical transformation that accounts for how the normalization affects the gradient flow, ensuring that the backprop correctly optimizes the same objective as the forward pass with normalization.

coderabbitai · 2026-06-18T05:29:00Z

+        s.push('\n');
+    }
+    let _ = writeln!(s, "      +{}", "-".repeat(w));
+    let _ = writeln!(s, "       session 0 .. {}   ('*'=hashenc  '.'=baseline  '#'=both)", base.len());


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Session range label is off by one.

Line 553 prints 0 .. len, but the last plotted index is len - 1.

Suggested fix

- let _ = writeln!(s, " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)", base.len()); + let _ = writeln!( + s, + " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)", + base.len().saturating_sub(1) + );

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let _ = writeln!(s, " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)", base.len());

let _ = writeln!(

s,

" session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)",

base.len().saturating_sub(1)

);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/bin/selflearn.rs` at line 553, The session range label in the writeln! macro on line 553 displays an off-by-one range. The label currently shows "0 .. base.len()" but since array indices are zero-based, the maximum valid index is base.len() - 1. Change the writeln! statement to display the upper bound as base.len() - 1 instead of base.len() to accurately reflect the range of plotted indices.

coderabbitai · 2026-06-18T05:29:00Z

+    let _ = writeln!(rep, "| Convergence to 90% of own plateau (sessions) | {:.1} | {:.1} | — | — |", mean(&base_conv), mean(&hash_conv));
+    let _ = writeln!(rep, "| Sessions to surpass baseline's final recall | — | {:.1} | reaches baseline quality fast, then exceeds it | — |", mean(&surpass));
+    let _ = writeln!(rep, "| Encode cost added per query | — | +{:.2} µs | **{:+.1}%** of a ~60µs query | — |", added_us, lat_overhead);
+    let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```\n{curve}```\n");


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Emit fenced code blocks with a language tag to avoid MD040 warnings.

Line 658 generates a bare triple-backtick block; markdownlint expects a language (e.g., text).

Suggested fix

- let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```\n{curve}```\n"); + let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```text\n{curve}```\n");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```\n{curve}```\n");

let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/bin/selflearn.rs` at line 658, The writeln! macro call that generates the Markdown code block for the learning curve is using a bare triple-backtick fence without a language tag, which triggers markdownlint MD040 warnings. Add a language tag (such as "text") immediately after the opening triple-backticks in the format string to properly specify the code block language. Locate the writeln! call in the selflearn.rs file that outputs the learning curve and update the opening fence from ``` to ```text.

Source: Linters/SAST tools

coderabbitai · 2026-06-18T05:29:00Z

+        for s in samples {
+            for i in 0..d {
+                mean[i] += s[i];
+            }
+        }
+        for m in &mut mean {
+            *m /= samples.len() as f32;
+        }
+        // Centered copies.
+        let centered: Vec<Vec<f32>> = samples
+            .iter()
+            .map(|s| (0..d).map(|i| s[i] - mean[i]).collect())
+            .collect();


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate sample dimensionality before PCA indexing loops.

Line 58 and Line 67 assume every sample has at least input_dim elements; shorter samples will panic at runtime.

Proposed fix

pub fn fit(&mut self, cfg: &HashEncConfig, samples: &[Vec<f32>]) { if cfg.projection != ProjectionKind::PcaInit || samples.is_empty() { return; } let d = self.input_dim; + if samples.iter().any(|s| s.len() < d) { + // Keep locked-random projection when sample shape is invalid. + return; + } // Mean. let mut mean = vec![0.0f32; d];

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/projection.rs` around lines 56 - 68, The code in the PCA indexing loops assumes every sample in the samples collection has at least d elements without validating this assumption. Add validation before the mean calculation loop to ensure each sample has at least d dimensions. Check that all samples in the samples iterator have length >= d, and either return an error or panic with a descriptive message if any sample is shorter than expected. This validation should occur before the first loop that accesses s[i] to prevent runtime panics when indexing into undersized samples.

coderabbitai · 2026-06-18T05:29:00Z

+    /// Load tables previously written by [`save`] into `cfg`-shaped buffers.
+    pub fn load(cfg: &HashEncConfig, path: &Path) -> io::Result<Self> {
+        let mut r = io::BufReader::new(std::fs::File::open(path)?);
+        let mut u32buf = [0u8; 4];
+        r.read_exact(&mut u32buf)?;
+        if u32::from_le_bytes(u32buf) != TABLE_MAGIC {
+            return Err(io::Error::new(io::ErrorKind::InvalidData, "bad magic"));
+        }
+        let mut me = Self::new(cfg);
+        r.read_exact(&mut u32buf)?; // levels
+        r.read_exact(&mut u32buf)?; // F
+        for t in &mut me.levels {
+            let mut u64buf = [0u8; 8];
+            r.read_exact(&mut u64buf)?;
+            let len = u64::from_le_bytes(u64buf) as usize;
+            if len != t.len() {
+                return Err(io::Error::new(io::ErrorKind::InvalidData, "shape mismatch"));
+            }
+            let bytes: &mut [u8] = bytemuck_cast_mut(t);
+            r.read_exact(bytes)?;
+        }
+        Ok(me)
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

load discards stored levels/features_per_level without validating against cfg.

Lines 118-119 read the saved levels and features counts but never compare them to cfg.levels and cfg.features_per_level. If the file was saved with different hyperparameters, the per-level length check on line 124 may pass coincidentally (e.g., fewer levels with larger rows), loading corrupted weights silently.

🛡️ Proposed fix

pub fn load(cfg: &HashEncConfig, path: &Path) -> io::Result<Self> { let mut r = io::BufReader::new(std::fs::File::open(path)?); let mut u32buf = [0u8; 4]; r.read_exact(&mut u32buf)?; if u32::from_le_bytes(u32buf) != TABLE_MAGIC { return Err(io::Error::new(io::ErrorKind::InvalidData, "bad magic")); } let mut me = Self::new(cfg); - r.read_exact(&mut u32buf)?; // levels - r.read_exact(&mut u32buf)?; // F + r.read_exact(&mut u32buf)?; + let file_levels = u32::from_le_bytes(u32buf) as usize; + r.read_exact(&mut u32buf)?; + let file_f = u32::from_le_bytes(u32buf) as usize; + if file_levels != cfg.levels || file_f != cfg.features_per_level { + return Err(io::Error::new( + io::ErrorKind::InvalidData, + "config/file shape mismatch", + )); + } for t in &mut me.levels {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/tables.rs` around lines 109 - 131, The load function reads the saved levels and features_per_level values from the file at lines 118-119 but never validates them against the cfg parameter, which means corrupted weights could be silently loaded if the file was saved with different hyperparameters. After reading the u32buf values for levels and features_per_level, extract them as u32 integers, compare each against cfg.levels and cfg.features_per_level respectively, and return an io::Error with ErrorKind::InvalidData if either value does not match before proceeding to load the table data in the for loop.

coderabbitai · 2026-06-18T05:29:00Z

+```
+crates/ruvector-hashenc/
+  Cargo.toml
+  src/
+    lib.rs            // public API, HashEncoder, HashEncConfig, FeatureSource impl
+    config.rs         // HashEncConfig (L, T, F, d_idx, N_min, N_max), defaults
+    grid.rs           // level resolutions, geometric growth, corner enumeration
+    hash.rs           // spatial hash (XOR of coord*prime mod T), per-level
+    interp.rs         // d-linear interpolation forward + analytic backward
+    projection.rs     // P: R^d -> R^{d_idx} (locked random / learned / PCA-init)
+    tables.rs         // FeatureTables: in-memory + mmap-backed, trainable
+    backward.rs       // sparse gradient scatter into tables + projection grad
+    simd.rs           // AVX2/AVX512/NEON gather + interpolation kernels
+    persist.rs        // mmap layout + header, reuse MmapManager conventions
+    wasm.rs           // wasm32 path (no mmap; in-memory tables)
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

These code fences are unlabeled; add text/rust to satisfy markdownlint consistently.

Also applies to: 133-169, 173-186, 190-213, 221-234, 247-255

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 114-114: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md` around lines 114 - 129, The fenced code blocks in the markdown document are missing language identifiers required by markdownlint (MD040 rule). Add the appropriate language identifier after the opening triple backticks (```) for all code fences throughout the document. For the code block showing the Rust crate structure and all other occurrences at lines 133-169, 173-186, 190-213, 221-234, and 247-255, specify either `text` for generic file structures or `rust` for Rust code blocks to ensure consistent markdown compliance.

Source: Linters/SAST tools

…, tiered store, SIMD (ADR-258) Lands the remaining ADR-258 roadmap as opt-in, fully unit-tested components behind the `hashenc` feature flag (default build unchanged). Phase 2 (GNN / self-learning upgrades): - Learned projection: trainable P with full analytic gradient (HashEncoder::projection_grad / apply_projection_grad), PCA-initialized via the new ProjectionKind::Learned. Proven by a finite-difference gradient check and an end-to-end learning test (tables + projection reduce loss >50%). - sampling.rs: NegativeSampler (Random / HnswHard mid-rank / Mixed) for hard-negative mining, and TemperatureSchedule (cosine InfoNCE annealing). - ruvector-gnn residual.rs: ResidualGatBlock — residual skip + learned edge gain over MultiHeadAttention + LayerNorm. Phase 3 (storage / SIMD): - tiered.rs: TieredFeatureStore (HOT tables / WARM int8 reconstruction / COLD) with footprint accounting — wires quantization into the live path (spirit of ruvnet#563). AVX2 + scalar L2 rerank distance with a SIMD==scalar differential test. Tests: ruvector-hashenc 13 unit + 2 gradient checks + e2e learning + doctest; ruvector-gnn 215+6+10 with `hashenc`. Clippy clean; default builds unaffected. Docs: ADR-258 updated with a living implementation-status table and Accepted outcome; added ADR-258 issue + PR description drafts for submission upstream. Co-Authored-By: claude-flow <ruv@ruv.net> Claude-Session: https://claude.ai/code/session_01X6ARy2fKVicSxK3GtjkV7x

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ruvector-gnn/src/residual.rs`:
- Around line 42-62: The forward method in the residual.rs file does not
validate input tensor shapes, allowing mismatched dimensions to silently
propagate and potentially cause panics later in LayerNorm. Add input validation
at the beginning of the forward method to enforce that neighbors.len() matches
edge_weights.len(), node.len() matches the configured embed_dim, and all
neighbor vectors have consistent dimensions. Consider updating the forward
method signature to return a Result type to properly communicate validation
failures to the caller.

In `@crates/ruvector-hashenc/src/sampling.rs`:
- Around line 39-43: Add guard conditions to prevent modulo-by-zero panic and
infinite loops in the sampling logic. Before the while loop that generates
random indices using `rng.next_u64() % n_items as u64`, verify that n_items is
greater than zero. Additionally, when calling the modulo operation on n_items,
ensure there exists at least one non-excluded item available, otherwise the loop
will hang indefinitely trying to find a valid sample. These guard checks should
be applied to both sampling blocks referenced in the comment (the one around
lines 39-43 and the similar one around lines 61-66).
- Around line 68-77: The hard_frac value is used without validation to calculate
n_hard in the NegativeSampler::Mixed branch, which can result in n_hard
exceeding n if hard_frac is greater than 1.0. This causes an integer underflow
when computing n - out.len() on line 76. Clamp hard_frac to the range [0.0, 1.0]
before using it to compute n_hard to ensure n_hard never exceeds n.

In `@crates/ruvector-hashenc/src/tiered.rs`:
- Around line 178-189: The l2_avx2 function uses the _mm256_fmadd_ps intrinsic
which requires the FMA instruction set, but the function only declares the
"avx2" target feature in its #[target_feature] attribute and the runtime CPU
capability check at line 159 only verifies for AVX2 support. Add "fma" to the
target_feature enable list in the l2_avx2 function definition to properly
declare the FMA requirement, and update the runtime CPU check to also verify
FMA3 capability is supported before allowing this code path to execute.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9cc7163b-4946-4340-9811-bdce95c75fa1

📥 Commits

Reviewing files that changed from the base of the PR and between 7464c2f and e910fa9.

📒 Files selected for processing (13)

crates/ruvector-gnn/src/lib.rs
crates/ruvector-gnn/src/residual.rs
crates/ruvector-hashenc/src/config.rs
crates/ruvector-hashenc/src/interp.rs
crates/ruvector-hashenc/src/lib.rs
crates/ruvector-hashenc/src/projection.rs
crates/ruvector-hashenc/src/sampling.rs
crates/ruvector-hashenc/src/tiered.rs
crates/ruvector-hashenc/tests/gradient_check.rs
crates/ruvector-hashenc/tests/learning.rs
docs/adr/ADR-258-ISSUE-DRAFT.md
docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md
docs/adr/ADR-258-PR-DRAFT.md

✅ Files skipped from review due to trivial changes (3)

docs/adr/ADR-258-PR-DRAFT.md
docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md
docs/adr/ADR-258-ISSUE-DRAFT.md

coderabbitai · 2026-06-19T05:07:26Z

+    pub fn forward(&self, node: &[f32], neighbors: &[Vec<f32>], edge_weights: &[f32]) -> Vec<f32> {
+        let d = node.len();
+        let mut out = node.to_vec(); // residual skip
+
+        if !neighbors.is_empty() {
+            // Attention sub-layer.
+            let attn = self.attention.forward(node, neighbors, neighbors);
+            for k in 0..d.min(attn.len()) {
+                out[k] += attn[k];
+            }
+            // Edge-weighted neighbour aggregation (normalized), scaled by gain.
+            let wsum: f32 = edge_weights.iter().copied().sum::<f32>().max(1e-9);
+            for (nb, &w) in neighbors.iter().zip(edge_weights) {
+                let wn = (w / wsum) * self.edge_gain;
+                for k in 0..d.min(nb.len()) {
+                    out[k] += wn * nb[k];
+                }
+            }
+        }
+        self.norm.forward(&out)
+    }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Validate input shapes and edge-weight arity at the forward boundary.

forward currently accepts inconsistent tensor sizes: neighbors.iter().zip(edge_weights) silently drops unmatched items, and node.len() can differ from embed_dim, which then flows into LayerNorm configured with embed_dim parameters and can panic on length mismatch. This should be enforced as a hard contract.

Proposed fix (explicit contract via Result)

-use crate::error::Result; +use crate::error::{GnnError, Result}; @@ - pub fn forward(&self, node: &[f32], neighbors: &[Vec<f32>], edge_weights: &[f32]) -> Vec<f32> { + pub fn forward( + &self, + node: &[f32], + neighbors: &[Vec<f32>], + edge_weights: &[f32], + ) -> Result<Vec<f32>> { + if node.len() != self.embed_dim { + return Err(GnnError::layer_config(format!( + "node dim {} != embed_dim {}", + node.len(), + self.embed_dim + ))); + } + if neighbors.len() != edge_weights.len() { + return Err(GnnError::layer_config(format!( + "neighbors len {} != edge_weights len {}", + neighbors.len(), + edge_weights.len() + ))); + } + if neighbors.iter().any(|nb| nb.len() != self.embed_dim) { + return Err(GnnError::layer_config("neighbor dim must match embed_dim".to_string())); + } + let d = node.len(); let mut out = node.to_vec(); // residual skip @@ - self.norm.forward(&out) + Ok(self.norm.forward(&out)) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub fn forward(&self, node: &[f32], neighbors: &[Vec<f32>], edge_weights: &[f32]) -> Vec<f32> {

let d = node.len();

let mut out = node.to_vec(); // residual skip

if !neighbors.is_empty() {

// Attention sub-layer.

let attn = self.attention.forward(node, neighbors, neighbors);

for k in 0..d.min(attn.len()) {

out[k] += attn[k];

}

// Edge-weighted neighbour aggregation (normalized), scaled by gain.

let wsum: f32 = edge_weights.iter().copied().sum::<f32>().max(1e-9);

for (nb, &w) in neighbors.iter().zip(edge_weights) {

let wn = (w / wsum) * self.edge_gain;

for k in 0..d.min(nb.len()) {

out[k] += wn * nb[k];

}

}

}

self.norm.forward(&out)

}

pub fn forward(

&self,

node: &[f32],

neighbors: &[Vec<f32>],

edge_weights: &[f32],

) -> Result<Vec<f32>> {

if node.len() != self.embed_dim {

return Err(GnnError::layer_config(format!(

"node dim {} != embed_dim {}",

node.len(),

self.embed_dim

)));

}

if neighbors.len() != edge_weights.len() {

return Err(GnnError::layer_config(format!(

"neighbors len {} != edge_weights len {}",

neighbors.len(),

edge_weights.len()

)));

}

if neighbors.iter().any(|nb| nb.len() != self.embed_dim) {

return Err(GnnError::layer_config("neighbor dim must match embed_dim".to_string()));

}

let d = node.len();

let mut out = node.to_vec(); // residual skip

if !neighbors.is_empty() {

// Attention sub-layer.

let attn = self.attention.forward(node, neighbors, neighbors);

for k in 0..d.min(attn.len()) {

out[k] += attn[k];

}

// Edge-weighted neighbour aggregation (normalized), scaled by gain.

let wsum: f32 = edge_weights.iter().copied().sum::<f32>().max(1e-9);

for (nb, &w) in neighbors.iter().zip(edge_weights) {

let wn = (w / wsum) * self.edge_gain;

for k in 0..d.min(nb.len()) {

out[k] += wn * nb[k];

}

}

}

Ok(self.norm.forward(&out))

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-gnn/src/residual.rs` around lines 42 - 62, The forward method in the residual.rs file does not validate input tensor shapes, allowing mismatched dimensions to silently propagate and potentially cause panics later in LayerNorm. Add input validation at the beginning of the forward method to enforce that neighbors.len() matches edge_weights.len(), node.len() matches the configured embed_dim, and all neighbor vectors have consistent dimensions. Consider updating the forward method signature to return a Result type to properly communicate validation failures to the caller.

coderabbitai · 2026-06-19T05:07:27Z

+                    let c = (rng.next_u64() % n_items as u64) as usize;
+                    if !is_excluded(c) {
+                        out.push(c);
+                    }
+                }


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard impossible sampling states to avoid panic/hangs.

n_items == 0 triggers modulo-by-zero, and if all in-range items are excluded these while loops can run indefinitely.

Suggested fix

pub fn sample( &self, ranked: &[usize], n_items: usize, n: usize, exclude: &[usize], rng: &mut SplitMix64, ) -> Vec<usize> { + if n == 0 || n_items == 0 { + return Vec::new(); + } + let excluded_in_range = exclude.iter().filter(|&&x| x < n_items).count(); + if excluded_in_range >= n_items { + return Vec::new(); + } let is_excluded = |x: usize| exclude.contains(&x);

Also applies to: 61-66

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/sampling.rs` around lines 39 - 43, Add guard conditions to prevent modulo-by-zero panic and infinite loops in the sampling logic. Before the while loop that generates random indices using `rng.next_u64() % n_items as u64`, verify that n_items is greater than zero. Additionally, when calling the modulo operation on n_items, ensure there exists at least one non-excluded item available, otherwise the loop will hang indefinitely trying to find a valid sample. These guard checks should be applied to both sampling blocks referenced in the comment (the one around lines 39-43 and the similar one around lines 61-66).

coderabbitai · 2026-06-19T05:07:27Z

+            NegativeSampler::Mixed { band, hard_frac } => {
+                let n_hard = ((n as f32) * hard_frac).round() as usize;
+                let hard =
+                    NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng);
+                out.extend(hard);
+                let mut excl2 = exclude.to_vec();
+                excl2.extend_from_slice(&out);
+                let rest =
+                    NegativeSampler::Random.sample(ranked, n_items, n - out.len(), &excl2, rng);
+                out.extend(rest);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clamp hard_frac before computing n_hard.

Unchecked hard_frac can produce n_hard > n; then n - out.len() underflows at Line 76 after extending hard negatives.

Suggested fix

NegativeSampler::Mixed { band, hard_frac } => { - let n_hard = ((n as f32) * hard_frac).round() as usize; + let frac = hard_frac.clamp(0.0, 1.0); + let n_hard = ((n as f32) * frac).round() as usize; let hard = NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng); out.extend(hard); let mut excl2 = exclude.to_vec(); excl2.extend_from_slice(&out); let rest = - NegativeSampler::Random.sample(ranked, n_items, n - out.len(), &excl2, rng); + NegativeSampler::Random.sample(ranked, n_items, n.saturating_sub(out.len()), &excl2, rng); out.extend(rest); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

NegativeSampler::Mixed { band, hard_frac } => {

let n_hard = ((n as f32) * hard_frac).round() as usize;

let hard =

NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng);

out.extend(hard);

let mut excl2 = exclude.to_vec();

excl2.extend_from_slice(&out);

let rest =

NegativeSampler::Random.sample(ranked, n_items, n - out.len(), &excl2, rng);

out.extend(rest);

NegativeSampler::Mixed { band, hard_frac } => {

let frac = hard_frac.clamp(0.0, 1.0);

let n_hard = ((n as f32) * frac).round() as usize;

let hard =

NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng);

out.extend(hard);

let mut excl2 = exclude.to_vec();

excl2.extend_from_slice(&out);

let rest =

NegativeSampler::Random.sample(ranked, n_items, n.saturating_sub(out.len()), &excl2, rng);

out.extend(rest);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/sampling.rs` around lines 68 - 77, The hard_frac value is used without validation to calculate n_hard in the NegativeSampler::Mixed branch, which can result in n_hard exceeding n if hard_frac is greater than 1.0. This causes an integer underflow when computing n - out.len() on line 76. Clamp hard_frac to the range [0.0, 1.0] before using it to compute n_hard to ensure n_hard never exceeds n.

coderabbitai · 2026-06-19T05:07:27Z

+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx2")]
+unsafe fn l2_avx2(a: &[f32], b: &[f32]) -> f32 {
+    use std::arch::x86_64::*;
+    let n = a.len().min(b.len());
+    let mut sum = _mm256_setzero_ps();
+    let mut i = 0;
+    while i + 8 <= n {
+        let va = _mm256_loadu_ps(a.as_ptr().add(i));
+        let vb = _mm256_loadu_ps(b.as_ptr().add(i));
+        let d = _mm256_sub_ps(va, vb);
+        sum = _mm256_fmadd_ps(d, d, sum);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

FMA intrinsic used without FMA feature detection/enablement.

_mm256_fmadd_ps at line 189 requires the FMA instruction set, but the runtime check (line 159) only verifies AVX2 and #[target_feature] only enables "avx2". While virtually all x86-64 CPUs with AVX2 also support FMA3, these are technically independent feature flags.

🔧 Proposed fix: enable and detect FMA

#[cfg(target_arch = "x86_64")] -#[target_feature(enable = "avx2")] +#[target_feature(enable = "avx2", enable = "fma")] unsafe fn l2_avx2(a: &[f32], b: &[f32]) -> f32 {

And update the runtime check:

#[cfg(target_arch = "x86_64")] { - if is_x86_feature_detected!("avx2") { + if is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma") { // Safety: guarded by runtime feature detection. return unsafe { l2_avx2(a, b) }; } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ruvector-hashenc/src/tiered.rs` around lines 178 - 189, The l2_avx2 function uses the _mm256_fmadd_ps intrinsic which requires the FMA instruction set, but the function only declares the "avx2" target feature in its #[target_feature] attribute and the runtime CPU capability check at line 159 only verifies for AVX2 support. Add "fma" to the target_feature enable list in the l2_avx2 function definition to properly declare the FMA requirement, and update the runtime CPU check to also verify FMA3 capability is supported before allowing this code path to execute.

claude and others added 2 commits June 18, 2026 04:33

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

	let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```\n{curve}```\n");
	let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n

Conversation

shaal commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

What's in this PR

Phase-1 results (5 seeds)

Reproduce / verify

Scope / safety

Roadmap (tracked here since Issues are disabled on this repo)

Success criteria (ADR-258 §5)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

socket-security Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shaal commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

socket-security Bot commented Jun 18, 2026 •

edited

Loading