[ADR-258] Multiresolution Hash Encoding & Neural Index v2 — Phase 1#3
[ADR-258] Multiresolution Hash Encoding & Neural Index v2 — Phase 1#3shaal wants to merge 3 commits into
Conversation
Design for adopting Instant-NGP-style multiresolution hash encoding (arXiv:2201.05989) as a trainable, multi-scale, mmap-persisted feature source for the GNN-over-HNSW self-learning loop, plus GNN/self-learning upgrades, unified tiered storage (PQ/RaBitQ), and an async query path. Includes measurable success criteria, validation plan, risks, and a phased rollout behind a `hashenc` feature flag. Co-Authored-By: claude-flow <ruv@ruv.net> Claude-Session: https://claude.ai/code/session_01X6ARy2fKVicSxK3GtjkV7x
…tegration + self-learning harness (ADR-258) Implements Phase 1 of ADR-258 (Instant-NGP-style multiresolution hash encoding for RuVector's neural index, arXiv:2201.05989). New crate `ruvector-hashenc`: - HashEncoder: projection (LockedRandom/PcaInit) into a low-d index space, hashed multiresolution grid (L levels, T table size, F features), d-linear interpolation; dense collision-free coarse levels + spatial hashing for fine. - FeatureTables + GradAccum: trainable tables with a sparse-scatter backward (only 2^d_idx*L params per sample) and a fused AXPY update; file persistence. - Dependency-light (thiserror only; optional memmap2), WASM-friendly. - Tests: partition-of-unity, determinism, dense-coarse, save/load, and a finite-difference vs analytic gradient check (differentiability proof). - Criterion benches for encode and forward+backward. GNN integration (behind `hashenc` feature, default off): - `FeatureSource` trait with `FlatEmbedding` (legacy, zero-overhead default) and `HashAugmented` (concat raw + encoded). Layer signature unchanged. Self-learning validation harness (`ruvector-selflearn`): - Reproducible online workload on a low-dim latent manifold lifted to high-D with multi-frequency (multi-scale) relevance; contrastive (InfoNCE) updates with hard negatives; tracks recall@10/100 over sessions. - Statistical rigor: >=5 seeds, mean +/- 95% CI, Cohen's d; emits CSV + ASCII curve + REPORT.md. - Phase-1 result (5 seeds): recall@10 +47.3% (d=1.24), recall@100 +21.8% (d=1.00), encoder adds +1.83us/query (+3.1% of a ~60us query). Co-Authored-By: claude-flow <ruv@ruv.net> Claude-Session: https://claude.ai/code/session_01X6ARy2fKVicSxK3GtjkV7x
📝 WalkthroughWalkthroughA new ChangesMultiresolution Hash Encoder, GNN Integration, and Validation
Sequence Diagram(s)sequenceDiagram
participant GNNLayer
participant FeatureSource
participant HashEncoder
participant Projection
participant FeatureTables
participant GradAccum
rect rgba(100, 149, 237, 0.5)
note over GNNLayer,FeatureTables: Forward Pass
GNNLayer->>FeatureSource: node_features(node_id, raw)
FeatureSource->>HashEncoder: encode_into(raw, cache)
HashEncoder->>Projection: apply(raw, coords)
Projection-->>HashEncoder: coords in (0,1)^d_idx
loop per resolution level
HashEncoder->>FeatureTables: dlinear accumulate
HashEncoder->>HashEncoder: record (row, weight) in EncodeCache
end
HashEncoder-->>FeatureSource: encoded Vec f32
FeatureSource-->>GNNLayer: Cow [raw || encoded]
end
rect rgba(60, 179, 113, 0.5)
note over GNNLayer,GradAccum: Backward Pass
GNNLayer->>HashEncoder: backward(cache, grad_out, grad_accum)
loop per cached (level, row, weight)
HashEncoder->>GradAccum: add(level, row, feat, weight × grad_out[feat])
end
GNNLayer->>GradAccum: apply(tables, lr)
GradAccum->>FeatureTables: fused SGD update + zero accum
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Dependency limit exceeded — report not shown. This pull request scan exceeded the 10,000-dependency limit applied to this scan, so the results are incomplete and may be inaccurate. To avoid reporting false positives, Socket has not posted a report. Upgrade your plan to raise the dependency limit and get complete reports, or view the partial scan in the dashboard. Socket is always free for open source. If this is a non-commercial open source project, contact us to request a free Team account. |
There was a problem hiding this comment.
Actionable comments posted: 13
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ruvector-gnn/src/feature_source.rs`:
- Around line 35-40: The `node_features()` method implementations do not
guarantee that returned vectors have a fixed length matching the value
advertised by `out_dim()`. Ensure that all implementations of `node_features()`
(including `FlatEmbedding` around lines 67-68 and `HashAugmented` around lines
82-87) always return vectors with exact length equal to `out_dim()`. If the raw
input length differs from the output dimension, apply appropriate
transformations (truncation, padding, or resizing) to enforce the fixed-width
contract so downstream tensor operations always receive consistently-sized
vectors.
In `@crates/ruvector-hashenc/src/bin/selflearn.rs`:
- Around line 74-117: The parse() function accepts CLI arguments without
validating that the configuration values satisfy invariants required by the rest
of the code (such as needing at least 10 ranked items and ensuring n_pos + n_neg
does not exceed the length of ranked items). Add validation checks at the end of
the parse() function before returning the Cfg struct to verify these invariants
are met, and either exit with a clear error message or provide sensible defaults
if the constraints are violated, preventing downstream panic paths.
- Around line 229-234: The L2-normalization of the representation variable
(variable `r`) in lines 229-234 is not accounted for in the gradient
backpropagation logic in lines 424-454. To fix this, apply the proper Jacobian
of the L2-normalization operation when computing gradients during
backpropagation. Specifically, when propagating gradients back through the
normalized representation in the gradient computation section, include the
mathematical transformation that accounts for how the normalization affects the
gradient flow, ensuring that the backprop correctly optimizes the same objective
as the forward pass with normalization.
- Line 658: The writeln! macro call that generates the Markdown code block for
the learning curve is using a bare triple-backtick fence without a language tag,
which triggers markdownlint MD040 warnings. Add a language tag (such as "text")
immediately after the opening triple-backticks in the format string to properly
specify the code block language. Locate the writeln! call in the selflearn.rs
file that outputs the learning curve and update the opening fence from ``` to
```text.
- Line 553: The session range label in the writeln! macro on line 553 displays
an off-by-one range. The label currently shows "0 .. base.len()" but since array
indices are zero-based, the maximum valid index is base.len() - 1. Change the
writeln! statement to display the upper bound as base.len() - 1 instead of
base.len() to accurately reflect the range of plotted indices.
In `@crates/ruvector-hashenc/src/config.rs`:
- Around line 111-129: The validate method is missing a validation check for the
n_min field. Add a validation condition that checks if n_min is less than 1 and
returns an appropriate error message. This check should be added alongside the
existing n_max >= n_min validation in the validate method to prevent n_min from
being 0, which would cause ln(0) to produce NaN/infinity values that corrupt
downstream grid indexing calculations in resolution.
In `@crates/ruvector-hashenc/src/lib.rs`:
- Around line 219-221: The roundtrip test uses a fixed filename
"rhe_roundtrip.bin" when calling save_tables() which causes file path collisions
during parallel test execution. Replace the hardcoded filename with a unique
identifier such as generating a temporary file using the tempfile crate,
creating a UUID-based filename, or appending a process ID and timestamp to
ensure each test instance has its own isolated file path.
- Around line 66-67: The new constructor method is calling .expect() on
cfg.validate(), which causes a panic if validation fails instead of returning an
error. Change the return type of the new method from Self to Result<Self,
HashEncError>, and replace the cfg.validate().expect() call with cfg.validate()?
to propagate the validation error. This allows callers to handle invalid
configurations gracefully without triggering a process-level panic.
In `@crates/ruvector-hashenc/src/projection.rs`:
- Around line 56-68: The code in the PCA indexing loops assumes every sample in
the samples collection has at least d elements without validating this
assumption. Add validation before the mean calculation loop to ensure each
sample has at least d dimensions. Check that all samples in the samples iterator
have length >= d, and either return an error or panic with a descriptive message
if any sample is shorter than expected. This validation should occur before the
first loop that accesses s[i] to prevent runtime panics when indexing into
undersized samples.
In `@crates/ruvector-hashenc/src/tables.rs`:
- Around line 109-131: The load function reads the saved levels and
features_per_level values from the file at lines 118-119 but never validates
them against the cfg parameter, which means corrupted weights could be silently
loaded if the file was saved with different hyperparameters. After reading the
u32buf values for levels and features_per_level, extract them as u32 integers,
compare each against cfg.levels and cfg.features_per_level respectively, and
return an io::Error with ErrorKind::InvalidData if either value does not match
before proceeding to load the table data in the for loop.
In `@docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md`:
- Line 297: Update the path reference for the self-learning simulation harness
in the ADR document at line 297. The current reference to
`crates/ruvector-bench` is incorrect; it should be updated to
`crates/ruvector-hashenc/src/bin/selflearn.rs` to match the actual
implementation location where the selflearn subcommand is implemented. This
ensures the ADR accurately reflects the actual codebase structure for
auditability.
- Around line 114-129: The fenced code blocks in the markdown document are
missing language identifiers required by markdownlint (MD040 rule). Add the
appropriate language identifier after the opening triple backticks (```) for all
code fences throughout the document. For the code block showing the Rust crate
structure and all other occurrences at lines 133-169, 173-186, 190-213, 221-234,
and 247-255, specify either `text` for generic file structures or `rust` for
Rust code blocks to ensure consistent markdown compliance.
- Line 3: The ADR document has an inconsistent status between line 3 and line
327. Review both the Status field at the beginning of the document (line 3) and
the status mentioned at line 327, then update both locations to use the same
explicit status. Choose one status value and ensure it is consistently applied
throughout the document. Consider using a more descriptive status like "Proposed
(pending Phase-1 results)" or "Accepted (conditional)" to provide additional
context about the decision lifecycle state.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 81fc105a-430c-427d-81d6-67f748d191ce
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockbench_results/selflearn.csvis excluded by!**/*.csv
📒 Files selected for processing (17)
Cargo.tomlbench_results/selflearn_REPORT.mdcrates/ruvector-gnn/Cargo.tomlcrates/ruvector-gnn/src/feature_source.rscrates/ruvector-gnn/src/lib.rscrates/ruvector-hashenc/Cargo.tomlcrates/ruvector-hashenc/benches/encode.rscrates/ruvector-hashenc/src/bin/selflearn.rscrates/ruvector-hashenc/src/config.rscrates/ruvector-hashenc/src/hash.rscrates/ruvector-hashenc/src/interp.rscrates/ruvector-hashenc/src/lib.rscrates/ruvector-hashenc/src/projection.rscrates/ruvector-hashenc/src/rng.rscrates/ruvector-hashenc/src/tables.rscrates/ruvector-hashenc/tests/gradient_check.rsdocs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md
| fn node_features<'a>(&self, _node_id: u64, raw: &'a [f32]) -> Cow<'a, [f32]> { | ||
| Cow::Borrowed(raw) | ||
| } | ||
| #[inline] | ||
| fn out_dim(&self) -> usize { | ||
| self.dim |
There was a problem hiding this comment.
Enforce fixed-width output invariants for node_features().
out_dim() is advertised as a fixed layer input width, but current logic can return variable-length vectors (FlatEmbedding if raw.len() != dim, HashAugmented when raw.len() < raw_dim due to min). This can desynchronize downstream tensor shapes and fail at runtime in message passing layers.
💡 Proposed patch
impl FeatureSource for FlatEmbedding {
#[inline]
fn node_features<'a>(&self, _node_id: u64, raw: &'a [f32]) -> Cow<'a, [f32]> {
+ assert_eq!(
+ raw.len(),
+ self.dim,
+ "FlatEmbedding expected raw dim {}, got {}",
+ self.dim,
+ raw.len()
+ );
Cow::Borrowed(raw)
}
@@
impl FeatureSource for HashAugmented {
fn node_features<'a>(&self, _node_id: u64, raw: &'a [f32]) -> Cow<'a, [f32]> {
let enc = self.encoder.encode(raw);
if self.include_raw {
- let mut v = Vec::with_capacity(self.raw_dim + enc.len());
- v.extend_from_slice(&raw[..self.raw_dim.min(raw.len())]);
+ assert!(
+ raw.len() >= self.raw_dim,
+ "HashAugmented expected raw len >= {}, got {}",
+ self.raw_dim,
+ raw.len()
+ );
+ let mut v = Vec::with_capacity(self.out_dim);
+ v.extend_from_slice(&raw[..self.raw_dim]);
v.extend_from_slice(&enc);
Cow::Owned(v)
} else {
Cow::Owned(enc)
}Also applies to: 67-68, 82-87
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-gnn/src/feature_source.rs` around lines 35 - 40, The
`node_features()` method implementations do not guarantee that returned vectors
have a fixed length matching the value advertised by `out_dim()`. Ensure that
all implementations of `node_features()` (including `FlatEmbedding` around lines
67-68 and `HashAugmented` around lines 82-87) always return vectors with exact
length equal to `out_dim()`. If the raw input length differs from the output
dimension, apply appropriate transformations (truncation, padding, or resizing)
to enforce the fixed-width contract so downstream tensor operations always
receive consistently-sized vectors.
| fn parse() -> Self { | ||
| let mut c = Cfg { | ||
| n_items: 1500, | ||
| n_queries: 200, | ||
| dim: 24, | ||
| latent_dim: 3, | ||
| sessions: 60, | ||
| queries_per_session: 128, | ||
| n_pos: 4, | ||
| n_neg: 16, | ||
| temperature: 0.3, | ||
| lr_w: 0.01, | ||
| lr_enc: 0.05, | ||
| beta: 1.0, | ||
| seeds: vec![1, 2, 3, 4, 5], | ||
| out_dir: PathBuf::from("bench_results"), | ||
| }; | ||
| let mut args = std::env::args().skip(1); | ||
| while let Some(a) = args.next() { | ||
| match a.as_str() { | ||
| "--sessions" => c.sessions = args.next().unwrap().parse().unwrap(), | ||
| "--seeds" => { | ||
| let k: usize = args.next().unwrap().parse().unwrap(); | ||
| c.seeds = (1..=k as u64).collect(); | ||
| } | ||
| "--items" => c.n_items = args.next().unwrap().parse().unwrap(), | ||
| "--queries" => c.n_queries = args.next().unwrap().parse().unwrap(), | ||
| "--out" => c.out_dir = PathBuf::from(args.next().unwrap()), | ||
| "--quick" => { | ||
| c.n_items = 600; | ||
| c.n_queries = 80; | ||
| c.sessions = 40; | ||
| c.queries_per_session = 64; | ||
| c.seeds = vec![1, 2, 3]; | ||
| } | ||
| "--help" | "-h" => { | ||
| println!("ruvector-selflearn [--sessions N] [--seeds K] [--items N] [--queries N] [--out DIR] [--quick]"); | ||
| std::process::exit(0); | ||
| } | ||
| _ => {} | ||
| } | ||
| } | ||
| c | ||
| } |
There was a problem hiding this comment.
Add invariant checks after CLI parsing to prevent panic paths.
Line 74-117 accepts values that later violate assumptions (e.g., Line 255 needs at least 10 ranked items; Lines 370-372 require n_pos + n_neg <= ranked.len()). This currently turns user input into runtime panics.
Suggested fix
impl Cfg {
+ fn validate(&self) -> Result<(), String> {
+ if self.sessions == 0 { return Err("sessions must be > 0".into()); }
+ if self.seeds.is_empty() { return Err("seeds must be >= 1".into()); }
+ if self.n_queries == 0 { return Err("queries must be > 0".into()); }
+ if self.n_items < 10 { return Err("items must be >= 10 for Recall@10".into()); }
+ let min_ranked = self.n_pos + self.n_neg;
+ if self.n_items < min_ranked {
+ return Err(format!("items must be >= n_pos + n_neg ({min_ranked})"));
+ }
+ if self.n_neg == 0 || self.n_pos == 0 {
+ return Err("n_pos and n_neg must be > 0".into());
+ }
+ if self.temperature <= 0.0 {
+ return Err("temperature must be > 0".into());
+ }
+ Ok(())
+ }
+
fn parse() -> Self {
let mut c = Cfg {
n_items: 1500,
@@
- c
+ c.validate().expect("invalid selflearn configuration");
+ c
}
}Also applies to: 255-255, 370-372
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/bin/selflearn.rs` around lines 74 - 117, The
parse() function accepts CLI arguments without validating that the configuration
values satisfy invariants required by the rest of the code (such as needing at
least 10 ranked items and ensuring n_pos + n_neg does not exceed the length of
ranked items). Add validation checks at the end of the parse() function before
returning the Cfg struct to verify these invariants are met, and either exit
with a clear error message or provide sensible defaults if the constraints are
violated, preventing downstream panic paths.
| // L2-normalize so training (dot) == evaluation (cosine); controls norm | ||
| // drift and keeps the contrastive objective consistent with retrieval. | ||
| let norm = r.iter().map(|v| v * v).sum::<f32>().sqrt().max(1e-9); | ||
| for v in &mut r { | ||
| *v /= norm; | ||
| } |
There was a problem hiding this comment.
Backprop currently ignores the L2-normalization Jacobian.
Line 229-234 normalizes rep, but Line 424-454 propagates gradients as if rep were unnormalized. That optimizes a different objective and can bias reported learning gains.
Also applies to: 424-454
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/bin/selflearn.rs` around lines 229 - 234, The
L2-normalization of the representation variable (variable `r`) in lines 229-234
is not accounted for in the gradient backpropagation logic in lines 424-454. To
fix this, apply the proper Jacobian of the L2-normalization operation when
computing gradients during backpropagation. Specifically, when propagating
gradients back through the normalized representation in the gradient computation
section, include the mathematical transformation that accounts for how the
normalization affects the gradient flow, ensuring that the backprop correctly
optimizes the same objective as the forward pass with normalization.
| s.push('\n'); | ||
| } | ||
| let _ = writeln!(s, " +{}", "-".repeat(w)); | ||
| let _ = writeln!(s, " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)", base.len()); |
There was a problem hiding this comment.
Session range label is off by one.
Line 553 prints 0 .. len, but the last plotted index is len - 1.
Suggested fix
- let _ = writeln!(s, " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)", base.len());
+ let _ = writeln!(
+ s,
+ " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)",
+ base.len().saturating_sub(1)
+ );📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let _ = writeln!(s, " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)", base.len()); | |
| let _ = writeln!( | |
| s, | |
| " session 0 .. {} ('*'=hashenc '.'=baseline '#'=both)", | |
| base.len().saturating_sub(1) | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/bin/selflearn.rs` at line 553, The session range
label in the writeln! macro on line 553 displays an off-by-one range. The label
currently shows "0 .. base.len()" but since array indices are zero-based, the
maximum valid index is base.len() - 1. Change the writeln! statement to display
the upper bound as base.len() - 1 instead of base.len() to accurately reflect
the range of plotted indices.
| let _ = writeln!(rep, "| Convergence to 90% of own plateau (sessions) | {:.1} | {:.1} | — | — |", mean(&base_conv), mean(&hash_conv)); | ||
| let _ = writeln!(rep, "| Sessions to surpass baseline's final recall | — | {:.1} | reaches baseline quality fast, then exceeds it | — |", mean(&surpass)); | ||
| let _ = writeln!(rep, "| Encode cost added per query | — | +{:.2} µs | **{:+.1}%** of a ~60µs query | — |", added_us, lat_overhead); | ||
| let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```\n{curve}```\n"); |
There was a problem hiding this comment.
Emit fenced code blocks with a language tag to avoid MD040 warnings.
Line 658 generates a bare triple-backtick block; markdownlint expects a language (e.g., text).
Suggested fix
- let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```\n{curve}```\n");
+ let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```text\n{curve}```\n");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n```\n{curve}```\n"); | |
| let _ = writeln!(rep, "\n## Recall@10 learning curve (averaged over seeds)\n\n |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/bin/selflearn.rs` at line 658, The writeln! macro
call that generates the Markdown code block for the learning curve is using a
bare triple-backtick fence without a language tag, which triggers markdownlint
MD040 warnings. Add a language tag (such as "text") immediately after the
opening triple-backticks in the format string to properly specify the code block
language. Locate the writeln! call in the selflearn.rs file that outputs the
learning curve and update the opening fence from ``` to ```text.
Source: Linters/SAST tools
| for s in samples { | ||
| for i in 0..d { | ||
| mean[i] += s[i]; | ||
| } | ||
| } | ||
| for m in &mut mean { | ||
| *m /= samples.len() as f32; | ||
| } | ||
| // Centered copies. | ||
| let centered: Vec<Vec<f32>> = samples | ||
| .iter() | ||
| .map(|s| (0..d).map(|i| s[i] - mean[i]).collect()) | ||
| .collect(); |
There was a problem hiding this comment.
Validate sample dimensionality before PCA indexing loops.
Line 58 and Line 67 assume every sample has at least input_dim elements; shorter samples will panic at runtime.
Proposed fix
pub fn fit(&mut self, cfg: &HashEncConfig, samples: &[Vec<f32>]) {
if cfg.projection != ProjectionKind::PcaInit || samples.is_empty() {
return;
}
let d = self.input_dim;
+ if samples.iter().any(|s| s.len() < d) {
+ // Keep locked-random projection when sample shape is invalid.
+ return;
+ }
// Mean.
let mut mean = vec![0.0f32; d];🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/projection.rs` around lines 56 - 68, The code in
the PCA indexing loops assumes every sample in the samples collection has at
least d elements without validating this assumption. Add validation before the
mean calculation loop to ensure each sample has at least d dimensions. Check
that all samples in the samples iterator have length >= d, and either return an
error or panic with a descriptive message if any sample is shorter than
expected. This validation should occur before the first loop that accesses s[i]
to prevent runtime panics when indexing into undersized samples.
| /// Load tables previously written by [`save`] into `cfg`-shaped buffers. | ||
| pub fn load(cfg: &HashEncConfig, path: &Path) -> io::Result<Self> { | ||
| let mut r = io::BufReader::new(std::fs::File::open(path)?); | ||
| let mut u32buf = [0u8; 4]; | ||
| r.read_exact(&mut u32buf)?; | ||
| if u32::from_le_bytes(u32buf) != TABLE_MAGIC { | ||
| return Err(io::Error::new(io::ErrorKind::InvalidData, "bad magic")); | ||
| } | ||
| let mut me = Self::new(cfg); | ||
| r.read_exact(&mut u32buf)?; // levels | ||
| r.read_exact(&mut u32buf)?; // F | ||
| for t in &mut me.levels { | ||
| let mut u64buf = [0u8; 8]; | ||
| r.read_exact(&mut u64buf)?; | ||
| let len = u64::from_le_bytes(u64buf) as usize; | ||
| if len != t.len() { | ||
| return Err(io::Error::new(io::ErrorKind::InvalidData, "shape mismatch")); | ||
| } | ||
| let bytes: &mut [u8] = bytemuck_cast_mut(t); | ||
| r.read_exact(bytes)?; | ||
| } | ||
| Ok(me) | ||
| } |
There was a problem hiding this comment.
load discards stored levels/features_per_level without validating against cfg.
Lines 118-119 read the saved levels and features counts but never compare them to cfg.levels and cfg.features_per_level. If the file was saved with different hyperparameters, the per-level length check on line 124 may pass coincidentally (e.g., fewer levels with larger rows), loading corrupted weights silently.
🛡️ Proposed fix
pub fn load(cfg: &HashEncConfig, path: &Path) -> io::Result<Self> {
let mut r = io::BufReader::new(std::fs::File::open(path)?);
let mut u32buf = [0u8; 4];
r.read_exact(&mut u32buf)?;
if u32::from_le_bytes(u32buf) != TABLE_MAGIC {
return Err(io::Error::new(io::ErrorKind::InvalidData, "bad magic"));
}
let mut me = Self::new(cfg);
- r.read_exact(&mut u32buf)?; // levels
- r.read_exact(&mut u32buf)?; // F
+ r.read_exact(&mut u32buf)?;
+ let file_levels = u32::from_le_bytes(u32buf) as usize;
+ r.read_exact(&mut u32buf)?;
+ let file_f = u32::from_le_bytes(u32buf) as usize;
+ if file_levels != cfg.levels || file_f != cfg.features_per_level {
+ return Err(io::Error::new(
+ io::ErrorKind::InvalidData,
+ "config/file shape mismatch",
+ ));
+ }
for t in &mut me.levels {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/tables.rs` around lines 109 - 131, The load
function reads the saved levels and features_per_level values from the file at
lines 118-119 but never validates them against the cfg parameter, which means
corrupted weights could be silently loaded if the file was saved with different
hyperparameters. After reading the u32buf values for levels and
features_per_level, extract them as u32 integers, compare each against
cfg.levels and cfg.features_per_level respectively, and return an io::Error with
ErrorKind::InvalidData if either value does not match before proceeding to load
the table data in the for loop.
| ``` | ||
| crates/ruvector-hashenc/ | ||
| Cargo.toml | ||
| src/ | ||
| lib.rs // public API, HashEncoder, HashEncConfig, FeatureSource impl | ||
| config.rs // HashEncConfig (L, T, F, d_idx, N_min, N_max), defaults | ||
| grid.rs // level resolutions, geometric growth, corner enumeration | ||
| hash.rs // spatial hash (XOR of coord*prime mod T), per-level | ||
| interp.rs // d-linear interpolation forward + analytic backward | ||
| projection.rs // P: R^d -> R^{d_idx} (locked random / learned / PCA-init) | ||
| tables.rs // FeatureTables: in-memory + mmap-backed, trainable | ||
| backward.rs // sparse gradient scatter into tables + projection grad | ||
| simd.rs // AVX2/AVX512/NEON gather + interpolation kernels | ||
| persist.rs // mmap layout + header, reuse MmapManager conventions | ||
| wasm.rs // wasm32 path (no mmap; in-memory tables) | ||
| ``` |
There was a problem hiding this comment.
Add language identifiers to fenced code blocks (MD040).
These code fences are unlabeled; add text/rust to satisfy markdownlint consistently.
Also applies to: 133-169, 173-186, 190-213, 221-234, 247-255
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 114-114: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md`
around lines 114 - 129, The fenced code blocks in the markdown document are
missing language identifiers required by markdownlint (MD040 rule). Add the
appropriate language identifier after the opening triple backticks (```) for all
code fences throughout the document. For the code block showing the Rust crate
structure and all other occurrences at lines 133-169, 173-186, 190-213, 221-234,
and 247-255, specify either `text` for generic file structures or `rust` for
Rust code blocks to ensure consistent markdown compliance.
Source: Linters/SAST tools
…, tiered store, SIMD (ADR-258) Lands the remaining ADR-258 roadmap as opt-in, fully unit-tested components behind the `hashenc` feature flag (default build unchanged). Phase 2 (GNN / self-learning upgrades): - Learned projection: trainable P with full analytic gradient (HashEncoder::projection_grad / apply_projection_grad), PCA-initialized via the new ProjectionKind::Learned. Proven by a finite-difference gradient check and an end-to-end learning test (tables + projection reduce loss >50%). - sampling.rs: NegativeSampler (Random / HnswHard mid-rank / Mixed) for hard-negative mining, and TemperatureSchedule (cosine InfoNCE annealing). - ruvector-gnn residual.rs: ResidualGatBlock — residual skip + learned edge gain over MultiHeadAttention + LayerNorm. Phase 3 (storage / SIMD): - tiered.rs: TieredFeatureStore (HOT tables / WARM int8 reconstruction / COLD) with footprint accounting — wires quantization into the live path (spirit of ruvnet#563). AVX2 + scalar L2 rerank distance with a SIMD==scalar differential test. Tests: ruvector-hashenc 13 unit + 2 gradient checks + e2e learning + doctest; ruvector-gnn 215+6+10 with `hashenc`. Clippy clean; default builds unaffected. Docs: ADR-258 updated with a living implementation-status table and Accepted outcome; added ADR-258 issue + PR description drafts for submission upstream. Co-Authored-By: claude-flow <ruv@ruv.net> Claude-Session: https://claude.ai/code/session_01X6ARy2fKVicSxK3GtjkV7x
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ruvector-gnn/src/residual.rs`:
- Around line 42-62: The forward method in the residual.rs file does not
validate input tensor shapes, allowing mismatched dimensions to silently
propagate and potentially cause panics later in LayerNorm. Add input validation
at the beginning of the forward method to enforce that neighbors.len() matches
edge_weights.len(), node.len() matches the configured embed_dim, and all
neighbor vectors have consistent dimensions. Consider updating the forward
method signature to return a Result type to properly communicate validation
failures to the caller.
In `@crates/ruvector-hashenc/src/sampling.rs`:
- Around line 39-43: Add guard conditions to prevent modulo-by-zero panic and
infinite loops in the sampling logic. Before the while loop that generates
random indices using `rng.next_u64() % n_items as u64`, verify that n_items is
greater than zero. Additionally, when calling the modulo operation on n_items,
ensure there exists at least one non-excluded item available, otherwise the loop
will hang indefinitely trying to find a valid sample. These guard checks should
be applied to both sampling blocks referenced in the comment (the one around
lines 39-43 and the similar one around lines 61-66).
- Around line 68-77: The hard_frac value is used without validation to calculate
n_hard in the NegativeSampler::Mixed branch, which can result in n_hard
exceeding n if hard_frac is greater than 1.0. This causes an integer underflow
when computing n - out.len() on line 76. Clamp hard_frac to the range [0.0, 1.0]
before using it to compute n_hard to ensure n_hard never exceeds n.
In `@crates/ruvector-hashenc/src/tiered.rs`:
- Around line 178-189: The l2_avx2 function uses the _mm256_fmadd_ps intrinsic
which requires the FMA instruction set, but the function only declares the
"avx2" target feature in its #[target_feature] attribute and the runtime CPU
capability check at line 159 only verifies for AVX2 support. Add "fma" to the
target_feature enable list in the l2_avx2 function definition to properly
declare the FMA requirement, and update the runtime CPU check to also verify
FMA3 capability is supported before allowing this code path to execute.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9cc7163b-4946-4340-9811-bdce95c75fa1
📒 Files selected for processing (13)
crates/ruvector-gnn/src/lib.rscrates/ruvector-gnn/src/residual.rscrates/ruvector-hashenc/src/config.rscrates/ruvector-hashenc/src/interp.rscrates/ruvector-hashenc/src/lib.rscrates/ruvector-hashenc/src/projection.rscrates/ruvector-hashenc/src/sampling.rscrates/ruvector-hashenc/src/tiered.rscrates/ruvector-hashenc/tests/gradient_check.rscrates/ruvector-hashenc/tests/learning.rsdocs/adr/ADR-258-ISSUE-DRAFT.mddocs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.mddocs/adr/ADR-258-PR-DRAFT.md
✅ Files skipped from review due to trivial changes (3)
- docs/adr/ADR-258-PR-DRAFT.md
- docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md
- docs/adr/ADR-258-ISSUE-DRAFT.md
| pub fn forward(&self, node: &[f32], neighbors: &[Vec<f32>], edge_weights: &[f32]) -> Vec<f32> { | ||
| let d = node.len(); | ||
| let mut out = node.to_vec(); // residual skip | ||
|
|
||
| if !neighbors.is_empty() { | ||
| // Attention sub-layer. | ||
| let attn = self.attention.forward(node, neighbors, neighbors); | ||
| for k in 0..d.min(attn.len()) { | ||
| out[k] += attn[k]; | ||
| } | ||
| // Edge-weighted neighbour aggregation (normalized), scaled by gain. | ||
| let wsum: f32 = edge_weights.iter().copied().sum::<f32>().max(1e-9); | ||
| for (nb, &w) in neighbors.iter().zip(edge_weights) { | ||
| let wn = (w / wsum) * self.edge_gain; | ||
| for k in 0..d.min(nb.len()) { | ||
| out[k] += wn * nb[k]; | ||
| } | ||
| } | ||
| } | ||
| self.norm.forward(&out) | ||
| } |
There was a problem hiding this comment.
Validate input shapes and edge-weight arity at the forward boundary.
forward currently accepts inconsistent tensor sizes: neighbors.iter().zip(edge_weights) silently drops unmatched items, and node.len() can differ from embed_dim, which then flows into LayerNorm configured with embed_dim parameters and can panic on length mismatch. This should be enforced as a hard contract.
Proposed fix (explicit contract via Result)
-use crate::error::Result;
+use crate::error::{GnnError, Result};
@@
- pub fn forward(&self, node: &[f32], neighbors: &[Vec<f32>], edge_weights: &[f32]) -> Vec<f32> {
+ pub fn forward(
+ &self,
+ node: &[f32],
+ neighbors: &[Vec<f32>],
+ edge_weights: &[f32],
+ ) -> Result<Vec<f32>> {
+ if node.len() != self.embed_dim {
+ return Err(GnnError::layer_config(format!(
+ "node dim {} != embed_dim {}",
+ node.len(),
+ self.embed_dim
+ )));
+ }
+ if neighbors.len() != edge_weights.len() {
+ return Err(GnnError::layer_config(format!(
+ "neighbors len {} != edge_weights len {}",
+ neighbors.len(),
+ edge_weights.len()
+ )));
+ }
+ if neighbors.iter().any(|nb| nb.len() != self.embed_dim) {
+ return Err(GnnError::layer_config("neighbor dim must match embed_dim".to_string()));
+ }
+
let d = node.len();
let mut out = node.to_vec(); // residual skip
@@
- self.norm.forward(&out)
+ Ok(self.norm.forward(&out))
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn forward(&self, node: &[f32], neighbors: &[Vec<f32>], edge_weights: &[f32]) -> Vec<f32> { | |
| let d = node.len(); | |
| let mut out = node.to_vec(); // residual skip | |
| if !neighbors.is_empty() { | |
| // Attention sub-layer. | |
| let attn = self.attention.forward(node, neighbors, neighbors); | |
| for k in 0..d.min(attn.len()) { | |
| out[k] += attn[k]; | |
| } | |
| // Edge-weighted neighbour aggregation (normalized), scaled by gain. | |
| let wsum: f32 = edge_weights.iter().copied().sum::<f32>().max(1e-9); | |
| for (nb, &w) in neighbors.iter().zip(edge_weights) { | |
| let wn = (w / wsum) * self.edge_gain; | |
| for k in 0..d.min(nb.len()) { | |
| out[k] += wn * nb[k]; | |
| } | |
| } | |
| } | |
| self.norm.forward(&out) | |
| } | |
| pub fn forward( | |
| &self, | |
| node: &[f32], | |
| neighbors: &[Vec<f32>], | |
| edge_weights: &[f32], | |
| ) -> Result<Vec<f32>> { | |
| if node.len() != self.embed_dim { | |
| return Err(GnnError::layer_config(format!( | |
| "node dim {} != embed_dim {}", | |
| node.len(), | |
| self.embed_dim | |
| ))); | |
| } | |
| if neighbors.len() != edge_weights.len() { | |
| return Err(GnnError::layer_config(format!( | |
| "neighbors len {} != edge_weights len {}", | |
| neighbors.len(), | |
| edge_weights.len() | |
| ))); | |
| } | |
| if neighbors.iter().any(|nb| nb.len() != self.embed_dim) { | |
| return Err(GnnError::layer_config("neighbor dim must match embed_dim".to_string())); | |
| } | |
| let d = node.len(); | |
| let mut out = node.to_vec(); // residual skip | |
| if !neighbors.is_empty() { | |
| // Attention sub-layer. | |
| let attn = self.attention.forward(node, neighbors, neighbors); | |
| for k in 0..d.min(attn.len()) { | |
| out[k] += attn[k]; | |
| } | |
| // Edge-weighted neighbour aggregation (normalized), scaled by gain. | |
| let wsum: f32 = edge_weights.iter().copied().sum::<f32>().max(1e-9); | |
| for (nb, &w) in neighbors.iter().zip(edge_weights) { | |
| let wn = (w / wsum) * self.edge_gain; | |
| for k in 0..d.min(nb.len()) { | |
| out[k] += wn * nb[k]; | |
| } | |
| } | |
| } | |
| Ok(self.norm.forward(&out)) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-gnn/src/residual.rs` around lines 42 - 62, The forward method
in the residual.rs file does not validate input tensor shapes, allowing
mismatched dimensions to silently propagate and potentially cause panics later
in LayerNorm. Add input validation at the beginning of the forward method to
enforce that neighbors.len() matches edge_weights.len(), node.len() matches the
configured embed_dim, and all neighbor vectors have consistent dimensions.
Consider updating the forward method signature to return a Result type to
properly communicate validation failures to the caller.
| let c = (rng.next_u64() % n_items as u64) as usize; | ||
| if !is_excluded(c) { | ||
| out.push(c); | ||
| } | ||
| } |
There was a problem hiding this comment.
Guard impossible sampling states to avoid panic/hangs.
n_items == 0 triggers modulo-by-zero, and if all in-range items are excluded these while loops can run indefinitely.
Suggested fix
pub fn sample(
&self,
ranked: &[usize],
n_items: usize,
n: usize,
exclude: &[usize],
rng: &mut SplitMix64,
) -> Vec<usize> {
+ if n == 0 || n_items == 0 {
+ return Vec::new();
+ }
+ let excluded_in_range = exclude.iter().filter(|&&x| x < n_items).count();
+ if excluded_in_range >= n_items {
+ return Vec::new();
+ }
let is_excluded = |x: usize| exclude.contains(&x);Also applies to: 61-66
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/sampling.rs` around lines 39 - 43, Add guard
conditions to prevent modulo-by-zero panic and infinite loops in the sampling
logic. Before the while loop that generates random indices using `rng.next_u64()
% n_items as u64`, verify that n_items is greater than zero. Additionally, when
calling the modulo operation on n_items, ensure there exists at least one
non-excluded item available, otherwise the loop will hang indefinitely trying to
find a valid sample. These guard checks should be applied to both sampling
blocks referenced in the comment (the one around lines 39-43 and the similar one
around lines 61-66).
| NegativeSampler::Mixed { band, hard_frac } => { | ||
| let n_hard = ((n as f32) * hard_frac).round() as usize; | ||
| let hard = | ||
| NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng); | ||
| out.extend(hard); | ||
| let mut excl2 = exclude.to_vec(); | ||
| excl2.extend_from_slice(&out); | ||
| let rest = | ||
| NegativeSampler::Random.sample(ranked, n_items, n - out.len(), &excl2, rng); | ||
| out.extend(rest); |
There was a problem hiding this comment.
Clamp hard_frac before computing n_hard.
Unchecked hard_frac can produce n_hard > n; then n - out.len() underflows at Line 76 after extending hard negatives.
Suggested fix
NegativeSampler::Mixed { band, hard_frac } => {
- let n_hard = ((n as f32) * hard_frac).round() as usize;
+ let frac = hard_frac.clamp(0.0, 1.0);
+ let n_hard = ((n as f32) * frac).round() as usize;
let hard =
NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng);
out.extend(hard);
let mut excl2 = exclude.to_vec();
excl2.extend_from_slice(&out);
let rest =
- NegativeSampler::Random.sample(ranked, n_items, n - out.len(), &excl2, rng);
+ NegativeSampler::Random.sample(ranked, n_items, n.saturating_sub(out.len()), &excl2, rng);
out.extend(rest);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| NegativeSampler::Mixed { band, hard_frac } => { | |
| let n_hard = ((n as f32) * hard_frac).round() as usize; | |
| let hard = | |
| NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng); | |
| out.extend(hard); | |
| let mut excl2 = exclude.to_vec(); | |
| excl2.extend_from_slice(&out); | |
| let rest = | |
| NegativeSampler::Random.sample(ranked, n_items, n - out.len(), &excl2, rng); | |
| out.extend(rest); | |
| NegativeSampler::Mixed { band, hard_frac } => { | |
| let frac = hard_frac.clamp(0.0, 1.0); | |
| let n_hard = ((n as f32) * frac).round() as usize; | |
| let hard = | |
| NegativeSampler::HnswHard { band }.sample(ranked, n_items, n_hard, exclude, rng); | |
| out.extend(hard); | |
| let mut excl2 = exclude.to_vec(); | |
| excl2.extend_from_slice(&out); | |
| let rest = | |
| NegativeSampler::Random.sample(ranked, n_items, n.saturating_sub(out.len()), &excl2, rng); | |
| out.extend(rest); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/sampling.rs` around lines 68 - 77, The hard_frac
value is used without validation to calculate n_hard in the
NegativeSampler::Mixed branch, which can result in n_hard exceeding n if
hard_frac is greater than 1.0. This causes an integer underflow when computing n
- out.len() on line 76. Clamp hard_frac to the range [0.0, 1.0] before using it
to compute n_hard to ensure n_hard never exceeds n.
| #[cfg(target_arch = "x86_64")] | ||
| #[target_feature(enable = "avx2")] | ||
| unsafe fn l2_avx2(a: &[f32], b: &[f32]) -> f32 { | ||
| use std::arch::x86_64::*; | ||
| let n = a.len().min(b.len()); | ||
| let mut sum = _mm256_setzero_ps(); | ||
| let mut i = 0; | ||
| while i + 8 <= n { | ||
| let va = _mm256_loadu_ps(a.as_ptr().add(i)); | ||
| let vb = _mm256_loadu_ps(b.as_ptr().add(i)); | ||
| let d = _mm256_sub_ps(va, vb); | ||
| sum = _mm256_fmadd_ps(d, d, sum); |
There was a problem hiding this comment.
FMA intrinsic used without FMA feature detection/enablement.
_mm256_fmadd_ps at line 189 requires the FMA instruction set, but the runtime check (line 159) only verifies AVX2 and #[target_feature] only enables "avx2". While virtually all x86-64 CPUs with AVX2 also support FMA3, these are technically independent feature flags.
🔧 Proposed fix: enable and detect FMA
#[cfg(target_arch = "x86_64")]
-#[target_feature(enable = "avx2")]
+#[target_feature(enable = "avx2", enable = "fma")]
unsafe fn l2_avx2(a: &[f32], b: &[f32]) -> f32 {And update the runtime check:
#[cfg(target_arch = "x86_64")]
{
- if is_x86_feature_detected!("avx2") {
+ if is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma") {
// Safety: guarded by runtime feature detection.
return unsafe { l2_avx2(a, b) };
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/ruvector-hashenc/src/tiered.rs` around lines 178 - 189, The l2_avx2
function uses the _mm256_fmadd_ps intrinsic which requires the FMA instruction
set, but the function only declares the "avx2" target feature in its
#[target_feature] attribute and the runtime CPU capability check at line 159
only verifies for AVX2 support. Add "fma" to the target_feature enable list in
the l2_avx2 function definition to properly declare the FMA requirement, and
update the runtime CPU check to also verify FMA3 capability is supported before
allowing this code path to execute.
What & why
Implements Phase 1 of ADR-258, adapting the Instant-NGP multiresolution hash encoding (Müller et al., SIGGRAPH 2022, arXiv:2201.05989) into RuVector's GNN-over-HNSW self-learning loop.
The self-learning loop today is bandwidth-bound: online InfoNCE updates touch full
d_embedembeddings and accumulate dense gradients throughMmapGradientAccumulator. Multiresolution hash encoding replaces that with O(L) cache-resident table lookups whose gradients are sparse (only2^d_idx·Lparams per sample), giving the GNN an explicit multi-scale signal aligned with the HNSW layer hierarchy, at a fixed, bounded memory budget.Full rationale, alternatives, success criteria, risks, and the phased plan are in the ADR:
docs/adr/ADR-258-Multiresolution-Hash-Encoding-and-Neural-Index-Upgrade.md.What's in this PR
New crate
ruvector-hashenc(dependency-light:thiserroronly, optionalmemmap2; WASM-friendly):HashEncoder— projection (LockedRandom/PcaInit) into a low-d_idxindex space, hashed multiresolution grid (Llevels,Ttable size,Ffeatures), d-linear interpolation; dense collision-free coarse levels + spatial hashing (h(x)=⊕ xᵢπᵢ mod T) for fine levels.FeatureTables+GradAccum— trainable tables with a sparse-scatter backward and fused AXPY update; file persistence.encode,forward+backward).GNN integration (behind
hashencfeature, default off → backward compatible):FeatureSourcetrait withFlatEmbedding(legacy, zero-overhead default) andHashAugmented(concat raw + encoded).RuvectorLayer::forwardsignature unchanged; onlyinput_dimgrows.Self-learning validation harness (
ruvector-selflearn):bench_results/selflearn_REPORT.md.Phase-1 results (5 seeds)
Baseline learns a diagonal metric; the hashenc variant freezes the linear part and learns only the encoder, isolating its contribution on a relevance target a linear metric provably cannot capture.
Reproduce / verify
All green locally; clippy clean on the new crate.
Scope / safety
hashencand the new crate are opt-in).ruvector-hashencadded to workspace members.Roadmap (tracked here since Issues are disabled on this repo)
FeatureSourceintegration + self-learning harness + gradient-check (this PR)TieredFeatureStore(HOT MHE / WARM PQ·RaBitQ / COLD block-aligned, wiring quantization into the live path, DbOptions.quantization is silently ignored — never applied by VectorDB (misleading default + docs) ruvnet/RuVector#563), async query path, AVX512/NEON/wasm gather kernels, full suite + comparison report; promote default if S-criteria passSuccess criteria (ADR-258 §5)
S1 Recall@10 +25–50% · S2 Recall@100 +15–35% · S3 convergence 2–3× · S4 QPS 1.8–3× · S5 p50→25–40µs · S6 mem −25–45% · S7 overhead ≤+15%. Promotion gated on S1 ∧ S3 with ≥5 seeds, 95% CI, Cohen's d ≥ 0.8.
🤖 Generated with claude-flow
Generated by Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests / Benchmarks
Documentation