-
Notifications
You must be signed in to change notification settings - Fork 444
Description
Summary
embed() (sync) always returns hash-based vectors regardless of ONNX availability, while embedAsync() returns true ONNX semantic vectors. Since remember() uses embedAsync() but many natural usage patterns call embed() directly, the persisted DB ends up with mixed embedding spaces and different dimensions (128/256 hash vs 384 ONNX). recall() queries with embedAsync() — cosine similarity against mismatched-dimension vectors produces meaningless scores or silent failures.
Root Cause
embed() has a code path that appears to handle ONNX but actually just calls hashEmbed() unconditionally:
embed(text) {
const dim = this.config.embeddingDim;
if (this.onnxReady && this.onnxEmbedder) {
try {
// Note: This is sync wrapper for async ONNX
// For full async, use embedAsync
return this.hashEmbed(text, dim); // ← always hash, ONNX never used
} catch { /* fall through */ }
}
return this.hashEmbed(text, dim);
}The comment documents the intent (sync ONNX wrapper) but the implementation never calls ONNX. The result: embed() is always a hash function, with no warning or error.
Consequences
-
Dimension mismatch:
embed()returnsembeddingDim-dimensional hash vectors (default 128 or 256).embedAsync()with ONNX returns 384-dim (all-MiniLM-L6-v2). CallingcosineSimilarity(384-dim-query, 128-dim-stored)produces incorrect results — the implementation usesa.length !== b.lengthas a guard that silently returns 0. -
Mixed embedding spaces: Any code that pre-computes embeddings via
engine.embed()and stores them directly (e.g. bulk import pipelines) will store hash vectors.recall()queries with ONNX vectors. The two spaces are incomparable — no meaningful similarity can be computed. -
Silent quality regression: Users who configure ONNX expecting semantic recall get hash-quality recall with no indication anything is wrong. Scores appear normal (0.6–0.75 range for hash similarity) but do not reflect semantic meaning.
Reproduction
const engine = new IntelligenceEngine({ embeddingDim: 128 });
// If ONNX initialises:
const syncVec = engine.embed('knowledge synthesis framework'); // 128-dim hash
const asyncVec = await engine.embedAsync('knowledge synthesis framework'); // 384-dim ONNX (if active)
console.log(syncVec.length, asyncVec.length); // 128, 384 — incompatibleSuggested Fix
Option A — Document and enforce the distinction:
Rename embed() to hashEmbed() as the public API (breaking), or add a clear deprecation warning when ONNX is active and embed() is called. Update all internal usages in import() and any bulk-write paths to use embedAsync().
Option B — Consistent embedding on write:
Ensure import() re-embeds stored memories using embedAsync() when ONNX becomes available, or store an embeddingModel tag on each memory entry so mismatches can be detected and flagged at recall time.
Option C — Dimension guard in cosineSimilarity():
At minimum, log a warning (not silent return) when a.length !== b.length:
cosineSimilarity(a, b) {
if (a.length !== b.length) {
console.warn(`[ruvector] Embedding dimension mismatch: ${a.length} vs ${b.length}. Results unreliable.`);
// ... truncate or return 0
}
}Related
This compounds the issue in #315 (import() not rebuilding HNSW): even after fixing HNSW rebuild, if stored embeddings are hash-based and queries are ONNX-based, HNSW distances will be incorrect and brute-force cosine scores will be meaningless.
Environment
ruvector@0.2.19- Node.js v20.19.5
dist/core/intelligence-engine.js—embed()~line 183,embedAsync()~line 211,remember()~line 336- ONNX bundle:
dist/core/onnx/pkg/ruvector_onnx_embeddings_wasm.js(present)
Best regards,
Rob / Roble