Skip to content

Bug: embed() always returns hash vectors — breaks cosine similarity when ONNX is active #316

@RobLe3

Description

@RobLe3

Summary

embed() (sync) always returns hash-based vectors regardless of ONNX availability, while embedAsync() returns true ONNX semantic vectors. Since remember() uses embedAsync() but many natural usage patterns call embed() directly, the persisted DB ends up with mixed embedding spaces and different dimensions (128/256 hash vs 384 ONNX). recall() queries with embedAsync() — cosine similarity against mismatched-dimension vectors produces meaningless scores or silent failures.

Root Cause

embed() has a code path that appears to handle ONNX but actually just calls hashEmbed() unconditionally:

embed(text) {
    const dim = this.config.embeddingDim;
    if (this.onnxReady && this.onnxEmbedder) {
        try {
            // Note: This is sync wrapper for async ONNX
            // For full async, use embedAsync
            return this.hashEmbed(text, dim); // ← always hash, ONNX never used
        } catch { /* fall through */ }
    }
    return this.hashEmbed(text, dim);
}

The comment documents the intent (sync ONNX wrapper) but the implementation never calls ONNX. The result: embed() is always a hash function, with no warning or error.

Consequences

  1. Dimension mismatch: embed() returns embeddingDim-dimensional hash vectors (default 128 or 256). embedAsync() with ONNX returns 384-dim (all-MiniLM-L6-v2). Calling cosineSimilarity(384-dim-query, 128-dim-stored) produces incorrect results — the implementation uses a.length !== b.length as a guard that silently returns 0.

  2. Mixed embedding spaces: Any code that pre-computes embeddings via engine.embed() and stores them directly (e.g. bulk import pipelines) will store hash vectors. recall() queries with ONNX vectors. The two spaces are incomparable — no meaningful similarity can be computed.

  3. Silent quality regression: Users who configure ONNX expecting semantic recall get hash-quality recall with no indication anything is wrong. Scores appear normal (0.6–0.75 range for hash similarity) but do not reflect semantic meaning.

Reproduction

const engine = new IntelligenceEngine({ embeddingDim: 128 });
// If ONNX initialises:
const syncVec  = engine.embed('knowledge synthesis framework');      // 128-dim hash
const asyncVec = await engine.embedAsync('knowledge synthesis framework'); // 384-dim ONNX (if active)
console.log(syncVec.length, asyncVec.length); // 128, 384 — incompatible

Suggested Fix

Option A — Document and enforce the distinction:
Rename embed() to hashEmbed() as the public API (breaking), or add a clear deprecation warning when ONNX is active and embed() is called. Update all internal usages in import() and any bulk-write paths to use embedAsync().

Option B — Consistent embedding on write:
Ensure import() re-embeds stored memories using embedAsync() when ONNX becomes available, or store an embeddingModel tag on each memory entry so mismatches can be detected and flagged at recall time.

Option C — Dimension guard in cosineSimilarity():
At minimum, log a warning (not silent return) when a.length !== b.length:

cosineSimilarity(a, b) {
    if (a.length !== b.length) {
        console.warn(`[ruvector] Embedding dimension mismatch: ${a.length} vs ${b.length}. Results unreliable.`);
        // ... truncate or return 0
    }
}

Related

This compounds the issue in #315 (import() not rebuilding HNSW): even after fixing HNSW rebuild, if stored embeddings are hash-based and queries are ONNX-based, HNSW distances will be incorrect and brute-force cosine scores will be meaningless.

Environment

  • ruvector@0.2.19
  • Node.js v20.19.5
  • dist/core/intelligence-engine.jsembed() ~line 183, embedAsync() ~line 211, remember() ~line 336
  • ONNX bundle: dist/core/onnx/pkg/ruvector_onnx_embeddings_wasm.js (present)

Best regards,
Rob / Roble

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions