Bug: embed() always returns hash vectors — breaks cosine similarity when ONNX is active

## Summary

`embed()` (sync) always returns hash-based vectors regardless of ONNX availability, while `embedAsync()` returns true ONNX semantic vectors. Since `remember()` uses `embedAsync()` but many natural usage patterns call `embed()` directly, the persisted DB ends up with mixed embedding spaces and different dimensions (128/256 hash vs 384 ONNX). `recall()` queries with `embedAsync()` — cosine similarity against mismatched-dimension vectors produces meaningless scores or silent failures.

## Root Cause

`embed()` has a code path that appears to handle ONNX but actually just calls `hashEmbed()` unconditionally:

```js
embed(text) {
    const dim = this.config.embeddingDim;
    if (this.onnxReady && this.onnxEmbedder) {
        try {
            // Note: This is sync wrapper for async ONNX
            // For full async, use embedAsync
            return this.hashEmbed(text, dim); // ← always hash, ONNX never used
        } catch { /* fall through */ }
    }
    return this.hashEmbed(text, dim);
}
```

The comment documents the intent (sync ONNX wrapper) but the implementation never calls ONNX. The result: `embed()` is always a hash function, with no warning or error.

## Consequences

1. **Dimension mismatch**: `embed()` returns `embeddingDim`-dimensional hash vectors (default 128 or 256). `embedAsync()` with ONNX returns 384-dim (`all-MiniLM-L6-v2`). Calling `cosineSimilarity(384-dim-query, 128-dim-stored)` produces incorrect results — the implementation uses `a.length !== b.length` as a guard that silently returns 0.

2. **Mixed embedding spaces**: Any code that pre-computes embeddings via `engine.embed()` and stores them directly (e.g. bulk import pipelines) will store hash vectors. `recall()` queries with ONNX vectors. The two spaces are incomparable — no meaningful similarity can be computed.

3. **Silent quality regression**: Users who configure ONNX expecting semantic recall get hash-quality recall with no indication anything is wrong. Scores appear normal (0.6–0.75 range for hash similarity) but do not reflect semantic meaning.

## Reproduction

```js
const engine = new IntelligenceEngine({ embeddingDim: 128 });
// If ONNX initialises:
const syncVec  = engine.embed('knowledge synthesis framework');      // 128-dim hash
const asyncVec = await engine.embedAsync('knowledge synthesis framework'); // 384-dim ONNX (if active)
console.log(syncVec.length, asyncVec.length); // 128, 384 — incompatible
```

## Suggested Fix

**Option A — Document and enforce the distinction:**
Rename `embed()` to `hashEmbed()` as the public API (breaking), or add a clear deprecation warning when ONNX is active and `embed()` is called. Update all internal usages in `import()` and any bulk-write paths to use `embedAsync()`.

**Option B — Consistent embedding on write:**
Ensure `import()` re-embeds stored memories using `embedAsync()` when ONNX becomes available, or store an `embeddingModel` tag on each memory entry so mismatches can be detected and flagged at recall time.

**Option C — Dimension guard in `cosineSimilarity()`:**
At minimum, log a warning (not silent return) when `a.length !== b.length`:
```js
cosineSimilarity(a, b) {
    if (a.length !== b.length) {
        console.warn(`[ruvector] Embedding dimension mismatch: ${a.length} vs ${b.length}. Results unreliable.`);
        // ... truncate or return 0
    }
}
```

## Related

This compounds the issue in #315 (`import()` not rebuilding HNSW): even after fixing HNSW rebuild, if stored embeddings are hash-based and queries are ONNX-based, HNSW distances will be incorrect and brute-force cosine scores will be meaningless.

## Environment

- `ruvector@0.2.19`
- Node.js v20.19.5
- `dist/core/intelligence-engine.js` — `embed()` ~line 183, `embedAsync()` ~line 211, `remember()` ~line 336
- ONNX bundle: `dist/core/onnx/pkg/ruvector_onnx_embeddings_wasm.js` (present)

---

Best regards,
Rob / Roble

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: embed() always returns hash vectors — breaks cosine similarity when ONNX is active #316

Summary

Root Cause

Consequences

Reproduction

Suggested Fix

Related

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: embed() always returns hash vectors — breaks cosine similarity when ONNX is active #316

Description

Summary

Root Cause

Consequences

Reproduction

Suggested Fix

Related

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions