Skip to content

engraph search` latency scales sharply with query length (1.4s for 2-word query → 32s for 50-word natural-language query) #34

@GrinNoCat

Description

@GrinNoCat

Version: engraph 1.6.1 (prebuilt Linux x86_64 binary)
OS: Linux 6.17.0 (Ubuntu)
Embedder in use: all-MiniLM-L6-v2 (per engraph status)
Vault size: 723 files, 11,739 chunks, 40 MB SQLite index

Summary

engraph search runtime grows dramatically with input length. A short keyword query returns in ~1.4 s; the same vault queried with a 50-word natural-language sentence takes ~32 s — a >20× slowdown that's not explained by candidate-set size (top-N is identical, results overlap).

This blocks LLM-driven retrieval pipelines that pass the user's prompt directly to engraph search, because realistic user prompts are usually full sentences, not keyword strings.

Reproduction

$ engraph status
Vault:      /home/jeb/Vaults/Archives
Files:      723
Chunks:     11739
Edges:      4846
Index size: 40.1 MB
Model:      all-MiniLM-L6-v2

$ time engraph search "ternary quantization" --json -n 10 >/dev/null
real    0m1.422s

$ time engraph search "I have an interest Ternary computing and there's probably some additional videos that should be ingested into the vault.  Is Ternary quant still going to be GPU compatible in addition to loading into RAM for performance on both?" --json -n 10 >/dev/null
real    0m32.474s

Both queries return overlapping top-N candidates, so the work being done isn't proportionally more useful — it's just slower.

Hypothesis

Without source access I can only speculate, but suspects:

  • Embedder tokenization or pooling cost growing with input length (MiniLM has a 512-token context — anything past that is being truncated or chunked oddly).
  • A lexical / BM25-style scoring lane doing per-term passes against the full chunk corpus.
  • Hybrid-fusion ranking allocating per-query-term work.

A keyword-only path (short query bypass) or input clamp would mask this, but the underlying scaling is what's surprising.

Workaround we're using

Added a "Stage 0" distillation step in the calling pipeline that calls a local LLM (Ollama, gemma4:26b) to extract 5–10 keywords from long prompts before calling engraph search. Long-query end-to-end dropped from 30 s+ timeout to ~6 s, and recall actually improved (the keyword form retrieves notes the natural-language form missed).

Happy to share the hook code if useful.

Minor observation: engraph status Model field

After engraph configure --model embed hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf and a full engraph index --rebuild (19.9 min, all 721 files re-embedded), engraph status still reports Model: all-MiniLM-L6-v2. The [models] config block contains the new URI and the rebuild took the expected wall-clock for re-embedding everything, so I assume the display field is just hardcoded or not wired to read the active config. Low-impact, but confusing during diagnosis.

Asks

  1. Is the long-query scaling expected, or does it look like a bug?
  2. Would an input-length clamp / warning be reasonable at the CLI layer?
  3. Is the engraph status "Model:" field expected to reflect the configured embedder, or is it intentionally a static label?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions