Skip to content

feat: libscope eval — offline search quality evaluation (recall@k, MRR) #486

@RobertLD

Description

@RobertLD

Summary

There is currently no way to measure whether a libscope knowledge base is actually answering questions well. Adding a libscope eval CLI command lets users run a ground-truth evaluation against their indexed content and get recall, MRR, and per-document hit rate — without any external service.

Why

  • RAG quality is a black box for most users. They don't know if their chunking strategy, embedding model, or search parameters are working.
  • No local knowledge base tool in the ecosystem ships offline evaluation tooling. It's a genuine differentiator.
  • libscope already has ratings infrastructure (src/core/ratings.ts) and search analytics (src/core/analytics.ts) — eval builds naturally on top of these.
  • A libscope eval command would also make the semantic chunking and reranking features (see related issues) verifiable — users can see before/after quality numbers when changing configuration.

Proposed design

Input format

A JSONL file where each line is an evaluation example:

{"query": "how do I configure the embedding provider?", "relevant": ["chunk-abc123", "chunk-def456"]}
{"query": "what databases does libscope support?", "relevant": ["chunk-xyz789"]}

relevant is a list of chunk IDs that should appear in the top-k results. These can be obtained by running a search, finding the right chunks manually, and saving them — or by using the existing rate command to mark good results.

CLI command

libscope eval <eval-file.jsonl> [options]

Options:
  --k <numbers>        k values to evaluate at (default: 1,5,10)
  --method <method>    Search method: hybrid|vector|fts5 (default: hybrid)
  --rerank             Enable reranking (if configured)
  --output <file>      Write JSON results to file
  --workspace <name>   Workspace to evaluate against

Output

Evaluation: my-eval.jsonl (42 queries)
Method: hybrid  k: 1, 5, 10

Metric          @1       @5       @10
──────────────────────────────────────
Recall          0.524    0.786    0.857
Precision       0.524    0.214    0.114
MRR             0.612    —        —
NDCG            0.612    0.698    0.724
Hit Rate        52.4%    78.6%    85.7%

Worst performers (lowest MRR):
  "how to add a custom connector?" → no hits in top 10
  "what is the default chunk size?" → first hit at rank 8
  "supported file formats" → first hit at rank 4

Run with --output results.json to save full per-query breakdown.

Metrics

  • Recall@k: fraction of relevant chunks found in top k
  • Precision@k: fraction of top k results that are relevant
  • MRR (Mean Reciprocal Rank): average of 1/rank of first relevant result
  • NDCG@k: normalized discounted cumulative gain
  • Hit Rate@k: fraction of queries with at least one relevant result in top k

Helper: generate eval set from ratings

libscope eval generate-from-ratings --min-rating 4 --output eval.jsonl

Converts existing 4–5 star ratings into an eval JSONL, bootstrapping evaluation without manual annotation.

Implementation

  • New command: src/cli/commands/eval.ts
  • Eval runner: src/core/eval.ts — takes eval set, runs searches, computes metrics
  • No new database tables required — reads from existing search and ratings infrastructure
  • Results are purely in-memory / file output (no persistence)

Acceptance criteria

  • libscope eval <file> command runs evaluation and prints metric table
  • Supports --k, --method, --output flags
  • Metrics computed: Recall@k, Precision@k, MRR, NDCG@k, Hit Rate@k
  • "Worst performers" section highlights queries with no hits or low rank
  • libscope eval generate-from-ratings generates an eval JSONL from existing ratings
  • --output writes full per-query breakdown as JSON
  • Unit tests for metric computation (recall, MRR, NDCG) with known fixtures
  • Integration test: eval against a seeded test DB with known ground truth
  • Help text documents the JSONL format with examples

Related issues

  • Semantic chunking (eval can measure chunking strategy impact)
  • Cross-encoder reranking (eval can measure reranking quality gain)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions