Summary
There is currently no way to measure whether a libscope knowledge base is actually answering questions well. Adding a libscope eval CLI command lets users run a ground-truth evaluation against their indexed content and get recall, MRR, and per-document hit rate — without any external service.
Why
- RAG quality is a black box for most users. They don't know if their chunking strategy, embedding model, or search parameters are working.
- No local knowledge base tool in the ecosystem ships offline evaluation tooling. It's a genuine differentiator.
- libscope already has ratings infrastructure (
src/core/ratings.ts) and search analytics (src/core/analytics.ts) — eval builds naturally on top of these.
- A
libscope eval command would also make the semantic chunking and reranking features (see related issues) verifiable — users can see before/after quality numbers when changing configuration.
Proposed design
Input format
A JSONL file where each line is an evaluation example:
{"query": "how do I configure the embedding provider?", "relevant": ["chunk-abc123", "chunk-def456"]}
{"query": "what databases does libscope support?", "relevant": ["chunk-xyz789"]}
relevant is a list of chunk IDs that should appear in the top-k results. These can be obtained by running a search, finding the right chunks manually, and saving them — or by using the existing rate command to mark good results.
CLI command
libscope eval <eval-file.jsonl> [options]
Options:
--k <numbers> k values to evaluate at (default: 1,5,10)
--method <method> Search method: hybrid|vector|fts5 (default: hybrid)
--rerank Enable reranking (if configured)
--output <file> Write JSON results to file
--workspace <name> Workspace to evaluate against
Output
Evaluation: my-eval.jsonl (42 queries)
Method: hybrid k: 1, 5, 10
Metric @1 @5 @10
──────────────────────────────────────
Recall 0.524 0.786 0.857
Precision 0.524 0.214 0.114
MRR 0.612 — —
NDCG 0.612 0.698 0.724
Hit Rate 52.4% 78.6% 85.7%
Worst performers (lowest MRR):
"how to add a custom connector?" → no hits in top 10
"what is the default chunk size?" → first hit at rank 8
"supported file formats" → first hit at rank 4
Run with --output results.json to save full per-query breakdown.
Metrics
- Recall@k: fraction of relevant chunks found in top k
- Precision@k: fraction of top k results that are relevant
- MRR (Mean Reciprocal Rank): average of 1/rank of first relevant result
- NDCG@k: normalized discounted cumulative gain
- Hit Rate@k: fraction of queries with at least one relevant result in top k
Helper: generate eval set from ratings
libscope eval generate-from-ratings --min-rating 4 --output eval.jsonl
Converts existing 4–5 star ratings into an eval JSONL, bootstrapping evaluation without manual annotation.
Implementation
- New command:
src/cli/commands/eval.ts
- Eval runner:
src/core/eval.ts — takes eval set, runs searches, computes metrics
- No new database tables required — reads from existing search and ratings infrastructure
- Results are purely in-memory / file output (no persistence)
Acceptance criteria
Related issues
- Semantic chunking (eval can measure chunking strategy impact)
- Cross-encoder reranking (eval can measure reranking quality gain)
Summary
There is currently no way to measure whether a libscope knowledge base is actually answering questions well. Adding a
libscope evalCLI command lets users run a ground-truth evaluation against their indexed content and get recall, MRR, and per-document hit rate — without any external service.Why
src/core/ratings.ts) and search analytics (src/core/analytics.ts) — eval builds naturally on top of these.libscope evalcommand would also make the semantic chunking and reranking features (see related issues) verifiable — users can see before/after quality numbers when changing configuration.Proposed design
Input format
A JSONL file where each line is an evaluation example:
{"query": "how do I configure the embedding provider?", "relevant": ["chunk-abc123", "chunk-def456"]} {"query": "what databases does libscope support?", "relevant": ["chunk-xyz789"]}relevantis a list of chunk IDs that should appear in the top-k results. These can be obtained by running a search, finding the right chunks manually, and saving them — or by using the existingratecommand to mark good results.CLI command
Output
Metrics
Helper: generate eval set from ratings
Converts existing 4–5 star ratings into an eval JSONL, bootstrapping evaluation without manual annotation.
Implementation
src/cli/commands/eval.tssrc/core/eval.ts— takes eval set, runs searches, computes metricsAcceptance criteria
libscope eval <file>command runs evaluation and prints metric table--k,--method,--outputflagslibscope eval generate-from-ratingsgenerates an eval JSONL from existing ratings--outputwrites full per-query breakdown as JSONRelated issues