feat: libscope eval — offline search quality evaluation (recall@k, MRR)

## Summary

There is currently no way to measure whether a libscope knowledge base is actually answering questions well. Adding a `libscope eval` CLI command lets users run a ground-truth evaluation against their indexed content and get recall, MRR, and per-document hit rate — without any external service.

## Why

- RAG quality is a black box for most users. They don't know if their chunking strategy, embedding model, or search parameters are working.
- No local knowledge base tool in the ecosystem ships offline evaluation tooling. It's a genuine differentiator.
- libscope already has ratings infrastructure (`src/core/ratings.ts`) and search analytics (`src/core/analytics.ts`) — eval builds naturally on top of these.
- A `libscope eval` command would also make the semantic chunking and reranking features (see related issues) verifiable — users can see before/after quality numbers when changing configuration.

## Proposed design

### Input format

A JSONL file where each line is an evaluation example:

```jsonl
{"query": "how do I configure the embedding provider?", "relevant": ["chunk-abc123", "chunk-def456"]}
{"query": "what databases does libscope support?", "relevant": ["chunk-xyz789"]}
```

`relevant` is a list of chunk IDs that should appear in the top-k results. These can be obtained by running a search, finding the right chunks manually, and saving them — or by using the existing `rate` command to mark good results.

### CLI command

```
libscope eval <eval-file.jsonl> [options]

Options:
  --k <numbers>        k values to evaluate at (default: 1,5,10)
  --method <method>    Search method: hybrid|vector|fts5 (default: hybrid)
  --rerank             Enable reranking (if configured)
  --output <file>      Write JSON results to file
  --workspace <name>   Workspace to evaluate against
```

### Output

```
Evaluation: my-eval.jsonl (42 queries)
Method: hybrid  k: 1, 5, 10

Metric          @1       @5       @10
──────────────────────────────────────
Recall          0.524    0.786    0.857
Precision       0.524    0.214    0.114
MRR             0.612    —        —
NDCG            0.612    0.698    0.724
Hit Rate        52.4%    78.6%    85.7%

Worst performers (lowest MRR):
  "how to add a custom connector?" → no hits in top 10
  "what is the default chunk size?" → first hit at rank 8
  "supported file formats" → first hit at rank 4

Run with --output results.json to save full per-query breakdown.
```

### Metrics

- **Recall@k**: fraction of relevant chunks found in top k
- **Precision@k**: fraction of top k results that are relevant
- **MRR** (Mean Reciprocal Rank): average of 1/rank of first relevant result
- **NDCG@k**: normalized discounted cumulative gain
- **Hit Rate@k**: fraction of queries with at least one relevant result in top k

### Helper: generate eval set from ratings

```
libscope eval generate-from-ratings --min-rating 4 --output eval.jsonl
```

Converts existing 4–5 star ratings into an eval JSONL, bootstrapping evaluation without manual annotation.

### Implementation

- New command: `src/cli/commands/eval.ts`
- Eval runner: `src/core/eval.ts` — takes eval set, runs searches, computes metrics
- No new database tables required — reads from existing search and ratings infrastructure
- Results are purely in-memory / file output (no persistence)

## Acceptance criteria

- [ ] `libscope eval <file>` command runs evaluation and prints metric table
- [ ] Supports `--k`, `--method`, `--output` flags
- [ ] Metrics computed: Recall@k, Precision@k, MRR, NDCG@k, Hit Rate@k
- [ ] "Worst performers" section highlights queries with no hits or low rank
- [ ] `libscope eval generate-from-ratings` generates an eval JSONL from existing ratings
- [ ] `--output` writes full per-query breakdown as JSON
- [ ] Unit tests for metric computation (recall, MRR, NDCG) with known fixtures
- [ ] Integration test: eval against a seeded test DB with known ground truth
- [ ] Help text documents the JSONL format with examples

## Related issues

- Semantic chunking (eval can measure chunking strategy impact)
- Cross-encoder reranking (eval can measure reranking quality gain)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: libscope eval — offline search quality evaluation (recall@k, MRR) #486

Summary

Why

Proposed design

Input format

CLI command

Output

Metrics

Helper: generate eval set from ratings

Implementation

Acceptance criteria

Related issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: libscope eval — offline search quality evaluation (recall@k, MRR) #486

Description

Summary

Why

Proposed design

Input format

CLI command

Output

Metrics

Helper: generate eval set from ratings

Implementation

Acceptance criteria

Related issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions