CodeWell Evaluation

This document defines how to check whether CodeWell is useful, not just whether its tests pass.

Evaluation Goal

CodeWell should reduce the work an AI coding agent or developer spends finding repository context and re-solving known adaptation failures.

The core question is:

Does CodeWell help retrieve the right code context and verified revision memory faster than ad hoc file search alone?

What The Baseline Proves

The bundled fixture baseline is intentionally small. It proves that the local loop works end to end:

Index a Python project.
Find relevant files and symbols for a task query.
Trace a symbol to direct callers and callees.
Build a context pack within a token budget.
Record and recall a verified revision memory item.

It does not prove real-world usefulness by itself. Real usefulness requires running the same criteria against larger projects and comparing agent or developer behavior with and without CodeWell.

Run The Fixture Baseline

python scripts/evaluate_fixture.py

Expected result:

the command exits with status 0
the JSON output has "failed": 0
every check has "passed": true

Use --keep to keep the temporary workspace for manual inspection:

python scripts/evaluate_fixture.py --keep

Run A Project Task Evaluation

Use evaluate_project.py when you have a local project and a JSON task list with expected files, symbols, and optional trace relationships:

python scripts/evaluate_project.py /path/to/python-project tasks.json

The repository includes a narrow self-evaluation task list for CodeWell itself:

python scripts/evaluate_project.py . evaluations/codewell_self.json

This baseline uses explicit symbol queries to verify the stable local loop. A second maintained baseline uses natural-language task queries:

python scripts/evaluate_project.py . evaluations/codewell_natural.json

The natural-language baseline is still small, but it verifies that query normalization and reranking can find implementation files from user-style questions rather than exact symbol names only.

A broader acceptance task list covers release-readiness paths such as CLI dispatch, GitHub repository ingest, schema/status reporting, Python AST parsing, and the real-project smoke script:

python scripts/evaluate_project.py . evaluations/codewell_acceptance.json

To run the same evaluator across multiple local projects, create a manifest based on evaluations/real_projects/manifest.example.json and run:

python scripts/evaluate_real_projects.py path/to/manifest.json --output-dir reports/real-projects

Task files can start from evaluations/real_projects/task_template.json. Do not mark a real-project evaluation as complete until the expected files, symbols, and trace relationships were established before running CodeWell.

The first maintained open-source evaluation uses pinned PyPI sources for Click, Requests, and Rich:

python scripts/evaluate_real_projects.py evaluations/real_projects/pypi_manifest.json --output-dir reports/real-projects

The repository also includes a maintained local GitHub-backed manifest for five real TS/JS projects:

python scripts/evaluate_real_projects.py evaluations/real_projects/github_manifest.json --output-dir reports/github-real-projects

Current GitHub-backed targets are:

express-typescript-prisma-postgresql
bulletproof-react (apps/react-vite)
nestjs-clean-architecture
nestjs-boilerplate
adonisjs-starter-kits (api-monorepo/apps/backend)

See docs/REAL_PROJECT_EVALUATION.md for setup commands, current results, and known misses.

Before publishing or tagging a release, run the complete local quality gate:

python scripts/check_release.py

The release gate includes a package smoke check that builds a wheel, installs it into a temporary virtual environment, and runs the installed codewell CLI against the fixture project:

python scripts/check_package.py

Bundled language fixtures also include TypeScript coverage for:

basic function/import flows
barrel and re-export flows
route-entry expansion through app.ts -> routes.ts -> service.ts
command-entry expansion through cli.ts -> service.ts
test-entry expansion through tests/service.spec.ts -> src/service.ts
named re-export chains such as export { x as y } from ...
object methods, namespace/module scopes, and decorator-prefixed methods
class-field arrow methods and object-property arrow functions
class-local this. and super. call relationships

To run the maintained bundled multi-project fixture suite, including the current TypeScript task set, use:

python scripts/evaluate_real_projects.py evaluations/fixture_suite_manifest.json --output-dir reports/fixture-suite

This fixture suite is still local and curated, not a substitute for broader external real-project evaluation. It exists so TS/JS retrieval changes can be judged by task-level reports instead of parser-only unit tests.

The GitHub-backed manifest goes one step further than the fixture suite: it exercises maintained real project structures for backend route/controller/service/repository relationships and frontend route/form/api/test/protected-route relationships, and now also a NestJS clean-architecture sample with controller/use-case/factory/CRM/e2e flows, plus a broader NestJS boilerplate sample with auth controllers/services, admin guard wiring, file-upload interceptor flow, and global bootstrap pipeline/interceptor configuration, plus an AdonisJS backend sample with controller-validator flows, auth middleware, and kernel middleware registration.

By default the script evaluates a temporary copy of the project so it does not write .codewell into the original. Pass --in-place only when you want to index the original project, or --keep to preserve the temporary workspace for inspection.

Minimal task file:

{
  "tasks": [
    {
      "name": "login_flow",
      "query": "login",
      "context_budget": 800,
      "expected_files": ["app.py", "auth.py"],
      "expected_symbols": ["handle_login", "login_user"],
      "trace_symbols": [
        {
          "symbol": "login_user",
          "expected_incoming": [
            {"caller": "handle_login", "callee": "login_user"}
          ],
          "expected_outgoing": [
            {"caller": "login_user", "callee": "bool"}
          ]
        }
      ]
    }
  ]
}

The script reports pass/fail checks per task:

search includes expected files
context pack includes expected files
context pack includes expected symbols
context pack fits the token budget
symbol traces include expected incoming and outgoing calls

It also emits machine-readable metrics at the report level and per task:

search_file_recall and search_file_precision
context_file_recall and context_file_precision
context_symbol_recall and context_symbol_precision
trace_call_recall
budget_fit
index_latency_seconds and task elapsed_seconds

Precision is calculated against the expected lists in the task file. Treat it as an evaluation signal, not a full relevance judgment, because a task file may omit files that are genuinely useful.

Current Baseline Checks

The fixture baseline checks:

index_fixture: indexes the bundled Python fixture and verifies expected file, symbol, and call edge counts.
search_login: verifies a login query finds the login handler context.
trace_login_user: verifies login_user can be traced back to handle_login and includes a source excerpt.
context_login: verifies the context pack includes relevant files and symbol traces within the requested budget.
revision_memory_round_trip: records a failed adaptation, records a verified fix, and verifies revision search can recall it.

Real-Project Evaluation Template

For each real project, create 3-5 tasks. Each task should have a known expected answer before running CodeWell.

Recommended task types:

Find the entry point for a feature such as login, routing, config loading, or job execution.
Find direct callers and callees for an important function.
Build the smallest useful context pack for a bug fix.
Record a failed snippet adaptation and verify that the fixed revision is searchable later.
Re-run indexing after a file change and verify unchanged files are skipped.

For each task, record:

project name and commit or local snapshot
query text
expected files
expected symbols
expected callers or callees
token budget
whether CodeWell found the expected context
whether a developer or agent needed extra manual search

Suggested Metrics

Track both correctness and effort:

context_recall: expected files or symbols included in the result.
context_precision: irrelevant selected files or symbols.
trace_recall: expected callers or callees included in symbol traces.
budget_fit: context pack token estimate stays at or under the requested budget.
manual_search_steps: extra grep, editor search, or file-open steps needed after CodeWell.
time_to_first_relevant_file: seconds from query start to first correct file.
revision_reuse: whether a previous verified fix is found for a similar failure.
index_latency: seconds to index a cold project.
incremental_latency: seconds to re-index after a small change.

Agent-Level Evaluation

Retrieval metrics alone are not enough to show CodeWell's product value for coding agents.

For agent-level comparison, run a lightweight A/B evaluation:

baseline: the agent works without CodeWell retrieval
codewell: the agent works with CodeWell search, trace, and context

Use:

docs/AGENT_EVALUATION.md for the protocol
evaluations/agent_eval_task_template.json for task design
evaluations/agent_eval_result_template.json for result recording
evaluations/agent_eval_tsjs_first_batch.json for the initial 10-task TS/JS batch

This layer should measure:

task success
time to first relevant file
tool call count
manual search steps
wrong context events
final patch placement in the expected feature area

Run this protocol before attempting a heavier SWE-bench-style comparison. It is cheaper, easier to reproduce locally, and easier to attribute to CodeWell itself.

For the current compressed V1 read across these evaluation layers, see docs/V1_EVALUATION_NOTE.md.

Interpreting Results

Use this rough scale while CodeWell is pre-alpha:

Pass: expected files, symbols, and direct graph relationships are present; context fits the budget; no extra manual search is needed for the task.
Partial: the right area is found, but key callers, callees, or source snippets are missing.
Fail: the result misses the expected file or symbol, exceeds budget, or points the user to unrelated code.

Do not add embeddings, reranking, or UI until repeated real-project evaluations show where the retrieval loop is actually weak.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeWell Evaluation

Evaluation Goal

What The Baseline Proves

Run The Fixture Baseline

Run A Project Task Evaluation

Current Baseline Checks

Real-Project Evaluation Template

Suggested Metrics

Agent-Level Evaluation

Interpreting Results

FilesExpand file tree

EVALUATION.md

Latest commit

History

EVALUATION.md

File metadata and controls

CodeWell Evaluation

Evaluation Goal

What The Baseline Proves

Run The Fixture Baseline

Run A Project Task Evaluation

Current Baseline Checks

Real-Project Evaluation Template

Suggested Metrics

Agent-Level Evaluation

Interpreting Results