This document defines how to check whether CodeWell is useful, not just whether its tests pass.
CodeWell should reduce the work an AI coding agent or developer spends finding repository context and re-solving known adaptation failures.
The core question is:
Does CodeWell help retrieve the right code context and verified revision memory faster than ad hoc file search alone?
The bundled fixture baseline is intentionally small. It proves that the local loop works end to end:
- Index a Python project.
- Find relevant files and symbols for a task query.
- Trace a symbol to direct callers and callees.
- Build a context pack within a token budget.
- Record and recall a verified revision memory item.
It does not prove real-world usefulness by itself. Real usefulness requires running the same criteria against larger projects and comparing agent or developer behavior with and without CodeWell.
python scripts/evaluate_fixture.pyExpected result:
- the command exits with status
0 - the JSON output has
"failed": 0 - every check has
"passed": true
Use --keep to keep the temporary workspace for manual inspection:
python scripts/evaluate_fixture.py --keepUse evaluate_project.py when you have a local project and a JSON task list with expected files,
symbols, and optional trace relationships:
python scripts/evaluate_project.py /path/to/python-project tasks.jsonThe repository includes a narrow self-evaluation task list for CodeWell itself:
python scripts/evaluate_project.py . evaluations/codewell_self.jsonThis baseline uses explicit symbol queries to verify the stable local loop. A second maintained baseline uses natural-language task queries:
python scripts/evaluate_project.py . evaluations/codewell_natural.jsonThe natural-language baseline is still small, but it verifies that query normalization and reranking can find implementation files from user-style questions rather than exact symbol names only.
A broader acceptance task list covers release-readiness paths such as CLI dispatch, GitHub repository ingest, schema/status reporting, Python AST parsing, and the real-project smoke script:
python scripts/evaluate_project.py . evaluations/codewell_acceptance.jsonTo run the same evaluator across multiple local projects, create a manifest based on
evaluations/real_projects/manifest.example.json and run:
python scripts/evaluate_real_projects.py path/to/manifest.json --output-dir reports/real-projectsTask files can start from evaluations/real_projects/task_template.json. Do not mark a
real-project evaluation as complete until the expected files, symbols, and trace relationships were
established before running CodeWell.
The first maintained open-source evaluation uses pinned PyPI sources for Click, Requests, and Rich:
python scripts/evaluate_real_projects.py evaluations/real_projects/pypi_manifest.json --output-dir reports/real-projectsThe repository also includes a maintained local GitHub-backed manifest for five real TS/JS projects:
python scripts/evaluate_real_projects.py evaluations/real_projects/github_manifest.json --output-dir reports/github-real-projectsCurrent GitHub-backed targets are:
express-typescript-prisma-postgresqlbulletproof-react(apps/react-vite)nestjs-clean-architecturenestjs-boilerplateadonisjs-starter-kits(api-monorepo/apps/backend)
See docs/REAL_PROJECT_EVALUATION.md for setup commands, current results, and known misses.
Before publishing or tagging a release, run the complete local quality gate:
python scripts/check_release.pyThe release gate includes a package smoke check that builds a wheel, installs it into a temporary
virtual environment, and runs the installed codewell CLI against the fixture project:
python scripts/check_package.pyBundled language fixtures also include TypeScript coverage for:
- basic function/import flows
- barrel and re-export flows
- route-entry expansion through
app.ts -> routes.ts -> service.ts - command-entry expansion through
cli.ts -> service.ts - test-entry expansion through
tests/service.spec.ts -> src/service.ts - named re-export chains such as
export { x as y } from ... - object methods, namespace/module scopes, and decorator-prefixed methods
- class-field arrow methods and object-property arrow functions
- class-local
this.andsuper.call relationships
To run the maintained bundled multi-project fixture suite, including the current TypeScript task set, use:
python scripts/evaluate_real_projects.py evaluations/fixture_suite_manifest.json --output-dir reports/fixture-suiteThis fixture suite is still local and curated, not a substitute for broader external real-project evaluation. It exists so TS/JS retrieval changes can be judged by task-level reports instead of parser-only unit tests.
The GitHub-backed manifest goes one step further than the fixture suite: it exercises maintained real project structures for backend route/controller/service/repository relationships and frontend route/form/api/test/protected-route relationships, and now also a NestJS clean-architecture sample with controller/use-case/factory/CRM/e2e flows, plus a broader NestJS boilerplate sample with auth controllers/services, admin guard wiring, file-upload interceptor flow, and global bootstrap pipeline/interceptor configuration, plus an AdonisJS backend sample with controller-validator flows, auth middleware, and kernel middleware registration.
By default the script evaluates a temporary copy of the project so it does not write .codewell
into the original. Pass --in-place only when you want to index the original project, or --keep
to preserve the temporary workspace for inspection.
Minimal task file:
{
"tasks": [
{
"name": "login_flow",
"query": "login",
"context_budget": 800,
"expected_files": ["app.py", "auth.py"],
"expected_symbols": ["handle_login", "login_user"],
"trace_symbols": [
{
"symbol": "login_user",
"expected_incoming": [
{"caller": "handle_login", "callee": "login_user"}
],
"expected_outgoing": [
{"caller": "login_user", "callee": "bool"}
]
}
]
}
]
}The script reports pass/fail checks per task:
- search includes expected files
- context pack includes expected files
- context pack includes expected symbols
- context pack fits the token budget
- symbol traces include expected incoming and outgoing calls
It also emits machine-readable metrics at the report level and per task:
search_file_recallandsearch_file_precisioncontext_file_recallandcontext_file_precisioncontext_symbol_recallandcontext_symbol_precisiontrace_call_recallbudget_fitindex_latency_secondsand taskelapsed_seconds
Precision is calculated against the expected lists in the task file. Treat it as an evaluation signal, not a full relevance judgment, because a task file may omit files that are genuinely useful.
The fixture baseline checks:
index_fixture: indexes the bundled Python fixture and verifies expected file, symbol, and call edge counts.search_login: verifies a login query finds the login handler context.trace_login_user: verifieslogin_usercan be traced back tohandle_loginand includes a source excerpt.context_login: verifies the context pack includes relevant files and symbol traces within the requested budget.revision_memory_round_trip: records a failed adaptation, records a verified fix, and verifies revision search can recall it.
For each real project, create 3-5 tasks. Each task should have a known expected answer before running CodeWell.
Recommended task types:
- Find the entry point for a feature such as login, routing, config loading, or job execution.
- Find direct callers and callees for an important function.
- Build the smallest useful context pack for a bug fix.
- Record a failed snippet adaptation and verify that the fixed revision is searchable later.
- Re-run indexing after a file change and verify unchanged files are skipped.
For each task, record:
- project name and commit or local snapshot
- query text
- expected files
- expected symbols
- expected callers or callees
- token budget
- whether CodeWell found the expected context
- whether a developer or agent needed extra manual search
Track both correctness and effort:
context_recall: expected files or symbols included in the result.context_precision: irrelevant selected files or symbols.trace_recall: expected callers or callees included in symbol traces.budget_fit: context pack token estimate stays at or under the requested budget.manual_search_steps: extra grep, editor search, or file-open steps needed after CodeWell.time_to_first_relevant_file: seconds from query start to first correct file.revision_reuse: whether a previous verified fix is found for a similar failure.index_latency: seconds to index a cold project.incremental_latency: seconds to re-index after a small change.
Retrieval metrics alone are not enough to show CodeWell's product value for coding agents.
For agent-level comparison, run a lightweight A/B evaluation:
baseline: the agent works without CodeWell retrievalcodewell: the agent works with CodeWell search, trace, and context
Use:
docs/AGENT_EVALUATION.mdfor the protocolevaluations/agent_eval_task_template.jsonfor task designevaluations/agent_eval_result_template.jsonfor result recordingevaluations/agent_eval_tsjs_first_batch.jsonfor the initial 10-task TS/JS batch
This layer should measure:
- task success
- time to first relevant file
- tool call count
- manual search steps
- wrong context events
- final patch placement in the expected feature area
Run this protocol before attempting a heavier SWE-bench-style comparison. It is cheaper, easier to reproduce locally, and easier to attribute to CodeWell itself.
For the current compressed V1 read across these evaluation layers, see
docs/V1_EVALUATION_NOTE.md.
Use this rough scale while CodeWell is pre-alpha:
- Pass: expected files, symbols, and direct graph relationships are present; context fits the budget; no extra manual search is needed for the task.
- Partial: the right area is found, but key callers, callees, or source snippets are missing.
- Fail: the result misses the expected file or symbol, exceeds budget, or points the user to unrelated code.
Do not add embeddings, reranking, or UI until repeated real-project evaluations show where the retrieval loop is actually weak.