Last updated: 2026-05-22
This document starts the V2 branch after the V1 freeze candidate was preserved.
Current V2 working branch:
feature/v2-multi-goal-retrieval
Current V1 comparison baseline:
snapshot/v1-open-source-baseline-candidatev1-open-source-baseline-candidate-2026-05-21-head
The V1 evidence now gives a clearer problem split than before.
V1 already showed:
- strong positive value on navigation-heavy repository tasks
- weak or negative value on obvious-near-obvious tasks
- a non-trivial
strong_multi_goalslice that cannot be interpreted from single runs
The strongest new lesson is:
reaching the right repository area is not the whole problem; agents can still waste work by competing between several locally plausible repair targets.
That makes V2 different from “just improve retrieval more.”
Recommended V2 question:
Can CodeWell become ambiguity-aware, so that it changes retrieval behavior when the prompt itself contains multiple plausible local repair goals?
This V2 direction stays compatible with the lightweight philosophy:
- local-first
- SQLite plus static retrieval
- no required LLM API
- no required embeddings
- no required reranker
- no hosted service
The first V2 thesis should be narrow:
The next gain is not from retrieving more files. The next gain is from controlling which local goal the retrieval process serves first.
This suggests a first V2 focus:
- prompt goal-shape detection
- primary-goal-first context assembly
- bounded retrieval behavior when local goal competition is high
This also implies a strict implementation rule for the first V2 line:
- solve the baseline problem without required LLM planning
- solve the baseline problem without required embeddings
- treat both as optional later enhancers, not as prerequisites for V2 usefulness
The first V2 slice should stay small and measurable.
Start from the current strong_multi_goal task family:
adonis_kernel_auth_middleware_wiringadonis_profile_route_auth_middleware_wiringnestjs_users_admin_guard_service_wiring
Current working subtype split:
stable_first_file_multi_targetadonis_kernel_auth_middleware_wiringadonis_profile_route_auth_middleware_wiring
unstable_first_file_multi_targetnestjs_users_admin_guard_service_wiring
Use V1 as the control.
Add only one new decision layer at first:
- estimate whether the prompt is:
- stable single-target
- stable-first-file but multi-target
- unstable-first-file multi-target
- reduce or redirect expansion accordingly
For user tasks that contain several small subgoals, the intended retrieval policy is:
- decompose the task into short structured subgoals
- run only shallow lexical probes across those subgoals first
- merge subgoals that map to the same local workspace region
- choose one primary branch for deep context expansion
- keep at most one lightweight backup branch
- when budget allows, retain a micro-context for that backup branch
This is intentionally not "expand every subgoal in parallel".
The first V2 implementation round should aim for:
- a lightweight ambiguity classifier or rule set
- one new context-assembly policy that reacts to that classifier
- an ablation against the current V1 baseline
- repeated-run comparison on the current three
strong_multi_goaltasks
The default V2 multi-goal baseline should remain fully local and model-free at runtime.
Core baseline responsibilities:
- retrieval-aware task decomposition
- shallow subgoal probing
- shared-workspace clustering
- primary-branch selection
- bounded deep expansion for the primary branch
- optional single backup branch retention
- optional backup-branch micro-context retention when budget remains
Default baseline constraints:
- no required LLM API calls
- no required embedding model
- no vector database
- no hosted planner or reranker
Why this baseline matters:
- it preserves the lightweight product thesis
- it keeps evaluation ablations interpretable
- it makes open-source use and reproduction easier
Later V2 work may add optional enhancement layers, but they must stay optional.
Optional planner layer:
- candidate subgoal generation from complex natural-language prompts
- subgoal dependency ordering suggestions
- fallback decomposition when rule-based splitting is weak
Optional semantic layer:
- embedding-based subgoal clustering
- semantic workspace-cluster merging
- reranking support for ambiguous candidate groups
Required rule for both layers:
- the core V2 path must still work when both are disabled
- evaluations must always preserve a pure no-LLM no-embedding baseline condition
The first V2 round is useful if it can improve at least one of these without making V1 heavy:
- lower elapsed median on
strong_multi_goal - lower token median on
strong_multi_goal - improve first-relevant-file stability on unstable-first-file tasks
- reduce local-goal drift while preserving the navigation-heavy bucket
Do not do these in the first V2 round:
- add embeddings by default
- add a reranker before proving the ambiguity problem is not solvable with cheaper logic
- broaden to many new benchmarks before the internal V2 hypothesis is coherent
- weaken the V1 lightweight positioning just to create a larger method delta
The next concrete V2 action should be:
- formalize the two
strong_multi_goalsubtypes observed in V1 - encode them as evaluation labels or diagnostics
- design the smallest ambiguity-aware retrieval control that can be ablated against V1
That subtype layer now serves a concrete purpose:
stable_first_file_multi_targetshould test whether CodeWell can stop expanding earlier and help the agent commit faster after reaching the right areaunstable_first_file_multi_targetshould test whether CodeWell can reduce first-hop drift before deeper local context assembly starts
Recommended build order:
- stabilize the no-LLM no-embedding decomposition baseline
- add structured subgoal records and shared-workspace clustering
- add primary-branch versus backup-branch retrieval control
- add backup-branch micro-context retention
- run ablations on the current
strong_multi_goalslice - only then consider an optional LLM planner
- only after that consider optional embedding-based clustering or reranking
Near-term execution checklist:
- define the structured subgoal schema
- define shallow-probe scoring and merge thresholds
- implement shared-workspace clustering from lexical and graph overlap
- expose primary versus backup branch diagnostics in context output
- retain a minimal backup-branch context under leftover budget
- re-run the three current repeated-run
strong_multi_goaltasks - compare token, elapsed, and first-file stability before any optional model-based enhancer work
Current implementation-side smoke findings from 2026-05-22:
- the first V2 baseline did help the unstable-first-file task sample
- the same baseline initially regressed on stable-first-file samples
- a large part of that regression was caused by overexposing multi-goal diagnostics and backup branch context in the default response path
- trimming default output while keeping the branch policy internal flipped one stable-first-file smoke sample back to positive
Current practical rule:
- keep rich multi-goal diagnostics in
debug - keep default output lean
- do not retain backup micro-context when the primary branch is an entrypoint-search branch
docs/V1_EVALUATION_NOTE.mddocs/V1_FREEZE_SUMMARY.mddocs/ARCHITECTURE_V2.mddocs/AGENT_EVAL_SUMMARY.mddocs/V2_PROGRESS_2026-05-23.md