CodeWell V2 Kickoff

Last updated: 2026-05-22

Scope

This document starts the V2 branch after the V1 freeze candidate was preserved.

Current V2 working branch:

feature/v2-multi-goal-retrieval

Current V1 comparison baseline:

snapshot/v1-open-source-baseline-candidate
v1-open-source-baseline-candidate-2026-05-21-head

Why V2 Starts Here

The V1 evidence now gives a clearer problem split than before.

V1 already showed:

strong positive value on navigation-heavy repository tasks
weak or negative value on obvious-near-obvious tasks
a non-trivial strong_multi_goal slice that cannot be interpreted from single runs

The strongest new lesson is:

reaching the right repository area is not the whole problem; agents can still waste work by competing between several locally plausible repair targets.

That makes V2 different from “just improve retrieval more.”

V2 Core Question

First V2 Thesis

The first V2 thesis should be narrow:

The next gain is not from retrieving more files. The next gain is from controlling which local goal the retrieval process serves first.

This suggests a first V2 focus:

prompt goal-shape detection
primary-goal-first context assembly
bounded retrieval behavior when local goal competition is high

This also implies a strict implementation rule for the first V2 line:

solve the baseline problem without required LLM planning
solve the baseline problem without required embeddings
treat both as optional later enhancers, not as prerequisites for V2 usefulness

First V2 Experimental Slice

The first V2 slice should stay small and measurable.

Start from the current strong_multi_goal task family:

adonis_kernel_auth_middleware_wiring
adonis_profile_route_auth_middleware_wiring
nestjs_users_admin_guard_service_wiring

Current working subtype split:

stable_first_file_multi_target
- adonis_kernel_auth_middleware_wiring
- adonis_profile_route_auth_middleware_wiring
unstable_first_file_multi_target
- nestjs_users_admin_guard_service_wiring

Use V1 as the control.

Add only one new decision layer at first:

estimate whether the prompt is:
- stable single-target
- stable-first-file but multi-target
- unstable-first-file multi-target
reduce or redirect expansion accordingly

For user tasks that contain several small subgoals, the intended retrieval policy is:

decompose the task into short structured subgoals
run only shallow lexical probes across those subgoals first
merge subgoals that map to the same local workspace region
choose one primary branch for deep context expansion
keep at most one lightweight backup branch
when budget allows, retain a micro-context for that backup branch

This is intentionally not "expand every subgoal in parallel".

First V2 Deliverables

The first V2 implementation round should aim for:

a lightweight ambiguity classifier or rule set
one new context-assembly policy that reacts to that classifier
an ablation against the current V1 baseline
repeated-run comparison on the current three strong_multi_goal tasks

Default V2 Baseline

The default V2 multi-goal baseline should remain fully local and model-free at runtime.

Core baseline responsibilities:

retrieval-aware task decomposition
shallow subgoal probing
shared-workspace clustering
primary-branch selection
bounded deep expansion for the primary branch
optional single backup branch retention
optional backup-branch micro-context retention when budget remains

Default baseline constraints:

no required LLM API calls
no required embedding model
no vector database
no hosted planner or reranker

Why this baseline matters:

it preserves the lightweight product thesis
it keeps evaluation ablations interpretable
it makes open-source use and reproduction easier

Optional Enhancers

Later V2 work may add optional enhancement layers, but they must stay optional.

Optional planner layer:

candidate subgoal generation from complex natural-language prompts
subgoal dependency ordering suggestions
fallback decomposition when rule-based splitting is weak

Optional semantic layer:

embedding-based subgoal clustering
semantic workspace-cluster merging
reranking support for ambiguous candidate groups

Required rule for both layers:

the core V2 path must still work when both are disabled
evaluations must always preserve a pure no-LLM no-embedding baseline condition

Success Criteria

The first V2 round is useful if it can improve at least one of these without making V1 heavy:

lower elapsed median on strong_multi_goal
lower token median on strong_multi_goal
improve first-relevant-file stability on unstable-first-file tasks
reduce local-goal drift while preserving the navigation-heavy bucket

What Not To Do Yet

Do not do these in the first V2 round:

add embeddings by default
add a reranker before proving the ambiguity problem is not solvable with cheaper logic
broaden to many new benchmarks before the internal V2 hypothesis is coherent
weaken the V1 lightweight positioning just to create a larger method delta

Recommended Immediate Next Step

The next concrete V2 action should be:

formalize the two strong_multi_goal subtypes observed in V1
encode them as evaluation labels or diagnostics
design the smallest ambiguity-aware retrieval control that can be ablated against V1

That subtype layer now serves a concrete purpose:

stable_first_file_multi_target should test whether CodeWell can stop expanding earlier and help the agent commit faster after reaching the right area
unstable_first_file_multi_target should test whether CodeWell can reduce first-hop drift before deeper local context assembly starts

Ordered Implementation Plan

Recommended build order:

stabilize the no-LLM no-embedding decomposition baseline
add structured subgoal records and shared-workspace clustering
add primary-branch versus backup-branch retrieval control
add backup-branch micro-context retention
run ablations on the current strong_multi_goal slice
only then consider an optional LLM planner
only after that consider optional embedding-based clustering or reranking

Near-term execution checklist:

define the structured subgoal schema
define shallow-probe scoring and merge thresholds
implement shared-workspace clustering from lexical and graph overlap
expose primary versus backup branch diagnostics in context output
retain a minimal backup-branch context under leftover budget
re-run the three current repeated-run strong_multi_goal tasks
compare token, elapsed, and first-file stability before any optional model-based enhancer work

Early Smoke Read

Current implementation-side smoke findings from 2026-05-22:

the first V2 baseline did help the unstable-first-file task sample
the same baseline initially regressed on stable-first-file samples
a large part of that regression was caused by overexposing multi-goal diagnostics and backup branch context in the default response path
trimming default output while keeping the branch policy internal flipped one stable-first-file smoke sample back to positive

Current practical rule:

keep rich multi-goal diagnostics in debug
keep default output lean
do not retain backup micro-context when the primary branch is an entrypoint-search branch

Related Files

docs/V1_EVALUATION_NOTE.md
docs/V1_FREEZE_SUMMARY.md
docs/ARCHITECTURE_V2.md
docs/AGENT_EVAL_SUMMARY.md
docs/V2_PROGRESS_2026-05-23.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeWell V2 Kickoff

Scope

Why V2 Starts Here

V2 Core Question

First V2 Thesis

First V2 Experimental Slice

First V2 Deliverables

Default V2 Baseline

Optional Enhancers

Success Criteria

What Not To Do Yet

Recommended Immediate Next Step

Ordered Implementation Plan

Early Smoke Read

Related Files

FilesExpand file tree

V2_KICKOFF.md

Latest commit

History

V2_KICKOFF.md

File metadata and controls

CodeWell V2 Kickoff

Scope

Why V2 Starts Here

V2 Core Question

First V2 Thesis

First V2 Experimental Slice

First V2 Deliverables

Default V2 Baseline

Optional Enhancers

Success Criteria

What Not To Do Yet

Recommended Immediate Next Step

Ordered Implementation Plan

Early Smoke Read

Related Files