Skip to content

Latest commit

 

History

History
244 lines (164 loc) · 7.86 KB

File metadata and controls

244 lines (164 loc) · 7.86 KB

CodeWell V2 Kickoff

Last updated: 2026-05-22

Scope

This document starts the V2 branch after the V1 freeze candidate was preserved.

Current V2 working branch:

  • feature/v2-multi-goal-retrieval

Current V1 comparison baseline:

  • snapshot/v1-open-source-baseline-candidate
  • v1-open-source-baseline-candidate-2026-05-21-head

Why V2 Starts Here

The V1 evidence now gives a clearer problem split than before.

V1 already showed:

  • strong positive value on navigation-heavy repository tasks
  • weak or negative value on obvious-near-obvious tasks
  • a non-trivial strong_multi_goal slice that cannot be interpreted from single runs

The strongest new lesson is:

reaching the right repository area is not the whole problem; agents can still waste work by competing between several locally plausible repair targets.

That makes V2 different from “just improve retrieval more.”

V2 Core Question

Recommended V2 question:

Can CodeWell become ambiguity-aware, so that it changes retrieval behavior when the prompt itself contains multiple plausible local repair goals?

This V2 direction stays compatible with the lightweight philosophy:

  • local-first
  • SQLite plus static retrieval
  • no required LLM API
  • no required embeddings
  • no required reranker
  • no hosted service

First V2 Thesis

The first V2 thesis should be narrow:

The next gain is not from retrieving more files. The next gain is from controlling which local goal the retrieval process serves first.

This suggests a first V2 focus:

  • prompt goal-shape detection
  • primary-goal-first context assembly
  • bounded retrieval behavior when local goal competition is high

This also implies a strict implementation rule for the first V2 line:

  • solve the baseline problem without required LLM planning
  • solve the baseline problem without required embeddings
  • treat both as optional later enhancers, not as prerequisites for V2 usefulness

First V2 Experimental Slice

The first V2 slice should stay small and measurable.

Start from the current strong_multi_goal task family:

  1. adonis_kernel_auth_middleware_wiring
  2. adonis_profile_route_auth_middleware_wiring
  3. nestjs_users_admin_guard_service_wiring

Current working subtype split:

  • stable_first_file_multi_target
    • adonis_kernel_auth_middleware_wiring
    • adonis_profile_route_auth_middleware_wiring
  • unstable_first_file_multi_target
    • nestjs_users_admin_guard_service_wiring

Use V1 as the control.

Add only one new decision layer at first:

  1. estimate whether the prompt is:
    • stable single-target
    • stable-first-file but multi-target
    • unstable-first-file multi-target
  2. reduce or redirect expansion accordingly

For user tasks that contain several small subgoals, the intended retrieval policy is:

  1. decompose the task into short structured subgoals
  2. run only shallow lexical probes across those subgoals first
  3. merge subgoals that map to the same local workspace region
  4. choose one primary branch for deep context expansion
  5. keep at most one lightweight backup branch
  6. when budget allows, retain a micro-context for that backup branch

This is intentionally not "expand every subgoal in parallel".

First V2 Deliverables

The first V2 implementation round should aim for:

  1. a lightweight ambiguity classifier or rule set
  2. one new context-assembly policy that reacts to that classifier
  3. an ablation against the current V1 baseline
  4. repeated-run comparison on the current three strong_multi_goal tasks

Default V2 Baseline

The default V2 multi-goal baseline should remain fully local and model-free at runtime.

Core baseline responsibilities:

  1. retrieval-aware task decomposition
  2. shallow subgoal probing
  3. shared-workspace clustering
  4. primary-branch selection
  5. bounded deep expansion for the primary branch
  6. optional single backup branch retention
  7. optional backup-branch micro-context retention when budget remains

Default baseline constraints:

  1. no required LLM API calls
  2. no required embedding model
  3. no vector database
  4. no hosted planner or reranker

Why this baseline matters:

  • it preserves the lightweight product thesis
  • it keeps evaluation ablations interpretable
  • it makes open-source use and reproduction easier

Optional Enhancers

Later V2 work may add optional enhancement layers, but they must stay optional.

Optional planner layer:

  • candidate subgoal generation from complex natural-language prompts
  • subgoal dependency ordering suggestions
  • fallback decomposition when rule-based splitting is weak

Optional semantic layer:

  • embedding-based subgoal clustering
  • semantic workspace-cluster merging
  • reranking support for ambiguous candidate groups

Required rule for both layers:

  • the core V2 path must still work when both are disabled
  • evaluations must always preserve a pure no-LLM no-embedding baseline condition

Success Criteria

The first V2 round is useful if it can improve at least one of these without making V1 heavy:

  • lower elapsed median on strong_multi_goal
  • lower token median on strong_multi_goal
  • improve first-relevant-file stability on unstable-first-file tasks
  • reduce local-goal drift while preserving the navigation-heavy bucket

What Not To Do Yet

Do not do these in the first V2 round:

  1. add embeddings by default
  2. add a reranker before proving the ambiguity problem is not solvable with cheaper logic
  3. broaden to many new benchmarks before the internal V2 hypothesis is coherent
  4. weaken the V1 lightweight positioning just to create a larger method delta

Recommended Immediate Next Step

The next concrete V2 action should be:

  1. formalize the two strong_multi_goal subtypes observed in V1
  2. encode them as evaluation labels or diagnostics
  3. design the smallest ambiguity-aware retrieval control that can be ablated against V1

That subtype layer now serves a concrete purpose:

  • stable_first_file_multi_target should test whether CodeWell can stop expanding earlier and help the agent commit faster after reaching the right area
  • unstable_first_file_multi_target should test whether CodeWell can reduce first-hop drift before deeper local context assembly starts

Ordered Implementation Plan

Recommended build order:

  1. stabilize the no-LLM no-embedding decomposition baseline
  2. add structured subgoal records and shared-workspace clustering
  3. add primary-branch versus backup-branch retrieval control
  4. add backup-branch micro-context retention
  5. run ablations on the current strong_multi_goal slice
  6. only then consider an optional LLM planner
  7. only after that consider optional embedding-based clustering or reranking

Near-term execution checklist:

  1. define the structured subgoal schema
  2. define shallow-probe scoring and merge thresholds
  3. implement shared-workspace clustering from lexical and graph overlap
  4. expose primary versus backup branch diagnostics in context output
  5. retain a minimal backup-branch context under leftover budget
  6. re-run the three current repeated-run strong_multi_goal tasks
  7. compare token, elapsed, and first-file stability before any optional model-based enhancer work

Early Smoke Read

Current implementation-side smoke findings from 2026-05-22:

  1. the first V2 baseline did help the unstable-first-file task sample
  2. the same baseline initially regressed on stable-first-file samples
  3. a large part of that regression was caused by overexposing multi-goal diagnostics and backup branch context in the default response path
  4. trimming default output while keeping the branch policy internal flipped one stable-first-file smoke sample back to positive

Current practical rule:

  • keep rich multi-goal diagnostics in debug
  • keep default output lean
  • do not retain backup micro-context when the primary branch is an entrypoint-search branch

Related Files

  • docs/V1_EVALUATION_NOTE.md
  • docs/V1_FREEZE_SUMMARY.md
  • docs/ARCHITECTURE_V2.md
  • docs/AGENT_EVAL_SUMMARY.md
  • docs/V2_PROGRESS_2026-05-23.md