V2 of multi-agent-codegen — the same pipeline, now with a self-evolving tester.
A multi-agent AI code generation pipeline (LangGraph + Claude) where the Tester agent autonomously improves its own test generation strategy over successive generations through self-evaluation, failure analysis, and prompt evolution.
The five-agent pipeline (Orchestrator → Planner → Coder → Reviewer → Tester) is inherited from V1. V2 adds a self-evolution engine that wraps the Tester, observing its performance across batches of pipeline runs and iteratively rewriting its system prompt to produce better tests.
The sidebar shows real-time agent status indicators. A student grade tracker was generated, reviewed (9/10), and passed 79/79 tests on the first attempt.
┌──────────────────────────────────────────────────────────────┐
│ CODE GENERATION PIPELINE (LangGraph StateGraph) │
│ │
│ Orchestrator → Planner → Coder → Reviewer → Tester ──┐ │
│ ▲ │ │
│ └── revision loop ──────────┘ │
└──────────────────────────────────┬───────────────────────────┘
│ test results + metadata
▼
┌──────────────────────────────────────────────────────────────┐
│ SELF-EVOLUTION ENGINE (evolution/) │
│ │
│ Evaluator → Analyzer → Evolver → Tracker │
│ │ │ │
│ │ scores tests │ writes new prompt │
│ │ via LLM-as-Judge │ (prompts/tester_gen_N.txt) │
│ │ │ │
│ └──────────────────────┘ │
│ ▲ │
│ │ next generation uses updated prompt │
└──────────────────────────────────────────────────────────────┘
| Agent | Model (Dev) | Model (Final) | Role |
|---|---|---|---|
| Orchestrator | claude-sonnet-4-6 | claude-opus-4-6 | Interprets the request, sets pipeline status |
| Planner | claude-sonnet-4-6 | claude-sonnet-4-6 | Produces a structured plan: objective, steps, files, dependencies |
| Coder | claude-sonnet-4-6 | claude-sonnet-4-6 | Generates CodeArtifact objects; on revisions, receives review issues and test failures as context |
| Reviewer | claude-sonnet-4-6 | claude-sonnet-4-6 | Scores code 0–10, flags issues and suggestions |
| Tester | claude-sonnet-4-6 | claude-sonnet-4-6 | Generates pytest files for the given prompt generation, runs them in a Docker sandbox |
After the Tester runs, a should_continue router either ends the pipeline (COMPLETED) or loops back to the Coder if tests failed or the review score is below threshold.
| Component | Model | Role |
|---|---|---|
| Evaluator | claude-haiku-4-5 | LLM-as-Judge: scores each test for bug detection, false failure, redundancy, coverage category |
| Analyzer | claude-sonnet-4-6 | Identifies the top 3 specific, actionable failure patterns from metrics + raw results |
| Evolver | claude-sonnet-4-6 | Makes surgical edits to the tester prompt — preserving strengths, inserting fix instructions |
| Tracker | — | JSON persistence to experiments/{name}/; cost estimation; maintains running evolution_history.json |
WEIGHTS = {
"bug_detection_rate": 0.30, # higher is better
"false_failure_rate": 0.25, # inverted: lower is better
"coverage_quality": 0.20, # 1-10 normalised to 0-1
"edge_case_coverage": 0.15, # 1-10 normalised to 0-1
"redundancy_rate": 0.10, # inverted: lower is better
}If a newly evolved prompt causes the overall score to drop by more than 15% relative to the previous generation, the evolution loop automatically reverts to the prior generation's prompt and continues.
# 1. Clone and install
git clone https://github.com/your-username/self-evolving-codegen.git
cd self-evolving-codegen
pip install -e ".[dev]"
# 2. Set your API key
cp .env.example .env
# open .env and set ANTHROPIC_API_KEY=sk-ant-...
# 3. Launch the Streamlit UI
streamlit run app.py
# 4. Run the evolution loop (separate terminal)
python run_evolution.py --generations 2 --batch-size 2 --experiment micro_run # Phase 4: ~$3-5
python run_evolution.py --generations 5 --batch-size 3 --experiment dev_run # Phase 5: ~$8-12
python run_evolution.py --generations 10 --batch-size 5 --experiment final_run # Phase 6: ~$18-28
# Phase 6 only: enable Opus for the Orchestrator via env var (do NOT edit config.py)
ORCHESTRATOR_MODEL=claude-opus-4-6 python run_evolution.py --generations 10 --batch-size 5 --experiment final_runKeeping API spend predictable is a first-class concern. Several mechanisms work together:
Every pipeline run is cached to .cache/pipeline_runs/ using a deterministic MD5 key of {task}:gen{generation}. Re-running the evolution loop (e.g. after a bug fix) replays all cached results at $0.00 — only genuinely new task/generation combinations hit the API.
[CACHE HIT] A Python calculator... (gen 0) ← free
[API CALL] A Python linked list... (gen 0) ← costs money, then cached
All pipeline invocations go through rate_limited_call(), which adds a 1.5-second delay between calls to respect Anthropic Pro plan rate limits.
Before any API spend, the evolution loop prints an estimate and requires explicit confirmation:
Estimated cost: $4.23 (cached runs cost $0.00)
Continue? [y/N]
| Role | Model | Why |
|---|---|---|
| Evaluator | Haiku (always) | Highest-volume call — ~90% cheaper than Sonnet |
| All other agents | Sonnet (dev) / Opus (final showcase only) | Quality where it matters |
| Opus override | env var only | ORCHESTRATOR_MODEL=claude-opus-4-6 — never hardcoded |
Every LLM call has an explicit max_tokens limit set from config.py to prevent runaway output costs:
MAX_TOKENS = {
"orchestrator": 1500, "planner": 2500, "coder": 4000,
"reviewer": 1500, "tester": 4000, "evaluator": 1500,
"analyzer": 2000, "evolver": 2000,
}--generations N Total evolution generations (default: 10)
--batch-size N Coding tasks evaluated per generation (default: 5)
--experiment NAME Experiment identifier used as the output directory name
For each generation, the tracker saves to experiments/{name}/:
experiments/my_run_001/
├── evolution_history.json # running log of all GenerationMetrics
├── metrics_gen_0.json
├── metrics_gen_1.json
├── ...
├── prompt_gen_0.txt # tester prompt used at gen 0
├── prompt_gen_1.txt # evolved prompt for gen 1
├── ...
├── analysis_gen_0.json # failure patterns + proposed fixes
└── evolution_chart.png # 2×2 performance chart
After running the loop, open the Evolution Dashboard tab in the Streamlit app to:
- Browse experiments by name
- View the performance chart
- Compare per-generation metrics in a table
- Diff tester prompts side-by-side
- Read per-generation strengths and weaknesses
All unit tests run against mock data in evolution/mock_data.py — no API calls required.
python -m pytest tests/ -v # run all tests
python -m pytest tests/ -x -v # stop on first failuretests/test_models.py — Pydantic model validation and serialization
tests/test_evaluator.py — Scoring and aggregation logic (pure functions)
tests/test_tracker.py — JSON persistence, best-gen selection, cost estimation
tests/test_visualizer.py — Chart generation from mock data
The evolution loop evaluates the tester across 10 coding tasks of varying complexity:
- Python calculator with division-by-zero handling
- FastAPI server with
/healthand/echoendpoints - Linked list with insert, delete, search, reverse
- File-based todo list manager
- Password validator (length, case, digits, special chars)
- CSV parser with column statistics
- Token-bucket rate limiter
- LRU cache
- Markdown-to-HTML converter
- Binary search tree
Tasks are drawn cyclically if --batch-size exceeds 10.
Evolution is a wrapper, not a core change. The pipeline agents (graph/workflow.py) are untouched. run_evolution.py invokes the pipeline programmatically; evolution components never import from graph/. The Tester is the only agent parameterised by generation — generation=0 produces identical behaviour to V1.
Surgical prompt edits, not rewrites. The Evolver is instructed to preserve the structure and wording that works, insert specific imperative directives, and keep the output under 1000 words to prevent prompt bloat.
Specific failure patterns, not vague advice. The Analyzer prompt enforces a strict rule: every identified pattern must be concrete and cite observable evidence (e.g. "misses division-by-zero in 70% of arithmetic tasks"), never generic (e.g. "should be more thorough").
Cost-efficient judging. The Evaluator uses Claude Haiku to score each test. Sonnet is reserved for the Analyzer and Evolver where reasoning quality matters more. All model assignments are centralised in config.py and overrideable via env vars — no model name is hardcoded in agent files.
Cache-first execution. Every pipeline run is cached to .cache/pipeline_runs/. Re-running the loop (e.g. after fixing a bug in the analyzer) replays all prior results for free. The cache key is deterministic: MD5("{task}:gen{generation}").
| Library | Version | Purpose |
|---|---|---|
| LangGraph | ≥ 0.2.0 | Agent orchestration with conditional edges |
| LangChain Anthropic | ≥ 0.3.0 | Claude API via ChatAnthropic |
| Pydantic | ≥ 2.0.0 | Typed state schemas and structured LLM outputs |
| Streamlit | ≥ 1.40.0 | Web UI with real-time streaming |
| Matplotlib | ≥ 3.8.0 | Evolution performance charts |
| python-dotenv | ≥ 1.0.0 | .env loading |
| Docker | — | Isolated sandbox for test execution |
Python ≥ 3.10 required.
# Required
ANTHROPIC_API_KEY=sk-ant-...
# Model overrides (all default to Sonnet except Evaluator=Haiku — see config.py)
# Uncomment ONLY for the final showcase run (Phase 6):
# ORCHESTRATOR_MODEL=claude-opus-4-6
# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langsmith_key
LANGCHAIN_PROJECT=self-evolving-codegenself-evolving-codegen/
├── agents/ # Agent implementations
│ ├── orchestrator.py # Uses config.ORCHESTRATOR_MODEL
│ ├── planner.py # Uses config.PLANNER_MODEL
│ ├── coder.py # Uses config.CODER_MODEL
│ ├── reviewer.py # Uses config.REVIEWER_MODEL
│ └── tester.py # Modified: supports generation param
├── evolution/ # Self-evolution engine
│ ├── models.py # TestEffectivenessScore, GenerationMetrics, EvolutionHistory
│ ├── evaluator.py # LLM-as-Judge test scoring (Haiku always)
│ ├── analyzer.py # Failure pattern analysis
│ ├── evolver.py # Prompt rewriter
│ ├── tracker.py # JSON persistence + cost estimation
│ ├── visualize.py # Matplotlib performance charts
│ ├── cache.py # Pipeline result caching + rate limiting
│ └── mock_data.py # Fake data for $0 unit testing
├── graph/
│ └── workflow.py # LangGraph StateGraph (unchanged from V1)
├── models/
│ └── schemas.py # AgentState, TestResult (with V2 optional fields)
├── prompts/
│ ├── tester_gen_0.txt # Base tester prompt
│ └── tester_gen_N.txt # Auto-generated evolved versions
├── sandbox/
│ └── Dockerfile # Isolated test runner
├── tests/ # Unit tests (zero API cost)
│ ├── test_models.py
│ ├── test_evaluator.py
│ ├── test_tracker.py
│ └── test_visualizer.py
├── experiments/ # Saved evolution run data (gitignored)
├── .cache/ # Cached pipeline results (gitignored)
├── config.py # Centralized model names, MAX_TOKENS, cost controls
├── app.py # Streamlit UI with Evolution Dashboard tab
└── run_evolution.py # Main evolution loop
- Claude (Anthropic) — Used as the AI assistant throughout development: writing and refactoring code, debugging, and architecture decisions.
