Skip to content

GWVG/agentic-code-repair

Repository files navigation

Agent Smith

Autonomous code-repair agent that iterates over a task in a secure sandbox using a Thought → Code → Observation loop.

Technical overview

This project is built around a few small, focused components:

  • student/orchestrator.py drives the loop, tracks metrics, and decides when to stop.
  • student/llm_client.py talks to OpenAI-compatible providers, handles retries, and rotates API keys.
  • student/code_extractor.py normalizes provider output into executable Python or tool-call equivalents.
  • student/sandbox.py executes untrusted code with import, filesystem, timeout, and memory restrictions.
  • student/mcp_client.py exposes MCP tools over stdio or streamable HTTP.
  • student/agent_mbpp.py and student/agent_swebench.py provide benchmark-specific CLIs.

The agent supports multiple provider backends through OpenAI-compatible endpoints. Configured providers include Groq, OpenRouter, Google, Together, Mistral, Cohere, Sarvam, and an Artificial Analysis-compatible alias. The code path is provider-agnostic; only the base URL and API key set change.

Prerequisites

  • Python 3.10+
  • uv
  • Docker, for MBPP and SWE-bench evaluation runs

Installation

uv sync

Create a .env file at the repository root and add the provider keys you plan to use. The repository includes .env.example with placeholder variable names such as OPENROUTER_API_KEYS, GROQ_API_KEYS, and TOGETHER_API_KEYS.

Run it

Sandbox and agent entry points are exposed as console scripts:

uv run sandbox sandbox_template.json
uv run agent-mbpp --task-file cache/mbpp_task.json --output cache/mbpp_solution.json --model-name "qwen/qwen3-32b" --provider-url "https://api.groq.com/openai/v1"
uv run agent-swebench --task-file cache/swebench_task.json --output cache/swebench_solution.json --model-name "qwen/qwen3-32b" --provider-url "https://api.groq.com/openai/v1"

The Makefile wraps the same workflow with shorter commands such as make mbpp-run, make swebench-run, make exam-mbpp, and make exam-swebench.

Architecture notes

The workflow is intentionally simple:

  1. Load a benchmark task from JSON.
  2. Prompt an LLM through an OpenAI-compatible provider.
  3. Extract Python or tool-call output.
  4. Execute the code in a restricted sandbox.
  5. Feed the observation back into the next iteration.

The sandbox only exposes allowlisted imports, allowlisted filesystem paths, a bounded memory budget, and a hard execution timeout. Tooling is injected dynamically through MCP so the same agent loop can work against MBPP helpers or SWE-bench repository tools.

Benchmarks and metrics

This repository includes BENCHMARK_REPORT.md, which compares five models across three SWE-bench tasks and records pass/fail, iterations, token counts, wall-clock time, and provider behavior.

Key results from that report:

  • Overall: 8 / 15 benchmark runs passed
  • Best-performing model in that slice: qwen/qwen3-32b
  • Fastest provider on average: Groq
  • The report also includes an ablation study and provider reliability analysis

Demo input

The agent is designed to work on benchmark task JSON files produced by the evaluation harness:

  • MBPP tasks for algorithmic Python problems
  • SWE-bench tasks for real-world repository bug fixes

The repo also contains benchmark artifacts under benchmark/ and replay data under evaluations/.

Notes

  • No API keys are hardcoded in source files; credentials are expected from environment variables.
  • The project records step-level metrics for reproducibility and post-run analysis.

About

Autonomous code-repair agent that iterates over a task in a secure sandbox using a Thought → Code → Observation loop.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors