Autonomous code-repair agent that iterates over a task in a secure sandbox using a Thought → Code → Observation loop.
This project is built around a few small, focused components:
student/orchestrator.pydrives the loop, tracks metrics, and decides when to stop.student/llm_client.pytalks to OpenAI-compatible providers, handles retries, and rotates API keys.student/code_extractor.pynormalizes provider output into executable Python or tool-call equivalents.student/sandbox.pyexecutes untrusted code with import, filesystem, timeout, and memory restrictions.student/mcp_client.pyexposes MCP tools over stdio or streamable HTTP.student/agent_mbpp.pyandstudent/agent_swebench.pyprovide benchmark-specific CLIs.
The agent supports multiple provider backends through OpenAI-compatible endpoints. Configured providers include Groq, OpenRouter, Google, Together, Mistral, Cohere, Sarvam, and an Artificial Analysis-compatible alias. The code path is provider-agnostic; only the base URL and API key set change.
- Python 3.10+
uv- Docker, for MBPP and SWE-bench evaluation runs
uv syncCreate a .env file at the repository root and add the provider keys you plan to use. The repository includes .env.example with placeholder variable names such as OPENROUTER_API_KEYS, GROQ_API_KEYS, and TOGETHER_API_KEYS.
Sandbox and agent entry points are exposed as console scripts:
uv run sandbox sandbox_template.json
uv run agent-mbpp --task-file cache/mbpp_task.json --output cache/mbpp_solution.json --model-name "qwen/qwen3-32b" --provider-url "https://api.groq.com/openai/v1"
uv run agent-swebench --task-file cache/swebench_task.json --output cache/swebench_solution.json --model-name "qwen/qwen3-32b" --provider-url "https://api.groq.com/openai/v1"The Makefile wraps the same workflow with shorter commands such as make mbpp-run, make swebench-run, make exam-mbpp, and make exam-swebench.
The workflow is intentionally simple:
- Load a benchmark task from JSON.
- Prompt an LLM through an OpenAI-compatible provider.
- Extract Python or tool-call output.
- Execute the code in a restricted sandbox.
- Feed the observation back into the next iteration.
The sandbox only exposes allowlisted imports, allowlisted filesystem paths, a bounded memory budget, and a hard execution timeout. Tooling is injected dynamically through MCP so the same agent loop can work against MBPP helpers or SWE-bench repository tools.
This repository includes BENCHMARK_REPORT.md, which compares five models across three SWE-bench tasks and records pass/fail, iterations, token counts, wall-clock time, and provider behavior.
Key results from that report:
- Overall: 8 / 15 benchmark runs passed
- Best-performing model in that slice:
qwen/qwen3-32b - Fastest provider on average: Groq
- The report also includes an ablation study and provider reliability analysis
The agent is designed to work on benchmark task JSON files produced by the evaluation harness:
- MBPP tasks for algorithmic Python problems
- SWE-bench tasks for real-world repository bug fixes
The repo also contains benchmark artifacts under benchmark/ and replay data under evaluations/.
- No API keys are hardcoded in source files; credentials are expected from environment variables.
- The project records step-level metrics for reproducibility and post-run analysis.