Agent Smith

Autonomous code-repair agent that iterates over a task in a secure sandbox using a Thought → Code → Observation loop.

Technical overview

This project is built around a few small, focused components:

student/orchestrator.py drives the loop, tracks metrics, and decides when to stop.
student/llm_client.py talks to OpenAI-compatible providers, handles retries, and rotates API keys.
student/code_extractor.py normalizes provider output into executable Python or tool-call equivalents.
student/sandbox.py executes untrusted code with import, filesystem, timeout, and memory restrictions.
student/mcp_client.py exposes MCP tools over stdio or streamable HTTP.
student/agent_mbpp.py and student/agent_swebench.py provide benchmark-specific CLIs.

The agent supports multiple provider backends through OpenAI-compatible endpoints. Configured providers include Groq, OpenRouter, Google, Together, Mistral, Cohere, Sarvam, and an Artificial Analysis-compatible alias. The code path is provider-agnostic; only the base URL and API key set change.

Prerequisites

Python 3.10+
uv
Docker, for MBPP and SWE-bench evaluation runs

Installation

uv sync

Create a .env file at the repository root and add the provider keys you plan to use. The repository includes .env.example with placeholder variable names such as OPENROUTER_API_KEYS, GROQ_API_KEYS, and TOGETHER_API_KEYS.

Run it

Sandbox and agent entry points are exposed as console scripts:

uv run sandbox sandbox_template.json
uv run agent-mbpp --task-file cache/mbpp_task.json --output cache/mbpp_solution.json --model-name "qwen/qwen3-32b" --provider-url "https://api.groq.com/openai/v1"
uv run agent-swebench --task-file cache/swebench_task.json --output cache/swebench_solution.json --model-name "qwen/qwen3-32b" --provider-url "https://api.groq.com/openai/v1"

The Makefile wraps the same workflow with shorter commands such as make mbpp-run, make swebench-run, make exam-mbpp, and make exam-swebench.

Architecture notes

The workflow is intentionally simple:

Load a benchmark task from JSON.
Prompt an LLM through an OpenAI-compatible provider.
Extract Python or tool-call output.
Execute the code in a restricted sandbox.
Feed the observation back into the next iteration.

The sandbox only exposes allowlisted imports, allowlisted filesystem paths, a bounded memory budget, and a hard execution timeout. Tooling is injected dynamically through MCP so the same agent loop can work against MBPP helpers or SWE-bench repository tools.

Benchmarks and metrics

This repository includes BENCHMARK_REPORT.md, which compares five models across three SWE-bench tasks and records pass/fail, iterations, token counts, wall-clock time, and provider behavior.

Key results from that report:

Overall: 8 / 15 benchmark runs passed
Best-performing model in that slice: qwen/qwen3-32b
Fastest provider on average: Groq
The report also includes an ablation study and provider reliability analysis

Demo input

The agent is designed to work on benchmark task JSON files produced by the evaluation harness:

MBPP tasks for algorithmic Python problems
SWE-bench tasks for real-world repository bug fixes

The repo also contains benchmark artifacts under benchmark/ and replay data under evaluations/.

Notes

No API keys are hardcoded in source files; credentials are expected from environment variables.
The project records step-level metrics for reproducibility and post-run analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmark		benchmark
docs		docs
moulinette		moulinette
scripts		scripts
student		student
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
BENCHMARK_REPORT.md		BENCHMARK_REPORT.md
Makefile		Makefile
README.md		README.md
mcp_tools_mbpp.py		mcp_tools_mbpp.py
mcp_tools_swebench.py		mcp_tools_swebench.py
pyproject.toml		pyproject.toml
sandbox_template.json		sandbox_template.json
subject.pdf		subject.pdf
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Smith

Technical overview

Prerequisites

Installation

Run it

Architecture notes

Benchmarks and metrics

Demo input

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Smith

Technical overview

Prerequisites

Installation

Run it

Architecture notes

Benchmarks and metrics

Demo input

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages