Benchmark framework for evaluating Ripperdoc AI coding agent using the Harbor framework.
This repository integrates Ripperdoc as a custom agent in the Harbor benchmark framework, enabling standardized evaluation on various coding benchmarks including terminal-bench-2.
ripperdoc-benchmark/
├── agents/
│ ├── ripperdoc.py # Harbor agent wrapper for Ripperdoc
│ └── install-ripperdoc.sh.j2 # Installation template for containers
├── pyproject.toml # Project configuration
├── setup.py # Package setup
└── README.md # This file
- Ripperdoc SDK: Located at
/mnt/hdd1/QuantmewRipperdoc - Harbor Framework: https://github.com/laude-institute/harbor
- Python: 3.10+
Harbor requires Python 3.12+. Install from source:
git clone https://github.com/laude-institute/harbor.git
cd harbor
pip install -e .cd /mnt/hdd1/xiahan_github/ripperdoc-benchmark
pip install -e .Configure your API keys as environment variables:
# For Anthropic Claude models
export ANTHROPIC_API_KEY="your-api-key"
# For OpenAI models
export OPENAI_API_KEY="your-api-key"
# For DeepSeek models
export DEEPSEEK_API_KEY="your-api-key"The ripperdoc conda environment has Python 3.12 and Harbor pre-installed.
# Activate the environment
conda activate ripperdoc
# Run tests
python test_agent.py
# Run benchmark with Ripperdoc
./run_ripperdoc_benchmark.sh
# Run with Terminus-2 for comparison
./run_ripperdoc_benchmark.sh --terminusFirst, test the framework with Terminus-2:
harbor run -d terminal-bench@2.0 --agent terminus-2Use the --agent-import-path flag to specify the Ripperdoc agent:
harbor run \
-d terminal-bench@2.0 \
--agent-import-path agents.ripperdoc:Ripperdoc \
--model glm-4.7Specify different models using the --model flag:
# GLM-4.7 (default)
harbor run -d terminal-bench@2.0 --agent-import-path agents.ripperdoc:Ripperdoc --model glm-4.7
# Claude Sonnet
harbor run -d terminal-bench@2.0 --agent-import-path agents.ripperdoc:Ripperdoc --model claude/claude-sonnet-4-20250514
# GPT-4
harbor run -d terminal-bench@2.0 --agent-import-path agents.ripperdoc:Ripperdoc --model openai/gpt-4# Set maximum thinking tokens for reasoning
export MAX_THINKING_TOKENS=20000
harbor run -d terminal-bench@2.0 --agent-import-path agents.ripperdoc:Ripperdoc --model glm-4.7
# Run on specific tasks
harbor run -d terminal-bench@2.0 --agent-import-path agents.ripperdoc:Ripperdoc --task-ids task_1,task_2Currently using terminal-bench-2 from https://github.com/laude-institute/terminal-bench-2/
Future datasets may be added for custom evaluation.
The Ripperdoc agent (agents/ripperdoc.py) implements Harbor's BaseInstalledAgent interface:
- Installation: Uses Jinja2 template to install Ripperdoc in container
- Execution: Runs Ripperdoc headlessly using the Python SDK
- Trajectory: Converts Ripperdoc history to ATIF format for analysis
To extend Ripperdoc's capabilities for benchmarking:
- Modify
agents/ripperdoc.pyto add new tool support - Update
ALLOWED_TOOLSlist for tool filtering - Adjust trajectory conversion in
_convert_events_to_trajectory()
Harbor requires Python 3.12+. If you have Python 3.9:
# Install pyenv and switch to Python 3.12
pyenv install 3.12
pyenv local 3.12Ensure the SDK is accessible:
ls -la /mnt/hdd1/QuantmewRipperdocIf missing, install from source:
pip install -e /mnt/hdd1/QuantmewRipperdocVerify your environment variables are set:
echo $ANTHROPIC_API_KEY
echo $OPENAI_API_KEYApache License 2.0 - see LICENSE file for details.
- Ripperdoc - Open-source AI coding agent
- Harbor Framework - Agent benchmark framework
- Terminal Bench 2 - Benchmark dataset