Motivation
Enable SWE-bench accuracy evaluation, aligned with performance dataset to be used in agentic inference benchmark
Proposed Solution
Call mini-swe-agent directly using a modified version of https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml that support custom model and sampling parameters. Wait for run to finish, then use swe-bench to verify pass rate and collect results.
Propose to start with princeton-nlp/SWE-bench_Lite dev split (23 samples). Extend to test split (300 samples) or SWE-bench_Verified as needed.
Alternatives Considered
No response
Additional Context
No response
Motivation
Enable SWE-bench accuracy evaluation, aligned with performance dataset to be used in agentic inference benchmark
Proposed Solution
Call mini-swe-agent directly using a modified version of https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml that support custom model and sampling parameters. Wait for run to finish, then use swe-bench to verify pass rate and collect results.
Propose to start with princeton-nlp/SWE-bench_Lite dev split (23 samples). Extend to test split (300 samples) or SWE-bench_Verified as needed.
Alternatives Considered
No response
Additional Context
No response