Skip to content

[Feature]: SWE-bench using mini-swe-agent #310

@tianmu-li

Description

@tianmu-li

Motivation

Enable SWE-bench accuracy evaluation, aligned with performance dataset to be used in agentic inference benchmark

Proposed Solution

Call mini-swe-agent directly using a modified version of https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml that support custom model and sampling parameters. Wait for run to finish, then use swe-bench to verify pass rate and collect results.
Propose to start with princeton-nlp/SWE-bench_Lite dev split (23 samples). Extend to test split (300 samples) or SWE-bench_Verified as needed.

Alternatives Considered

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions