[Feature]: SWE-bench using mini-swe-agent

### Motivation

Enable SWE-bench accuracy evaluation, aligned with performance dataset to be used in agentic inference benchmark

### Proposed Solution

Call mini-swe-agent directly using a modified version of https://github.com/SWE-agent/mini-swe-agent/blob/main/src/minisweagent/config/benchmarks/swebench.yaml that support custom model and sampling parameters. Wait for run to finish, then use swe-bench to verify pass rate and collect results.
Propose to start with princeton-nlp/SWE-bench_Lite dev split (23 samples). Extend to test split (300 samples) or SWE-bench_Verified as needed.

### Alternatives Considered

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: SWE-bench using mini-swe-agent #310

Motivation

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: SWE-bench using mini-swe-agent #310

Description

Motivation

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions