agent-eval

Tools for evaluating the performance of agents on Primer-related tasks.

Overview

This project creates a framework used for running experiments. In each experiment, we establish treatments that are then used to setup the environment for the agent before completing evaluations. These evaluations (evals) represent a scenario where, given a prompt, we are evaluating how the agent behaves.

This framework allows us to test the effectiveness of different experimental treatments on the evaluations we care about. In particular, we score results based on:

Correctness: how many of the tests does the agent output pass
Cost: how much did the agent spend in API calls
Latency: how long did the agent take to complete the evals

Experiments

Experiments live in ./experiments/src. Each experiment is a new TypeScript file that exports an experiment config:

import type {ExperimentConfig} from '@primer/agent-experiment'

export const experiment: ExperimentConfig = {
  name: 'Example experiment',
  description: 'Experiment config demonstrating different options',
  models: ['gpt-5.5', 'claude-opus-4.7', 'claude-sonnet-4.6'],
  evals: ['001-agent-uses-button-from-primer', '002-agent-uses-octicon-from-primer'],
  treatments: [
    {
      name: 'Treatment one',
      async setup({sandbox}) {
        // ...
      },
    },
    {
      name: 'Treatment two',
      async setup({sandbox}) {
        // ...
      },
    },
  ],
}

The experiment config will specify:

A name for the experiment
A description for the experiment describing more about what it is for
Models against which you would like to run the experiment
Evaluations (evals) that are used to grade the output of the model
Treatments that specify the different conditions you would like to test (for example, testing with an MCP server versus without)

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.devcontainer		.devcontainer
.github		.github
evals		evals
experiments		experiments
packages		packages
script		script
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
prettier.config.js		prettier.config.js
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-eval

Overview

Experiments

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-eval

Overview

Experiments

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages