Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
304e54c
[Move DISCO queue to core]:
arubique Mar 8, 2026
c0f81b9
[Move DISCO queue to core]:
arubique Mar 9, 2026
6ad80a8
[Move DISCO queue to core]:
arubique Mar 9, 2026
b498ce7
[Move DISCO queue to core]:
arubique Mar 11, 2026
14bcb3f
[Move DISCO queue to core] Add ModelScorer, ModelAgentAdapter, and re…
arubique Mar 11, 2026
079ef47
[Move DISCO queue to core]:
arubique Mar 12, 2026
e23b1df
[Move DISCO queue to core] Remove dummy implementations and tighten d…
arubique Mar 12, 2026
dd46f1a
[Move DISCO queue to core]:
arubique Mar 13, 2026
3779e2e
[Move DISCO queue to core]:
arubique Mar 13, 2026
bf4abbb
[Move DISCO queue to core]:
arubique Mar 14, 2026
f6a5885
[Move DISCO queue to core]:
arubique Mar 14, 2026
2693197
[Move DISCO queue to core]:
arubique Mar 16, 2026
afd2cf9
[Move DISCO queue to core]:
arubique Mar 16, 2026
7201832
[Move DISCO queue to core]: Update docs to reflect MMLU, scorer, and …
arubique Mar 16, 2026
e7d15a8
[Move DISCO queue to core]:
arubique Mar 16, 2026
3aa675e
[Move DISCO queue to core]:
arubique Mar 16, 2026
6f5b0e2
Add benchmark/index.md to mkdocs.yml to fix warning during docs building
arubique Mar 16, 2026
fa280fd
small quality fixes
cemde Mar 27, 2026
9a61aaf
improved testing
cemde Mar 28, 2026
7a0fe30
fixed typing errors in tests
cemde Mar 28, 2026
0d5c221
fixed potential scientific integrity issues
cemde Mar 28, 2026
3d10c15
Merge remote-tracking branch 'upstream/main' into adaptive_queue
cemde Mar 28, 2026
524fee5
small bug fix, formatting issues
cemde Mar 28, 2026
9b2696a
fixed changelog and docs
cemde Mar 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion BENCHMARKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,23 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses

---

## 6. [Name of Next Benchmark]
## 6. MMLU (Massive Multitask Language Understanding) (Beta)

MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.

> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!

> **Implemented:** A ready-to-use implementation is available via `DefaultMMLUBenchmark` with HuggingFace model support. Install with `pip install maseval[mmlu]`. See the [MMLU documentation](https://maseval.readthedocs.io/en/stable/benchmark/mmlu/) for usage details.

### Source and License

- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021)
- **DISCO Paper:** [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://arxiv.org/abs/2510.07959) (Rubinstein et al., ICLR 2026)
- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU)

---

## 7. [Name of Next Benchmark]

(Description for the next benchmark...)

Expand Down
62 changes: 27 additions & 35 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

**Core**

- Fixed `MessageHistory.to_list()` returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #PR_NUMBER_PLACEHOLDER)
- Fixed `MessageHistory.to_list()` returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #48)
- Fixed `get_git_info()` crashing on detached HEAD (e.g. in CI checkout), now returns `detached@<short-hash>` as the branch name. (PR: #41)

**Interface**

Expand All @@ -24,16 +25,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45)
- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45)
- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)

- `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
- `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)
- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34 and #41)
- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34 and #41)
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)

**Benchmarks**
**Interface**

- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24)
- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34 and #41)
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)

- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
**Benchmarks**

- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. `MMLUBenchmark` is a framework-agnostic base class (`setup_agents()` and `get_model_adapter()` must be implemented by subclasses); `DefaultMMLUBenchmark` provides a ready-made HuggingFace implementation. Also includes `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34 and #41)
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)

- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
Expand All @@ -43,7 +58,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)

- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
- `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
- `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
Expand All @@ -54,32 +68,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
**Examples**

- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45)

- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41)
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

**Documentation**

- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45)

**Core**

- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)

**Interface**

- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)

**Testing**

- Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
Expand All @@ -91,7 +87,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
- Added `respx` dev dependency for HTTP-level mocking (PR: #29)
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34)
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)

### Changed

Expand All @@ -108,28 +104,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `LlamaIndexAgentAdapter`: Added `max_iterations` constructor parameter, forwarded to `AgentWorkflow.run()`. Fixes silent swallowing of `max_steps` by `FunctionAgent.__init__`. (PR: #39)
- `SmolAgentAdapter`: New `_determine_step_status()` detects crashed steps where `AgentGenerationError` was raised before `step.error` was set, preventing false "success" status on empty steps. (PR: #39)
- `GoogleGenAIModelAdapter`: Consecutive tool-response messages are now merged into a single `contents` entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
- `SmolAgentUser` → `SmolAgentLLMUser`
- `LangGraphUser` → `LangGraphLLMUser`
- `LlamaIndexUser` → `LlamaIndexLLMUser`

**Benchmarks**

- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`
- `Tau2Benchmark`: Seeds `simulators/user`, `agents/default_agent`
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)

**User**

- Refactored `User` class into abstract base class defining the interface (`get_initial_query()`, `respond()`, `is_done()`) with `LLMUser` as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
- Renamed `AgenticUser` → `AgenticLLMUser` for consistency with the new hierarchy (PR: #22)

**Interface**

- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
- `SmolAgentUser` → `SmolAgentLLMUser`
- `LangGraphUser` → `LangGraphLLMUser`
- `LlamaIndexUser` → `LlamaIndexLLMUser`

- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)

**Testing**

- Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,13 @@ pip install "maseval[langgraph]"
pip install "maseval[llamaindex]"
```

Or install benchmark-specific dependencies:

```bash
# MMLU (HuggingFace models)
pip install "maseval[mmlu]"
```

## Example

Examples are available in the [Documentation](https://maseval.readthedocs.io/en/stable/).
Expand Down
Loading
Loading