diff --git a/AREISSUES.md b/AREISSUES.md new file mode 100644 index 0000000..8fc1027 --- /dev/null +++ b/AREISSUES.md @@ -0,0 +1,109 @@ +# AREEnvironment Issues + +Issues identified by comparing `AREEnvironment` (generic ARE wrapper in `maseval/interface/environments/`) with `Gaia2Environment` (dataset-specific implementation in `maseval/benchmark/gaia2/`). + +Reviewed through scientific coding principles: silent failures that could produce wrong-but-plausible benchmark results are treated as high priority. + +**Design principle:** maseval's `Benchmark` class provides `fail_on_task_error`, `fail_on_setup_error`, `fail_on_evaluation_error`, etc. flags that give users explicit control over error tolerance. Environment and tool code must propagate errors so the benchmark runner can classify them (agent fault vs. infrastructure fault) and respect the user's `fail_on_*` settings. Silent swallowing inside environment code bypasses this mechanism entirely — the user asked for strict mode but gets silent degradation. + +## High Priority + +### 1. Oracle mode silently degrades to empty data + +`maseval/interface/environments/are.py:200-203` uses `hasattr` checks that silently fall back to empty dicts/lists: + +```python +oracle_traces = { + "apps_state": oracle_env.get_apps_state() if hasattr(oracle_env, "get_apps_state") else {}, + "world_logs": oracle_env.get_world_logs() if hasattr(oracle_env, "get_world_logs") else [], +} +``` + +If the ARE API changes or the methods are missing, oracle traces will be `{}` and `[]` — structurally valid but scientifically empty. Downstream evaluation code will run without error and produce meaningless scores. This is a textbook "wrong result that looks right" scenario. + +`Gaia2Environment` avoids this entirely by delegating to ARE's canonical `preprocess_scenario()`. + +No tests cover oracle mode, so this cannot be caught by CI. + +**Fix:** Remove the `hasattr` fallbacks. Call the methods directly — if they don't exist, the crash immediately tells you the ARE API changed. Alternatively, delegate to `preprocess_scenario()`. Add oracle mode tests. + +### 2. Missing simulation time tracking in AREToolWrapper + +`Gaia2GenericTool` records simulation time before/after each tool call (`maseval/benchmark/gaia2/tool_wrapper.py:123-150`). `AREToolWrapper` only records wall-clock time (`maseval/interface/environments/are_tool_wrapper.py:122-128`). + +Simulation time is the temporal coordinate of the experiment. Without it, any analysis of agent behavior in time-sensitive ARE scenarios is done against wall-clock time, which conflates LLM latency with simulated environment dynamics. Results would be unreproducible across different hardware or API providers. + +**Fix:** Add `_get_simulation_time()` helper and record `simulation_time_before`, `simulation_time_after`, `simulation_time_elapsed` in the invocation `meta` dict, matching `Gaia2GenericTool`. + +### 3. Schema extraction silently fabricates defaults + +`AREToolWrapper._extract_schema()` (`maseval/interface/environments/are_tool_wrapper.py:90-95`): + +```python +"type": getattr(arg, "arg_type", "string"), # fabricates "string" if missing +if not getattr(arg, "has_default", True): # assumes optional if missing +``` + +If an ARE tool arg is missing `arg_type`, it silently becomes `"string"`. If `has_default` is missing, it silently becomes optional. The schema will look valid but will be wrong — agents will receive incorrect type information and parameter requirements, producing tool calls that either fail in hard-to-trace ways or succeed with wrong arguments. + +`Gaia2GenericTool` accesses these directly (`arg.arg_type`, `arg.has_default`) — it will crash immediately if the structure is unexpected, which is the correct behavior. + +**Fix:** Remove the silent defaults. Access `arg.arg_type` and `arg.has_default` directly. If ARE tools don't have these attributes, that's a real problem that should surface immediately. Additionally, use ARE's `AppToolAdapter` as the canonical source of tool metadata, as `Gaia2GenericTool` does. + +### 4. `poll_notifications()` silently swallows all exceptions + +Both `AREEnvironment` (`are.py:330-331`) and `Gaia2Environment` (`environment.py:328-329`): + +```python +except Exception: + return [], [], False +``` + +If the notification system has a bug, returns corrupt data, or the ARE API changes, this catch-all returns "no notifications, simulation not stopped" — a plausible-looking empty result. The agent loop continues as if nothing happened, missing user messages and environment events. The benchmark produces scores based on an agent that never received its inputs. + +This is distinct from lifecycle cleanup (where swallowing errors is acceptable). Notification polling is part of the data path — it directly affects what the agent sees and does. + +Additionally, this bypasses maseval's `fail_on_*_error` mechanism: even if the user configured strict mode via `fail_on_task_error` or `fail_on_setup_error` to catch infrastructure failures, these errors are swallowed before the benchmark runner ever sees them. + +**Fix:** Remove the bare `except Exception`. Catch only the specific exceptions that represent expected transient conditions (if any). Let unexpected errors propagate — the benchmark runner classifies them (`ENVIRONMENT_ERROR` during execution, `SETUP_FAILED` during setup) and the user's `fail_on_*` settings decide whether to abort or continue. + +## Medium Priority + +### 5. AUI tool filtering missing from AREEnvironment + +`Gaia2Environment` filters out `AgentUserInterface` message-retrieval tools and sets `wait_for_user_response = False` (`maseval/benchmark/gaia2/environment.py:178-213`), matching ARE's default agent behavior. `AREEnvironment` doesn't. + +Anyone using `AREEnvironment` with ARE's notification-based message delivery will get tools that block waiting for a response or duplicate user messages. This is a correctness issue for interactive scenarios, not just a convenience gap. + +**Fix:** Add an opt-in parameter (e.g. `filter_aui_tools=True`) to `AREEnvironment.__init__`, or at minimum document the behavior prominently so users of the generic wrapper know they must handle this themselves. + +### 6. Inconsistent error handling in AREEnvironment lifecycle methods + +`cleanup()` wraps in try/except (`are.py:360-364`), but `pause()` and `resume_with_offset()` propagate exceptions (`are.py:268-280`). `Gaia2Environment` wraps all lifecycle methods consistently. + +For lifecycle methods, the question is: "if this fails, can the experiment still produce correct results?" + +- `cleanup()`: Swallowing is acceptable — the task is already done, we're tearing down. +- `pause()` / `resume_with_offset()`: These control simulation time during agent execution. A failure here means time is advancing when it shouldn't be (or vice versa), directly affecting the experimental conditions. The current inconsistency means `pause()` will crash the benchmark run while `cleanup()` won't — but neither behavior is fully correct. + +Crucially, swallowing these errors removes the user's ability to control error tolerance via `fail_on_task_error` / `fail_on_setup_error`. The user opted into strict mode precisely to catch conditions like "simulation time wasn't paused during LLM generation." Silent swallowing overrides that choice. + +**Fix:** Let `pause()` and `resume_with_offset()` propagate exceptions. The benchmark runner classifies them and the user's `fail_on_*` settings decide the outcome. `cleanup()` is the only lifecycle method where swallowing is acceptable (teardown after task completion). + +### 7. No `get_turn_notifications()` on AREEnvironment + +`Gaia2Environment` has `get_turn_notifications()` (`maseval/benchmark/gaia2/environment.py:331-380`) which re-queues environment notifications for the inner agent loop — essential for multi-turn ARE scenarios. Without it, environment notifications are consumed by the outer turn loop and never reach the agent's step loop. + +Anyone building a multi-turn agent on `AREEnvironment` will silently lose environment notifications between turns. + +**Fix:** Add `get_turn_notifications()` to `AREEnvironment`, or document that `poll_notifications()` alone is insufficient for multi-turn agent loops and that users must implement re-queuing themselves. + +## Low Priority + +### 8. Metadata extraction should use AppToolAdapter + +`Gaia2GenericTool` delegates metadata extraction to ARE's `AppToolAdapter` (`maseval/benchmark/gaia2/tool_wrapper.py:68-75`) — the canonical source of truth for tool name, description, inputs, and output_type. `AREToolWrapper` reads these attributes directly from the ARE tool object. + +This works today but couples `AREToolWrapper` to ARE's internal tool structure rather than its public adapter API. + +**Fix:** Use `AppToolAdapter` in `AREToolWrapper` for metadata extraction. diff --git a/docs/interface/environments/are.md b/docs/interface/environments/are.md new file mode 100644 index 0000000..571b320 --- /dev/null +++ b/docs/interface/environments/are.md @@ -0,0 +1,26 @@ +# ARE + +Environment wrapper for Meta's [Agent Research Environments (ARE)](https://github.com/facebookresearch/meta-agents-research-environments). + +- [Documentation](https://github.com/facebookresearch/meta-agents-research-environments) +- [Code Repository](https://github.com/facebookresearch/meta-agents-research-environments) + +## Installation + +```bash +pip install maseval[are] +``` + +Alternatively, install ARE directly: + +```bash +pip install meta-agents-research-environments +``` + +## API Reference + +[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/environments/are.py){ .md-source-file } + +::: maseval.interface.environments.are.AREEnvironment + +::: maseval.interface.environments.are_tool_wrapper.AREToolWrapper diff --git a/docs/superpowers/plans/2026-03-27-are-environment.md b/docs/superpowers/plans/2026-03-27-are-environment.md new file mode 100644 index 0000000..f3803d1 --- /dev/null +++ b/docs/superpowers/plans/2026-03-27-are-environment.md @@ -0,0 +1,1260 @@ +# AREEnvironment Integration Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a generic AREEnvironment to maseval that wraps Meta's ARE simulation infrastructure as a reusable building block for interactive agent environments. + +**Architecture:** `AREEnvironment` is a maseval `Environment` subclass in `maseval/interface/environments/` that wraps ARE's `Environment`, `Scenario`, and `Tool` classes. It accepts either a pre-built ARE `Scenario` or a shorthand dict of apps+events via `task_data` (= `task.environment_data`). Tools are wrapped in `AREToolWrapper` with `ToolInvocationHistory` tracing. The ARE event loop is user-controlled (start/stop/pause/resume). + +**Tech Stack:** Python 3.10+, maseval core (`Environment`, `TraceableMixin`, `ConfigurableMixin`, `ToolInvocationHistory`), ARE (`meta-agents-research-environments>=1.2.0`) as optional dependency. + +**Spec:** `docs/superpowers/specs/2026-03-27-are-environment-design.md` + +--- + +### Task 1: Create `maseval/interface/environments/` Package + +**Files:** +- Create: `maseval/interface/environments/__init__.py` +- Modify: `maseval/interface/__init__.py` + +- [ ] **Step 1: Create the environments package init** + +Create `maseval/interface/environments/__init__.py` with conditional ARE import (matching the pattern in `maseval/interface/agents/__init__.py`): + +```python +"""maseval.interface.environments + +Environment integrations for external simulation platforms. +""" + +__all__: list[str] = [] + +try: + from .are import AREEnvironment # noqa: F401 + from .are_tool_wrapper import AREToolWrapper # noqa: F401 + + __all__.extend(["AREEnvironment", "AREToolWrapper"]) +except ImportError: + pass +``` + +- [ ] **Step 2: Register the environments subpackage in `maseval/interface/__init__.py`** + +Add `environments` to the interface package. Modify `maseval/interface/__init__.py`: + +```python +"""maseval.interface + +This package contains adapters and thin shims that integrate external libraries and services +with MASEval. Each integration is optional and requires installing the corresponding extra. + +Organization: +- inference/: Model inference adapters (OpenAI, Google, HuggingFace, etc.) +- agents/: Agent framework adapters (smolagents, langgraph, etc.) +- environments/: Environment integrations (ARE, etc.) +- logging/: Logging platform adapters (wandb, langfuse, etc.) + +Canonical rules: +- Keep adapters thin: translate between MASEval internal abstractions and the external API. +- Avoid heavy imports at module import time; import lazily inside functions/classes. + +See `maseval/interface/README.md` for more details and conventions for optional dependencies, +packaging extras, and testing. +""" + +# Import subpackages +from . import inference, agents, environments +from . import logging as logging_ # Rename to avoid conflict with stdlib + +__all__ = ["inference", "agents", "environments", "logging_"] +``` + +- [ ] **Step 3: Commit** + +```bash +git add maseval/interface/environments/__init__.py maseval/interface/__init__.py +git commit -m "feat: add maseval/interface/environments/ package" +``` + +--- + +### Task 2: Implement `AREToolWrapper` + +**Files:** +- Create: `maseval/interface/environments/are_tool_wrapper.py` +- Create: `tests/interface/environments/test_are_tool_wrapper.py` + +- [ ] **Step 1: Write the failing tests** + +Create `tests/interface/environments/__init__.py` (empty) and `tests/interface/environments/test_are_tool_wrapper.py`: + +```python +"""Tests for AREToolWrapper.""" + +from unittest.mock import MagicMock +import pytest + +from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + +class TestAREToolWrapper: + """Tests for AREToolWrapper.""" + + def _make_mock_are_tool(self, name="Calendar__create_event", description="Create a calendar event", + inputs=None, output_type="string", return_value="Event created"): + """Create a mock ARE tool.""" + tool = MagicMock() + tool.name = name + tool.description = description + tool.inputs = inputs or {"title": {"type": "string", "description": "Event title"}} + tool.output_type = output_type + tool.return_value = return_value + tool.__call__ = MagicMock(return_value=return_value) + return tool + + def test_metadata_from_are_tool(self): + """Wrapper exposes ARE tool metadata.""" + are_tool = self._make_mock_are_tool() + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + + assert wrapper.name == "Calendar__create_event" + assert wrapper.description == "Create a calendar event" + assert wrapper.inputs == {"title": {"type": "string", "description": "Event title"}} + assert wrapper.output_type == "string" + + def test_call_delegates_to_are_tool(self): + """Calling wrapper delegates to underlying ARE tool.""" + are_tool = self._make_mock_are_tool(return_value="Event created") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + result = wrapper(title="Standup") + + are_tool.assert_called_once_with(title="Standup") + assert result == "Event created" + + def test_call_records_success_in_history(self): + """Successful calls are recorded in invocation history.""" + are_tool = self._make_mock_are_tool(return_value="OK") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + wrapper(title="Test") + + assert len(wrapper.history) == 1 + record = wrapper.history.to_list()[0] + assert record["inputs"] == {"title": "Test"} + assert record["outputs"] == "OK" + assert record["status"] == "success" + + def test_call_records_error_in_history(self): + """Failed calls are recorded in invocation history and re-raised.""" + are_tool = self._make_mock_are_tool() + are_tool.side_effect = ValueError("Invalid title") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + + with pytest.raises(ValueError, match="Invalid title"): + wrapper(title="") + + assert len(wrapper.history) == 1 + record = wrapper.history.to_list()[0] + assert record["status"] == "error" + assert "Invalid title" in record["outputs"] + + def test_gather_traces(self): + """gather_traces returns structured trace data.""" + are_tool = self._make_mock_are_tool(return_value="Done") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + wrapper(title="Test1") + wrapper(title="Test2") + + traces = wrapper.gather_traces() + assert traces["type"] == "AREToolWrapper" + assert traces["name"] == "Calendar__create_event" + assert traces["total_invocations"] == 2 + assert len(traces["invocations"]) == 2 + + def test_gather_config(self): + """gather_config returns tool configuration.""" + are_tool = self._make_mock_are_tool() + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + config = wrapper.gather_config() + + assert config["name"] == "Calendar__create_event" + assert config["description"] == "Create a calendar event" + assert "input_schema" in config + + def test_repr(self): + """String representation shows tool signature.""" + are_tool = self._make_mock_are_tool() + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + r = repr(wrapper) + + assert "AREToolWrapper" in r + assert "Calendar__create_event" in r +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +uv run pytest tests/interface/environments/test_are_tool_wrapper.py -v +``` + +Expected: FAIL with `ModuleNotFoundError` (module doesn't exist yet). + +- [ ] **Step 3: Implement AREToolWrapper** + +Create `maseval/interface/environments/are_tool_wrapper.py`: + +```python +"""ARE Tool Wrapper for MASEval. + +Framework-agnostic wrapper for ARE Tool instances. Provides a callable +interface with ToolInvocationHistory tracing and metadata exposure for +framework adapters (smolagents, LangGraph, etc.) to build framework-native +tools from. + +This is the layer 1->2 wrapper: +- Layer 1: ARE Tool (forward(), inputs, output_type) +- Layer 2: maseval generic (callable, ToolInvocationHistory, metadata) +- Layer 3: framework-specific -- NOT handled here. +""" + +from datetime import datetime +from typing import TYPE_CHECKING, Any, Dict + +from maseval.core.tracing import TraceableMixin +from maseval.core.config import ConfigurableMixin +from maseval.core.history import ToolInvocationHistory + +if TYPE_CHECKING: + from maseval.interface.environments.are import AREEnvironment + + +class AREToolWrapper(TraceableMixin, ConfigurableMixin): + """Framework-agnostic wrapper for ARE tools with maseval tracing. + + Wraps an ARE Tool and exposes its metadata (name, description, inputs, + output_type) so that agent adapters can construct framework-native tools. + + Example for smolagents:: + + class MySmolagentsTool(smolagents.Tool): + skip_forward_signature_validation = True + + def __init__(self, wrapper: AREToolWrapper): + self.wrapper = wrapper + self.name = wrapper.name + self.description = wrapper.description + self.inputs = wrapper.inputs + self.output_type = wrapper.output_type + super().__init__() + + def forward(self, **kwargs) -> str: + return self.wrapper(**kwargs) + """ + + def __init__(self, are_tool: Any, environment: "AREEnvironment"): + """Initialize the tool wrapper. + + Args: + are_tool: ARE Tool instance to wrap. + environment: The AREEnvironment this tool belongs to. + """ + super().__init__() + self._are_tool = are_tool + self._environment = environment + self.history = ToolInvocationHistory() + + # Expose ARE tool metadata for framework adapters + self.name: str = are_tool.name + self.description: str = are_tool.description + self.inputs: Dict[str, Any] = are_tool.inputs + self.output_type: str = are_tool.output_type + + # Extract JSON schema from ARE tool args (if available) + self.input_schema: Dict[str, Any] = self._extract_schema(are_tool) + + @staticmethod + def _extract_schema(are_tool: Any) -> Dict[str, Any]: + """Convert ARE's args list to JSON schema format. + + Args: + are_tool: ARE Tool instance. + + Returns: + JSON schema dict with properties and required fields. + """ + args = getattr(are_tool, "args", None) + if not args: + return {} + + properties = {} + required = [] + + for arg in args: + param_name = getattr(arg, "name", None) + if not param_name: + continue + properties[param_name] = { + "type": getattr(arg, "arg_type", "string"), + "description": getattr(arg, "description", ""), + } + if not getattr(arg, "has_default", True): + required.append(param_name) + + return {"properties": properties, "required": required} + + def __call__(self, **kwargs: Any) -> Any: + """Execute the ARE tool with tracing. + + Args: + **kwargs: Tool arguments matching the inputs schema. + + Returns: + Tool output (type varies per tool). + + Raises: + Any exception from the underlying ARE tool is re-raised. + """ + start_time = datetime.now() + status = "success" + result = None + error_message = None + + try: + result = self._are_tool(**kwargs) + except Exception as e: + status = "error" + error_message = str(e) + raise + finally: + self.history.add_invocation( + inputs=kwargs, + outputs=result if status == "success" else error_message, + status=status, + timestamp=start_time.isoformat(), + ) + + return result + + def gather_traces(self) -> Dict[str, Any]: + """Gather execution traces from this tool. + + Returns: + Dictionary with tool name, invocation history, and counts. + """ + return { + **super().gather_traces(), + "name": self.name, + "invocations": self.history.to_list(), + "total_invocations": len(self.history), + } + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration from this tool. + + Returns: + Dictionary with tool name, description, and schema. + """ + return { + **super().gather_config(), + "name": self.name, + "description": self.description, + "input_schema": self.input_schema, + } + + def __repr__(self) -> str: + args = ", ".join(f"{k}: {v.get('type', '?')}" for k, v in self.inputs.items()) + return f"{self.__class__.__name__}({self.name}({args}) -> {self.output_type})" +``` + +- [ ] **Step 4: Run tests to verify they pass** + +```bash +uv run pytest tests/interface/environments/test_are_tool_wrapper.py -v +``` + +Expected: All 7 tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add maseval/interface/environments/are_tool_wrapper.py tests/interface/environments/__init__.py tests/interface/environments/test_are_tool_wrapper.py +git commit -m "feat: add AREToolWrapper with tracing and metadata" +``` + +--- + +### Task 3: Implement `AREEnvironment` — Core Structure and Scenario Path + +**Files:** +- Create: `maseval/interface/environments/are.py` +- Create: `tests/interface/environments/test_are_environment.py` + +This task implements the core class with the Scenario construction path. The shorthand path (apps+events) is added in Task 4. + +- [ ] **Step 1: Write the failing tests** + +Create `tests/interface/environments/test_are_environment.py`: + +```python +"""Tests for AREEnvironment.""" + +from unittest.mock import MagicMock, patch, PropertyMock +import pytest + +from maseval.interface.environments.are import AREEnvironment + + +def _make_mock_scenario(scenario_id="test-001", duration=600, seed=42, start_time=0): + """Create a mock ARE Scenario.""" + scenario = MagicMock() + scenario.scenario_id = scenario_id + scenario.duration = duration + scenario.seed = seed + scenario.start_time = start_time + scenario.time_increment_in_seconds = 1 + scenario.apps = [MagicMock(name="EmailClient"), MagicMock(name="Calendar")] + return scenario + + +def _make_mock_are_env(apps=None): + """Create a mock ARE Environment.""" + env = MagicMock() + if apps is None: + # Create mock apps with mock tools + email_app = MagicMock() + email_tool = MagicMock() + email_tool.name = "EmailClient__send_email" + email_tool.description = "Send an email" + email_tool.inputs = {"to": {"type": "string", "description": "Recipient"}} + email_tool.output_type = "string" + email_app.get_tools.return_value = [email_tool] + email_app.name = "EmailClient" + + calendar_app = MagicMock() + cal_tool = MagicMock() + cal_tool.name = "Calendar__create_event" + cal_tool.description = "Create event" + cal_tool.inputs = {"title": {"type": "string", "description": "Title"}} + cal_tool.output_type = "string" + calendar_app.get_tools.return_value = [cal_tool] + calendar_app.name = "Calendar" + + env.apps = {"EmailClient": email_app, "Calendar": calendar_app} + else: + env.apps = apps + env.current_time = 0.0 + return env + + +class TestAREEnvironmentScenarioPath: + """Tests for AREEnvironment with Scenario objects.""" + + @patch("maseval.interface.environments.are._import_are") + def test_setup_state_with_scenario(self, mock_import): + """setup_state initialises from an ARE Scenario and returns state dict.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + task_data = {"scenario": scenario} + + env = AREEnvironment(task_data) + + assert env.state["scenario_id"] == "test-001" + assert env.state["duration"] == 600 + assert env.state["seed"] == 42 + assert env._are_env is mock_are_env + + @patch("maseval.interface.environments.are._import_are") + def test_setup_state_requires_scenario_or_apps(self, mock_import): + """setup_state raises ValueError if neither scenario nor apps provided.""" + mock_import.return_value = MagicMock() + + with pytest.raises(ValueError, match="must contain either"): + AREEnvironment(task_data={}) + + @patch("maseval.interface.environments.are._import_are") + def test_create_tools_wraps_are_tools(self, mock_import): + """create_tools wraps all ARE app tools in AREToolWrapper.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + + tools = env.get_tools() + assert "EmailClient__send_email" in tools + assert "Calendar__create_event" in tools + assert len(tools) == 2 + + # Check wrapper metadata + email_tool = tools["EmailClient__send_email"] + assert email_tool.name == "EmailClient__send_email" + assert email_tool.description == "Send an email" + + @patch("maseval.interface.environments.are._import_are") + def test_start_runs_scenario(self, mock_import): + """start() calls are_env.run() with the scenario.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + env.start() + + mock_are_env.run.assert_called_once() + call_kwargs = mock_are_env.run.call_args + assert call_kwargs[1].get("wait_for_end") is False + + @patch("maseval.interface.environments.are._import_are") + def test_stop_stops_env(self, mock_import): + """stop() calls are_env.stop().""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + env.stop() + + mock_are_env.stop.assert_called_once() + + @patch("maseval.interface.environments.are._import_are") + def test_pause_and_resume(self, mock_import): + """pause() and resume_with_offset() delegate to ARE env.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + env.pause() + mock_are_env.pause.assert_called_once() + + env.resume_with_offset(5.0) + mock_are_env.resume_with_offset.assert_called_once_with(5.0) + + @patch("maseval.interface.environments.are._import_are") + def test_get_simulation_time(self, mock_import): + """get_simulation_time() returns ARE env's current_time.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_env.current_time = 42.5 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + + assert env.get_simulation_time() == 42.5 + + @patch("maseval.interface.environments.are._import_are") + def test_cleanup_stops_env(self, mock_import): + """cleanup() stops the ARE environment.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + env.cleanup() + + mock_are_env.stop.assert_called_once() + + @patch("maseval.interface.environments.are._import_are") + def test_gather_traces(self, mock_import): + """gather_traces returns structured trace data.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_env.current_time = 100.0 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + + traces = env.gather_traces() + assert traces["scenario_id"] == "test-001" + assert traces["tool_count"] == 2 + assert "tools" in traces + assert traces["final_simulation_time"] == 100.0 + + @patch("maseval.interface.environments.are._import_are") + def test_gather_config(self, mock_import): + """gather_config returns environment configuration.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + + config = env.gather_config() + assert config["scenario_id"] == "test-001" + assert config["duration"] == 600 + assert config["notification_verbosity"] == "medium" + assert "tool_names" in config +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +uv run pytest tests/interface/environments/test_are_environment.py -v +``` + +Expected: FAIL with `ModuleNotFoundError`. + +- [ ] **Step 3: Implement AREEnvironment** + +Create `maseval/interface/environments/are.py`: + +```python +"""AREEnvironment — generic maseval Environment wrapping ARE simulation. + +Provides a reusable building block for interactive agent environments +using Meta's ARE (Agents Research Environments) infrastructure. + +Original Repository: https://github.com/facebookresearch/meta-agents-research-environments +Code License: MIT +""" + +from typing import Any, Dict, List, Optional, Tuple + +from maseval.core.environment import Environment +from maseval.core.callback import EnvironmentCallback +from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + +def _check_are_installed() -> None: + """Check if ARE is installed and raise a helpful error if not.""" + try: + import are # noqa: F401 + except ImportError as e: + raise ImportError( + "ARE (Agent Research Environments) is required for AREEnvironment.\n" + "Install with: pip install maseval[are]\n" + "Or: uv add meta-agents-research-environments" + ) from e + + +def _import_are() -> Any: + """Lazily import and return the ARE simulation module. + + Returns: + The ``are.simulation`` module namespace with Environment, + EnvironmentConfig, Scenario, etc. + + Raises: + ImportError: If ARE is not installed. + """ + _check_are_installed() + from types import SimpleNamespace + from are.simulation.environment import Environment as AREEnv # type: ignore[import-not-found] + from are.simulation.environment import EnvironmentConfig # type: ignore[import-not-found] + + return SimpleNamespace( + Environment=AREEnv, + EnvironmentConfig=EnvironmentConfig, + ) + + +class AREEnvironment(Environment): + """Generic maseval Environment wrapping ARE's simulation infrastructure. + + Supports two construction paths via ``task_data`` (= ``task.environment_data``): + + 1. **Scenario path:** ``task_data = {"scenario": }`` + 2. **Shorthand path:** ``task_data = {"apps": [...], "events": [...], "duration": 1800, ...}`` + + Lifecycle is user-controlled: call ``start()`` before ``run_agents()``, + ``stop()`` after. ``pause()``/``resume_with_offset()`` control simulation time. + """ + + def __init__( + self, + task_data: Dict[str, Any], + callbacks: Optional[List[EnvironmentCallback]] = None, + run_oracle: bool = False, + notification_verbosity: str = "medium", + ): + """Initialize AREEnvironment. + + Args: + task_data: ``task.environment_data`` dict. Must contain either: + - ``"scenario"``: ARE Scenario object, OR + - ``"apps"``: list of ARE App instances, plus optional ``"events"``, + ``"duration"``, ``"seed"``, ``"start_time"``, ``"time_increment_in_seconds"`` + callbacks: Optional maseval EnvironmentCallbacks. + run_oracle: If True, run ARE oracle mode during setup to generate + expected event log. Stored in traces for evaluation. + notification_verbosity: ARE notification verbosity level. + ``"low"`` = no environment notifications, + ``"medium"`` = standard notifications, + ``"high"`` = all notifications. + """ + self._run_oracle = run_oracle + self._notification_verbosity = notification_verbosity + self._are_env: Any = None + self._scenario: Any = None + self._oracle_traces: Optional[Dict[str, Any]] = None + self._tool_wrappers: Dict[str, AREToolWrapper] = {} + + super().__init__(task_data, callbacks) + + def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]: + """Initialize ARE environment from task data. + + Args: + task_data: Dict with ``"scenario"`` or ``"apps"`` key. + + Returns: + State dict with scenario metadata. + + Raises: + ValueError: If task_data contains neither ``"scenario"`` nor ``"apps"``. + """ + are_mod = _import_are() + + scenario = task_data.get("scenario") + + if scenario is None and "apps" not in task_data: + raise ValueError( + "task_data must contain either 'scenario' (ARE Scenario object) " + "or 'apps' (list of ARE App instances)." + ) + + if scenario is None: + scenario = self._build_scenario_from_shorthand(task_data) + + self._scenario = scenario + + # Run oracle mode if requested + if self._run_oracle: + self._oracle_traces = self._run_oracle_mode(are_mod, scenario) + + # Create ARE Environment (but don't start the event loop yet) + config = are_mod.EnvironmentConfig( + oracle_mode=False, + duration=scenario.duration, + time_increment_in_seconds=getattr(scenario, "time_increment_in_seconds", 1), + ) + if getattr(scenario, "start_time", None) and scenario.start_time > 0: + config.start_time = scenario.start_time + + # Create notification system based on verbosity + notification_system = self._create_notification_system() + self._are_env = are_mod.Environment(config, notification_system=notification_system) + + # Register apps from scenario so tools are available before start() + self._are_env.register_apps(scenario.apps) + + return { + "scenario_id": getattr(scenario, "scenario_id", None), + "duration": scenario.duration, + "seed": getattr(scenario, "seed", None), + "start_time": getattr(scenario, "start_time", None), + "app_names": [getattr(app, "name", str(app)) for app in scenario.apps], + "oracle_traces": self._oracle_traces, + } + + def _build_scenario_from_shorthand(self, task_data: Dict[str, Any]) -> Any: + """Build an ARE Scenario from shorthand task_data. + + Args: + task_data: Dict with ``"apps"``, and optional ``"events"``, + ``"duration"``, ``"seed"``, ``"start_time"``, + ``"time_increment_in_seconds"``. + + Returns: + ARE Scenario instance. + """ + from are.simulation.scenarios.scenario import Scenario # type: ignore[import-not-found] + + apps = task_data["apps"] + events = task_data.get("events", []) + duration = task_data.get("duration", 1800) + seed = task_data.get("seed", 0) + start_time = task_data.get("start_time", 0) + time_increment = task_data.get("time_increment_in_seconds", 1) + + scenario = Scenario( + scenario_id=task_data.get("scenario_id", "custom"), + apps=apps, + events=events, + duration=duration, + seed=seed, + start_time=start_time, + time_increment_in_seconds=time_increment, + ) + scenario.initialize() + return scenario + + def _run_oracle_mode(self, are_mod: Any, scenario: Any) -> Dict[str, Any]: + """Run ARE oracle mode to generate expected event log. + + Args: + are_mod: ARE module namespace. + scenario: ARE Scenario instance. + + Returns: + Dict with oracle event log. + """ + oracle_config = are_mod.EnvironmentConfig( + oracle_mode=True, + duration=scenario.duration, + time_increment_in_seconds=getattr(scenario, "time_increment_in_seconds", 1), + ) + oracle_env = are_mod.Environment(oracle_config) + oracle_env.run(scenario, wait_for_end=True, schedule_events=True) + + # Capture oracle state + oracle_traces = { + "apps_state": oracle_env.get_apps_state() if hasattr(oracle_env, "get_apps_state") else {}, + "world_logs": oracle_env.get_world_logs() if hasattr(oracle_env, "get_world_logs") else [], + } + + # Soft-reset so app state is clean for agent run + scenario.soft_reset() + + return oracle_traces + + def _create_notification_system(self) -> Any: + """Create ARE notification system based on verbosity setting. + + Returns: + ARE NotificationSystem instance. + """ + try: + from are.simulation.notification_system import ( # type: ignore[import-not-found] + VerboseNotificationSystem, + VerbosityLevel, + ) + + level_map = { + "low": VerbosityLevel.LOW, + "medium": VerbosityLevel.MEDIUM, + "high": VerbosityLevel.HIGH, + } + level = level_map.get(self._notification_verbosity, VerbosityLevel.MEDIUM) + return VerboseNotificationSystem(verbosity_level=level) + except ImportError: + return None + + def create_tools(self) -> Dict[str, AREToolWrapper]: + """Wrap all ARE app tools in AREToolWrapper. + + Returns: + Dict mapping tool names to AREToolWrapper instances. + """ + tools: Dict[str, AREToolWrapper] = {} + + if self._are_env is None: + return tools + + for app in self._are_env.apps.values(): + for are_tool in app.get_tools(): + wrapper = AREToolWrapper(are_tool, self) + tools[are_tool.name] = wrapper + self._tool_wrappers[are_tool.name] = wrapper + + return tools + + # ── Lifecycle ────────────────────────────────────────────────────── + + def start(self) -> None: + """Start the ARE simulation event loop. + + Call this after environment setup and before running agents. + Runs the scenario with ``wait_for_end=False`` so control returns + immediately for agent interaction. + """ + if self._are_env is not None and self._scenario is not None: + self._are_env.run(self._scenario, wait_for_end=False, schedule_events=True) + + def stop(self) -> None: + """Stop the ARE simulation event loop.""" + if self._are_env is not None: + self._are_env.stop() + + def pause(self) -> None: + """Pause simulation time progression.""" + if self._are_env is not None: + self._are_env.pause() + + def resume_with_offset(self, offset: float) -> None: + """Resume simulation with a time offset. + + Args: + offset: Seconds to advance simulation clock before resuming. + """ + if self._are_env is not None: + self._are_env.resume_with_offset(offset) + + # ── Notification Polling ────────────────────────────────────────── + + def poll_notifications(self) -> Tuple[List[str], List[str], bool]: + """Drain pending notifications from ARE's notification queue. + + Returns: + Tuple of ``(user_messages, env_notifications, has_stop_signal)``. + ``user_messages``: Messages from simulated users. + ``env_notifications``: System events (new email, calendar reminder, etc.). + ``has_stop_signal``: True when simulation has ended. + + Agent adapters should call this between agent steps and inject + the messages into the agent's context. + """ + if self._are_env is None: + return [], [], False + + notification_system = getattr(self._are_env, "notification_system", None) + if notification_system is None: + return [], [], False + + try: + from datetime import datetime, timezone + from are.simulation.notification_system import MessageType # type: ignore[import-not-found] + + sim_time = self.get_simulation_time() + timestamp = datetime.fromtimestamp(sim_time, tz=timezone.utc) + unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp) + + if not unhandled: + return [], [], False + + user_messages: List[str] = [] + env_notifications: List[str] = [] + has_stop = False + + for notif in unhandled: + msg_type = getattr(notif, "message_type", None) + if msg_type == MessageType.USER_MESSAGE: + user_messages.append(notif.message) + elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION: + ts = notif.timestamp.strftime("%Y-%m-%d %H:%M:%S") if notif.timestamp else "" + env_notifications.append(f"[{ts}] {notif.message}") + elif msg_type == MessageType.ENVIRONMENT_STOP: + has_stop = True + + return user_messages, env_notifications, has_stop + + except Exception: + return [], [], False + + # ── Data Access ─────────────────────────────────────────────────── + + def get_simulation_time(self) -> float: + """Get current simulation time in seconds since scenario start.""" + if self._are_env is None: + return 0.0 + try: + return self._are_env.current_time + except AttributeError: + return 0.0 + + def get_are_environment(self) -> Any: + """Get the underlying ARE Environment instance.""" + return self._are_env + + def get_oracle_traces(self) -> Optional[Dict[str, Any]]: + """Get oracle event log if oracle mode was enabled. + + Returns: + Oracle traces dict, or None if oracle was not run. + """ + return self._oracle_traces + + # ── Cleanup ─────────────────────────────────────────────────────── + + def cleanup(self) -> None: + """Stop ARE simulation. Called by maseval after task completes.""" + if self._are_env is not None: + try: + self._are_env.stop() + except Exception: + pass + + # ── Tracing & Config ────────────────────────────────────────────── + + def gather_traces(self) -> Dict[str, Any]: + """Collect traces from environment and all tools.""" + tool_traces = {} + for name, wrapper in self._tool_wrappers.items(): + tool_traces[name] = wrapper.gather_traces() + + return { + **super().gather_traces(), + "scenario_id": self.state.get("scenario_id"), + "duration": self.state.get("duration"), + "seed": self.state.get("seed"), + "app_names": self.state.get("app_names", []), + "oracle_traces": self._oracle_traces, + "final_simulation_time": self.get_simulation_time(), + "tool_count": len(self._tool_wrappers), + "tools": tool_traces, + } + + def gather_config(self) -> Dict[str, Any]: + """Gather environment configuration for reproducibility.""" + return { + **super().gather_config(), + "scenario_id": self.state.get("scenario_id"), + "duration": self.state.get("duration"), + "seed": self.state.get("seed"), + "start_time": self.state.get("start_time"), + "notification_verbosity": self._notification_verbosity, + "run_oracle": self._run_oracle, + } +``` + +- [ ] **Step 4: Run tests to verify they pass** + +```bash +uv run pytest tests/interface/environments/test_are_environment.py -v +``` + +Expected: All 10 tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add maseval/interface/environments/are.py tests/interface/environments/test_are_environment.py +git commit -m "feat: add AREEnvironment with scenario path and lifecycle control" +``` + +--- + +### Task 4: Implement Shorthand Construction Path + +**Files:** +- Modify: `maseval/interface/environments/are.py` (already has `_build_scenario_from_shorthand` stub) +- Modify: `tests/interface/environments/test_are_environment.py` + +- [ ] **Step 1: Write the failing tests** + +Add to `tests/interface/environments/test_are_environment.py`: + +```python +class TestAREEnvironmentShorthandPath: + """Tests for AREEnvironment with apps+events shorthand.""" + + @patch("maseval.interface.environments.are._import_are") + @patch("maseval.interface.environments.are.AREEnvironment._build_scenario_from_shorthand") + def test_shorthand_builds_scenario(self, mock_build, mock_import): + """Shorthand task_data with 'apps' key triggers scenario construction.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_scenario = _make_mock_scenario() + mock_build.return_value = mock_scenario + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + task_data = { + "apps": [MagicMock(), MagicMock()], + "events": [MagicMock()], + "duration": 300, + "seed": 99, + } + env = AREEnvironment(task_data=task_data) + + mock_build.assert_called_once_with(task_data) + assert env._scenario is mock_scenario + + @patch("maseval.interface.environments.are._import_are") + def test_shorthand_passes_config_to_scenario(self, mock_import): + """Shorthand config values are passed through to Scenario construction.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + # Patch Scenario import inside _build_scenario_from_shorthand + mock_scenario_cls = MagicMock() + mock_scenario_instance = _make_mock_scenario() + mock_scenario_cls.return_value = mock_scenario_instance + + with patch.dict("sys.modules", { + "are": MagicMock(), + "are.simulation": MagicMock(), + "are.simulation.scenarios": MagicMock(), + "are.simulation.scenarios.scenario": MagicMock(Scenario=mock_scenario_cls), + }): + apps = [MagicMock(), MagicMock()] + task_data = { + "apps": apps, + "duration": 300, + "seed": 99, + "start_time": 100, + "time_increment_in_seconds": 5, + } + env = AREEnvironment(task_data=task_data) + + mock_scenario_cls.assert_called_once() + call_kwargs = mock_scenario_cls.call_args[1] + assert call_kwargs["duration"] == 300 + assert call_kwargs["seed"] == 99 + assert call_kwargs["start_time"] == 100 + assert call_kwargs["time_increment_in_seconds"] == 5 + mock_scenario_instance.initialize.assert_called_once() +``` + +- [ ] **Step 2: Run tests to verify they fail** + +```bash +uv run pytest tests/interface/environments/test_are_environment.py::TestAREEnvironmentShorthandPath -v +``` + +Expected: Tests should be runnable now since the code exists; verify the shorthand path logic. + +- [ ] **Step 3: Run full test suite to verify everything still passes** + +```bash +uv run pytest tests/interface/environments/ -v +``` + +Expected: All tests PASS (shorthand path was already stubbed in Task 3). + +- [ ] **Step 4: Commit** + +```bash +git add tests/interface/environments/test_are_environment.py +git commit -m "test: add shorthand construction path tests for AREEnvironment" +``` + +--- + +### Task 5: Add Optional Dependency to pyproject.toml + +**Files:** +- Modify: `pyproject.toml` + +- [ ] **Step 1: Add the `are` optional extra** + +In `pyproject.toml`, under `[project.optional-dependencies]`, add the `are` extra near the existing benchmark extras (near `gaia2`): + +```toml +are = ["meta-agents-research-environments>=1.2.0"] +``` + +- [ ] **Step 2: Add pytest marker** + +In `pyproject.toml` under `[tool.pytest.ini_options]` markers, add: + +```toml +"are: Tests that specifically require ARE (Agent Research Environments)", +``` + +Note: check existing markers — if there is already an equivalent `gaia2` marker that covers ARE, this may be redundant. Add only if no existing marker covers the `are` extra specifically. + +- [ ] **Step 3: Commit** + +```bash +git add pyproject.toml +git commit -m "feat: add 'are' optional dependency extra" +``` + +--- + +### Task 6: Integration Smoke Test (optional, requires ARE installed) + +**Files:** +- Create: `tests/interface/environments/test_are_integration.py` + +This test is marked with `@pytest.mark.are` so it only runs when ARE is installed. + +- [ ] **Step 1: Write the integration test** + +Create `tests/interface/environments/test_are_integration.py`: + +```python +"""Integration tests for AREEnvironment (requires ARE installed).""" + +import pytest + +try: + import are # noqa: F401 + HAS_ARE = True +except ImportError: + HAS_ARE = False + + +@pytest.mark.are +@pytest.mark.skipif(not HAS_ARE, reason="ARE not installed") +class TestAREEnvironmentIntegration: + """Integration tests that exercise real ARE infrastructure.""" + + def test_import_works(self): + """AREEnvironment can be imported when ARE is installed.""" + from maseval.interface.environments.are import AREEnvironment + assert AREEnvironment is not None + + def test_tool_wrapper_import(self): + """AREToolWrapper can be imported when ARE is installed.""" + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + assert AREToolWrapper is not None + + def test_package_init_exports(self): + """Package __init__ exports AREEnvironment when ARE is installed.""" + from maseval.interface.environments import AREEnvironment, AREToolWrapper + assert AREEnvironment is not None + assert AREToolWrapper is not None +``` + +- [ ] **Step 2: Run integration tests (only if ARE is installed)** + +```bash +uv run pytest tests/interface/environments/test_are_integration.py -v -m are +``` + +Expected: PASS if ARE is installed, SKIP otherwise. + +- [ ] **Step 3: Run full test suite to verify nothing is broken** + +```bash +uv run pytest tests/ -v --ignore=tests/interface/environments/test_are_integration.py +``` + +Expected: All existing tests PASS, new unit tests PASS. + +- [ ] **Step 4: Commit** + +```bash +git add tests/interface/environments/test_are_integration.py +git commit -m "test: add ARE integration smoke tests" +``` diff --git a/docs/superpowers/plans/2026-03-28-are-fixes-gaia2-simplification.md b/docs/superpowers/plans/2026-03-28-are-fixes-gaia2-simplification.md new file mode 100644 index 0000000..69a5dcf --- /dev/null +++ b/docs/superpowers/plans/2026-03-28-are-fixes-gaia2-simplification.md @@ -0,0 +1,1391 @@ +# AREEnvironment Fixes & Gaia2 Simplification + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Fix silent-failure bugs in AREEnvironment (issues from AREISSUES.md), then simplify Gaia2Environment by making it a subclass of AREEnvironment and eliminating Gaia2GenericTool. + +**Architecture:** Phase 1 fixes AREEnvironment and AREToolWrapper to be correct and feature-complete (simulation time tracking, error propagation, AppToolAdapter, AUI filtering, turn notifications). Phase 2 makes Gaia2Environment subclass AREEnvironment, overriding only setup_state and create_tools, and deletes Gaia2GenericTool. All existing Gaia2 tests must continue to pass unchanged to confirm behavioral equivalence. + +**Tech Stack:** Python, unittest.mock, pytest, ARE (optional dependency, mocked in tests) + +--- + +## File Structure + +### Phase 1 — AREEnvironment Fixes + +| Action | File | Responsibility | +|--------|------|----------------| +| Modify | `maseval/interface/environments/are_tool_wrapper.py` | Add simulation time tracking, use AppToolAdapter, remove silent defaults | +| Modify | `maseval/interface/environments/are.py` | Fix oracle mode, add AUI filtering, add get_turn_notifications, fix error handling, add convenience accessors | +| Modify | `tests/interface/environments/test_are_environment.py` | Add tests for all new/changed behavior | + +### Phase 2 — Gaia2 Simplification + +| Action | File | Responsibility | +|--------|------|----------------| +| Modify | `maseval/benchmark/gaia2/environment.py` | Subclass AREEnvironment, keep only GAIA2-specific logic | +| Delete | `maseval/benchmark/gaia2/tool_wrapper.py` | Replaced by AREToolWrapper | +| Modify | `tests/test_benchmarks/test_gaia2/test_tool_wrapper.py` | Update imports to AREToolWrapper | +| Modify | `tests/test_benchmarks/test_gaia2/test_environment.py` | Update imports, verify all tests still pass | +| Modify | `tests/test_benchmarks/test_gaia2/conftest.py` | Update fixtures if they reference Gaia2GenericTool | + +--- + +## Phase 1: AREEnvironment Fixes + +### Task 1: Add simulation time tracking to AREToolWrapper + +**Files:** +- Modify: `maseval/interface/environments/are_tool_wrapper.py` +- Test: `tests/interface/environments/test_are_environment.py` + +**AREISSUES.md ref:** Issue #2 (simulation time tracking), Issue #3 (schema defaults), Issue #8 (AppToolAdapter) + +- [ ] **Step 1: Write failing test for simulation time in invocation meta** + +Add to `tests/interface/environments/test_are_environment.py`: + +```python +class TestAREToolWrapper: + """Tests for AREToolWrapper.""" + + def test_invocation_records_simulation_time(self): + """Tool invocations record simulation_time_before/after in meta.""" + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + mock_env = MagicMock() + # Simulate time advancing during tool call + mock_env.get_simulation_time.side_effect = [100.0, 105.0] + + mock_tool = MagicMock() + mock_tool.name = "Email__send" + mock_tool.description = "Send email" + mock_tool.inputs = {"to": {"type": "string"}} + mock_tool.output_type = "string" + mock_tool.args = [] + mock_tool.return_value = "sent" + + wrapper = AREToolWrapper(mock_tool, mock_env) + wrapper(to="alice@example.com") + + invocation = wrapper.history.to_list()[0] + assert invocation["meta"]["simulation_time_before"] == 100.0 + assert invocation["meta"]["simulation_time_after"] == 105.0 + assert invocation["meta"]["simulation_time_elapsed"] == 5.0 +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREToolWrapper::test_invocation_records_simulation_time -v` + +Expected: FAIL — `meta` will be `{}` (no simulation time recorded). + +- [ ] **Step 3: Write failing test for simulation time when get_simulation_time raises** + +```python + def test_invocation_records_none_when_sim_time_unavailable(self): + """If get_simulation_time() raises, meta records None without crashing.""" + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + mock_env = MagicMock() + mock_env.get_simulation_time.side_effect = AttributeError("no current_time") + + mock_tool = MagicMock() + mock_tool.name = "Email__send" + mock_tool.description = "Send email" + mock_tool.inputs = {} + mock_tool.output_type = "string" + mock_tool.args = [] + mock_tool.return_value = "sent" + + wrapper = AREToolWrapper(mock_tool, mock_env) + wrapper() + + invocation = wrapper.history.to_list()[0] + assert invocation["meta"]["simulation_time_before"] is None + assert invocation["meta"]["simulation_time_after"] is None + assert invocation["meta"]["simulation_time_elapsed"] is None +``` + +- [ ] **Step 4: Run test to verify it fails** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREToolWrapper::test_invocation_records_none_when_sim_time_unavailable -v` + +Expected: FAIL. + +- [ ] **Step 5: Implement simulation time tracking in AREToolWrapper** + +In `maseval/interface/environments/are_tool_wrapper.py`, add a `_get_simulation_time` helper and update `__call__`: + +```python +from typing import TYPE_CHECKING, Any, Dict, Optional + +# ... existing imports ... + +class AREToolWrapper(TraceableMixin, ConfigurableMixin): + # ... existing __init__ ... + + def _get_simulation_time(self) -> Optional[float]: + """Get current simulation time from the parent AREEnvironment. + + Returns: + Simulation time in seconds, or None if unavailable. + """ + try: + return self._environment.get_simulation_time() + except Exception: + return None + + def __call__(self, **kwargs: Any) -> Any: + """Execute the ARE tool with tracing. + + Args: + **kwargs: Tool arguments matching the inputs schema. + + Returns: + Tool output (type varies per tool). + + Raises: + Any exception from the underlying ARE tool is re-raised. + """ + start_time = datetime.now() + sim_time_before = self._get_simulation_time() + status = "success" + result = None + error_message = None + + try: + result = self._are_tool(**kwargs) + except Exception as e: + status = "error" + error_message = str(e) + raise + finally: + sim_time_after = self._get_simulation_time() + self.history.add_invocation( + inputs=kwargs, + outputs=result if status == "success" else error_message, + status=status, + timestamp=start_time.isoformat(), + meta={ + "wall_time": start_time.isoformat(), + "simulation_time_before": sim_time_before, + "simulation_time_after": sim_time_after, + "simulation_time_elapsed": ( + sim_time_after - sim_time_before + if sim_time_after is not None and sim_time_before is not None + else None + ), + }, + ) + + return result +``` + +Also update the import at the top of the file to include `Optional`: + +```python +from typing import TYPE_CHECKING, Any, Dict, Optional +``` + +- [ ] **Step 6: Run both tests to verify they pass** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREToolWrapper -v` + +Expected: PASS (both tests). + +- [ ] **Step 7: Commit** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +git add maseval/interface/environments/are_tool_wrapper.py tests/interface/environments/test_are_environment.py +git commit -m "feat(are): add simulation time tracking to AREToolWrapper invocations + +Record simulation_time_before, simulation_time_after, and +simulation_time_elapsed in invocation meta dict, matching +Gaia2GenericTool behavior. Gracefully returns None when +simulation time is unavailable." +``` + +### Task 2: Remove silent defaults from AREToolWrapper schema extraction and use AppToolAdapter + +**Files:** +- Modify: `maseval/interface/environments/are_tool_wrapper.py` +- Test: `tests/interface/environments/test_are_environment.py` + +**AREISSUES.md ref:** Issue #3 (schema defaults), Issue #8 (AppToolAdapter) + +- [ ] **Step 1: Write failing test — schema extraction crashes on missing arg_type** + +```python + def test_schema_extraction_crashes_on_missing_arg_type(self): + """_extract_schema raises AttributeError if arg lacks arg_type.""" + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + mock_tool = MagicMock() + mock_arg = MagicMock(spec=[]) # empty spec — no attributes + mock_arg.name = "param1" + # Deliberately no arg_type or has_default + mock_tool.args = [mock_arg] + + with pytest.raises(AttributeError): + AREToolWrapper._extract_schema(mock_tool) +``` + +- [ ] **Step 2: Run test to verify it fails (currently getattr returns "string" instead of crashing)** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREToolWrapper::test_schema_extraction_crashes_on_missing_arg_type -v` + +Expected: FAIL — test expects AttributeError, but getattr returns "string" silently. + +- [ ] **Step 3: Write failing test — AppToolAdapter is used for metadata** + +```python + @patch("maseval.interface.environments.are_tool_wrapper.AppToolAdapter") + def test_uses_app_tool_adapter_for_metadata(self, mock_adapter_cls): + """AREToolWrapper delegates metadata extraction to AppToolAdapter.""" + mock_adapter = MagicMock() + mock_adapter.name = "Calendar__create_event" + mock_adapter.description = "Create a calendar event" + mock_adapter.inputs = {"title": {"type": "string"}} + mock_adapter.output_type = "string" + mock_adapter.actual_return_type = "str" + mock_adapter_cls.return_value = mock_adapter + + mock_tool = MagicMock() + mock_tool.args = [] + + mock_env = MagicMock() + + wrapper = AREToolWrapper(mock_tool, mock_env) + + mock_adapter_cls.assert_called_once_with(mock_tool) + assert wrapper.name == "Calendar__create_event" + assert wrapper.description == "Create a calendar event" + assert wrapper.inputs == {"title": {"type": "string"}} + assert wrapper.output_type == "string" +``` + +- [ ] **Step 4: Run test to verify it fails** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREToolWrapper::test_uses_app_tool_adapter_for_metadata -v` + +Expected: FAIL — AppToolAdapter not imported or used. + +- [ ] **Step 5: Implement — use AppToolAdapter, remove silent defaults** + +Update `maseval/interface/environments/are_tool_wrapper.py`: + +```python +"""ARE Tool Wrapper for MASEval. + +Framework-agnostic wrapper for ARE Tool instances. Provides a callable +interface with ToolInvocationHistory tracing and metadata exposure for +framework adapters (smolagents, LangGraph, etc.) to build framework-native +tools from. + +This is the layer 1->2 wrapper: +- Layer 1: ARE Tool (forward(), inputs, output_type) +- Layer 2: maseval generic (callable, ToolInvocationHistory, metadata) +- Layer 3: framework-specific -- NOT handled here. +""" + +from datetime import datetime +from typing import TYPE_CHECKING, Any, Dict, Optional + +from maseval.core.tracing import TraceableMixin +from maseval.core.config import ConfigurableMixin +from maseval.core.history import ToolInvocationHistory + +try: + from are.simulation.tool_utils import AppToolAdapter # type: ignore[import-not-found] +except ImportError: + AppToolAdapter = None # type: ignore[assignment,misc] + +if TYPE_CHECKING: + from maseval.interface.environments.are import AREEnvironment + + +class AREToolWrapper(TraceableMixin, ConfigurableMixin): + """Framework-agnostic wrapper for ARE tools with maseval tracing. + + Wraps an ARE Tool and exposes its metadata (name, description, inputs, + output_type) so that agent adapters can construct framework-native tools. + + Example for smolagents:: + + class MySmolagentsTool(smolagents.Tool): + skip_forward_signature_validation = True + + def __init__(self, wrapper: AREToolWrapper): + self.wrapper = wrapper + self.name = wrapper.name + self.description = wrapper.description + self.inputs = wrapper.inputs + self.output_type = wrapper.output_type + super().__init__() + + def forward(self, **kwargs) -> str: + return self.wrapper(**kwargs) + """ + + def __init__(self, are_tool: Any, environment: "AREEnvironment"): + """Initialize the tool wrapper. + + Args: + are_tool: ARE Tool instance to wrap. + environment: The AREEnvironment this tool belongs to. + + Raises: + ImportError: If ARE is not installed (AppToolAdapter unavailable). + """ + super().__init__() + self._are_tool = are_tool + self._environment = environment + self.history = ToolInvocationHistory() + + # Delegate metadata extraction to ARE's AppToolAdapter (tool_utils.py:544-584). + # This is the source of truth for tool name, description, inputs, and output_type. + if AppToolAdapter is None: + raise ImportError( + "ARE (Agent Research Environments) is required for AREToolWrapper.\n" + "Install with: pip install maseval[are]" + ) + adapter = AppToolAdapter(are_tool) + self.name: str = adapter.name + self.description: str = adapter.description + self.inputs: Dict[str, Any] = adapter.inputs + self.output_type: str = adapter.output_type + self.actual_return_type: Optional[str] = adapter.actual_return_type + + # Extract JSON schema from ARE tool args (if available) + self.input_schema: Dict[str, Any] = self._extract_schema(are_tool) + + @staticmethod + def _extract_schema(are_tool: Any) -> Dict[str, Any]: + """Convert ARE's args list to JSON schema format. + + Args: + are_tool: ARE Tool instance. + + Returns: + JSON schema dict with properties and required fields. + + Raises: + AttributeError: If an arg lacks expected attributes (arg_type, + has_default). This is intentional — a missing attribute means + the ARE API changed and the schema would be wrong. + """ + args = getattr(are_tool, "args", None) + if not args: + return {} + + properties = {} + required = [] + + for arg in args: + param_name = getattr(arg, "name", None) + if not param_name: + continue + properties[param_name] = { + "type": arg.arg_type, + "description": getattr(arg, "description", ""), + } + if not arg.has_default: + required.append(param_name) + + return {"properties": properties, "required": required} + + def _get_simulation_time(self) -> Optional[float]: + """Get current simulation time from the parent AREEnvironment. + + Returns: + Simulation time in seconds, or None if unavailable. + """ + try: + return self._environment.get_simulation_time() + except Exception: + return None + + def __call__(self, **kwargs: Any) -> Any: + """Execute the ARE tool with tracing. + + Args: + **kwargs: Tool arguments matching the inputs schema. + + Returns: + Tool output (type varies per tool). + + Raises: + Any exception from the underlying ARE tool is re-raised. + """ + start_time = datetime.now() + sim_time_before = self._get_simulation_time() + status = "success" + result = None + error_message = None + + try: + result = self._are_tool(**kwargs) + except Exception as e: + status = "error" + error_message = str(e) + raise + finally: + sim_time_after = self._get_simulation_time() + self.history.add_invocation( + inputs=kwargs, + outputs=result if status == "success" else error_message, + status=status, + timestamp=start_time.isoformat(), + meta={ + "wall_time": start_time.isoformat(), + "simulation_time_before": sim_time_before, + "simulation_time_after": sim_time_after, + "simulation_time_elapsed": ( + sim_time_after - sim_time_before + if sim_time_after is not None and sim_time_before is not None + else None + ), + }, + ) + + return result + + def gather_traces(self) -> Dict[str, Any]: + """Gather execution traces from this tool. + + Returns: + Dictionary with tool name, invocation history, and counts. + """ + return { + **super().gather_traces(), + "name": self.name, + "invocations": self.history.to_list(), + "total_invocations": len(self.history), + } + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration from this tool. + + Returns: + Dictionary with tool name, description, and schema. + """ + return { + **super().gather_config(), + "name": self.name, + "description": self.description, + "input_schema": self.input_schema, + } + + def __repr__(self) -> str: + args = ", ".join(f"{k}: {v.get('type', '?')}" for k, v in self.inputs.items()) + return f"{self.__class__.__name__}({self.name}({args}) -> {self.output_type})" +``` + +- [ ] **Step 6: Update existing tests to mock AppToolAdapter** + +The existing tests in `test_are_environment.py` that create `AREEnvironment` instances need to mock `AppToolAdapter` since it's now imported at module level. Update the mock tool setup in `_make_mock_are_env` and add the patch: + +In `_make_mock_are_env`, the mock tools already have `.name`, `.description`, `.inputs`, `.output_type` — but now AREToolWrapper reads these from `AppToolAdapter(are_tool)` instead of directly. Add a module-level patch for tests: + +```python +@pytest.fixture(autouse=True) +def mock_app_tool_adapter(): + """Mock AppToolAdapter so AREToolWrapper can initialize without ARE installed.""" + def make_adapter(are_tool): + adapter = MagicMock() + adapter.name = are_tool.name + adapter.description = are_tool.description + adapter.inputs = are_tool.inputs + adapter.output_type = are_tool.output_type + adapter.actual_return_type = None + return adapter + + with patch("maseval.interface.environments.are_tool_wrapper.AppToolAdapter", side_effect=make_adapter): + yield +``` + +- [ ] **Step 7: Run all AREEnvironment tests** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py -v` + +Expected: All PASS. + +- [ ] **Step 8: Commit** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +git add maseval/interface/environments/are_tool_wrapper.py tests/interface/environments/test_are_environment.py +git commit -m "fix(are): use AppToolAdapter for metadata, remove silent schema defaults + +AREToolWrapper now delegates to ARE's AppToolAdapter for tool +metadata (name, description, inputs, output_type), matching +Gaia2GenericTool. Schema extraction no longer silently fabricates +'string' type or 'optional' status when attributes are missing — +crashes immediately so ARE API changes are detected." +``` + +### Task 3: Fix oracle mode — remove hasattr fallbacks + +**Files:** +- Modify: `maseval/interface/environments/are.py` +- Test: `tests/interface/environments/test_are_environment.py` + +**AREISSUES.md ref:** Issue #1 + +- [ ] **Step 1: Write failing test — oracle mode calls methods directly** + +```python +class TestAREEnvironmentOracleMode: + """Tests for AREEnvironment oracle mode.""" + + @patch("maseval.interface.environments.are._import_are") + def test_oracle_mode_captures_traces(self, mock_import): + """Oracle mode runs scenario and captures apps_state and world_logs.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_oracle_env = MagicMock() + mock_oracle_env.get_apps_state.return_value = {"email": {"inbox": []}} + mock_oracle_env.get_world_logs.return_value = [{"event": "email_sent"}] + + mock_agent_env = _make_mock_are_env() + + # First Environment() call = oracle env, second = agent env + mock_are_mod.Environment.side_effect = [mock_oracle_env, mock_agent_env] + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}, run_oracle=True) + + assert env._oracle_traces is not None + assert env._oracle_traces["apps_state"] == {"email": {"inbox": []}} + assert env._oracle_traces["world_logs"] == [{"event": "email_sent"}] + mock_oracle_env.get_apps_state.assert_called_once() + mock_oracle_env.get_world_logs.assert_called_once() + scenario.soft_reset.assert_called_once() + + @patch("maseval.interface.environments.are._import_are") + def test_oracle_mode_crashes_if_get_apps_state_missing(self, mock_import): + """Oracle mode raises AttributeError if ARE env lacks get_apps_state.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_oracle_env = MagicMock(spec=[]) # no methods + mock_oracle_env.run = MagicMock() # only run() exists + + mock_agent_env = _make_mock_are_env() + mock_are_mod.Environment.side_effect = [mock_oracle_env, mock_agent_env] + + scenario = _make_mock_scenario() + with pytest.raises(AttributeError): + AREEnvironment(task_data={"scenario": scenario}, run_oracle=True) +``` + +- [ ] **Step 2: Run tests to verify the second test fails** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREEnvironmentOracleMode -v` + +Expected: `test_oracle_mode_captures_traces` PASS, `test_oracle_mode_crashes_if_get_apps_state_missing` FAIL (hasattr returns False, silently returns `{}`). + +- [ ] **Step 3: Fix — remove hasattr fallbacks** + +In `maseval/interface/environments/are.py`, replace `_run_oracle_mode`: + +```python + def _run_oracle_mode(self, are_mod: Any, scenario: Any) -> Dict[str, Any]: + """Run ARE oracle mode to generate expected event log. + + Args: + are_mod: ARE module namespace. + scenario: ARE Scenario instance. + + Returns: + Dict with oracle event log. + + Raises: + AttributeError: If ARE environment lacks expected oracle methods. + """ + oracle_config = are_mod.EnvironmentConfig( + oracle_mode=True, + duration=scenario.duration, + time_increment_in_seconds=getattr(scenario, "time_increment_in_seconds", 1), + ) + oracle_env = are_mod.Environment(oracle_config) + oracle_env.run(scenario, wait_for_end=True, schedule_events=True) + + oracle_traces = { + "apps_state": oracle_env.get_apps_state(), + "world_logs": oracle_env.get_world_logs(), + } + + # Soft-reset so app state is clean for agent run + scenario.soft_reset() + + return oracle_traces +``` + +- [ ] **Step 4: Run oracle mode tests** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREEnvironmentOracleMode -v` + +Expected: Both PASS. + +- [ ] **Step 5: Commit** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +git add maseval/interface/environments/are.py tests/interface/environments/test_are_environment.py +git commit -m "fix(are): remove hasattr fallbacks from oracle mode + +Oracle mode now calls get_apps_state() and get_world_logs() +directly. If the ARE API changed, this crashes immediately +instead of silently returning empty data that produces +meaningless evaluation scores." +``` + +### Task 4: Remove silent exception swallowing from poll_notifications and lifecycle methods + +**Files:** +- Modify: `maseval/interface/environments/are.py` +- Test: `tests/interface/environments/test_are_environment.py` + +**AREISSUES.md ref:** Issue #4 (poll_notifications), Issue #6 (pause/resume) + +- [ ] **Step 1: Write failing test — poll_notifications propagates unexpected errors** + +```python +class TestAREEnvironmentNotifications: + """Tests for notification polling.""" + + @patch("maseval.interface.environments.are._import_are") + def test_poll_notifications_propagates_errors(self, mock_import): + """poll_notifications does not swallow unexpected exceptions.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + # Set up notification system that raises on access + mock_notif_sys = MagicMock() + mock_notif_sys.message_queue.get_by_timestamp.side_effect = RuntimeError("corrupt queue") + mock_are_env.notification_system = mock_notif_sys + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + + with pytest.raises(RuntimeError, match="corrupt queue"): + env.poll_notifications() +``` + +- [ ] **Step 2: Write failing test — pause propagates errors** + +```python + @patch("maseval.interface.environments.are._import_are") + def test_pause_propagates_errors(self, mock_import): + """pause() lets exceptions propagate for fail_on_task_error to catch.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_env.pause.side_effect = RuntimeError("pause failed") + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + + with pytest.raises(RuntimeError, match="pause failed"): + env.pause() +``` + +- [ ] **Step 3: Run tests to verify they fail** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREEnvironmentNotifications -v` + +Expected: FAIL — exceptions are currently swallowed. + +Note: `test_pause_propagates_errors` will actually PASS already because AREEnvironment's `pause()` currently does NOT wrap in try/except. The test for `poll_notifications` is the one that will fail. Keep both tests — they document the intended contract. + +- [ ] **Step 4: Fix — remove except Exception from poll_notifications** + +In `maseval/interface/environments/are.py`, replace the `poll_notifications` method. Remove the outer `try/except Exception` block: + +```python + def poll_notifications(self) -> Tuple[List[str], List[str], bool]: + """Drain pending notifications from ARE's notification queue. + + Returns: + Tuple of ``(user_messages, env_notifications, has_stop_signal)``. + ``user_messages``: Messages from simulated users. + ``env_notifications``: System events (new email, calendar reminder, etc.). + ``has_stop_signal``: True when simulation has ended. + + Raises: + Any unexpected exception from the notification system propagates + so that the benchmark runner can classify it via ``fail_on_*`` flags. + + Agent adapters should call this between agent steps and inject + the messages into the agent's context. + """ + if self._are_env is None: + return [], [], False + + notification_system = getattr(self._are_env, "notification_system", None) + if notification_system is None: + return [], [], False + + from datetime import datetime, timezone + from are.simulation.notification_system import MessageType # type: ignore[import-not-found] + + sim_time = self.get_simulation_time() + timestamp = datetime.fromtimestamp(sim_time, tz=timezone.utc) + unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp) + + if not unhandled: + return [], [], False + + user_messages: List[str] = [] + env_notifications: List[str] = [] + has_stop = False + + for notif in unhandled: + msg_type = getattr(notif, "message_type", None) + if msg_type == MessageType.USER_MESSAGE: + user_messages.append(notif.message) + elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION: + ts = notif.timestamp.strftime("%Y-%m-%d %H:%M:%S") if notif.timestamp else "" + env_notifications.append(f"[{ts}] {notif.message}") + elif msg_type == MessageType.ENVIRONMENT_STOP: + has_stop = True + + return user_messages, env_notifications, has_stop +``` + +- [ ] **Step 5: Run all notification and lifecycle tests** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREEnvironmentNotifications tests/interface/environments/test_are_environment.py::TestAREEnvironmentScenarioPath::test_pause_and_resume -v` + +Expected: All PASS. + +- [ ] **Step 6: Commit** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +git add maseval/interface/environments/are.py tests/interface/environments/test_are_environment.py +git commit -m "fix(are): propagate errors from poll_notifications and lifecycle methods + +Remove bare except Exception from poll_notifications — errors on +the data path must propagate so the benchmark runner can classify +them and respect fail_on_task_error / fail_on_setup_error settings. +pause() and resume_with_offset() already propagate (keep as-is). +cleanup() keeps try/except (teardown only)." +``` + +### Task 5: Add AUI tool filtering and get_turn_notifications to AREEnvironment + +**Files:** +- Modify: `maseval/interface/environments/are.py` +- Test: `tests/interface/environments/test_are_environment.py` + +**AREISSUES.md ref:** Issue #5 (AUI filtering), Issue #7 (get_turn_notifications) + +- [ ] **Step 1: Write failing test — AUI tools filtered when opt-in enabled** + +```python +class TestAREEnvironmentAUIFiltering: + """Tests for AUI tool filtering.""" + + @patch("maseval.interface.environments.are._import_are") + def test_aui_tools_filtered_when_enabled(self, mock_import): + """AUI message-retrieval tools are excluded when filter_aui_tools=True.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + # Create app with AUI tools and a normal tool + aui_app = MagicMock() + aui_app.name = "AgentUserInterface" + aui_tool_get = MagicMock() + aui_tool_get.name = "AgentUserInterface__get_last_message_from_user" + aui_tool_send = MagicMock() + aui_tool_send.name = "AgentUserInterface__send_message_to_user" + aui_tool_send.description = "Send message" + aui_tool_send.inputs = {} + aui_tool_send.output_type = "string" + aui_tool_send.args = [] + aui_app.get_tools.return_value = [aui_tool_get, aui_tool_send] + + email_app = MagicMock() + email_app.name = "EmailClient" + email_tool = MagicMock() + email_tool.name = "EmailClient__send_email" + email_tool.description = "Send email" + email_tool.inputs = {"to": {"type": "string"}} + email_tool.output_type = "string" + email_tool.args = [] + email_app.get_tools.return_value = [email_tool] + + mock_are_env = MagicMock() + mock_are_env.apps = {"AgentUserInterface": aui_app, "EmailClient": email_app} + mock_are_env.current_time = 0.0 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}, filter_aui_tools=True) + + tools = env.get_tools() + assert "AgentUserInterface__get_last_message_from_user" not in tools + assert "AgentUserInterface__send_message_to_user" in tools + assert "EmailClient__send_email" in tools + + @patch("maseval.interface.environments.are._import_are") + def test_aui_tools_not_filtered_by_default(self, mock_import): + """AUI tools are included by default (filter_aui_tools=False).""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + aui_app = MagicMock() + aui_app.name = "AgentUserInterface" + aui_tool = MagicMock() + aui_tool.name = "AgentUserInterface__get_last_message_from_user" + aui_tool.description = "Get message" + aui_tool.inputs = {} + aui_tool.output_type = "string" + aui_tool.args = [] + aui_app.get_tools.return_value = [aui_tool] + + mock_are_env = MagicMock() + mock_are_env.apps = {"AgentUserInterface": aui_app} + mock_are_env.current_time = 0.0 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(task_data={"scenario": scenario}) + + tools = env.get_tools() + assert "AgentUserInterface__get_last_message_from_user" in tools +``` + +- [ ] **Step 2: Write failing test — get_turn_notifications re-queues env notifications** + +```python +class TestAREEnvironmentTurnNotifications: + """Tests for get_turn_notifications.""" + + @patch("maseval.interface.environments.are._import_are") + def test_get_turn_notifications_requeues_env_notifications(self, mock_import): + """get_turn_notifications separates user messages and re-queues env notifications.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + # Mock notification system with MessageType enum + mock_message_type = MagicMock() + mock_message_type.USER_MESSAGE = "user" + mock_message_type.ENVIRONMENT_NOTIFICATION = "env" + mock_message_type.ENVIRONMENT_STOP = "stop" + + user_notif = MagicMock() + user_notif.message_type = mock_message_type.USER_MESSAGE + user_notif.message = "Hello agent" + + env_notif = MagicMock() + env_notif.message_type = mock_message_type.ENVIRONMENT_NOTIFICATION + env_notif.message = "New email arrived" + + mock_notif_sys = MagicMock() + mock_notif_sys.message_queue.get_by_timestamp.return_value = [user_notif, env_notif] + mock_are_env.notification_system = mock_notif_sys + + scenario = _make_mock_scenario() + + with patch("maseval.interface.environments.are.MessageType", mock_message_type): + env = AREEnvironment(task_data={"scenario": scenario}) + user_msgs, has_env, has_stop = env.get_turn_notifications() + + assert user_msgs == ["Hello agent"] + assert has_env is True + assert has_stop is False + # Env notification was re-queued + mock_notif_sys.message_queue.put.assert_called_once_with(env_notif) +``` + +- [ ] **Step 3: Run tests to verify they fail** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py::TestAREEnvironmentAUIFiltering tests/interface/environments/test_are_environment.py::TestAREEnvironmentTurnNotifications -v` + +Expected: FAIL — `filter_aui_tools` parameter doesn't exist, `get_turn_notifications` doesn't exist. + +- [ ] **Step 4: Implement AUI filtering and get_turn_notifications** + +In `maseval/interface/environments/are.py`: + +Add `filter_aui_tools` parameter to `__init__`: + +```python + # Tools removed by ARE's remove_aui_irrelevant_tools() + # ARE agents/default_agent/are_simulation_main.py:206-228 + _AUI_TOOLS_TO_REMOVE = { + "AgentUserInterface__get_last_message_from_user", + "AgentUserInterface__get_last_message_from_agent", + "AgentUserInterface__get_last_unread_messages", + "AgentUserInterface__get_all_messages", + } + + def __init__( + self, + task_data: Dict[str, Any], + callbacks: Optional[List[EnvironmentCallback]] = None, + run_oracle: bool = False, + notification_verbosity: str = "medium", + filter_aui_tools: bool = False, + ): + """Initialize AREEnvironment. + + Args: + task_data: ``task.environment_data`` dict. Must contain either: + - ``"scenario"``: ARE Scenario object, OR + - ``"apps"``: list of ARE App instances, plus optional ``"events"``, + ``"duration"``, ``"seed"``, ``"start_time"``, ``"time_increment_in_seconds"`` + callbacks: Optional maseval EnvironmentCallbacks. + run_oracle: If True, run ARE oracle mode during setup to generate + expected event log. Stored in traces for evaluation. + notification_verbosity: ARE notification verbosity level. + ``"low"`` = no environment notifications, + ``"medium"`` = standard notifications, + ``"high"`` = all notifications. + filter_aui_tools: If True, remove AgentUserInterface message-retrieval + tools and set ``wait_for_user_response = False``, matching ARE's + default agent behavior. Required when using notification-based + message delivery. + """ + self._run_oracle = run_oracle + self._notification_verbosity = notification_verbosity + self._filter_aui_tools = filter_aui_tools + self._are_env: Any = None + self._scenario: Any = None + self._oracle_traces: Optional[Dict[str, Any]] = None + self._tool_wrappers: Dict[str, AREToolWrapper] = {} + + super().__init__(task_data, callbacks) +``` + +Update `create_tools` to support AUI filtering: + +```python + def create_tools(self) -> Dict[str, AREToolWrapper]: + """Wrap all ARE app tools in AREToolWrapper. + + When ``filter_aui_tools=True``, removes AgentUserInterface + message-retrieval tools and sets ``wait_for_user_response = False``, + matching ARE's ``remove_aui_irrelevant_tools()``. + + Returns: + Dict mapping tool names to AREToolWrapper instances. + """ + tools: Dict[str, AREToolWrapper] = {} + + if self._are_env is None: + return tools + + for app in self._are_env.apps.values(): + if self._filter_aui_tools and hasattr(app, "wait_for_user_response"): + app.wait_for_user_response = False + + for are_tool in app.get_tools(): + if self._filter_aui_tools and are_tool.name in self._AUI_TOOLS_TO_REMOVE: + continue + wrapper = AREToolWrapper(are_tool, self) + tools[are_tool.name] = wrapper + self._tool_wrappers[are_tool.name] = wrapper + + return tools +``` + +Add `get_turn_notifications` and convenience accessors: + +```python + def get_turn_notifications(self) -> Tuple[List[str], bool, bool]: + """Get notifications for turn transitions, re-queuing env notifications. + + Drains the notification queue, separates by type, re-queues environment + notifications (so the inner loop's pre-step picks them up), and returns + user messages and status flags. + + Matches ARE's ``get_notifications()`` in ``are_simulation_main.py:331-359``. + + Returns: + Tuple of ``(user_messages, has_env_notifications, has_stop)``. + + Raises: + Any unexpected exception from the notification system propagates. + """ + if self._are_env is None: + return [], False, False + + notification_system = getattr(self._are_env, "notification_system", None) + if notification_system is None: + return [], False, False + + from datetime import datetime, timezone + from are.simulation.notification_system import MessageType # type: ignore[import-not-found] + + sim_time = self.get_simulation_time() + timestamp = datetime.fromtimestamp(sim_time, tz=timezone.utc) + unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp) + + if not unhandled: + return [], False, False + + user_messages: List[str] = [] + has_env = False + has_stop = False + + for notif in unhandled: + msg_type = getattr(notif, "message_type", None) + if msg_type == MessageType.USER_MESSAGE: + user_messages.append(notif.message) + elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION: + notification_system.message_queue.put(notif) + has_env = True + elif msg_type == MessageType.ENVIRONMENT_STOP: + has_stop = True + + return user_messages, has_env, has_stop + + def get_scenario(self) -> Any: + """Get the ARE scenario object.""" + return self._scenario + + def get_start_time(self) -> Optional[float]: + """Get the scenario start time. + + Returns: + Start time as Unix timestamp, or None if not available. + """ + return self.state.get("start_time") + + def get_notification_system(self) -> Any: + """Get the ARE notification system. + + Returns: + ARE NotificationSystem instance, or None if not available. + """ + if self._are_env is None: + return None + return getattr(self._are_env, "notification_system", None) +``` + +- [ ] **Step 5: Run all AREEnvironment tests** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/test_are_environment.py -v` + +Expected: All PASS. + +- [ ] **Step 6: Commit** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +git add maseval/interface/environments/are.py tests/interface/environments/test_are_environment.py +git commit -m "feat(are): add AUI tool filtering and get_turn_notifications + +Add filter_aui_tools parameter to AREEnvironment for notification- +based message delivery. Add get_turn_notifications() for multi-turn +agent loops. Add get_scenario(), get_start_time(), and +get_notification_system() convenience accessors." +``` + +### Task 6: Run full ARE test suite to verify Phase 1 + +**Files:** +- Test: `tests/interface/environments/test_are_environment.py` + +- [ ] **Step 1: Run all ARE tests** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/ -v` + +Expected: All PASS. + +- [ ] **Step 2: Run linter** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run ruff check maseval/interface/environments/` + +Expected: No errors. + +--- + +## Phase 2: Gaia2 Simplification + +### Task 7: Make Gaia2Environment subclass AREEnvironment + +**Files:** +- Modify: `maseval/benchmark/gaia2/environment.py` +- Test: `tests/test_benchmarks/test_gaia2/test_environment.py` (run existing tests, no changes yet) + +- [ ] **Step 1: Rewrite Gaia2Environment to subclass AREEnvironment** + +Replace `maseval/benchmark/gaia2/environment.py` with: + +```python +"""Gaia2 Benchmark - Environment. + +MASEval Environment wrapping ARE's simulation. + +Original Repository: https://github.com/facebookresearch/meta-agents-research-environments +Code License: MIT + +Citation: + Froger, R., Benhalloum, A., Rusakov, A., et al. (2026). Gaia2: Benchmarking LLM + Agents on Dynamic and Asynchronous Environments. ICLR 2026. + https://openreview.net/forum?id=9gw03JpKK4 +""" + +from typing import Any, Dict, List, Optional + +from maseval.interface.environments.are import AREEnvironment +from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + +class Gaia2Environment(AREEnvironment): + """GAIA2 benchmark environment built on AREEnvironment. + + Extends AREEnvironment with GAIA2-specific setup: + - Delegates to ARE's ``preprocess_scenario()`` for oracle run, judge + creation, and turn initialization + - Configures custom judge engine for semantic comparison + - Filters AUI tools (notification-based message delivery) + + Inherits from AREEnvironment: + - Tool wrapping with simulation time tracking + - Notification polling (poll_notifications, get_turn_notifications) + - Lifecycle control (pause, resume_with_offset, cleanup) + - Tracing and configuration gathering + """ + + def __init__( + self, + task_data: Dict[str, Any], + callbacks: Optional[List[Any]] = None, + judge_engine_config: Optional[Any] = None, + ): + """Initialize Gaia2 environment. + + Args: + task_data: Task data containing: + - scenario: ARE BenchmarkScenario object + - capability: Capability type (execution, search, etc.) + - universe_id: Universe identifier + callbacks: Optional callbacks + judge_engine_config: Optional :class:`Gaia2JudgeEngineConfig` controlling + which LLM model and provider the ARE judge uses for semantic comparison. + """ + self._judge_engine_config = judge_engine_config + # Gaia2 always uses notification-based delivery, so filter AUI tools + super().__init__( + task_data, + callbacks=callbacks, + filter_aui_tools=True, + notification_verbosity="medium", + ) + + def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]: + """Initialize ARE scenario using preprocess_scenario(). + + Delegates to ARE's ``preprocess_scenario()`` for faithful preprocessing: + SystemApp insertion, duration setting, initialization, oracle run, + soft reset, judge creation, and turn initialization. + + Args: + task_data: Task data with scenario, capability, universe_id + + Returns: + State dictionary with scenario metadata + """ + try: + from are.simulation.environment import Environment as AREEnv # type: ignore[import-not-found] + from are.simulation.environment import EnvironmentConfig # type: ignore[import-not-found] + from are.simulation.notification_system import VerboseNotificationSystem # type: ignore[import-not-found] + from are.simulation.scenarios.scenario_imported_from_json.utils import ( # type: ignore[import-not-found] + get_scenario_duration, + preprocess_scenario, + ) + from are.simulation.validation import GraphPerEventJudgeConfig # type: ignore[import-not-found] + except ImportError as e: + raise ImportError( + "ARE (Agent Research Environments) is required for Gaia2 benchmark.\n" + "Install with: pip install meta-agents-research-environments\n" + "Or: uv add --optional gaia2 meta-agents-research-environments" + ) from e + + from are.simulation.scenarios.config import ( # type: ignore[import-not-found] + MAX_SCENARIO_DURATION, + MAX_TIME_SCENARIO_DURATION, + ) + + scenario = task_data.get("scenario") + if scenario is None: + raise ValueError("Task data must contain 'scenario' with ARE BenchmarkScenario") + + max_duration = get_scenario_duration(scenario, MAX_TIME_SCENARIO_DURATION, MAX_SCENARIO_DURATION) + + # Build judge config + if self._judge_engine_config is not None: + from are.simulation.agents.are_simulation_agent_config import ( # type: ignore[import-not-found] + LLMEngineConfig, + ) + from are.simulation.validation.configs import create_judge_engine # type: ignore[import-not-found] + + llm_engine_config = LLMEngineConfig( + model_name=self._judge_engine_config.model_name, + provider=self._judge_engine_config.provider, + endpoint=self._judge_engine_config.endpoint, + ) + engine = create_judge_engine(llm_engine_config) + judge_config = GraphPerEventJudgeConfig(engine=engine) + else: + judge_config = GraphPerEventJudgeConfig() + + preprocess_scenario( + scenario=scenario, + judge_config=judge_config, + max_scenario_duration=max_duration, + ) + + # Create ARE environment and start simulation + config = EnvironmentConfig( + oracle_mode=False, + duration=scenario.duration, + time_increment_in_seconds=scenario.time_increment_in_seconds, + ) + if scenario.start_time and scenario.start_time > 0: + config.start_time = scenario.start_time + + self._are_env = AREEnv(config, notification_system=VerboseNotificationSystem()) + self._scenario = scenario + self._are_env.run(scenario, wait_for_end=False, schedule_events=True) + + return { + "scenario_id": getattr(scenario, "scenario_id", None), + "duration": scenario.duration, + "capability": task_data.get("capability"), + "universe_id": task_data.get("universe_id"), + "start_time": getattr(scenario, "start_time", None), + } + + def gather_traces(self) -> Dict[str, Any]: + """Collect traces with GAIA2-specific fields.""" + traces = super().gather_traces() + traces["capability"] = self.state.get("capability") + traces["universe_id"] = self.state.get("universe_id") + return traces + + def gather_config(self) -> Dict[str, Any]: + """Gather config with GAIA2-specific fields.""" + config = super().gather_config() + config["capability"] = self.state.get("capability") + config["universe_id"] = self.state.get("universe_id") + return config +``` + +- [ ] **Step 2: Run the existing Gaia2 environment tests (no changes to tests)** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/test_benchmarks/test_gaia2/test_environment.py -v` + +Expected: Most tests PASS. Some may fail due to import path changes (`Gaia2GenericTool` no longer exists as the tool type). Note which tests fail — they will be fixed in Task 8. + +- [ ] **Step 3: Commit (WIP — tests may not all pass yet)** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +git add maseval/benchmark/gaia2/environment.py +git commit -m "refactor(gaia2): make Gaia2Environment subclass AREEnvironment + +WIP — Gaia2Environment now inherits tool wrapping, notification +polling, lifecycle control, and cleanup from AREEnvironment. +Only setup_state (preprocess_scenario + judge) and GAIA2-specific +gather_traces/gather_config fields remain as overrides." +``` + +### Task 8: Update Gaia2 tests and delete Gaia2GenericTool + +**Files:** +- Delete: `maseval/benchmark/gaia2/tool_wrapper.py` +- Modify: `tests/test_benchmarks/test_gaia2/test_tool_wrapper.py` +- Modify: `tests/test_benchmarks/test_gaia2/test_environment.py` +- Modify: `tests/test_benchmarks/test_gaia2/conftest.py` (if needed) + +- [ ] **Step 1: Check what references Gaia2GenericTool in test files** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && grep -rn "Gaia2GenericTool\|gaia2\.tool_wrapper\|from maseval.benchmark.gaia2.tool_wrapper" tests/` + +This tells you every import and reference that needs updating. + +- [ ] **Step 2: Update test_tool_wrapper.py imports and assertions** + +Replace `from maseval.benchmark.gaia2.tool_wrapper import Gaia2GenericTool` with `from maseval.interface.environments.are_tool_wrapper import AREToolWrapper` throughout. Update class instantiation from `Gaia2GenericTool(mock_are_tool, mock_env)` to `AREToolWrapper(mock_are_tool, mock_env)`. Update assertion class names in type checks. + +The test structure stays the same — these tests now validate AREToolWrapper behavior with the same expectations that were validated against Gaia2GenericTool. + +- [ ] **Step 3: Update test_environment.py imports** + +Replace any `Gaia2GenericTool` references with `AREToolWrapper`. The environment tests that check tool types (e.g., `isinstance(tool, Gaia2GenericTool)`) should check `isinstance(tool, AREToolWrapper)`. + +- [ ] **Step 4: Update conftest.py if it references Gaia2GenericTool** + +Check and update any fixtures that import or reference `Gaia2GenericTool`. + +- [ ] **Step 5: Check for other references across the codebase** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && grep -rn "Gaia2GenericTool\|gaia2\.tool_wrapper" maseval/ --include="*.py"` + +Update any remaining references (benchmark.py, __init__.py, etc.). + +- [ ] **Step 6: Delete Gaia2GenericTool** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +rm maseval/benchmark/gaia2/tool_wrapper.py +``` + +- [ ] **Step 7: Run all Gaia2 tests** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/test_benchmarks/test_gaia2/ -v` + +Expected: All PASS. If any fail, fix the specific import or assertion and re-run. + +- [ ] **Step 8: Commit** + +```bash +cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman +git add -A +git commit -m "refactor(gaia2): delete Gaia2GenericTool, use AREToolWrapper + +Gaia2GenericTool is functionally identical to AREToolWrapper now +that AREToolWrapper has simulation time tracking and AppToolAdapter. +All Gaia2 tool wrapper tests updated to use AREToolWrapper directly." +``` + +### Task 9: Final verification — full test suite + +**Files:** None (verification only) + +- [ ] **Step 1: Run all ARE and Gaia2 tests together** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest tests/interface/environments/ tests/test_benchmarks/test_gaia2/ -v` + +Expected: All PASS. + +- [ ] **Step 2: Run linter on all changed files** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run ruff check maseval/interface/environments/ maseval/benchmark/gaia2/` + +Expected: No errors. + +- [ ] **Step 3: Run full test suite to check for regressions** + +Run: `cd /Users/cornelius/Repositories/maseval/.claude/worktrees/objective-raman && uv run pytest --tb=short -q` + +Expected: No regressions outside of ARE/Gaia2 tests. diff --git a/docs/superpowers/specs/2026-03-27-are-environment-design.md b/docs/superpowers/specs/2026-03-27-are-environment-design.md new file mode 100644 index 0000000..5fd5708 --- /dev/null +++ b/docs/superpowers/specs/2026-03-27-are-environment-design.md @@ -0,0 +1,354 @@ +# AREEnvironment Integration Design + +**Date:** 2026-03-27 +**Status:** Draft +**Scope:** Add generic ARE (Meta Agents Research Environments) integration to maseval as a reusable environment building block. + +## Context + +[ARE](https://github.com/facebookresearch/meta-agents-research-environments) is Meta's platform for evaluating AI agents in dynamic, time-evolving scenarios. It provides apps (Email, Calendar, Messaging, etc.), events, time management, and validation infrastructure. + +maseval already has a Gaia2 benchmark that wraps ARE, but that integration is tightly coupled to Gaia2-specific logic (oracle preprocessing, judge config, scenario format). This design adds a generic `AREEnvironment` that any maseval benchmark can use to build interactive ARE-based environments. + +## Goals + +1. Generic ARE integration usable by any maseval benchmark +2. Support both loading ARE `Scenario` objects and programmatic composition (apps + events) +3. Framework-agnostic tool wrapping (layer 1->2 only; agent adapters handle 2->3) +4. Built-in notification polling for event-driven scenarios +5. Optional oracle mode with traces for evaluation +6. Do not modify existing Gaia2 code + +## Non-Goals + +- Framework-specific tool conversion (smolagents, LangGraph, etc.) -- that's the agent adapter's job +- Replacing or refactoring `Gaia2Environment` +- Building a fluent builder API (can be added later) +- ARE scenario authoring tools + +## Module Structure + +``` +maseval/interface/environments/ + __init__.py + are.py # AREEnvironment class + are_tool_wrapper.py # AREToolWrapper class +``` + +ARE is an optional dependency, imported lazily inside methods (matching Gaia2's pattern). + +## AREEnvironment + +### Class Definition + +```python +class AREEnvironment(Environment): + """Generic maseval Environment wrapping ARE's simulation infrastructure. + + Supports two construction paths via task_data (= task.environment_data): + + 1. Scenario path: task_data = {"scenario": } + 2. Shorthand path: task_data = {"apps": [...], "events": [...], "duration": 1800, ...} + + The shorthand path internally constructs an ARE Scenario from the provided + apps, events, and config, then follows the same initialization as the + scenario path. + + Lifecycle is user-controlled: call start() before run_agents(), stop() + after. pause()/resume_with_offset() control simulation time. + """ + + def __init__( + self, + task_data: Dict[str, Any], + callbacks: Optional[List[EnvironmentCallback]] = None, + run_oracle: bool = False, + notification_verbosity: str = "medium", + ): + """Initialize AREEnvironment. + + Args: + task_data: task.environment_data dict. Must contain either: + - "scenario": ARE Scenario object, OR + - "apps": list of ARE App instances, plus optional "events", + "duration", "seed", "start_time", "time_increment_in_seconds" + callbacks: Optional maseval EnvironmentCallbacks. + run_oracle: If True, run ARE oracle mode during setup to generate + expected event log. Stored in traces for evaluation. + notification_verbosity: ARE notification verbosity level. + "low" = no environment notifications, + "medium" = standard notifications (email, calendar, etc.), + "high" = all notifications. + """ +``` + +### setup_state(task_data) -> Dict[str, Any] + +1. **Detect input mode**: check for `"scenario"` key vs `"apps"` key +2. **Shorthand -> Scenario**: if apps/events provided, construct an ARE `Scenario`: + - Instantiate Scenario with provided apps, events, duration, seed + - Call `scenario.initialize()` to populate app state and event graph +3. **Oracle mode** (if `run_oracle=True`): + - Create ARE Environment in oracle mode + - Run scenario to completion (no agent) + - Capture oracle event log (expected actions) + - Soft-reset scenario for agent run +4. **Create ARE Environment and register apps**: + - Build `EnvironmentConfig` from scenario params + - Create `are.simulation.Environment(config, notification_system=...)` + - Register apps from scenario onto the ARE env (`env.register_apps(scenario.apps)`) + - Store scenario for later use by `start()` (event scheduling happens at start) + - Store as `self._are_env` (but do NOT start the event loop yet -- user calls `start()`) +5. **Return state dict**: + ```python + { + "scenario_id": scenario.scenario_id, + "duration": scenario.duration, + "seed": scenario.seed, + "start_time": scenario.start_time, + "app_names": [app.name for app in scenario.apps], + "oracle_traces": oracle_event_log, # None if oracle not run + } + ``` + +### create_tools() -> Dict[str, AREToolWrapper] + +Iterates all apps in the ARE environment, wraps each app's tools: + +```python +tools = {} +for app in self._are_env.apps.values(): + for are_tool in app.get_tools(): + wrapper = AREToolWrapper(are_tool, self) + tools[are_tool.name] = wrapper +return tools +``` + +No tool filtering by default (unlike Gaia2 which removes AUI tools). Subclasses or config can filter if needed (e.g., `tool_filter` callable parameter). + +**Note on shorthand apps:** The `"apps"` list in the shorthand path must contain instantiated ARE App objects (not classes), since apps hold mutable state (inbox contents, calendar entries, etc.) that defines the initial environment. + +### Lifecycle Methods + +```python +def start(self) -> None: + """Start the ARE simulation event loop. + + Call this after environment setup and before running agents. + Runs the scenario with wait_for_end=False so control returns + immediately for agent interaction. + """ + +def stop(self) -> None: + """Stop the ARE simulation event loop.""" + +def pause(self) -> None: + """Pause simulation time progression. + + Call during LLM generation to prevent simulation time from + advancing while the agent is "thinking". + """ + +def resume_with_offset(self, offset: float) -> None: + """Resume simulation with a time offset. + + Args: + offset: Seconds to advance simulation clock before resuming. + """ +``` + +### Notification Polling + +```python +def poll_notifications(self) -> Tuple[List[str], List[str], bool]: + """Drain pending notifications from ARE's notification queue. + + Returns: + Tuple of (user_messages, env_notifications, has_stop_signal). + user_messages: Messages from simulated users. + env_notifications: System events (new email, calendar reminder, etc.). + has_stop_signal: True when simulation has ended. + + Agent adapters should call this between agent steps and inject + the messages into the agent's context. + """ +``` + +### Data Access + +```python +def get_simulation_time(self) -> float: + """Current simulation time in seconds since scenario start.""" + +def get_are_environment(self) -> Any: + """Underlying ARE Environment instance for advanced use.""" + +def get_oracle_traces(self) -> Optional[Dict[str, Any]]: + """Oracle event log if oracle mode was enabled, else None.""" +``` + +### Tracing + +```python +def gather_traces(self) -> Dict[str, Any]: + """Collect traces from environment and all tools. + + Returns dict with: + - Standard TraceableMixin fields (type, gathered_at) + - scenario_id, duration, seed + - app_names + - oracle_traces (if oracle was run) + - final_simulation_time + - tool_count + - tools: {name: tool.gather_traces() for each tool} + """ + +def gather_config(self) -> Dict[str, Any]: + """Environment configuration for reproducibility logging. + + Returns dict with: + - Standard ConfigurableMixin fields (type, gathered_at) + - scenario_id, duration, seed, start_time + - notification_verbosity + - run_oracle + - tool_count, tool_names + """ +``` + +### Cleanup + +```python +def cleanup(self) -> None: + """Stop ARE simulation. Called by maseval after task completes.""" +``` + +## AREToolWrapper + +```python +class AREToolWrapper: + """Wraps an ARE Tool into a maseval-compatible tool with tracing. + + This is the layer 1->2 wrapper: + - Layer 1: ARE Tool (forward(), inputs, output_type) + - Layer 2: maseval generic (callable, ToolInvocationHistory, metadata) + - Layer 3: framework-specific (smolagents Tool, LangGraph tool, etc.) + -- NOT handled here, that's the agent adapter's responsibility. + + Exposes ARE tool metadata (name, description, inputs schema, output_type) + so that agent adapters can construct framework-native tools. + """ + + def __init__(self, are_tool: Any, environment: "AREEnvironment"): + self.are_tool = are_tool + self.environment = environment + self.history = ToolInvocationHistory() + + # Metadata for framework adapters + self.name: str = are_tool.name + self.description: str = are_tool.description + self.inputs: dict = are_tool.inputs + self.output_type: str = are_tool.output_type + + def __call__(self, **kwargs) -> Any: + """Call the ARE tool with tracing. + + Args: + **kwargs: Tool arguments matching the inputs schema. + + Returns: + Tool output (type varies per tool). + + Raises: + Any exception from the underlying ARE tool. + """ + try: + result = self.are_tool(**kwargs) + self.history.add_invocation( + inputs=kwargs, outputs=result, status="success" + ) + return result + except Exception as exc: + self.history.add_invocation( + inputs=kwargs, outputs=str(exc), status="error" + ) + raise + + def gather_traces(self) -> Dict[str, Any]: + return { + "type": "AREToolWrapper", + "name": self.name, + "invocations": self.history.to_list(), + "total_invocations": len(self.history), + } +``` + +## Usage Example: Custom Benchmark + +```python +from maseval import Benchmark, Task, Environment +from maseval.interface.environments.are import AREEnvironment + + +class MyCustomBenchmark(Benchmark): + + def load_tasks(self): + # Custom environment from apps + events + return [ + Task( + query="Schedule a meeting with Alice for tomorrow at 2pm", + environment_data={ + "apps": [EmailClient(), Calendar(), Contacts()], + "events": [ + # Simulated user sends email at t=60s + SendEmailEvent(at=60, from_="alice", subject="Meeting?"), + ], + "duration": 600, + "seed": 42, + }, + evaluation_data={"expected_action": "calendar_create_event"}, + ) + ] + + def setup_environment(self, agent_data, task, seed_generator): + env = AREEnvironment( + task_data=task.environment_data, + run_oracle=True, + ) + return env + + def run_agents(self, agents, task, environment, query): + environment.start() # Start ARE event loop + try: + # Agent adapter calls environment.poll_notifications() + # between steps and injects messages into context + result = agents[0].run(query) + finally: + environment.stop() + return result +``` + +## Usage Example: Loading ARE Scenario + +```python +def setup_environment(self, agent_data, task, seed_generator): + # task.environment_data = {"scenario": } + env = AREEnvironment( + task_data=task.environment_data, + run_oracle=True, + notification_verbosity="medium", + ) + return env +``` + +## Dependencies + +- `meta-agents-research-environments` as optional dependency +- Lazy import pattern (matching Gaia2's approach) +- Added to `pyproject.toml` under an `[are]` optional extra + +## Testing Strategy + +- Unit tests for AREToolWrapper (mock ARE Tool, verify tracing) +- Unit tests for AREEnvironment construction (both paths) +- Integration test with a minimal ARE scenario (if ARE is installed) +- Verify gather_traces() and gather_config() output structure diff --git a/maseval/benchmark/gaia2/__init__.py b/maseval/benchmark/gaia2/__init__.py index ea38e7d..d1e1310 100644 --- a/maseval/benchmark/gaia2/__init__.py +++ b/maseval/benchmark/gaia2/__init__.py @@ -71,10 +71,9 @@ def get_model_adapter(self, model_id, **kwargs): ) # Tool wrapper -from maseval.benchmark.gaia2.tool_wrapper import ( - Gaia2GenericTool, - wrap_are_tools, -) +# Backward compatibility: Gaia2GenericTool is now AREToolWrapper +from maseval.interface.environments.are_tool_wrapper import AREToolWrapper as Gaia2GenericTool +from maseval.benchmark.gaia2.tool_wrapper import wrap_are_tools # Data loading and configuration from maseval.benchmark.gaia2.data_loader import ( diff --git a/maseval/benchmark/gaia2/environment.py b/maseval/benchmark/gaia2/environment.py index f7e6b2c..5648288 100644 --- a/maseval/benchmark/gaia2/environment.py +++ b/maseval/benchmark/gaia2/environment.py @@ -11,28 +11,25 @@ https://openreview.net/forum?id=9gw03JpKK4 """ -from typing import Any, Dict, List, Optional, Tuple +from typing import Any, Dict, List, Optional -from maseval import Environment +from maseval.interface.environments.are import AREEnvironment -from maseval.benchmark.gaia2.tool_wrapper import Gaia2GenericTool +class Gaia2Environment(AREEnvironment): + """GAIA2 benchmark environment built on AREEnvironment. -class Gaia2Environment(Environment): - """MASEval Environment wrapping ARE's simulation. + Extends AREEnvironment with GAIA2-specific setup: + - Delegates to ARE's ``preprocess_scenario()`` for oracle run, judge + creation, and turn initialization + - Configures custom judge engine for semantic comparison + - Filters AUI tools (notification-based message delivery) - The ARE simulation runs its own internal event loop. Agent interaction - happens purely through tool calls - including time control via - SystemApp.wait_for_notification(). No special execution loop needed. - - Exposes all ARE app tools (Calendar, Email, Messaging, Contacts, Shopping, - Cab, City, FileSystem, Browser, ChatsApp, SystemApp, Timer) to agents. - - Key Features: - - Wraps ARE's simulation environment - - Provides MASEval-compatible tool wrappers with tracing - - Exposes simulation time for temporal reasoning tasks - - Handles proper cleanup of ARE resources + Inherits from AREEnvironment: + - Tool wrapping with simulation time tracking (AREToolWrapper) + - Notification polling (poll_notifications, get_turn_notifications) + - Lifecycle control (pause, resume_with_offset, cleanup) + - Tracing and configuration gathering """ def __init__( @@ -53,12 +50,14 @@ def __init__( which LLM model and provider the ARE judge uses for semantic comparison. Passed explicitly from ``setup_environment()`` (lives in ``evaluation_data``). """ - self._scenario = environment_data.get("scenario") self._judge_engine_config = judge_engine_config - self._are_env: Any = None - self._tool_wrappers: Dict[str, Gaia2GenericTool] = {} - - super().__init__(environment_data, callbacks) + # Gaia2 always uses notification-based delivery, so filter AUI tools + super().__init__( + environment_data, + callbacks=callbacks, + filter_aui_tools=True, + notification_verbosity="medium", + ) def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]: """Initialize ARE scenario and start simulation. @@ -81,7 +80,7 @@ def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]: """ # Import ARE modules (optional dependency) try: - from are.simulation.environment import Environment as AREEnvironment # type: ignore[import-not-found] + from are.simulation.environment import Environment as AREEnv # type: ignore[import-not-found] from are.simulation.environment import EnvironmentConfig # type: ignore[import-not-found] from are.simulation.scenarios.scenario_imported_from_json.utils import ( # type: ignore[import-not-found] get_scenario_duration, @@ -158,7 +157,10 @@ def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]: # to VerbosityLevel.MEDIUM, which includes environment notifications # (email, messaging, shopping, cab, calendar). Without this, the default # is VerbosityLevel.LOW (no environment notifications). - self._are_env = AREEnvironment(config, notification_system=VerboseNotificationSystem()) + self._are_env = AREEnv(config, notification_system=VerboseNotificationSystem()) + + # Store scenario for lifecycle methods and accessors + self._scenario = scenario # Run scenario (registers apps, schedules events, starts event loop) # wait_for_end=False so control returns immediately for agent interaction @@ -172,295 +174,16 @@ def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]: "start_time": getattr(scenario, "start_time", None), } - # Tools removed by ARE's remove_aui_irrelevant_tools() - # ARE agents/default_agent/are_simulation_main.py:206-228 - # User messages are delivered via the notification system, not via these tools. - _AUI_TOOLS_TO_REMOVE = { - "AgentUserInterface__get_last_message_from_user", - "AgentUserInterface__get_last_message_from_agent", - "AgentUserInterface__get_last_unread_messages", - "AgentUserInterface__get_all_messages", - } - - def create_tools(self) -> Dict[str, Gaia2GenericTool]: - """Wrap ARE app tools for MASEval tracing. - - Creates framework-agnostic Gaia2GenericTool instances that provide - clean API with built-in tracing. - - Filters out AgentUserInterface message-retrieval tools that ARE removes - in ``remove_aui_irrelevant_tools()``, and sets ``wait_for_user_response`` - to ``False`` so the AUI does not block waiting for a response when the - agent sends a message. User messages are delivered via the notification - system instead. - - ARE agents/default_agent/are_simulation_main.py:206-228 - - Returns: - Dict mapping tool names to Gaia2GenericTool instances - """ - tools: Dict[str, Gaia2GenericTool] = {} - - if self._are_env is None: - return tools - - # Get all tools from all apps, filtering out AUI message-retrieval tools - # ARE agents/default_agent/are_simulation_main.py:221-227 - for app in self._are_env.apps.values(): - # Set wait_for_user_response=False on AUI so it doesn't block - # ARE agents/default_agent/are_simulation_main.py:216 - if hasattr(app, "wait_for_user_response"): - app.wait_for_user_response = False - - for tool in app.get_tools(): - if tool.name in self._AUI_TOOLS_TO_REMOVE: - continue - wrapper = Gaia2GenericTool(tool, self) - tools[tool.name] = wrapper - self._tool_wrappers[tool.name] = wrapper - - return tools - - def get_simulation_time(self) -> float: - """Get current simulation time in seconds. - - Returns: - Current simulation time in seconds since scenario start - """ - if self._are_env is None: - return 0.0 - - try: - return self._are_env.current_time - except AttributeError: - return 0.0 - - def get_scenario(self) -> Any: - """Get the ARE scenario object. - - Returns: - ARE BenchmarkScenario object - """ - return self._scenario - - def get_are_environment(self) -> Any: - """Get the underlying ARE Environment. - - Used by the evaluator to access completed events and judge. - - Returns: - ARE Environment instance - """ - return self._are_env - - def get_notification_system(self) -> Any: - """Get the ARE notification system. - - Used by agents that need to poll for messages between iterations, - matching ARE's pre-step notification polling behavior. - - Returns: - ARE NotificationSystem instance, or None if not available - """ - if self._are_env is None: - return None - return getattr(self._are_env, "notification_system", None) - - def poll_notifications(self) -> Tuple[List[str], List[str], bool]: - """Poll pending notifications from the ARE notification system. - - Drains all pending messages from the notification queue and returns - them as pre-formatted strings. Call this between agent steps to - receive messages that arrived during ``wait_for_notification()`` or - from background simulation events. - - GAIA2 uses an event-driven multi-turn architecture. When the agent - calls ``SystemApp__wait_for_notification``, the ARE environment - processes scheduled events, advances simulation time, and queues - notifications. After the tool returns, call this method to retrieve - those notifications and inject them into the agent's context before - the next LLM call. - - ARE agents/default_agent/steps/are_simulation.py:26-62 - - Returns: - Tuple of ``(user_messages, env_notifications, has_stop_message)``. - ``user_messages`` and ``env_notifications`` contain pre-formatted - strings ready to inject into agent context. ``has_stop_message`` - is True when the environment has signalled the simulation is over. - """ - notification_system = self.get_notification_system() - if notification_system is None: - return [], [], False - - try: - from datetime import datetime, timezone - - from are.simulation.notification_system import MessageType # type: ignore[import-not-found] - - # Use simulation time, not wall-clock time. Notifications are timestamped - # with simulation time (via TimeManager), so querying with wall-clock would - # drain all messages prematurely. Matches ARE agents/default_agent/steps/are_simulation.py:30-32. - sim_time = self.get_simulation_time() - timestamp = datetime.fromtimestamp(sim_time, tz=timezone.utc) - unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp) - - if not unhandled: - return [], [], False - - # Separate by message type, matching ARE steps/are_simulation.py:34-61 - user_messages: List[str] = [] - env_notifications: List[str] = [] - has_stop = False - - for notif in unhandled: - msg_type = getattr(notif, "message_type", None) - if msg_type == MessageType.USER_MESSAGE: - user_messages.append(notif.message) - elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION: - ts = notif.timestamp.strftime("%Y-%m-%d %H:%M:%S") if notif.timestamp else "" - env_notifications.append(f"[{ts}] {notif.message}") - elif msg_type == MessageType.ENVIRONMENT_STOP: - has_stop = True - - return user_messages, env_notifications, has_stop - - except Exception: - return [], [], False - - def get_turn_notifications(self) -> Tuple[List[str], bool, bool]: - """Get notifications for turn transitions, re-queuing env notifications. - - Matches ARE's ``get_notifications()`` in ``are_simulation_main.py:331-359``: - drains the notification queue, separates by type, re-queues environment - notifications (so the inner loop's pre-step picks them up), and returns - user messages and status flags. - - Returns: - Tuple of ``(user_messages, has_env_notifications, has_stop)``. - ``user_messages`` are raw message strings for ``[TASK]`` formatting. - ``has_env_notifications`` is True when env notifications were re-queued. - ``has_stop`` is True when the environment signalled stop. - """ - notification_system = self.get_notification_system() - if notification_system is None: - return [], False, False - - try: - from datetime import datetime, timezone - - from are.simulation.notification_system import MessageType # type: ignore[import-not-found] - - sim_time = self.get_simulation_time() - timestamp = datetime.fromtimestamp(sim_time, tz=timezone.utc) - unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp) - - if not unhandled: - return [], False, False - - user_messages: List[str] = [] - has_env = False - has_stop = False - - for notif in unhandled: - msg_type = getattr(notif, "message_type", None) - if msg_type == MessageType.USER_MESSAGE: - user_messages.append(notif.message) - elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION: - # Re-queue for inner loop's pre-step to pick up - # ARE are_simulation_main.py:349-352 - notification_system.message_queue.put(notif) - has_env = True - elif msg_type == MessageType.ENVIRONMENT_STOP: - has_stop = True - - return user_messages, has_env, has_stop - - except Exception: - return [], False, False - - def get_start_time(self) -> Optional[float]: - """Get the scenario start time. - - Returns: - Start time as Unix timestamp, or None if not available - """ - return self.state.get("start_time") - - def pause(self) -> None: - """Pause the ARE simulation environment. - - Stops time progression during LLM generation, matching ARE's - simulated generation time behavior. - ARE simulation/environment.py:262-272 - - No-op if environment is not available or not running. - """ - if self._are_env is not None: - try: - self._are_env.pause() - except Exception: - pass - - def resume_with_offset(self, offset: float) -> None: - """Resume the ARE simulation environment with a time offset. - - Advances simulation time by the given offset and resumes the event loop. - ARE simulation/environment.py:286-298 - - Args: - offset: Time in seconds to advance the simulation clock - """ - if self._are_env is not None: - try: - self._are_env.resume_with_offset(offset) - except Exception: - pass - - def cleanup(self) -> None: - """Stop ARE simulation when task completes. - - Ensures proper resource cleanup and stops any running simulation. - """ - if self._are_env is not None: - try: - self._are_env.stop() - except Exception: - pass # Ignore cleanup errors - def gather_traces(self) -> Dict[str, Any]: - """Collect traces from environment and all tools. - - Returns: - Trace dictionary with scenario info and all tool traces - """ - tool_traces = {} - for name, wrapper in self._tool_wrappers.items(): - tool_traces[name] = wrapper.gather_traces() - - return { - **super().gather_traces(), - "scenario_id": self.state.get("scenario_id"), - "capability": self.state.get("capability"), - "universe_id": self.state.get("universe_id"), - "final_simulation_time": self.get_simulation_time(), - "tool_count": len(self._tool_wrappers), - "tools": tool_traces, - } + """Collect traces with GAIA2-specific fields.""" + traces = super().gather_traces() + traces["capability"] = self.state.get("capability") + traces["universe_id"] = self.state.get("universe_id") + return traces def gather_config(self) -> Dict[str, Any]: - """Gather environment configuration. - - Returns: - Configuration dictionary - """ + """Gather config with GAIA2-specific fields.""" config = super().gather_config() - config.update( - { - "scenario_id": self.state.get("scenario_id"), - "capability": self.state.get("capability"), - "universe_id": self.state.get("universe_id"), - "duration": self.state.get("duration"), - } - ) + config["capability"] = self.state.get("capability") + config["universe_id"] = self.state.get("universe_id") return config diff --git a/maseval/benchmark/gaia2/gaia2.py b/maseval/benchmark/gaia2/gaia2.py index 9c7ecf8..38daabb 100644 --- a/maseval/benchmark/gaia2/gaia2.py +++ b/maseval/benchmark/gaia2/gaia2.py @@ -1378,7 +1378,7 @@ def setup_agents( # type: ignore[override] model = self.get_model_adapter(model_id, register_name="agent_model", seed=agent_seed) agent = DefaultGaia2Agent( - tools=tools, # type: ignore[arg-type] # Gaia2GenericTool has __call__ + tools=tools, # type: ignore[arg-type] # AREToolWrapper has __call__ model=model, environment=environment, llm_args=llm_args, diff --git a/maseval/benchmark/gaia2/tool_wrapper.py b/maseval/benchmark/gaia2/tool_wrapper.py index 9d0755e..b665ddc 100644 --- a/maseval/benchmark/gaia2/tool_wrapper.py +++ b/maseval/benchmark/gaia2/tool_wrapper.py @@ -1,205 +1,27 @@ """Gaia2 Benchmark - Tool Wrapper. -Framework-agnostic wrapper for ARE AppTool instances. -Provides clean API with built-in tracing for MASEval compatibility. +Backward compatibility module. The canonical tool wrapper is now +AREToolWrapper in maseval.interface.environments.are_tool_wrapper. Original Repository: https://github.com/facebookresearch/meta-agents-research-environments Code License: MIT - -Citation: - Froger, R., Benhalloum, A., Rusakov, A., et al. (2026). Gaia2: Benchmarking LLM - Agents on Dynamic and Asynchronous Environments. ICLR 2026. - https://openreview.net/forum?id=9gw03JpKK4 """ -from datetime import datetime -from typing import TYPE_CHECKING, Any, Dict, List, Optional +from typing import TYPE_CHECKING, Any, Dict, List -from maseval.core.tracing import TraceableMixin -from maseval.core.config import ConfigurableMixin -from maseval.core.history import ToolInvocationHistory +from maseval.interface.environments.are_tool_wrapper import AREToolWrapper if TYPE_CHECKING: from maseval.benchmark.gaia2.environment import Gaia2Environment - -class Gaia2GenericTool(TraceableMixin, ConfigurableMixin): - """Framework-agnostic wrapper for ARE tools. - - Provides clean API with built-in tracing for MASEval compatibility. - Developers wrap this for their framework using composition. - - Example for smolagents: - - class MySmolagentsTool(smolagents.Tool): - skip_forward_signature_validation = True - - def __init__(self, generic_tool: Gaia2GenericTool): - self.generic_tool = generic_tool - self.name = generic_tool.name - self.description = generic_tool.description - self.inputs = generic_tool.inputs - self.output_type = generic_tool.output_type - super().__init__() - - def forward(self, **kwargs) -> str: - return self.generic_tool(**kwargs) - - def gather_traces(self): - return self.generic_tool.gather_traces() - - This wrapper preserves ARE's native return types while adding - MASEval tracing capabilities and providing a framework-agnostic interface. - """ - - def __init__(self, are_tool: Any, environment: "Gaia2Environment"): - """Initialize the tool wrapper. - - Args: - are_tool: ARE AppTool instance to wrap - environment: The Gaia2Environment this tool belongs to - """ - super().__init__() - self._are_tool = are_tool - self._environment = environment - - # Delegate metadata extraction to ARE's AppToolAdapter (tool_utils.py:544-584). - # This is the source of truth for tool name, description, inputs, and output_type. - from are.simulation.tool_utils import AppToolAdapter # type: ignore[import-not-found] - - adapter = AppToolAdapter(are_tool) - self.name: str = adapter.name - self.description: str = adapter.description - self.inputs: Dict[str, Any] = adapter.inputs - self.output_type: str = adapter.output_type - self.actual_return_type: Optional[str] = adapter.actual_return_type - self.input_schema: Dict[str, Any] = self._extract_schema(are_tool) - - # Initialize invocation history - self.history = ToolInvocationHistory() - - @staticmethod - def _extract_schema(are_tool: Any) -> Dict[str, Any]: - """Convert ARE's args list to JSON schema format for tracing/config. - - Args: - are_tool: ARE AppTool instance - - Returns: - JSON schema dictionary with properties and required fields - """ - args = getattr(are_tool, "args", None) - if not args: - return {} - - properties = {} - required = [] - - for arg in args: - param_name = getattr(arg, "name", None) - if not param_name: - continue - - properties[param_name] = { - "type": arg.arg_type, - "description": getattr(arg, "description", ""), - } - - if not arg.has_default: - required.append(param_name) - - return {"properties": properties, "required": required} - - def __call__(self, **kwargs: Any) -> Any: - """Execute tool and record invocation. - - Args: - **kwargs: Tool arguments - - Returns: - Tool execution result (preserves ARE's native return type) - """ - start_time = datetime.now() - sim_time_before = self._get_simulation_time() - - # Execute the ARE tool - status = "success" - result = None - error_message = None - - try: - result = self._are_tool(**kwargs) - except Exception as e: - status = "error" - error_message = str(e) - raise - finally: - sim_time_after = self._get_simulation_time() - - # Record invocation with timing metadata (same structure as before) - self.history.add_invocation( - inputs=kwargs, - outputs=result if status == "success" else error_message, - status=status, - timestamp=start_time.isoformat(), - meta={ - "wall_time": start_time.isoformat(), - "simulation_time_before": sim_time_before, - "simulation_time_after": sim_time_after, - "simulation_time_elapsed": sim_time_after - sim_time_before if sim_time_after and sim_time_before else None, - }, - ) - - return result - - def _get_simulation_time(self) -> Optional[float]: - """Get current simulation time from ARE environment. - - Returns: - Simulation time in seconds, or None if not available - """ - try: - return self._environment.get_simulation_time() - except Exception: - return None - - def gather_traces(self) -> Dict[str, Any]: - """Gather execution traces from this tool. - - Returns: - Dictionary containing tool traces with invocation history - """ - return { - **super().gather_traces(), - "name": self.name, - "description": self.description, - "invocations": self.history.to_list(), - "total_invocations": len(self.history), - } - - def gather_config(self) -> Dict[str, Any]: - """Gather configuration from this tool. - - Returns: - Dictionary containing tool configuration - """ - return { - **super().gather_config(), - "name": self.name, - "description": self.description, - "input_schema": self.input_schema, - } - - def __repr__(self) -> str: - """String representation of the tool.""" - args = ", ".join(f"{k}: {v['type']}" for k, v in self.inputs.items()) - return f"{self.__class__.__name__}({self.name}({args}) -> {self.output_type})" +# Backward compatibility alias +Gaia2GenericTool = AREToolWrapper def wrap_are_tools( are_tools: List[Any], environment: "Gaia2Environment", -) -> Dict[str, Gaia2GenericTool]: +) -> Dict[str, AREToolWrapper]: """Wrap multiple ARE tools for MASEval. Args: @@ -209,10 +31,8 @@ def wrap_are_tools( Returns: Dictionary mapping tool names to wrapped tools """ - wrapped: Dict[str, Gaia2GenericTool] = {} - + wrapped: Dict[str, AREToolWrapper] = {} for tool in are_tools: - wrapper = Gaia2GenericTool(tool, environment) + wrapper = AREToolWrapper(tool, environment) wrapped[wrapper.name] = wrapper - return wrapped diff --git a/maseval/interface/__init__.py b/maseval/interface/__init__.py index 6034652..ccbec85 100644 --- a/maseval/interface/__init__.py +++ b/maseval/interface/__init__.py @@ -7,6 +7,7 @@ - inference/: Model inference adapters (OpenAI, Google, HuggingFace, etc.) - agents/: Agent framework adapters (smolagents, langgraph, etc.) - logging/: Logging platform adapters (wandb, langfuse, etc.) +- environments/: Environment integrations (ARE, etc.) Canonical rules: - Keep adapters thin: translate between MASEval internal abstractions and the external API. @@ -17,7 +18,7 @@ """ # Import subpackages -from . import inference, agents +from . import inference, agents, environments from . import logging as logging_ # Rename to avoid conflict with stdlib -__all__ = ["inference", "agents", "logging_"] +__all__ = ["inference", "agents", "logging_", "environments"] diff --git a/maseval/interface/environments/__init__.py b/maseval/interface/environments/__init__.py new file mode 100644 index 0000000..f9c54e9 --- /dev/null +++ b/maseval/interface/environments/__init__.py @@ -0,0 +1,14 @@ +"""maseval.interface.environments + +Environment integrations for external simulation platforms. +""" + +__all__: list[str] = [] + +try: + from .are import AREEnvironment # noqa: F401 + from .are_tool_wrapper import AREToolWrapper # noqa: F401 + + __all__.extend(["AREEnvironment", "AREToolWrapper"]) +except ImportError: + pass diff --git a/maseval/interface/environments/are.py b/maseval/interface/environments/are.py new file mode 100644 index 0000000..7a91cfe --- /dev/null +++ b/maseval/interface/environments/are.py @@ -0,0 +1,476 @@ +"""AREEnvironment — generic maseval Environment wrapping ARE simulation. + +Provides a reusable building block for interactive agent environments +using Meta's ARE (Agents Research Environments) infrastructure. + +Original Repository: https://github.com/facebookresearch/meta-agents-research-environments +Code License: MIT +""" + +from typing import Any, Dict, List, Optional, Tuple + +from maseval.core.environment import Environment +from maseval.core.callback import EnvironmentCallback +from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + +def _check_are_installed() -> None: + """Check if ARE is installed and raise a helpful error if not.""" + try: + import are # noqa: F401 + except ImportError as e: + raise ImportError( + "ARE (Agent Research Environments) is required for AREEnvironment.\n" + "Install with: pip install maseval[are]\n" + "Or: uv add meta-agents-research-environments" + ) from e + + +def _import_are() -> Any: + """Lazily import and return the ARE simulation module. + + Returns: + The ``are.simulation`` module namespace with Environment, + EnvironmentConfig, Scenario, etc. + + Raises: + ImportError: If ARE is not installed. + """ + _check_are_installed() + from types import SimpleNamespace + from are.simulation.environment import Environment as AREEnv # type: ignore[import-not-found] + from are.simulation.environment import EnvironmentConfig # type: ignore[import-not-found] + + return SimpleNamespace( + Environment=AREEnv, + EnvironmentConfig=EnvironmentConfig, + ) + + +class AREEnvironment(Environment): + """Generic maseval Environment wrapping ARE's simulation infrastructure. + + Supports two construction paths via ``environment_data``: + + 1. **Scenario path:** ``environment_data = {"scenario": }`` + 2. **Shorthand path:** ``environment_data = {"apps": [...], "events": [...], "duration": 1800, ...}`` + + Lifecycle is user-controlled: call ``start()`` before ``run_agents()``, + ``stop()`` after. ``pause()``/``resume_with_offset()`` control simulation time. + """ + + _AUI_TOOLS_TO_REMOVE = { + "AgentUserInterface__get_last_message_from_user", + "AgentUserInterface__get_last_message_from_agent", + "AgentUserInterface__get_last_unread_messages", + "AgentUserInterface__get_all_messages", + } + + def __init__( + self, + environment_data: Dict[str, Any], + callbacks: Optional[List[EnvironmentCallback]] = None, + run_oracle: bool = False, + notification_verbosity: str = "medium", + filter_aui_tools: bool = False, + ): + """Initialize AREEnvironment. + + Args: + environment_data: Dict. Must contain either: + - ``"scenario"``: ARE Scenario object, OR + - ``"apps"``: list of ARE App instances, plus optional ``"events"``, + ``"duration"``, ``"seed"``, ``"start_time"``, ``"time_increment_in_seconds"`` + callbacks: Optional maseval EnvironmentCallbacks. + run_oracle: If True, run ARE oracle mode during setup to generate + expected event log. Stored in traces for evaluation. + notification_verbosity: ARE notification verbosity level. + ``"low"`` = no environment notifications, + ``"medium"`` = standard notifications, + ``"high"`` = all notifications. + """ + self._run_oracle = run_oracle + self._notification_verbosity = notification_verbosity + self._filter_aui_tools = filter_aui_tools + self._are_env: Any = None + self._scenario: Any = None + self._oracle_traces: Optional[Dict[str, Any]] = None + self._tool_wrappers: Dict[str, AREToolWrapper] = {} + + super().__init__(environment_data, callbacks) + + def setup_state(self, environment_data: Dict[str, Any]) -> Dict[str, Any]: + """Initialize ARE environment from environment data. + + Args: + environment_data: Dict with ``"scenario"`` or ``"apps"`` key. + + Returns: + State dict with scenario metadata. + + Raises: + ValueError: If environment_data contains neither ``"scenario"`` nor ``"apps"``. + """ + are_mod = _import_are() + + scenario = environment_data.get("scenario") + + if scenario is None and "apps" not in environment_data: + raise ValueError("environment_data must contain either 'scenario' (ARE Scenario object) or 'apps' (list of ARE App instances).") + + if scenario is None: + scenario = self._build_scenario_from_shorthand(environment_data) + + self._scenario = scenario + + # Run oracle mode if requested + if self._run_oracle: + self._oracle_traces = self._run_oracle_mode(are_mod, scenario) + + # Create ARE Environment (but don't start the event loop yet) + config = are_mod.EnvironmentConfig( + oracle_mode=False, + duration=scenario.duration, + time_increment_in_seconds=getattr(scenario, "time_increment_in_seconds", 1), + ) + if getattr(scenario, "start_time", None) and scenario.start_time > 0: + config.start_time = scenario.start_time + + # Create notification system based on verbosity + notification_system = self._create_notification_system() + self._are_env = are_mod.Environment(config, notification_system=notification_system) + + # Register apps from scenario so tools are available before start() + self._are_env.register_apps(scenario.apps) + + return { + "scenario_id": getattr(scenario, "scenario_id", None), + "duration": scenario.duration, + "seed": getattr(scenario, "seed", None), + "start_time": getattr(scenario, "start_time", None), + "app_names": [getattr(app, "name", str(app)) for app in scenario.apps], + "oracle_traces": self._oracle_traces, + } + + def _build_scenario_from_shorthand(self, environment_data: Dict[str, Any]) -> Any: + """Build an ARE Scenario from shorthand environment_data. + + Args: + environment_data: Dict with ``"apps"``, and optional ``"events"``, + ``"duration"``, ``"seed"``, ``"start_time"``, + ``"time_increment_in_seconds"``. + + Returns: + ARE Scenario instance. + """ + from are.simulation.scenarios.scenario import Scenario # type: ignore[import-not-found] + + apps = environment_data["apps"] + events = environment_data.get("events", []) + duration = environment_data.get("duration", 1800) + seed = environment_data.get("seed", 0) + start_time = environment_data.get("start_time", 0) + time_increment = environment_data.get("time_increment_in_seconds", 1) + + scenario = Scenario( + scenario_id=environment_data.get("scenario_id", "custom"), # ty: ignore[unknown-argument] + apps=apps, # ty: ignore[unknown-argument] + events=events, # ty: ignore[unknown-argument] + duration=duration, # ty: ignore[unknown-argument] + seed=seed, # ty: ignore[unknown-argument] + start_time=start_time, # ty: ignore[unknown-argument] + time_increment_in_seconds=time_increment, # ty: ignore[unknown-argument] + ) + scenario.initialize() + return scenario + + def _run_oracle_mode(self, are_mod: Any, scenario: Any) -> Dict[str, Any]: + """Run ARE oracle mode to generate expected event log. + + Args: + are_mod: ARE module namespace. + scenario: ARE Scenario instance. + + Returns: + Dict with oracle event log. + + Raises: + AttributeError: If the ARE Environment API changed and expected + methods (get_apps_state, get_world_logs) are missing. + """ + oracle_config = are_mod.EnvironmentConfig( + oracle_mode=True, + duration=scenario.duration, + time_increment_in_seconds=getattr(scenario, "time_increment_in_seconds", 1), + ) + oracle_env = are_mod.Environment(oracle_config) + oracle_env.run(scenario, wait_for_end=True, schedule_events=True) + + # Capture oracle state + oracle_traces = { + "apps_state": oracle_env.get_apps_state(), + "world_logs": oracle_env.get_world_logs(), + } + + # Soft-reset so app state is clean for agent run + scenario.soft_reset() + + return oracle_traces + + def _create_notification_system(self) -> Any: + """Create ARE notification system based on verbosity setting. + + Returns: + ARE NotificationSystem instance. + """ + try: + from are.simulation.notification_system import ( # type: ignore[import-not-found] + VerboseNotificationSystem, + VerbosityLevel, + ) + + level_map = { + "low": VerbosityLevel.LOW, + "medium": VerbosityLevel.MEDIUM, + "high": VerbosityLevel.HIGH, + } + level = level_map.get(self._notification_verbosity, VerbosityLevel.MEDIUM) + return VerboseNotificationSystem(verbosity_level=level) + except ImportError: + return None + + def create_tools(self) -> Dict[str, AREToolWrapper]: + """Wrap all ARE app tools in AREToolWrapper. + + Returns: + Dict mapping tool names to AREToolWrapper instances. + """ + tools: Dict[str, AREToolWrapper] = {} + + if self._are_env is None: + return tools + + for app in self._are_env.apps.values(): + if self._filter_aui_tools and hasattr(app, "wait_for_user_response"): + app.wait_for_user_response = False + for are_tool in app.get_tools(): + if self._filter_aui_tools and are_tool.name in self._AUI_TOOLS_TO_REMOVE: + continue + wrapper = AREToolWrapper(are_tool, self) + tools[are_tool.name] = wrapper + self._tool_wrappers[are_tool.name] = wrapper + + return tools + + # ── Lifecycle ────────────────────────────────────────────────────── + + def start(self) -> None: + """Start the ARE simulation event loop. + + Call this after environment setup and before running agents. + Runs the scenario with ``wait_for_end=False`` so control returns + immediately for agent interaction. + """ + if self._are_env is not None and self._scenario is not None: + self._are_env.run(self._scenario, wait_for_end=False, schedule_events=True) + + def stop(self) -> None: + """Stop the ARE simulation event loop.""" + if self._are_env is not None: + self._are_env.stop() + + def pause(self) -> None: + """Pause simulation time progression.""" + if self._are_env is not None: + self._are_env.pause() + + def resume_with_offset(self, offset: float) -> None: + """Resume simulation with a time offset. + + Args: + offset: Seconds to advance simulation clock before resuming. + """ + if self._are_env is not None: + self._are_env.resume_with_offset(offset) + + # ── Notification Polling ────────────────────────────────────────── + + def poll_notifications(self) -> Tuple[List[str], List[str], bool]: + """Drain pending notifications from ARE's notification queue. + + Returns: + Tuple of ``(user_messages, env_notifications, has_stop_signal)``. + ``user_messages``: Messages from simulated users. + ``env_notifications``: System events (new email, calendar reminder, etc.). + ``has_stop_signal``: True when simulation has ended. + + Agent adapters should call this between agent steps and inject + the messages into the agent's context. + + Raises: + Any exception from the underlying ARE notification system is + propagated so the benchmark runner can classify it via + ``fail_on_task_error`` / ``fail_on_setup_error``. + """ + if self._are_env is None: + return [], [], False + + notification_system = getattr(self._are_env, "notification_system", None) + if notification_system is None: + return [], [], False + + from datetime import datetime, timezone + from are.simulation.notification_system import MessageType # type: ignore[import-not-found] + + sim_time = self.get_simulation_time() + timestamp = datetime.fromtimestamp(sim_time, tz=timezone.utc) + unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp) + + if not unhandled: + return [], [], False + + user_messages: List[str] = [] + env_notifications: List[str] = [] + has_stop = False + + for notif in unhandled: + msg_type = getattr(notif, "message_type", None) + if msg_type == MessageType.USER_MESSAGE: + user_messages.append(notif.message) + elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION: + ts = notif.timestamp.strftime("%Y-%m-%d %H:%M:%S") if notif.timestamp else "" + env_notifications.append(f"[{ts}] {notif.message}") + elif msg_type == MessageType.ENVIRONMENT_STOP: + has_stop = True + + return user_messages, env_notifications, has_stop + + def get_turn_notifications(self) -> Tuple[List[str], bool, bool]: + """Drain pending notifications, re-queuing environment notifications. + + Like ``poll_notifications`` but instead of formatting environment + notifications into strings, it re-queues them back onto the message + queue so they remain available for later processing, and returns + boolean flags indicating their presence. + + Returns: + Tuple of ``(user_messages, has_env_notifications, has_stop)``. + ``user_messages``: Messages from simulated users. + ``has_env_notifications``: True if any environment notifications were seen. + ``has_stop``: True when simulation has ended. + + Raises: + Any exception from the underlying ARE notification system is + propagated so the benchmark runner can classify it. + """ + if self._are_env is None: + return [], False, False + + notification_system = getattr(self._are_env, "notification_system", None) + if notification_system is None: + return [], False, False + + from datetime import datetime, timezone + from are.simulation.notification_system import MessageType # type: ignore[import-not-found] + + sim_time = self.get_simulation_time() + timestamp = datetime.fromtimestamp(sim_time, tz=timezone.utc) + unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp) + + if not unhandled: + return [], False, False + + user_messages: List[str] = [] + has_env = False + has_stop = False + + for notif in unhandled: + msg_type = getattr(notif, "message_type", None) + if msg_type == MessageType.USER_MESSAGE: + user_messages.append(notif.message) + elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION: + has_env = True + notification_system.message_queue.put(notif) + elif msg_type == MessageType.ENVIRONMENT_STOP: + has_stop = True + + return user_messages, has_env, has_stop + + # ── Data Access ─────────────────────────────────────────────────── + + def get_simulation_time(self) -> float: + """Get current simulation time in seconds since scenario start.""" + if self._are_env is None: + return 0.0 + try: + return self._are_env.current_time + except AttributeError: + return 0.0 + + def get_are_environment(self) -> Any: + """Get the underlying ARE Environment instance.""" + return self._are_env + + def get_scenario(self) -> Any: + """Get the ARE scenario object.""" + return self._scenario + + def get_start_time(self) -> Optional[float]: + """Get the scenario start time.""" + return self.state.get("start_time") + + def get_notification_system(self) -> Any: + """Get the ARE notification system.""" + if self._are_env is None: + return None + return getattr(self._are_env, "notification_system", None) + + def get_oracle_traces(self) -> Optional[Dict[str, Any]]: + """Get oracle event log if oracle mode was enabled. + + Returns: + Oracle traces dict, or None if oracle was not run. + """ + return self._oracle_traces + + # ── Cleanup ─────────────────────────────────────────────────────── + + def cleanup(self) -> None: + """Stop ARE simulation. Called by maseval after task completes.""" + if self._are_env is not None: + try: + self._are_env.stop() + except Exception: + pass + + # ── Tracing & Config ────────────────────────────────────────────── + + def gather_traces(self) -> Dict[str, Any]: + """Collect traces from environment and all tools.""" + tool_traces = {} + for name, wrapper in self._tool_wrappers.items(): + tool_traces[name] = wrapper.gather_traces() + + return { + **super().gather_traces(), + "scenario_id": self.state.get("scenario_id"), + "duration": self.state.get("duration"), + "seed": self.state.get("seed"), + "app_names": self.state.get("app_names", []), + "oracle_traces": self._oracle_traces, + "final_simulation_time": self.get_simulation_time(), + "tool_count": len(self._tool_wrappers), + "tools": tool_traces, + } + + def gather_config(self) -> Dict[str, Any]: + """Gather environment configuration for reproducibility.""" + return { + **super().gather_config(), + "scenario_id": self.state.get("scenario_id"), + "duration": self.state.get("duration"), + "seed": self.state.get("seed"), + "start_time": self.state.get("start_time"), + "notification_verbosity": self._notification_verbosity, + "run_oracle": self._run_oracle, + } diff --git a/maseval/interface/environments/are_tool_wrapper.py b/maseval/interface/environments/are_tool_wrapper.py new file mode 100644 index 0000000..0089d39 --- /dev/null +++ b/maseval/interface/environments/are_tool_wrapper.py @@ -0,0 +1,195 @@ +"""ARE Tool Wrapper for MASEval. + +Framework-agnostic wrapper for ARE Tool instances. Provides a callable +interface with ToolInvocationHistory tracing and metadata exposure for +framework adapters (smolagents, LangGraph, etc.) to build framework-native +tools from. + +This is the layer 1->2 wrapper: +- Layer 1: ARE Tool (forward(), inputs, output_type) +- Layer 2: maseval generic (callable, ToolInvocationHistory, metadata) +- Layer 3: framework-specific -- NOT handled here. +""" + +from datetime import datetime +from typing import TYPE_CHECKING, Any, Dict, Optional + +from maseval.core.tracing import TraceableMixin +from maseval.core.config import ConfigurableMixin +from maseval.core.history import ToolInvocationHistory + +try: + from are.simulation.tool_utils import AppToolAdapter # type: ignore[import-not-found] +except ImportError: + AppToolAdapter = None # type: ignore[assignment,misc] + +if TYPE_CHECKING: + from maseval.interface.environments.are import AREEnvironment + + +class AREToolWrapper(TraceableMixin, ConfigurableMixin): + """Framework-agnostic wrapper for ARE tools with maseval tracing. + + Wraps an ARE Tool and exposes its metadata (name, description, inputs, + output_type) so that agent adapters can construct framework-native tools. + + Example for smolagents:: + + class MySmolagentsTool(smolagents.Tool): + skip_forward_signature_validation = True + + def __init__(self, wrapper: AREToolWrapper): + self.wrapper = wrapper + self.name = wrapper.name + self.description = wrapper.description + self.inputs = wrapper.inputs + self.output_type = wrapper.output_type + super().__init__() + + def forward(self, **kwargs) -> str: + return self.wrapper(**kwargs) + """ + + def __init__(self, are_tool: Any, environment: "AREEnvironment"): + """Initialize the tool wrapper. + + Args: + are_tool: ARE Tool instance to wrap. + environment: The AREEnvironment this tool belongs to. + """ + super().__init__() + self._are_tool = are_tool + self._environment = environment + self.history = ToolInvocationHistory() + + # Expose ARE tool metadata via AppToolAdapter (canonical source of truth) + if AppToolAdapter is None: + raise ImportError("ARE (Agent Research Environments) is required for AREToolWrapper.\nInstall with: pip install maseval[are]") + adapter = AppToolAdapter(are_tool) + self.name: str = adapter.name + self.description: str = adapter.description + self.inputs: Dict[str, Any] = adapter.inputs + self.output_type: str = adapter.output_type + self.actual_return_type: Optional[str] = adapter.actual_return_type + + # Extract JSON schema from ARE tool args (if available) + self.input_schema: Dict[str, Any] = self._extract_schema(are_tool) + + @staticmethod + def _extract_schema(are_tool: Any) -> Dict[str, Any]: + """Convert ARE's args list to JSON schema format. + + Args: + are_tool: ARE Tool instance. + + Returns: + JSON schema dict with properties and required fields. + """ + args = getattr(are_tool, "args", None) + if not args: + return {} + + properties = {} + required = [] + + for arg in args: + param_name = getattr(arg, "name", None) + if not param_name: + continue + properties[param_name] = { + "type": arg.arg_type, + "description": getattr(arg, "description", ""), + } + if not arg.has_default: + required.append(param_name) + + return {"properties": properties, "required": required} + + def _get_simulation_time(self) -> Optional[float]: + """Get the current simulation time from the ARE environment. + + Returns: + The current simulation time, or None if unavailable. + """ + try: + return self._environment.get_simulation_time() + except Exception: + return None + + def __call__(self, **kwargs: Any) -> Any: + """Execute the ARE tool with tracing. + + Args: + **kwargs: Tool arguments matching the inputs schema. + + Returns: + Tool output (type varies per tool). + + Raises: + Any exception from the underlying ARE tool is re-raised. + """ + start_time = datetime.now() + sim_time_before = self._get_simulation_time() + status = "success" + result = None + error_message = None + + try: + result = self._are_tool(**kwargs) + except Exception as e: + status = "error" + error_message = str(e) + raise + finally: + sim_time_after = self._get_simulation_time() + if sim_time_before is not None and sim_time_after is not None: + sim_time_elapsed = sim_time_after - sim_time_before + else: + sim_time_elapsed = None + + meta = { + "wall_time": start_time.isoformat(), + "simulation_time_before": sim_time_before, + "simulation_time_after": sim_time_after, + "simulation_time_elapsed": sim_time_elapsed, + } + + self.history.add_invocation( + inputs=kwargs, + outputs=result if status == "success" else error_message, + status=status, + timestamp=start_time.isoformat(), + meta=meta, + ) + + return result + + def gather_traces(self) -> Dict[str, Any]: + """Gather execution traces from this tool. + + Returns: + Dictionary with tool name, invocation history, and counts. + """ + return { + **super().gather_traces(), + "name": self.name, + "invocations": self.history.to_list(), + "total_invocations": len(self.history), + } + + def gather_config(self) -> Dict[str, Any]: + """Gather configuration from this tool. + + Returns: + Dictionary with tool name, description, and schema. + """ + return { + **super().gather_config(), + "name": self.name, + "description": self.description, + "input_schema": self.input_schema, + } + + def __repr__(self) -> str: + args = ", ".join(f"{k}: {v.get('type', '?')}" for k, v in self.inputs.items()) + return f"{self.__class__.__name__}({self.name}({args}) -> {self.output_type})" diff --git a/mkdocs.yml b/mkdocs.yml index 6a6edf9..bbcd264 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -130,6 +130,8 @@ nav: - HuggingFace: interface/inference/huggingface.md - LiteLLM: interface/inference/litellm.md - OpenAI: interface/inference/openai.md + - Environments: + - ARE: interface/environments/are.md - Benchmarks: - ConVerse: benchmark/converse.md - GAIA2: benchmark/gaia2.md diff --git a/pyproject.toml b/pyproject.toml index 6d58e8d..6475808 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -48,6 +48,7 @@ wandb = ["wandb>=0.15.0"] langfuse = ["langfuse>=3.3.4"] # Benchmarks +are = ["meta-agents-research-environments>=1.2.0"] gaia2 = ["meta-agents-research-environments>=1.2.0", "datasets>=3.0.0"] macs = [] multiagentbench = [ @@ -220,6 +221,7 @@ markers = [ "smolagents: Tests that specifically require smolagents", "langgraph: Tests that specifically require langgraph", "llamaindex: Tests that specifically require llama-index-core", + "are: Tests that specifically require ARE (Agent Research Environments)", "gaia2: Tests that specifically require ARE (Agent Research Environments)", "camel: Tests that specifically require camel-ai", "mmlu: Tests that specifically require MMLU benchmark (HuggingFace + DISCO)", diff --git a/tests/interface/__init__.py b/tests/interface/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/interface/environments/__init__.py b/tests/interface/environments/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/interface/environments/test_are_environment.py b/tests/interface/environments/test_are_environment.py new file mode 100644 index 0000000..558613e --- /dev/null +++ b/tests/interface/environments/test_are_environment.py @@ -0,0 +1,665 @@ +"""Tests for AREEnvironment.""" + +from unittest.mock import MagicMock, patch +import pytest + +from maseval.interface.environments.are import AREEnvironment + + +@pytest.fixture(autouse=True) +def mock_app_tool_adapter(): + """Mock AppToolAdapter so AREToolWrapper can initialize without ARE installed.""" + + def make_adapter(are_tool): + adapter = MagicMock() + adapter.name = are_tool.name + adapter.description = are_tool.description + adapter.inputs = are_tool.inputs + adapter.output_type = are_tool.output_type + adapter.actual_return_type = None + return adapter + + with patch("maseval.interface.environments.are_tool_wrapper.AppToolAdapter", side_effect=make_adapter): + yield + + +def _make_mock_scenario(scenario_id="test-001", duration=600, seed=42, start_time=0): + """Create a mock ARE Scenario.""" + scenario = MagicMock() + scenario.scenario_id = scenario_id + scenario.duration = duration + scenario.seed = seed + scenario.start_time = start_time + scenario.time_increment_in_seconds = 1 + scenario.apps = [MagicMock(name="EmailClient"), MagicMock(name="Calendar")] + return scenario + + +def _make_mock_are_env(apps=None): + """Create a mock ARE Environment.""" + env = MagicMock() + if apps is None: + # Create mock apps with mock tools + email_app = MagicMock() + email_tool = MagicMock() + email_tool.name = "EmailClient__send_email" + email_tool.description = "Send an email" + email_tool.inputs = {"to": {"type": "string", "description": "Recipient"}} + email_tool.output_type = "string" + email_app.get_tools.return_value = [email_tool] + email_app.name = "EmailClient" + + calendar_app = MagicMock() + cal_tool = MagicMock() + cal_tool.name = "Calendar__create_event" + cal_tool.description = "Create event" + cal_tool.inputs = {"title": {"type": "string", "description": "Title"}} + cal_tool.output_type = "string" + calendar_app.get_tools.return_value = [cal_tool] + calendar_app.name = "Calendar" + + env.apps = {"EmailClient": email_app, "Calendar": calendar_app} + else: + env.apps = apps + env.current_time = 0.0 + return env + + +class TestAREEnvironmentScenarioPath: + """Tests for AREEnvironment with Scenario objects.""" + + @patch("maseval.interface.environments.are._import_are") + def test_setup_state_with_scenario(self, mock_import): + """setup_state initialises from an ARE Scenario and returns state dict.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + environment_data = {"scenario": scenario} + + env = AREEnvironment(environment_data) + + assert env.state["scenario_id"] == "test-001" + assert env.state["duration"] == 600 + assert env.state["seed"] == 42 + assert env._are_env is mock_are_env + + @patch("maseval.interface.environments.are._import_are") + def test_setup_state_requires_scenario_or_apps(self, mock_import): + """setup_state raises ValueError if neither scenario nor apps provided.""" + mock_import.return_value = MagicMock() + + with pytest.raises(ValueError, match="must contain either"): + AREEnvironment(environment_data={}) + + @patch("maseval.interface.environments.are._import_are") + def test_create_tools_wraps_are_tools(self, mock_import): + """create_tools wraps all ARE app tools in AREToolWrapper.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + + tools = env.get_tools() + assert "EmailClient__send_email" in tools + assert "Calendar__create_event" in tools + assert len(tools) == 2 + + # Check wrapper metadata + email_tool = tools["EmailClient__send_email"] + assert email_tool.name == "EmailClient__send_email" + assert email_tool.description == "Send an email" + + @patch("maseval.interface.environments.are._import_are") + def test_start_runs_scenario(self, mock_import): + """start() calls are_env.run() with the scenario.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + env.start() + + mock_are_env.run.assert_called_once() + call_kwargs = mock_are_env.run.call_args + assert call_kwargs[1].get("wait_for_end") is False + + @patch("maseval.interface.environments.are._import_are") + def test_stop_stops_env(self, mock_import): + """stop() calls are_env.stop().""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + env.stop() + + mock_are_env.stop.assert_called_once() + + @patch("maseval.interface.environments.are._import_are") + def test_pause_and_resume(self, mock_import): + """pause() and resume_with_offset() delegate to ARE env.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + env.pause() + mock_are_env.pause.assert_called_once() + + env.resume_with_offset(5.0) + mock_are_env.resume_with_offset.assert_called_once_with(5.0) + + @patch("maseval.interface.environments.are._import_are") + def test_get_simulation_time(self, mock_import): + """get_simulation_time() returns ARE env's current_time.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_env.current_time = 42.5 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + + assert env.get_simulation_time() == 42.5 + + @patch("maseval.interface.environments.are._import_are") + def test_cleanup_stops_env(self, mock_import): + """cleanup() stops the ARE environment.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + env.cleanup() + + mock_are_env.stop.assert_called_once() + + @patch("maseval.interface.environments.are._import_are") + def test_gather_traces(self, mock_import): + """gather_traces returns structured trace data.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_env.current_time = 100.0 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + + traces = env.gather_traces() + assert traces["scenario_id"] == "test-001" + assert traces["tool_count"] == 2 + assert "tools" in traces + assert traces["final_simulation_time"] == 100.0 + + @patch("maseval.interface.environments.are._import_are") + def test_gather_config(self, mock_import): + """gather_config returns environment configuration.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + + config = env.gather_config() + assert config["scenario_id"] == "test-001" + assert config["duration"] == 600 + assert config["notification_verbosity"] == "medium" + assert "tool_names" in config + + +class TestAREEnvironmentShorthandPath: + """Tests for AREEnvironment with apps+events shorthand.""" + + @patch("maseval.interface.environments.are._import_are") + @patch("maseval.interface.environments.are.AREEnvironment._build_scenario_from_shorthand") + def test_shorthand_builds_scenario(self, mock_build, mock_import): + """Shorthand environment_data with 'apps' key triggers scenario construction.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_scenario = _make_mock_scenario() + mock_build.return_value = mock_scenario + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + environment_data = { + "apps": [MagicMock(), MagicMock()], + "events": [MagicMock()], + "duration": 300, + "seed": 99, + } + env = AREEnvironment(environment_data=environment_data) + + mock_build.assert_called_once_with(environment_data) + assert env._scenario is mock_scenario + + @patch("maseval.interface.environments.are._import_are") + def test_shorthand_passes_config_to_scenario(self, mock_import): + """Shorthand config values are passed through to Scenario construction.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + # Patch Scenario import inside _build_scenario_from_shorthand + mock_scenario_cls = MagicMock() + mock_scenario_instance = _make_mock_scenario() + mock_scenario_cls.return_value = mock_scenario_instance + + with patch.dict( + "sys.modules", + { + "are": MagicMock(), + "are.simulation": MagicMock(), + "are.simulation.scenarios": MagicMock(), + "are.simulation.scenarios.scenario": MagicMock(Scenario=mock_scenario_cls), + }, + ): + apps = [MagicMock(), MagicMock()] + environment_data = { + "apps": apps, + "duration": 300, + "seed": 99, + "start_time": 100, + "time_increment_in_seconds": 5, + } + AREEnvironment(environment_data=environment_data) + + mock_scenario_cls.assert_called_once() + call_kwargs = mock_scenario_cls.call_args[1] + assert call_kwargs["duration"] == 300 + assert call_kwargs["seed"] == 99 + assert call_kwargs["start_time"] == 100 + assert call_kwargs["time_increment_in_seconds"] == 5 + mock_scenario_instance.initialize.assert_called_once() + + +class TestAREEnvironmentOracleMode: + """Tests for AREEnvironment oracle mode.""" + + @patch("maseval.interface.environments.are._import_are") + def test_oracle_mode_captures_traces(self, mock_import): + """Oracle mode runs scenario and captures apps_state and world_logs.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_oracle_env = MagicMock() + mock_oracle_env.get_apps_state.return_value = {"email": {"inbox": []}} + mock_oracle_env.get_world_logs.return_value = [{"event": "email_sent"}] + + mock_agent_env = _make_mock_are_env() + + # First Environment() call = oracle env, second = agent env + mock_are_mod.Environment.side_effect = [mock_oracle_env, mock_agent_env] + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}, run_oracle=True) + + assert env._oracle_traces is not None + assert env._oracle_traces["apps_state"] == {"email": {"inbox": []}} + assert env._oracle_traces["world_logs"] == [{"event": "email_sent"}] + mock_oracle_env.get_apps_state.assert_called_once() + mock_oracle_env.get_world_logs.assert_called_once() + scenario.soft_reset.assert_called_once() + + @patch("maseval.interface.environments.are._import_are") + def test_oracle_mode_crashes_if_methods_missing(self, mock_import): + """Oracle mode raises AttributeError if ARE env lacks expected methods.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + # Create oracle env with spec=[] so it has NO attributes except run + mock_oracle_env = MagicMock(spec=[]) + mock_oracle_env.run = MagicMock() + + mock_agent_env = _make_mock_are_env() + mock_are_mod.Environment.side_effect = [mock_oracle_env, mock_agent_env] + + scenario = _make_mock_scenario() + with pytest.raises(AttributeError): + AREEnvironment(environment_data={"scenario": scenario}, run_oracle=True) + + +class TestAREToolWrapper: + """Tests for AREToolWrapper simulation time tracking.""" + + def _make_wrapper(self, mock_env): + """Create an AREToolWrapper with a mock tool and environment.""" + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + mock_tool = MagicMock() + mock_tool.name = "TestTool__do_thing" + mock_tool.description = "Does a thing" + mock_tool.inputs = {"x": {"type": "string", "description": "input"}} + mock_tool.output_type = "string" + mock_tool.args = [] + mock_tool.return_value = "result" + + wrapper = AREToolWrapper(mock_tool, mock_env) + return wrapper + + def test_invocation_records_simulation_time(self): + """Wrapper records simulation time before/after tool call in meta.""" + mock_env = MagicMock() + mock_env.get_simulation_time = MagicMock(side_effect=[100.0, 105.0]) + + wrapper = self._make_wrapper(mock_env) + wrapper(x="hello") + + assert len(wrapper.history.logs) == 1 + meta = wrapper.history.logs[0]["meta"] + assert meta["simulation_time_before"] == 100.0 + assert meta["simulation_time_after"] == 105.0 + assert meta["simulation_time_elapsed"] == 5.0 + assert "wall_time" in meta + + def test_invocation_records_none_when_sim_time_unavailable(self): + """Wrapper records None values when get_simulation_time raises.""" + mock_env = MagicMock() + mock_env.get_simulation_time = MagicMock(side_effect=AttributeError) + + wrapper = self._make_wrapper(mock_env) + wrapper(x="hello") + + assert len(wrapper.history.logs) == 1 + meta = wrapper.history.logs[0]["meta"] + assert meta["simulation_time_before"] is None + assert meta["simulation_time_after"] is None + assert meta["simulation_time_elapsed"] is None + + def test_schema_extraction_crashes_on_missing_arg_type(self): + """_extract_schema raises AttributeError if arg lacks arg_type.""" + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + mock_tool = MagicMock(spec=[]) # empty spec — no attributes + mock_tool.name = "param1" + # Deliberately no arg_type or has_default + mock_are_tool = MagicMock() + mock_are_tool.args = [mock_tool] + + with pytest.raises(AttributeError): + AREToolWrapper._extract_schema(mock_are_tool) + + @patch("maseval.interface.environments.are_tool_wrapper.AppToolAdapter") + def test_uses_app_tool_adapter_for_metadata(self, mock_adapter_cls): + """AREToolWrapper delegates metadata extraction to AppToolAdapter.""" + mock_adapter = MagicMock() + mock_adapter.name = "Calendar__create_event" + mock_adapter.description = "Create a calendar event" + mock_adapter.inputs = {"title": {"type": "string"}} + mock_adapter.output_type = "string" + mock_adapter.actual_return_type = "str" + mock_adapter_cls.return_value = mock_adapter + + mock_tool = MagicMock() + mock_tool.args = [] + mock_env = MagicMock() + + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + wrapper = AREToolWrapper(mock_tool, mock_env) + + mock_adapter_cls.assert_called_once_with(mock_tool) + assert wrapper.name == "Calendar__create_event" + assert wrapper.description == "Create a calendar event" + assert wrapper.inputs == {"title": {"type": "string"}} + assert wrapper.output_type == "string" + assert wrapper.actual_return_type == "str" + + +class TestAREEnvironmentNotifications: + """Tests for notification polling.""" + + @patch("maseval.interface.environments.are._import_are") + def test_poll_notifications_propagates_errors(self, mock_import): + """poll_notifications does not swallow unexpected exceptions.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + # Set up notification system that raises on access + mock_notif_sys = MagicMock() + mock_notif_sys.message_queue.get_by_timestamp.side_effect = RuntimeError("corrupt queue") + mock_are_env.notification_system = mock_notif_sys + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + + with patch.dict("sys.modules", {"are.simulation.notification_system": MagicMock()}): + with pytest.raises(RuntimeError, match="corrupt queue"): + env.poll_notifications() + + @patch("maseval.interface.environments.are._import_are") + def test_poll_notifications_returns_empty_when_no_env(self, mock_import): + """poll_notifications returns empty tuple when no ARE env.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + env._are_env = None + + result = env.poll_notifications() + assert result == ([], [], False) + + +class TestAREEnvironmentAUIFiltering: + """Tests for AUI tool filtering.""" + + @patch("maseval.interface.environments.are._import_are") + def test_aui_tools_filtered_when_enabled(self, mock_import): + """AUI message-retrieval tools are excluded when filter_aui_tools=True.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + aui_app = MagicMock() + aui_app.name = "AgentUserInterface" + aui_tool_get = MagicMock() + aui_tool_get.name = "AgentUserInterface__get_last_message_from_user" + aui_tool_get.description = "Get message" + aui_tool_get.inputs = {} + aui_tool_get.output_type = "string" + aui_tool_get.args = [] + aui_tool_send = MagicMock() + aui_tool_send.name = "AgentUserInterface__send_message_to_user" + aui_tool_send.description = "Send message" + aui_tool_send.inputs = {} + aui_tool_send.output_type = "string" + aui_tool_send.args = [] + aui_app.get_tools.return_value = [aui_tool_get, aui_tool_send] + + email_app = MagicMock() + email_app.name = "EmailClient" + email_tool = MagicMock() + email_tool.name = "EmailClient__send_email" + email_tool.description = "Send email" + email_tool.inputs = {"to": {"type": "string"}} + email_tool.output_type = "string" + email_tool.args = [] + email_app.get_tools.return_value = [email_tool] + + mock_are_env = MagicMock() + mock_are_env.apps = {"AgentUserInterface": aui_app, "EmailClient": email_app} + mock_are_env.current_time = 0.0 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}, filter_aui_tools=True) + + tools = env.get_tools() + assert "AgentUserInterface__get_last_message_from_user" not in tools + assert "AgentUserInterface__send_message_to_user" in tools + assert "EmailClient__send_email" in tools + + @patch("maseval.interface.environments.are._import_are") + def test_aui_tools_not_filtered_by_default(self, mock_import): + """AUI tools are included by default.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + aui_app = MagicMock() + aui_app.name = "AgentUserInterface" + aui_tool = MagicMock() + aui_tool.name = "AgentUserInterface__get_last_message_from_user" + aui_tool.description = "Get message" + aui_tool.inputs = {} + aui_tool.output_type = "string" + aui_tool.args = [] + aui_app.get_tools.return_value = [aui_tool] + + mock_are_env = MagicMock() + mock_are_env.apps = {"AgentUserInterface": aui_app} + mock_are_env.current_time = 0.0 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + + tools = env.get_tools() + assert "AgentUserInterface__get_last_message_from_user" in tools + + @patch("maseval.interface.environments.are._import_are") + def test_wait_for_user_response_set_false(self, mock_import): + """filter_aui_tools=True sets wait_for_user_response=False on AUI app.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + + aui_app = MagicMock() + aui_app.name = "AgentUserInterface" + aui_app.wait_for_user_response = True + aui_app.get_tools.return_value = [] + + mock_are_env = MagicMock() + mock_are_env.apps = {"AgentUserInterface": aui_app} + mock_are_env.current_time = 0.0 + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + AREEnvironment(environment_data={"scenario": scenario}, filter_aui_tools=True) + + assert aui_app.wait_for_user_response is False + + +class TestAREEnvironmentTurnNotifications: + """Tests for get_turn_notifications.""" + + @patch("maseval.interface.environments.are._import_are") + def test_requeues_env_notifications(self, mock_import): + """get_turn_notifications re-queues env notifications.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + mock_message_type = MagicMock() + mock_message_type.USER_MESSAGE = "user" + mock_message_type.ENVIRONMENT_NOTIFICATION = "env" + mock_message_type.ENVIRONMENT_STOP = "stop" + + user_notif = MagicMock() + user_notif.message_type = mock_message_type.USER_MESSAGE + user_notif.message = "Hello agent" + + env_notif = MagicMock() + env_notif.message_type = mock_message_type.ENVIRONMENT_NOTIFICATION + env_notif.message = "New email arrived" + + mock_notif_sys = MagicMock() + mock_notif_sys.message_queue.get_by_timestamp.return_value = [user_notif, env_notif] + mock_are_env.notification_system = mock_notif_sys + + scenario = _make_mock_scenario() + + with patch.dict("sys.modules", {"are.simulation.notification_system": MagicMock(MessageType=mock_message_type)}): + env = AREEnvironment(environment_data={"scenario": scenario}) + user_msgs, has_env, has_stop = env.get_turn_notifications() + + assert user_msgs == ["Hello agent"] + assert has_env is True + assert has_stop is False + mock_notif_sys.message_queue.put.assert_called_once_with(env_notif) + + @patch("maseval.interface.environments.are._import_are") + def test_returns_empty_when_no_env(self, mock_import): + """Returns empty results when no ARE environment.""" + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + env._are_env = None + + result = env.get_turn_notifications() + assert result == ([], False, False) + + +class TestAREEnvironmentAccessors: + """Tests for convenience accessors.""" + + @patch("maseval.interface.environments.are._import_are") + def test_get_scenario(self, mock_import): + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + assert env.get_scenario() is scenario + + @patch("maseval.interface.environments.are._import_are") + def test_get_start_time(self, mock_import): + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + mock_are_env = _make_mock_are_env() + mock_are_mod.Environment.return_value = mock_are_env + scenario = _make_mock_scenario(start_time=1000) + env = AREEnvironment(environment_data={"scenario": scenario}) + assert env.get_start_time() == 1000 + + @patch("maseval.interface.environments.are._import_are") + def test_get_notification_system(self, mock_import): + mock_are_mod = MagicMock() + mock_import.return_value = mock_are_mod + mock_are_env = _make_mock_are_env() + mock_notif_sys = MagicMock() + mock_are_env.notification_system = mock_notif_sys + mock_are_mod.Environment.return_value = mock_are_env + scenario = _make_mock_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + assert env.get_notification_system() is mock_notif_sys diff --git a/tests/interface/environments/test_are_integration.py b/tests/interface/environments/test_are_integration.py new file mode 100644 index 0000000..a5ea3dc --- /dev/null +++ b/tests/interface/environments/test_are_integration.py @@ -0,0 +1,456 @@ +"""Integration tests for AREEnvironment with real ARE simulation stack. + +These tests exercise AREEnvironment against real ARE apps and scenarios — +no mocks. They validate that the maseval wrapper correctly integrates with +ARE's simulation infrastructure: tool wrapping, lifecycle, tracing, oracle +mode, and the shorthand construction path. + +Marked ``interface`` + ``are``. Runs in the default test suite (no network +or API keys needed — ARE is a local dependency). +""" + +import pytest + +try: + from are.simulation.apps import CalendarApp, ContactsApp, SystemApp + from are.simulation.scenarios.scenario import Scenario + + HAS_ARE = True +except ImportError: + HAS_ARE = False + +pytestmark = [pytest.mark.interface, pytest.mark.are] + +skip_no_are = pytest.mark.skipif(not HAS_ARE, reason="ARE not installed") + + +def _make_scenario(duration=60, seed=42, start_time=0): + """Create a minimal ARE scenario with Calendar + Contacts + SystemApp.""" + apps = [CalendarApp(), ContactsApp(), SystemApp()] + scenario = Scenario( + scenario_id="test-integration", # ty: ignore[unknown-argument] + apps=apps, # ty: ignore[unknown-argument] + events=[], # ty: ignore[unknown-argument] + duration=duration, # ty: ignore[unknown-argument] + seed=seed, # ty: ignore[unknown-argument] + start_time=start_time, # ty: ignore[unknown-argument] + time_increment_in_seconds=1, # ty: ignore[unknown-argument] + ) + scenario.initialize() + return scenario + + +# ============================================================================= +# Environment Lifecycle +# ============================================================================= + + +@skip_no_are +class TestAREEnvironmentLifecycle: + """Test the full AREEnvironment lifecycle with real ARE.""" + + def test_scenario_path_creates_environment(self): + """AREEnvironment initializes from a real ARE Scenario.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + assert env.state["scenario_id"] == "test-integration" + assert env.state["duration"] == 60 + assert env.state["seed"] == 42 + assert "CalendarApp" in env.state["app_names"] + assert "ContactsApp" in env.state["app_names"] + assert "SystemApp" in env.state["app_names"] + finally: + env.cleanup() + + def test_shorthand_path_creates_environment(self): + """AREEnvironment initializes from shorthand apps + config.""" + from maseval.interface.environments.are import AREEnvironment + + apps = [CalendarApp(), ContactsApp(), SystemApp()] + env = AREEnvironment( + environment_data={ + "apps": apps, + "duration": 30, + "seed": 99, + } + ) + try: + assert env.state["duration"] == 30 + assert env.state["seed"] == 99 + assert len(env.tools) > 0 + finally: + env.cleanup() + + def test_start_stop_lifecycle(self): + """start() begins simulation, stop() ends it without error.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario(duration=10) + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + env.start() + assert env.get_simulation_time() >= 0 + env.stop() + finally: + env.cleanup() + + def test_pause_resume(self): + """pause() and resume_with_offset() control simulation time.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario(duration=60) + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + env.start() + env.pause() + time_at_pause = env.get_simulation_time() + env.resume_with_offset(10.0) + time_after_resume = env.get_simulation_time() + assert time_after_resume >= time_at_pause + 10.0 + finally: + env.cleanup() + + def test_cleanup_is_idempotent(self): + """cleanup() can be called multiple times without error.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario(duration=10) + env = AREEnvironment(environment_data={"scenario": scenario}) + env.start() + env.cleanup() + env.cleanup() # second call should not raise + + +# ============================================================================= +# Tool Wrapping +# ============================================================================= + + +@skip_no_are +class TestAREToolWrapping: + """Test that real ARE tools are correctly wrapped and callable.""" + + def test_tools_are_created_from_real_apps(self): + """create_tools() produces AREToolWrapper instances for all ARE app tools.""" + from maseval.interface.environments.are import AREEnvironment + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + tools = env.get_tools() + assert len(tools) > 0 + for name, tool in tools.items(): + assert isinstance(tool, AREToolWrapper), f"{name} is {type(tool).__name__}" + finally: + env.cleanup() + + def test_tool_metadata_from_real_apps(self): + """Wrapped tools expose name, description, inputs, output_type from ARE.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + for name, tool in env.get_tools().items(): + assert isinstance(tool.name, str) and tool.name + assert isinstance(tool.description, str) and tool.description + assert isinstance(tool.inputs, dict) + assert isinstance(tool.output_type, str) + finally: + env.cleanup() + + def test_tool_call_returns_result_and_traces(self): + """Calling a real ARE tool returns a result and records a traced invocation.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + env.start() + get_time = env.get_tool("SystemApp__get_current_time") + assert get_time is not None + + result = get_time() + assert isinstance(result, dict) + assert "current_timestamp" in result + + # Invocation was traced + assert len(get_time.history) == 1 + record = get_time.history.to_list()[0] + assert record["status"] == "success" + assert record["outputs"] == result + + # Simulation time was captured in meta + meta = record["meta"] + assert meta["simulation_time_before"] is not None + assert meta["simulation_time_after"] is not None + assert isinstance(meta["simulation_time_elapsed"], (int, float)) + finally: + env.cleanup() + + def test_tool_error_is_traced_and_reraised(self): + """A tool call that fails records the error and re-raises.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + env.start() + # get_calendar_event with a nonexistent ID should fail + get_event = env.get_tool("CalendarApp__get_calendar_event") + assert get_event is not None + + with pytest.raises(Exception): + get_event(event_id="nonexistent-id-12345") + + assert len(get_event.history) == 1 + assert get_event.history.to_list()[0]["status"] == "error" + finally: + env.cleanup() + + def test_multiple_tool_calls_accumulate_history(self): + """Multiple calls to the same tool accumulate in history.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + env.start() + get_time = env.get_tool("SystemApp__get_current_time") + assert get_time is not None + get_time() + get_time() + get_time() + + assert len(get_time.history) == 3 + for record in get_time.history.to_list(): + assert record["status"] == "success" + finally: + env.cleanup() + + +# ============================================================================= +# AUI Tool Filtering +# ============================================================================= + + +@skip_no_are +class TestAUIToolFiltering: + """Test AUI tool filtering with real ARE apps.""" + + def test_filter_aui_tools_excludes_removal_set(self): + """filter_aui_tools=True removes tools in _AUI_TOOLS_TO_REMOVE.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment( + environment_data={"scenario": scenario}, + filter_aui_tools=True, + ) + try: + for name in env.tools: + assert name not in AREEnvironment._AUI_TOOLS_TO_REMOVE + finally: + env.cleanup() + + def test_unfiltered_has_at_least_as_many_tools(self): + """Unfiltered environment has >= tools as filtered.""" + from maseval.interface.environments.are import AREEnvironment + + env_unfiltered = AREEnvironment(environment_data={"scenario": _make_scenario()}) + env_filtered = AREEnvironment( + environment_data={"scenario": _make_scenario()}, + filter_aui_tools=True, + ) + try: + assert len(env_filtered.tools) <= len(env_unfiltered.tools) + finally: + env_unfiltered.cleanup() + env_filtered.cleanup() + + +# ============================================================================= +# Oracle Mode +# ============================================================================= + + +@skip_no_are +class TestAREOracleMode: + """Test oracle mode with real ARE simulation.""" + + def test_oracle_mode_produces_traces(self): + """run_oracle=True produces oracle_traces with apps_state and world_logs.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario(duration=10) + env = AREEnvironment( + environment_data={"scenario": scenario}, + run_oracle=True, + ) + try: + traces = env.get_oracle_traces() + assert traces is not None + assert "apps_state" in traces + assert "world_logs" in traces + assert isinstance(traces["apps_state"], dict) + assert isinstance(traces["world_logs"], list) + + # Oracle traces appear in gather_traces too + full_traces = env.gather_traces() + assert full_traces["oracle_traces"] is not None + finally: + env.cleanup() + + def test_no_oracle_by_default(self): + """Oracle mode is off by default.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + assert env.get_oracle_traces() is None + finally: + env.cleanup() + + +# ============================================================================= +# Tracing & Config +# ============================================================================= + + +@skip_no_are +class TestARETracingAndConfig: + """Test that tracing and config capture real environment state.""" + + def test_gather_traces_after_tool_calls(self): + """gather_traces() captures tool invocation history from real calls.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + env.start() + get_time = env.get_tool("SystemApp__get_current_time") + assert get_time is not None + get_time() + get_time() + + traces = env.gather_traces() + assert traces["tool_count"] == len(env.tools) + assert traces["scenario_id"] == "test-integration" + + # The tool we called should have 2 invocations in traces + tool_traces = traces["tools"]["SystemApp__get_current_time"] + assert tool_traces["total_invocations"] == 2 + finally: + env.cleanup() + + def test_gather_config_captures_settings(self): + """gather_config() records environment settings for reproducibility.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment( + environment_data={"scenario": scenario}, + notification_verbosity="high", + run_oracle=False, + ) + try: + config = env.gather_config() + assert config["scenario_id"] == "test-integration" + assert config["duration"] == 60 + assert config["seed"] == 42 + assert config["notification_verbosity"] == "high" + assert config["run_oracle"] is False + assert config["tool_count"] == len(env.tools) + assert "tool_names" in config + assert "SystemApp__get_current_time" in config["tool_names"] + finally: + env.cleanup() + + def test_simulation_time_advances_with_wait(self): + """Simulation time advances when wait_for_notification is called.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + env.start() + t0 = env.get_simulation_time() + + wait_tool = env.get_tool("SystemApp__wait_for_notification") + if wait_tool: + wait_tool(timeout=2) + t1 = env.get_simulation_time() + assert t1 > t0, f"Simulation time should advance: {t0} -> {t1}" + finally: + env.cleanup() + + +# ============================================================================= +# Accessors +# ============================================================================= + + +@skip_no_are +class TestAREAccessors: + """Test convenience accessors return real ARE objects.""" + + def test_get_are_environment_returns_real_env(self): + """get_are_environment() returns the actual ARE Environment instance.""" + from are.simulation.environment import Environment as RealAREEnv + + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + are_env = env.get_are_environment() + assert isinstance(are_env, RealAREEnv) + finally: + env.cleanup() + + def test_get_scenario_returns_real_scenario(self): + """get_scenario() returns the ARE Scenario that was passed in.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + assert env.get_scenario() is scenario + finally: + env.cleanup() + + def test_get_notification_system_returns_real_system(self): + """get_notification_system() returns a real ARE notification system.""" + from maseval.interface.environments.are import AREEnvironment + + scenario = _make_scenario() + env = AREEnvironment(environment_data={"scenario": scenario}) + try: + ns = env.get_notification_system() + assert ns is not None + assert hasattr(ns, "message_queue") + finally: + env.cleanup() + + +# ============================================================================= +# Error Handling +# ============================================================================= + + +@skip_no_are +class TestAREErrorHandling: + """Test that errors propagate correctly (not swallowed).""" + + def test_invalid_environment_data_raises(self): + """Missing scenario and apps raises ValueError.""" + from maseval.interface.environments.are import AREEnvironment + + with pytest.raises(ValueError, match="must contain either"): + AREEnvironment(environment_data={}) diff --git a/tests/interface/environments/test_are_tool_wrapper.py b/tests/interface/environments/test_are_tool_wrapper.py new file mode 100644 index 0000000..688b0f6 --- /dev/null +++ b/tests/interface/environments/test_are_tool_wrapper.py @@ -0,0 +1,136 @@ +"""Tests for AREToolWrapper.""" + +from unittest.mock import MagicMock, patch +import pytest + +from maseval.interface.environments.are_tool_wrapper import AREToolWrapper + + +@pytest.fixture(autouse=True) +def mock_app_tool_adapter(): + """Mock AppToolAdapter so AREToolWrapper can initialize without real ARE validation.""" + + def make_adapter(are_tool): + adapter = MagicMock() + adapter.name = are_tool.name + adapter.description = are_tool.description + adapter.inputs = are_tool.inputs + adapter.output_type = are_tool.output_type + adapter.actual_return_type = None + return adapter + + with patch("maseval.interface.environments.are_tool_wrapper.AppToolAdapter", side_effect=make_adapter): + yield + + +class TestAREToolWrapper: + """Tests for AREToolWrapper.""" + + def _make_mock_are_tool( + self, + name="Calendar__create_event", + description="Create a calendar event", + inputs=None, + output_type="string", + return_value="Event created", + ): + """Create a mock ARE tool.""" + tool = MagicMock() + tool.name = name + tool.description = description + tool.inputs = inputs or {"title": {"type": "string", "description": "Event title"}} + tool.output_type = output_type + tool.return_value = return_value + tool.__call__ = MagicMock(return_value=return_value) + return tool + + def test_metadata_from_are_tool(self): + """Wrapper exposes ARE tool metadata.""" + are_tool = self._make_mock_are_tool() + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + + assert wrapper.name == "Calendar__create_event" + assert wrapper.description == "Create a calendar event" + assert wrapper.inputs == {"title": {"type": "string", "description": "Event title"}} + assert wrapper.output_type == "string" + + def test_call_delegates_to_are_tool(self): + """Calling wrapper delegates to underlying ARE tool.""" + are_tool = self._make_mock_are_tool(return_value="Event created") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + result = wrapper(title="Standup") + + are_tool.assert_called_once_with(title="Standup") + assert result == "Event created" + + def test_call_records_success_in_history(self): + """Successful calls are recorded in invocation history.""" + are_tool = self._make_mock_are_tool(return_value="OK") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + wrapper(title="Test") + + assert len(wrapper.history) == 1 + record = wrapper.history.to_list()[0] + assert record["inputs"] == {"title": "Test"} + assert record["outputs"] == "OK" + assert record["status"] == "success" + + def test_call_records_error_in_history(self): + """Failed calls are recorded in invocation history and re-raised.""" + are_tool = self._make_mock_are_tool() + are_tool.side_effect = ValueError("Invalid title") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + + with pytest.raises(ValueError, match="Invalid title"): + wrapper(title="") + + assert len(wrapper.history) == 1 + record = wrapper.history.to_list()[0] + assert record["status"] == "error" + assert "Invalid title" in record["outputs"] + + def test_gather_traces(self): + """gather_traces returns structured trace data.""" + are_tool = self._make_mock_are_tool(return_value="Done") + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + wrapper(title="Test1") + wrapper(title="Test2") + + traces = wrapper.gather_traces() + assert traces["type"] == "AREToolWrapper" + assert traces["name"] == "Calendar__create_event" + assert traces["total_invocations"] == 2 + assert len(traces["invocations"]) == 2 + + def test_gather_config(self): + """gather_config returns tool configuration.""" + are_tool = self._make_mock_are_tool() + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + config = wrapper.gather_config() + + assert config["name"] == "Calendar__create_event" + assert config["description"] == "Create a calendar event" + assert "input_schema" in config + + def test_repr(self): + """String representation shows tool signature.""" + are_tool = self._make_mock_are_tool() + env = MagicMock() + + wrapper = AREToolWrapper(are_tool, env) + r = repr(wrapper) + + assert "AREToolWrapper" in r + assert "Calendar__create_event" in r diff --git a/tests/test_benchmarks/test_gaia2/test_environment.py b/tests/test_benchmarks/test_gaia2/test_environment.py index 1b96abe..2ec086e 100644 --- a/tests/test_benchmarks/test_gaia2/test_environment.py +++ b/tests/test_benchmarks/test_gaia2/test_environment.py @@ -8,6 +8,36 @@ from unittest.mock import MagicMock, patch +@pytest.fixture(autouse=True) +def mock_app_tool_adapter(): + """Mock AppToolAdapter so AREToolWrapper can initialize without real ARE.""" + + def make_adapter(are_tool): + adapter = MagicMock() + adapter.name = getattr(are_tool, "_public_name", are_tool.name) if hasattr(are_tool, "_public_name") else are_tool.name + app_name = getattr(are_tool, "app_name", "") + desc = getattr(are_tool, "_public_description", getattr(are_tool, "function_description", "")) + adapter.description = f"Acts on app {app_name}: {desc}" if app_name else desc + type_map = {"str": "string", "int": "integer", "float": "number", "bool": "boolean"} + inputs = {} + for arg in getattr(are_tool, "args", []): + arg_name = getattr(arg, "name", None) + if arg_name: + raw_type = getattr(arg, "arg_type", "string") + entry = {"type": type_map.get(raw_type, raw_type), "description": getattr(arg, "description", "")} + if getattr(arg, "has_default", False): + entry["default"] = getattr(arg, "default", None) + entry["nullable"] = True + inputs[arg_name] = entry + adapter.inputs = inputs + adapter.output_type = "string" + adapter.actual_return_type = None + return adapter + + with patch("maseval.interface.environments.are_tool_wrapper.AppToolAdapter", side_effect=make_adapter): + yield + + def _make_are_mock(): """Create a fully-mocked ARE module structure for sys.modules patching. @@ -45,6 +75,7 @@ def _get_scenario_duration(scenario, max_time_duration, max_duration): "are.simulation.environment": mock_are.simulation.environment, "are.simulation.types": mock_are.simulation.types, "are.simulation.validation": mock_are.simulation.validation, + "are.simulation.notification_system": mock_are.simulation.notification_system, "are.simulation.scenarios": mock_are.simulation.scenarios, "are.simulation.scenarios.config": mock_are.simulation.scenarios.config, "are.simulation.scenarios.scenario_imported_from_json": (mock_are.simulation.scenarios.scenario_imported_from_json), @@ -133,7 +164,7 @@ class TestGaia2EnvironmentCreateTools: def test_create_tools_wraps_are_tools(self): """Test create_tools returns wrapped ARE tools.""" from maseval.benchmark.gaia2.environment import Gaia2Environment - from maseval.benchmark.gaia2.tool_wrapper import Gaia2GenericTool + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper from types import SimpleNamespace mock_are, modules = _make_are_mock() @@ -166,7 +197,7 @@ def test_create_tools_wraps_are_tools(self): tools = env.create_tools() assert "TestTool__do_something" in tools - assert isinstance(tools["TestTool__do_something"], Gaia2GenericTool) + assert isinstance(tools["TestTool__do_something"], AREToolWrapper) def test_create_tools_filters_aui_tools(self): """Test create_tools filters out AUI message-retrieval tools.""" @@ -863,8 +894,8 @@ def test_resume_with_offset_noop_when_no_are_env(self): env.resume_with_offset(5.0) # Should not raise - def test_pause_swallows_exception(self): - """pause() swallows exceptions from ARE environment.""" + def test_pause_propagates_exception(self): + """pause() propagates exceptions from ARE environment.""" from maseval.benchmark.gaia2.environment import Gaia2Environment env = Gaia2Environment.__new__(Gaia2Environment) @@ -875,4 +906,5 @@ def test_pause_swallows_exception(self): env._tool_wrappers = {} env.state = {} - env.pause() # Should not raise + with pytest.raises(RuntimeError, match="pause failed"): + env.pause() diff --git a/tests/test_benchmarks/test_gaia2/test_integration.py b/tests/test_benchmarks/test_gaia2/test_integration.py index f6d1818..edd6bf1 100644 --- a/tests/test_benchmarks/test_gaia2/test_integration.py +++ b/tests/test_benchmarks/test_gaia2/test_integration.py @@ -76,7 +76,7 @@ def test_environment_setup_state(self, first_real_task): def test_real_tools_are_created(self, first_real_task): """Tools created from a real scenario are non-empty Gaia2GenericTool instances.""" from maseval.benchmark.gaia2.environment import Gaia2Environment - from maseval.benchmark.gaia2.tool_wrapper import Gaia2GenericTool + from maseval.interface.environments.are_tool_wrapper import AREToolWrapper env = Gaia2Environment(environment_data=first_real_task.environment_data) try: @@ -86,7 +86,7 @@ def test_real_tools_are_created(self, first_real_task): assert len(tools) > 0, "No tools created from real scenario. ARE environment should expose app tools (Calendar, Email, etc.)." for name, tool in tools.items(): - assert isinstance(tool, Gaia2GenericTool), f"Tool '{name}' is {type(tool).__name__}, expected Gaia2GenericTool" + assert isinstance(tool, AREToolWrapper), f"Tool '{name}' is {type(tool).__name__}, expected AREToolWrapper" assert tool.name, "Tool has empty name" finally: env.cleanup() diff --git a/tests/test_benchmarks/test_gaia2/test_tool_wrapper.py b/tests/test_benchmarks/test_gaia2/test_tool_wrapper.py index 052e94a..ee85ee1 100644 --- a/tests/test_benchmarks/test_gaia2/test_tool_wrapper.py +++ b/tests/test_benchmarks/test_gaia2/test_tool_wrapper.py @@ -5,7 +5,49 @@ """ import pytest -from unittest.mock import MagicMock +from unittest.mock import MagicMock, patch + + +@pytest.fixture(autouse=True) +def mock_app_tool_adapter(): + """Mock AppToolAdapter so AREToolWrapper can initialize without real ARE validation.""" + + def make_adapter(are_tool): + adapter = MagicMock() + # Match ARE's AppToolAdapter behavior: + # - name comes from _public_name + adapter.name = getattr(are_tool, "_public_name", are_tool.name) + # - description includes app prefix: "Acts on app {app_name}: {desc}" + app_name = getattr(are_tool, "app_name", "") + desc = getattr(are_tool, "_public_description", getattr(are_tool, "function_description", "")) + adapter.description = f"Acts on app {app_name}: {desc}" if app_name else desc + # - inputs as flat dict from args + # ARE AppToolAdapter type mapping (tool_utils.py:572-578) + type_map = {"str": "string", "int": "integer", "float": "number", "bool": "boolean"} + inputs = {} + for arg in getattr(are_tool, "args", []): + arg_name = getattr(arg, "name", None) + if arg_name: + raw_type = getattr(arg, "arg_type", "string") + entry = { + "type": type_map.get(raw_type, raw_type), + "description": getattr(arg, "description", ""), + } + if getattr(arg, "has_default", False): + entry["default"] = getattr(arg, "default", None) + entry["nullable"] = True + inputs[arg_name] = entry + adapter.inputs = inputs + adapter.output_type = "string" + # Return type suffix for description + rt = getattr(are_tool, "return_type", None) + if rt is not None: + adapter.description += f" Returns {rt.__name__ if isinstance(rt, type) else str(rt)}" + adapter.actual_return_type = None + return adapter + + with patch("maseval.interface.environments.are_tool_wrapper.AppToolAdapter", side_effect=make_adapter): + yield # ============================================================================= diff --git a/uv.lock b/uv.lock index f6701d6..ccc710d 100644 --- a/uv.lock +++ b/uv.lock @@ -3433,6 +3433,9 @@ all = [ anthropic = [ { name = "anthropic" }, ] +are = [ + { name = "meta-agents-research-environments" }, +] camel = [ { name = "camel-ai" }, { name = "litellm" }, @@ -3641,6 +3644,7 @@ requires-dist = [ { name = "maseval", extras = ["smolagents", "langgraph", "llamaindex", "camel", "anthropic", "openai", "google-genai", "litellm", "langfuse", "gaia2", "macs", "tau2", "disco"], marker = "extra == 'examples'" }, { name = "matplotlib", marker = "extra == 'disco'", specifier = ">=3.5.0" }, { name = "mcp", marker = "extra == 'examples'", specifier = ">=1.22.0" }, + { name = "meta-agents-research-environments", marker = "extra == 'are'", specifier = ">=1.2.0" }, { name = "meta-agents-research-environments", marker = "extra == 'gaia2'", specifier = ">=1.2.0" }, { name = "names", marker = "extra == 'multiagentbench'", specifier = ">=0.3.0" }, { name = "numpy", marker = "extra == 'mmlu'", specifier = ">=1.20.0" }, @@ -3676,7 +3680,7 @@ requires-dist = [ { name = "waitress", marker = "extra == 'multiagentbench'", specifier = ">=3.0.0" }, { name = "wandb", marker = "extra == 'wandb'", specifier = ">=0.15.0" }, ] -provides-extras = ["smolagents", "langgraph", "llamaindex", "camel", "anthropic", "openai", "google-genai", "transformers", "litellm", "wandb", "langfuse", "gaia2", "macs", "multiagentbench", "tau2", "converse", "mmlu", "lm-eval", "disco", "examples", "all"] +provides-extras = ["smolagents", "langgraph", "llamaindex", "camel", "anthropic", "openai", "google-genai", "transformers", "litellm", "wandb", "langfuse", "are", "gaia2", "macs", "multiagentbench", "tau2", "converse", "mmlu", "lm-eval", "disco", "examples", "all"] [package.metadata.requires-dev] dev = [