-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
coreIn regards to the core package `maseval/core`In regards to the core package `maseval/core`enhancementNew feature or requestNew feature or request
Description
Feature Request
Problem
Benchmark scenarios always execute from the beginning. To investigate agent behavior at a specific point, run ablation studies on a later phase, or compare different agents against identical mid-scenario conditions, you must replay the entire scenario from time zero every time.
This is a general maseval limitation that affects all benchmarks and environments.
Proposed Solution
Add support for restoring full system state from previously gathered traces to a specific point, then continue execution from there. Full system state includes:
- Environment state -- simulation clock, app state, world/event logs
- Agent state -- message histories, internal memory, accumulated context
- User/participant state -- interaction history, pending messages, turn state
- Tool state -- any stateful tool context accumulated during the run
Given traces from a prior run, maseval should be able to:
- Initialize all components to the state at time
tby replaying or restoring the recorded history up to that point - Resume live execution from
tonward with a real agent making new decisions - Do this repeatedly from the same checkpoint to enable controlled comparisons
Sketch
# Run a benchmark, collect traces
result = benchmark.run(agent=agent_a, tasks=tasks)
traces = result.traces # full system traces from gather_traces()
# Later: replay from midpoint with a different agent
result_b = benchmark.run(
agent=agent_b,
tasks=tasks,
restore_from=traces,
resume_at=0.5, # e.g., 50% through the scenario, or a simulation timestamp
)Use Cases
- Debugging: An agent fails at turn 40 of a 50-turn interaction. Restore to turn 38 and replay with verbose logging, without re-running turns 0-37.
- Ablation studies: Test how different agents or configurations handle identical mid-scenario state.
- Evaluation efficiency: For long benchmarks, evaluate only the critical decision window instead of the full scenario.
- Reproducibility: Save and share checkpoints so collaborators can reproduce results from a specific system state.
- Agent comparison: Run multiple agents from the same checkpoint for controlled comparisons that are not confounded by different early-game trajectories.
Considerations
- Trace format:
gather_traces()already captures environment and tool traces. This needs to extend to agent message history and user interaction state in a restorable format. - Environment support: Each
Environmentsubclass needs to implement a restore/checkpoint protocol. Some environments have natural checkpoint semantics (e.g. ARE with its simulation clock); others may need to replay events sequentially to reach the target state. - Agent restore:
AgentAdapterneeds a way to inject prior message history so the agent's context matches what it would have seen at the checkpoint. Each framework adapter (smolagents, langgraph, etc.) needs its own implementation. - Determinism: Restoring to a checkpoint does not guarantee the same outcome since LLM responses are stochastic. The value is in controlled starting conditions, not exact replay.
- Checkpoint granularity: Need to define what "midpoint" means across different benchmark types -- simulation time, turn number, percentage, or a specific event.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
coreIn regards to the core package `maseval/core`In regards to the core package `maseval/core`enhancementNew feature or requestNew feature or request