Support checkpoint-based replay: restore full system state mid-scenario

## Feature Request

### Problem

Benchmark scenarios always execute from the beginning. To investigate agent behavior at a specific point, run ablation studies on a later phase, or compare different agents against identical mid-scenario conditions, you must replay the entire scenario from time zero every time.

This is a general maseval limitation that affects all benchmarks and environments.

### Proposed Solution

Add support for restoring full system state from previously gathered traces to a specific point, then continue execution from there. Full system state includes:

- **Environment state** -- simulation clock, app state, world/event logs
- **Agent state** -- message histories, internal memory, accumulated context
- **User/participant state** -- interaction history, pending messages, turn state
- **Tool state** -- any stateful tool context accumulated during the run

Given traces from a prior run, maseval should be able to:

1. Initialize all components to the state at time `t` by replaying or restoring the recorded history up to that point
2. Resume live execution from `t` onward with a real agent making new decisions
3. Do this repeatedly from the same checkpoint to enable controlled comparisons

### Sketch

```python
# Run a benchmark, collect traces
result = benchmark.run(agent=agent_a, tasks=tasks)
traces = result.traces  # full system traces from gather_traces()

# Later: replay from midpoint with a different agent
result_b = benchmark.run(
    agent=agent_b,
    tasks=tasks,
    restore_from=traces,
    resume_at=0.5,  # e.g., 50% through the scenario, or a simulation timestamp
)
```

### Use Cases

- **Debugging**: An agent fails at turn 40 of a 50-turn interaction. Restore to turn 38 and replay with verbose logging, without re-running turns 0-37.
- **Ablation studies**: Test how different agents or configurations handle identical mid-scenario state.
- **Evaluation efficiency**: For long benchmarks, evaluate only the critical decision window instead of the full scenario.
- **Reproducibility**: Save and share checkpoints so collaborators can reproduce results from a specific system state.
- **Agent comparison**: Run multiple agents from the same checkpoint for controlled comparisons that are not confounded by different early-game trajectories.

### Considerations

- **Trace format**: `gather_traces()` already captures environment and tool traces. This needs to extend to agent message history and user interaction state in a restorable format.
- **Environment support**: Each `Environment` subclass needs to implement a restore/checkpoint protocol. Some environments have natural checkpoint semantics (e.g. ARE with its simulation clock); others may need to replay events sequentially to reach the target state.
- **Agent restore**: `AgentAdapter` needs a way to inject prior message history so the agent's context matches what it would have seen at the checkpoint. Each framework adapter (smolagents, langgraph, etc.) needs its own implementation.
- **Determinism**: Restoring to a checkpoint does not guarantee the same outcome since LLM responses are stochastic. The value is in controlled starting conditions, not exact replay.
- **Checkpoint granularity**: Need to define what "midpoint" means across different benchmark types -- simulation time, turn number, percentage, or a specific event.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support checkpoint-based replay: restore full system state mid-scenario #56

Feature Request

Problem

Proposed Solution

Sketch

Use Cases

Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support checkpoint-based replay: restore full system state mid-scenario #56

Description

Feature Request

Problem

Proposed Solution

Sketch

Use Cases

Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions