Skip to content

Support checkpoint-based replay: restore full system state mid-scenario #56

@cemde

Description

@cemde

Feature Request

Problem

Benchmark scenarios always execute from the beginning. To investigate agent behavior at a specific point, run ablation studies on a later phase, or compare different agents against identical mid-scenario conditions, you must replay the entire scenario from time zero every time.

This is a general maseval limitation that affects all benchmarks and environments.

Proposed Solution

Add support for restoring full system state from previously gathered traces to a specific point, then continue execution from there. Full system state includes:

  • Environment state -- simulation clock, app state, world/event logs
  • Agent state -- message histories, internal memory, accumulated context
  • User/participant state -- interaction history, pending messages, turn state
  • Tool state -- any stateful tool context accumulated during the run

Given traces from a prior run, maseval should be able to:

  1. Initialize all components to the state at time t by replaying or restoring the recorded history up to that point
  2. Resume live execution from t onward with a real agent making new decisions
  3. Do this repeatedly from the same checkpoint to enable controlled comparisons

Sketch

# Run a benchmark, collect traces
result = benchmark.run(agent=agent_a, tasks=tasks)
traces = result.traces  # full system traces from gather_traces()

# Later: replay from midpoint with a different agent
result_b = benchmark.run(
    agent=agent_b,
    tasks=tasks,
    restore_from=traces,
    resume_at=0.5,  # e.g., 50% through the scenario, or a simulation timestamp
)

Use Cases

  • Debugging: An agent fails at turn 40 of a 50-turn interaction. Restore to turn 38 and replay with verbose logging, without re-running turns 0-37.
  • Ablation studies: Test how different agents or configurations handle identical mid-scenario state.
  • Evaluation efficiency: For long benchmarks, evaluate only the critical decision window instead of the full scenario.
  • Reproducibility: Save and share checkpoints so collaborators can reproduce results from a specific system state.
  • Agent comparison: Run multiple agents from the same checkpoint for controlled comparisons that are not confounded by different early-game trajectories.

Considerations

  • Trace format: gather_traces() already captures environment and tool traces. This needs to extend to agent message history and user interaction state in a restorable format.
  • Environment support: Each Environment subclass needs to implement a restore/checkpoint protocol. Some environments have natural checkpoint semantics (e.g. ARE with its simulation clock); others may need to replay events sequentially to reach the target state.
  • Agent restore: AgentAdapter needs a way to inject prior message history so the agent's context matches what it would have seen at the checkpoint. Each framework adapter (smolagents, langgraph, etc.) needs its own implementation.
  • Determinism: Restoring to a checkpoint does not guarantee the same outcome since LLM responses are stochastic. The value is in controlled starting conditions, not exact replay.
  • Checkpoint granularity: Need to define what "midpoint" means across different benchmark types -- simulation time, turn number, percentage, or a specific event.

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreIn regards to the core package `maseval/core`enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions