Skip to content

Sensitivity: record variation draws + self-describing episode summaries#788

Draft
cvolkcvolk wants to merge 2 commits into
alex/feature/variations_in_eval_runnerfrom
cvolk/feature/sensitivity_variation_recording
Draft

Sensitivity: record variation draws + self-describing episode summaries#788
cvolkcvolk wants to merge 2 commits into
alex/feature/variations_in_eval_runnerfrom
cvolk/feature/sensitivity_variation_recording

Conversation

@cvolkcvolk

@cvolkcvolk cvolkcvolk commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

DRAFT Stacked on #779 (alex/feature/variations_in_eval_runner). Produces the self-describing episode_summary.jsonl the sensitivity toolbox consumes, sourcing factor values from the variation system's realized draws.

Build-time variations now stash their sampled value (VariationBase.last_draw),
and the builder exposes two collectors keyed by <asset>.<variation>:

- get_enabled_variation_factor_schema(): the prior (type + range) per enabled
  variation, derived from its sampler (Uniform -> continuous, Choice -> categorical)
- get_enabled_variation_draws(): this build's realized values

These feed the sensitivity episode-summary writer so variation-sampled factors are
recorded the same way static CLI factors already are. Run-time (per-reset) draws are
not captured yet (would need a per-env buffer).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Opt-in --episode_summary writes a self-describing JSONL for sensitivity analysis:

- A one-time meta header (run slice + enabled-variation factor schema + outcome names),
  then one episode row each carrying arena_env_args, realized variation_draws, and outcomes.
- episode_writer reads per-episode outcomes via MetricsManager.compute_per_episode and the
  factor schema / draws off the env (attached in load_env). Written per rebuild.
- The write is guarded so a JSONL failure can't flip a succeeded job to FAILED.
- metrics.get_recorded_metric_data sorts demos by numeric index so per-episode ordering is
  correct at >= 10 episodes (h5py iterates demo_0, demo_1, demo_10, ... alphabetically).
- Job carries arena_env_args_dict (the pre-CLI dict) for logging.

The analysis-side reader (SensitivityDataset.from_file) lands with the analyzer.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>

@arena-review-bot arena-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab-Arena Review Bot

Summary

This PR records realized build-time variation draws and emits a self-describing episode_summary.jsonl for downstream sensitivity analysis. The variation-recording plumbing (_record_draw/last_draw, the pure-Python variation_recording collectors) is clean and well-placed, and the numeric demo-sort fix in metrics.py is correct. However, there is a critical type mismatch in episode_writer.py that makes the whole feature silently write no episode rows — and because the call site swallows the exception, it fails as a [WARNING] rather than loudly. See Findings.

Findings

🔴 Critical: episode_writer.py:73metrics_to_plain_python_types expects a MetricsDataCollection (it reads .num_episodes and .metric_data_entries), but compute_per_episode() returns a list[dict[str, Any]] — so episode_metrics is a plain dict. The call raises AttributeError: 'dict' object has no attribute 'num_episodes' on the first episode, which the except Exception guard in eval_runner.py:333 turns into a [WARNING]. Net effect: the meta header is written but every episode row is dropped, silently, on every run. Detail below.

🟡 Warning: episode_writer.py:92 (run-header / single-slice) — the header is written once (empty-file guard) from whichever job/rebuild reaches the file first, but eval_runner.main() appends all jobs to the same --episode_summary path. If two jobs differ in slice or factor schema, only the first job's slice/factors are recorded and the rest are silently mixed in. Is one-file-per-slice an enforced invariant? If so, consider asserting later jobs match the header slice rather than relying on it.

🔵 Improvement: episode_writer.py:70 (rebuild identity) — rows carry episode_idx (restarting at 0 each rebuild) but no rebuild_idx, while write_episode_summaries is called once per rebuild with re-sampled draws. So (job_name, episode_idx) isn't unique across rebuilds. The draws/outcomes still differ per row, but would it help the reader (#729) to stamp each row with rebuild_idx?

Test Coverage

No tests are added. collect_variation_draws / collect_variation_factor_schema are pure-Python and trivially unit-testable (build a couple of variations with UniformSampler/ChoiceSampler, assert the schema/draws) without spinning up sim — worth covering since they're the contract the analyzer depends on. The metrics.py demo-ordering fix is a bug fix that ideally gets a regression test (a dataset with ≥10 demos asserting recorded order), and compute_per_episode has no coverage. Given the 🔴 above, even a smoke test that actually exercises write_episode_summaries end-to-end would have caught the broken feature.

Verdict

Significant concerns — the critical type mismatch means the feature doesn't currently record any episodes.

"episode_idx": episode_idx,
"arena_env_args": arena_env_args_snapshot,
"variation_draws": variation_draws,
"outcomes": metrics_to_plain_python_types(episode_metrics),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Criticalmetrics_to_plain_python_types takes a MetricsDataCollection and immediately accesses metrics_data.num_episodes / .metric_data_entries. But compute_per_episode() returns a list[dict[str, Any]], so episode_metrics here is a plain {metric_name: value} dict — this raises AttributeError: 'dict' object has no attribute 'num_episodes'. The try/except Exception in eval_runner.py:333 then catches it and logs a [WARNING], so the run looks fine but the JSONL ends up with only the meta header and zero episode rows.

You need a per-value sanitizer here, not the collection-level one. Suggest extracting the per-value branch out of metrics_to_plain_python_types (e.g. sanitize_metric_value(value) in metrics_logger.py) and reusing it in both places to avoid duplicating the numpy→JSON conversion:

Suggested change
"outcomes": metrics_to_plain_python_types(episode_metrics),
"outcomes": {name: sanitize_metric_value(value) for name, value in episode_metrics.items()},

(where sanitize_metric_value is the np.bool_/np.floating/np.integer/np.ndarray conversion currently inlined in metrics_to_plain_python_types). A smoke test exercising this path would also keep it from silently regressing behind the guard.

header = {
"record": "meta",
"schema_version": SCHEMA_VERSION,
"slice": _slice_identity(job),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Question / single-slice — this header is written exactly once (empty-file guard), but main() appends every job to the same --episode_summary file. If jobs differ in slice or factors, only the first job's are captured and later jobs are silently folded under the wrong header. The _slice_identity comment notes "MNPE assumes a single source" — if that's a hard invariant, would it be worth asserting subsequent jobs match the recorded slice instead of relying on it silently?

summary_row = {
"record": "episode",
"job_name": job.name,
"episode_idx": episode_idx,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Each rebuild calls write_episode_summaries with freshly re-sampled draws, and episode_idx restarts at 0 per rebuild — so (job_name, episode_idx) isn't unique across rebuilds (the per-rebuild count is only in the print, not the row). Worth stamping each row with rebuild_idx so the analyzer can disambiguate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant