Sensitivity: record variation draws + self-describing episode summaries#788
Conversation
Build-time variations now stash their sampled value (VariationBase.last_draw), and the builder exposes two collectors keyed by <asset>.<variation>: - get_enabled_variation_factor_schema(): the prior (type + range) per enabled variation, derived from its sampler (Uniform -> continuous, Choice -> categorical) - get_enabled_variation_draws(): this build's realized values These feed the sensitivity episode-summary writer so variation-sampled factors are recorded the same way static CLI factors already are. Run-time (per-reset) draws are not captured yet (would need a per-env buffer). Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Opt-in --episode_summary writes a self-describing JSONL for sensitivity analysis: - A one-time meta header (run slice + enabled-variation factor schema + outcome names), then one episode row each carrying arena_env_args, realized variation_draws, and outcomes. - episode_writer reads per-episode outcomes via MetricsManager.compute_per_episode and the factor schema / draws off the env (attached in load_env). Written per rebuild. - The write is guarded so a JSONL failure can't flip a succeeded job to FAILED. - metrics.get_recorded_metric_data sorts demos by numeric index so per-episode ordering is correct at >= 10 episodes (h5py iterates demo_0, demo_1, demo_10, ... alphabetically). - Job carries arena_env_args_dict (the pre-CLI dict) for logging. The analysis-side reader (SensitivityDataset.from_file) lands with the analyzer. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This PR records realized build-time variation draws and emits a self-describing episode_summary.jsonl for downstream sensitivity analysis. The variation-recording plumbing (_record_draw/last_draw, the pure-Python variation_recording collectors) is clean and well-placed, and the numeric demo-sort fix in metrics.py is correct. However, there is a critical type mismatch in episode_writer.py that makes the whole feature silently write no episode rows — and because the call site swallows the exception, it fails as a [WARNING] rather than loudly. See Findings.
Findings
🔴 Critical: episode_writer.py:73 — metrics_to_plain_python_types expects a MetricsDataCollection (it reads .num_episodes and .metric_data_entries), but compute_per_episode() returns a list[dict[str, Any]] — so episode_metrics is a plain dict. The call raises AttributeError: 'dict' object has no attribute 'num_episodes' on the first episode, which the except Exception guard in eval_runner.py:333 turns into a [WARNING]. Net effect: the meta header is written but every episode row is dropped, silently, on every run. Detail below.
🟡 Warning: episode_writer.py:92 (run-header / single-slice) — the header is written once (empty-file guard) from whichever job/rebuild reaches the file first, but eval_runner.main() appends all jobs to the same --episode_summary path. If two jobs differ in slice or factor schema, only the first job's slice/factors are recorded and the rest are silently mixed in. Is one-file-per-slice an enforced invariant? If so, consider asserting later jobs match the header slice rather than relying on it.
🔵 Improvement: episode_writer.py:70 (rebuild identity) — rows carry episode_idx (restarting at 0 each rebuild) but no rebuild_idx, while write_episode_summaries is called once per rebuild with re-sampled draws. So (job_name, episode_idx) isn't unique across rebuilds. The draws/outcomes still differ per row, but would it help the reader (#729) to stamp each row with rebuild_idx?
Test Coverage
No tests are added. collect_variation_draws / collect_variation_factor_schema are pure-Python and trivially unit-testable (build a couple of variations with UniformSampler/ChoiceSampler, assert the schema/draws) without spinning up sim — worth covering since they're the contract the analyzer depends on. The metrics.py demo-ordering fix is a bug fix that ideally gets a regression test (a dataset with ≥10 demos asserting recorded order), and compute_per_episode has no coverage. Given the 🔴 above, even a smoke test that actually exercises write_episode_summaries end-to-end would have caught the broken feature.
Verdict
Significant concerns — the critical type mismatch means the feature doesn't currently record any episodes.
| "episode_idx": episode_idx, | ||
| "arena_env_args": arena_env_args_snapshot, | ||
| "variation_draws": variation_draws, | ||
| "outcomes": metrics_to_plain_python_types(episode_metrics), |
There was a problem hiding this comment.
🔴 Critical — metrics_to_plain_python_types takes a MetricsDataCollection and immediately accesses metrics_data.num_episodes / .metric_data_entries. But compute_per_episode() returns a list[dict[str, Any]], so episode_metrics here is a plain {metric_name: value} dict — this raises AttributeError: 'dict' object has no attribute 'num_episodes'. The try/except Exception in eval_runner.py:333 then catches it and logs a [WARNING], so the run looks fine but the JSONL ends up with only the meta header and zero episode rows.
You need a per-value sanitizer here, not the collection-level one. Suggest extracting the per-value branch out of metrics_to_plain_python_types (e.g. sanitize_metric_value(value) in metrics_logger.py) and reusing it in both places to avoid duplicating the numpy→JSON conversion:
| "outcomes": metrics_to_plain_python_types(episode_metrics), | |
| "outcomes": {name: sanitize_metric_value(value) for name, value in episode_metrics.items()}, |
(where sanitize_metric_value is the np.bool_/np.floating/np.integer/np.ndarray conversion currently inlined in metrics_to_plain_python_types). A smoke test exercising this path would also keep it from silently regressing behind the guard.
| header = { | ||
| "record": "meta", | ||
| "schema_version": SCHEMA_VERSION, | ||
| "slice": _slice_identity(job), |
There was a problem hiding this comment.
🟡 Question / single-slice — this header is written exactly once (empty-file guard), but main() appends every job to the same --episode_summary file. If jobs differ in slice or factors, only the first job's are captured and later jobs are silently folded under the wrong header. The _slice_identity comment notes "MNPE assumes a single source" — if that's a hard invariant, would it be worth asserting subsequent jobs match the recorded slice instead of relying on it silently?
| summary_row = { | ||
| "record": "episode", | ||
| "job_name": job.name, | ||
| "episode_idx": episode_idx, |
There was a problem hiding this comment.
🔵 Each rebuild calls write_episode_summaries with freshly re-sampled draws, and episode_idx restarts at 0 per rebuild — so (job_name, episode_idx) isn't unique across rebuilds (the per-rebuild count is only in the print, not the row). Worth stamping each row with rebuild_idx so the analyzer can disambiguate?
Summary
DRAFT Stacked on #779 (
alex/feature/variations_in_eval_runner). Produces the self-describingepisode_summary.jsonlthe sensitivity toolbox consumes, sourcing factor values from the variation system's realized draws.