Sensitivity: record variation draws + self-describing episode summaries by cvolkcvolk · Pull Request #788 · isaac-sim/IsaacLab-Arena

cvolkcvolk · 2026-06-12T13:31:28Z

Summary

DRAFT Stacked on #779 (alex/feature/variations_in_eval_runner). Produces the self-describing episode_summary.jsonl the sensitivity toolbox consumes, sourcing factor values from the variation system's realized draws.

Build-time variations now stash their sampled value (VariationBase.last_draw), and the builder exposes two collectors keyed by <asset>.<variation>: - get_enabled_variation_factor_schema(): the prior (type + range) per enabled variation, derived from its sampler (Uniform -> continuous, Choice -> categorical) - get_enabled_variation_draws(): this build's realized values These feed the sensitivity episode-summary writer so variation-sampled factors are recorded the same way static CLI factors already are. Run-time (per-reset) draws are not captured yet (would need a per-env buffer). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Opt-in --episode_summary writes a self-describing JSONL for sensitivity analysis: - A one-time meta header (run slice + enabled-variation factor schema + outcome names), then one episode row each carrying arena_env_args, realized variation_draws, and outcomes. - episode_writer reads per-episode outcomes via MetricsManager.compute_per_episode and the factor schema / draws off the env (attached in load_env). Written per rebuild. - The write is guarded so a JSONL failure can't flip a succeeded job to FAILED. - metrics.get_recorded_metric_data sorts demos by numeric index so per-episode ordering is correct at >= 10 episodes (h5py iterates demo_0, demo_1, demo_10, ... alphabetically). - Job carries arena_env_args_dict (the pre-CLI dict) for logging. The analysis-side reader (SensitivityDataset.from_file) lands with the analyzer. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

arena-review-bot

🤖 Isaac Lab-Arena Review Bot

Summary

This PR records realized build-time variation draws and emits a self-describing episode_summary.jsonl for downstream sensitivity analysis. The variation-recording plumbing (_record_draw/last_draw, the pure-Python variation_recording collectors) is clean and well-placed, and the numeric demo-sort fix in metrics.py is correct. However, there is a critical type mismatch in episode_writer.py that makes the whole feature silently write no episode rows — and because the call site swallows the exception, it fails as a [WARNING] rather than loudly. See Findings.

Findings

🔴 Critical: episode_writer.py:73 — metrics_to_plain_python_types expects a MetricsDataCollection (it reads .num_episodes and .metric_data_entries), but compute_per_episode() returns a list[dict[str, Any]] — so episode_metrics is a plain dict. The call raises AttributeError: 'dict' object has no attribute 'num_episodes' on the first episode, which the except Exception guard in eval_runner.py:333 turns into a [WARNING]. Net effect: the meta header is written but every episode row is dropped, silently, on every run. Detail below.

🟡 Warning: episode_writer.py:92 (run-header / single-slice) — the header is written once (empty-file guard) from whichever job/rebuild reaches the file first, but eval_runner.main() appends all jobs to the same --episode_summary path. If two jobs differ in slice or factor schema, only the first job's slice/factors are recorded and the rest are silently mixed in. Is one-file-per-slice an enforced invariant? If so, consider asserting later jobs match the header slice rather than relying on it.

🔵 Improvement: episode_writer.py:70 (rebuild identity) — rows carry episode_idx (restarting at 0 each rebuild) but no rebuild_idx, while write_episode_summaries is called once per rebuild with re-sampled draws. So (job_name, episode_idx) isn't unique across rebuilds. The draws/outcomes still differ per row, but would it help the reader (#729) to stamp each row with rebuild_idx?

Test Coverage

No tests are added. collect_variation_draws / collect_variation_factor_schema are pure-Python and trivially unit-testable (build a couple of variations with UniformSampler/ChoiceSampler, assert the schema/draws) without spinning up sim — worth covering since they're the contract the analyzer depends on. The metrics.py demo-ordering fix is a bug fix that ideally gets a regression test (a dataset with ≥10 demos asserting recorded order), and compute_per_episode has no coverage. Given the 🔴 above, even a smoke test that actually exercises write_episode_summaries end-to-end would have caught the broken feature.

Verdict

Significant concerns — the critical type mismatch means the feature doesn't currently record any episodes.

arena-review-bot · 2026-06-12T13:35:11Z

+                "episode_idx": episode_idx,
+                "arena_env_args": arena_env_args_snapshot,
+                "variation_draws": variation_draws,
+                "outcomes": metrics_to_plain_python_types(episode_metrics),


🔴 Critical — metrics_to_plain_python_types takes a MetricsDataCollection and immediately accesses metrics_data.num_episodes / .metric_data_entries. But compute_per_episode() returns a list[dict[str, Any]], so episode_metrics here is a plain {metric_name: value} dict — this raises AttributeError: 'dict' object has no attribute 'num_episodes'. The try/except Exception in eval_runner.py:333 then catches it and logs a [WARNING], so the run looks fine but the JSONL ends up with only the meta header and zero episode rows.

You need a per-value sanitizer here, not the collection-level one. Suggest extracting the per-value branch out of metrics_to_plain_python_types (e.g. sanitize_metric_value(value) in metrics_logger.py) and reusing it in both places to avoid duplicating the numpy→JSON conversion:

Suggested change

"outcomes": metrics_to_plain_python_types(episode_metrics),

"outcomes": {name: sanitize_metric_value(value) for name, value in episode_metrics.items()},

(where sanitize_metric_value is the np.bool_/np.floating/np.integer/np.ndarray conversion currently inlined in metrics_to_plain_python_types). A smoke test exercising this path would also keep it from silently regressing behind the guard.

arena-review-bot · 2026-06-12T13:35:11Z

+    header = {
+        "record": "meta",
+        "schema_version": SCHEMA_VERSION,
+        "slice": _slice_identity(job),


🟡 Question / single-slice — this header is written exactly once (empty-file guard), but main() appends every job to the same --episode_summary file. If jobs differ in slice or factors, only the first job's are captured and later jobs are silently folded under the wrong header. The _slice_identity comment notes "MNPE assumes a single source" — if that's a hard invariant, would it be worth asserting subsequent jobs match the recorded slice instead of relying on it silently?

arena-review-bot · 2026-06-12T13:35:11Z

+            summary_row = {
+                "record": "episode",
+                "job_name": job.name,
+                "episode_idx": episode_idx,


🔵 Each rebuild calls write_episode_summaries with freshly re-sampled draws, and episode_idx restarts at 0 per rebuild — so (job_name, episode_idx) isn't unique across rebuilds (the per-rebuild count is only in the print, not the row). Worth stamping each row with rebuild_idx so the analyzer can disambiguate?

cvolkcvolk added 2 commits June 12, 2026 15:30

arena-review-bot Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensitivity: record variation draws + self-describing episode summaries#788

Sensitivity: record variation draws + self-describing episode summaries#788
cvolkcvolk wants to merge 2 commits into
alex/feature/variations_in_eval_runnerfrom
cvolk/feature/sensitivity_variation_recording

cvolkcvolk commented Jun 12, 2026 •

edited

Loading

Uh oh!

arena-review-bot Bot left a comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Uh oh!

arena-review-bot Bot Jun 12, 2026

Uh oh!

arena-review-bot Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"outcomes": metrics_to_plain_python_types(episode_metrics),
	"outcomes": {name: sanitize_metric_value(value) for name, value in episode_metrics.items()},

Conversation

cvolkcvolk commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

arena-review-bot Bot left a comment

Choose a reason for hiding this comment

🤖 Isaac Lab-Arena Review Bot

Summary

Findings

Test Coverage

Verdict

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cvolkcvolk commented Jun 12, 2026 •

edited

Loading