Sensitivity analysis MVP: MNPE/NPE posterior by cvolkcvolk · Pull Request #729 · isaac-sim/IsaacLab-Arena

cvolkcvolk · 2026-05-27T13:07:33Z

Summary

Sensitivity analysis toolbox (MNPE/NPE) running on synthetic data. From an eval sweep's per-episode results, learn which environment conditions drive success: Fit a posterior over the varied factors, conditioned on the outcome, and render one summary figure.

Inspired by robolab, but the factors now come from a factors.yaml and are piped generically, allowing for arbitrary continuous/categorical factor mixes.

Detailed description

factors.yaml declares the varied factors and ranges. This file will be auto generated by the Variation System and could be moved into one time write into the same output file. For now its hand crafted.
SensitivitDataset genericatlly handling n dimensional factors, categorical and continous
SensitivityAnalyzer auto-selects MNPE (mixed continuous + categorical) or NPE (continuous-only), trains on the full (theta, x), and samples the joint posterior at a chosen observation. Continuous factors are normalized so factors on very different scales train on equal footing.
generate_report produces one figure, a density curve per continuous factor, a probability bar per categorical, saved by file extension (.png/.pdf).
A synthetic ground-truth simulator (synthetic.py)
python -m isaaclab_arena.analysis.sensitivity.synthetic --kind {mixed,continuous,rich} runs the whole pipeline in one command.

Next: Plug in real sim/ variation pipeline

Builds on the MVP-1 foundation (#729) with categorical factor support, a cleaner analyzer/plotting separation, and a tighter eval-side / analysis-side contract that drops a class of drift bugs. - Analyzer hierarchy (BaseAnalyzer / PosteriorAnalyzer / NPEAnalyzer / MNPEAnalyzer / EmpiricalAnalyzer) dispatched via make_analyzer. Pure- categorical schemas use empirical frequency analysis directly (under uniform prior the posterior is exactly the normalized per-category success rate); sbi MNPE 0.26 also requires at least one continuous theta column, which this dispatch handles automatically. - Split inference (analyzer.py) from rendering (plotting.py). Analyzers expose continuous_marginal_density and categorical_marginal_probs queries; plotting consumes them via plot_marginal. New plot types become additive (free functions) without touching the analyzer. - Drop --factor_keys CLI flag on eval_runner. The writer now logs the full arena_env_args per episode; the analyzer-side factors.yaml picks what to study. Removes the drift bug class where --factor_keys and factors.yaml could disagree. - Rename JSONL field "factors" -> "arena_env_args". Honest about provenance and leaves room for sibling source fields (future "sim_state" for MVP-3 reset-time snapshots, "variation_draws" for the variation system) without further wire-format changes. - Add synthetic_data_categorical.py smoke-test generator and rename synthetic_data.py -> synthetic_data_continuous.py for symmetry. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Adds a policy-sensitivity analysis stack under isaaclab_arena/analysis/ sensitivity/: a SensitivityDataset loader (factors.yaml + episode JSONL), NPE / MNPE / KDE / empirical analyzers (sbi-backed), continuous + categorical factor support with LogUniform priors, and an interactive Plotly HTML report. eval_runner gains an opt-in --episode_summary flag that appends one JSONL row per recorded episode (full arena_env_args dict + task outcomes); the analyzer decides which arena_env_args keys are factors via factors.yaml, so eval needs no knowledge of "factors". Job now carries arena_env_args_dict so the writer logs typed values. Adds sbi to dev deps. Driver scripts: analyze_sensitivity.py (single factor/outcome) and generate_sensitivity_report.py (full multi-factor HTML deliverable). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Paired (factors.yaml, jobs_config.json) sets for pi0 on the droid pick_and_place_maple_table task: * light_intensity_sweep — single continuous factor (light intensity) * pick_up_object_sweep — single categorical factor (object identity) * multi_factor_overnight_sweep — light_intensity (log-uniform) x 5 objects, num_episodes=4 * two_object_shiny_matte_sweep — focused 2-object contrast (matte mustard vs specular soup can) x log-uniform light, num_episodes=2 factors.yaml declares each factor's type/range/distribution for the analyzer; jobs configs are consumed by eval_runner --episode_summary. Use --chunk_size for the long sweeps to avoid host-RSS OOM. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

main's metrics refactor (#733) made cfg.metrics a MetricsCfg configclass (one field per metric) rather than an iterable of metric objects. Iterate its fields and use compute_metric_func/recorder_term_name/params, matching MetricsManager. Fixes 'MetricsCfg object is not iterable' that produced empty episode-summary JSONL. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

sbi logs TensorBoard training curves under <cwd>/sbi-logs by default (get_log_root hardcodes the cwd), so fitting raised PermissionError when the cwd wasn't writable — e.g. generating a report from a repo checkout in a non-root container. A one-shot report fit never reads those curves, so pass a no-op tracker (_NullTracker) that discards them: no files written, no hidden cwd dependency, runs from any directory. Centralize the tracker on the base by having subclasses name their sbi class via _inference_cls. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Grow the two-object light sweep to 500 rubiks_cube + 500 alphabet_soup_can (num_envs=2, num_episodes=2) for an overnight run with denser log-uniform light sampling. Update the factors.yaml header to match. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Simplify the deliverable to one PDF on disk showing the most important plots (robolab-style): an outcome x factor grid of marginal-posterior plots, fit one analyzer per outcome. Drop the involved Plotly HTML report (report.py + its CLI) — to be reintroduced in a follow-up PR. - plotting.py: split the renderers into draw_marginal(ax, ...) that draws onto a caller-supplied Axes; plot_marginal keeps its single-figure save behavior. - pdf_report.py: new generate_pdf_report() lays out the grid and saves one PDF. - generate_sensitivity_report.py: now drives the PDF (--output_pdf). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Split the report title across two lines (report+episodes / slice) and widen the top margin so it doesn't clip when the grid is a single column. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Scope PR #729 to the MVP: sensitivity to a single light_intensity factor on one object. Remove the multi-factor (multi_factor_overnight) and multi-object (two_object_shiny_matte, pick_up_object) sweep configs — including two 22k-line job configs — preserved on branch cvolk/feature/sensitivity_large_sweep_configs for the larger overnight runs. Keep only the single-factor light_intensity configs. Also drop the stale --factor_keys reference (no such flag; the writer logs the full arena_env_args). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Collapse the multi-line explanatory comment blocks to 1-2 dense lines, keeping the non-obvious 'why' (log-space transforms, the step-count/len gotcha, the deferred pxr import) and dropping the over-explanation. No code changes. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Drop log_uniform from the first PR: the FactorSpec.distribution field, the YAML parsing/validation, the log10 theta transform + log-space prior in dataset, the log-grid handling in NPE/KDE marginals, and the log x-axis in plotting. The MVP sweeps linearly; analyzers and plotting are unchanged for linear factors. The full log_uniform implementation is preserved on branch cvolk/feature/sensitivity_log_uniform for a follow-up PR. Verified post-removal: KDE, MNPE+categorical, and NPE all fit and render. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

…tracker - Remove the verbose module-level docstrings across the sensitivity package; the two synthetic-data generators and the two CLI scripts now pass a short literal argparse description instead of `description=__doc__`. - Remove the `_NullTracker` workaround. With in-container runs no longer executing as root, sbi's default tracker writes a (gitignored) `sbi-logs/` owned by the user, so the PermissionError it guarded against no longer occurs. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

…nalyzers - Make `EmpiricalAnalyzer` an abstract base for the two direct (non-neural) analyzers, which estimate the posterior straight from data under a uniform prior. Rename the concrete categorical analyzer to `FrequencyTableAnalyzer` and reparent `KDEAnalyzer` beneath the same base. Both now share a named `SUCCESS_THRESHOLD` and a `_success_mask()` helper instead of inlining the `>= 0.5` success test. - Update `make_analyzer` dispatch and the plotting/docstring references; fix a stale claim that only `PosteriorAnalyzer` provides `continuous_marginal_density` (KDEAnalyzer does too). - Drop `v0.3`/`MVP-1` wording from the analyzer docstrings, keeping the substantive uniform-prior and binary-outcome assumptions. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

- Break the monolithic analyzer.py into analyzer_base, posterior_analyzer, and empirical_analyzer modules along the neural/empirical family seam. - Move the make_analyzer dispatch into factory.py, re-exported from the package __init__; lazy concrete imports keep package import free of torch/sbi so the eval-time episode_writer path stays light. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

They're standalone smoke-test tools — not part of the runtime pipeline — so they don't belong in the production analysis namespace. Relocated to the test-helper package, ready to back a sim-free analyzer regression test. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Relocate analyze_sensitivity.py / generate_sensitivity_report.py from scripts/ into analysis/sensitivity/ as analyze.py / generate_report.py, mirroring how eval_runner/policy_runner live flat inside the evaluation package. Drops the redundant "sensitivity" prefix now that the package name carries it. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

- Ship one analyzer per family — KDEAnalyzer (empirical) and MNPEAnalyzer (the sbi robolab port) — keeping the reviewable surface small while still demonstrating the multi-analyzer design across both the empirical and neural families. - Park NPEAnalyzer, FrequencyTableAnalyzer, and the now-orphaned categorical synthetic data generator on cvolk/feature/sensitivity_deferred_analyzers to bring in later. - Guard the deferred factor mixes in make_analyzer with clear asserts pointing at that branch: pure-categorical → FrequencyTable; multi-continuous or non-binary → NPE. - Keep the PosteriorAnalyzer/EmpiricalAnalyzer family bases as the extension points the parked siblings re-attach to; drop the now-unused binary-outcome warn hook. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Drop git-branch references and the "robolab" attribution from the make_analyzer asserts and the analyzer docstrings; state the unsupported factor mixes plainly. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

episode_writer is eval-time data production — it runs inside the eval loop, depends on the metrics/evaluation machinery, and is called only by eval_runner. It has no coupling to the analysis code (the analyzer consumes the JSONL purely as a format). Relocating it beside its caller frees the sensitivity package of any pxr/sim dependency. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

The analyze.py CLI rendered one (factor, outcome) marginal to a PNG — a strict subset of the outcome × factor grid that generate_report already produces. Remove it along with the now-unused plot_marginal/_plot_title/_save_figure helpers, leaving draw_marginal (used by the PDF report) as the single rendering path. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

task_duration was a synthetic-only continuous outcome (no matching registered metric) that existed to exercise NPE. With NPE deferred, a single continuous factor with a non-binary outcome now asserts, so emitting it made generate_report crash on the smoke dataset. The generator now emits only the binary success/object-moved outcomes the MVP analyzes. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Address PR review: change the copyright headers on the sensitivity package files (plus the moved episode_writer and the synthetic generator) from 2025-2026 to 2026, matching the --use-current-year convention used by new files in the repo. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Drop the /eval/ ignore added earlier; leave .gitignore as on main. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Keep #729 focused on the analysis toolbox (analysis/sensitivity + tests + docs), which is CPU-only and validated on synthetic data. The eval-side recording that produces episode_summary.jsonl — episode_writer.py plus the eval_runner/CLI/job_manager/metrics hooks — moves to a stacked follow-up PR where it can land with proper simulation tests. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

These are imported at module level by analysis/sensitivity, so they belong in install_requires, not the [dev] extra — otherwise importing the package fails without [dev]. The Docker image already installs via pip install -e .[dev], which pulls install_requires too, so the image is unchanged; [dev] now holds only dev tools. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

nargs='*' let '--observation' be passed with zero values, producing an empty tensor that reaches sbi as a cryptic shape mismatch. nargs='+' makes argparse reject it clearly; omitting the flag still falls back to the default observation (success). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Show only the 'rich' run (3 continuous + 2 categorical) and mention the other --kind values in a line, instead of three near-identical commands. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

greptile-apps · 2026-06-11T14:56:14Z

+        self._continuous_low = torch.tensor([factor.range[0][0] for factor in continuous_factors])
+        self._continuous_high = torch.tensor([factor.range[0][1] for factor in continuous_factors])


factor.range indexed without a None guard in __init__

Both list comprehensions on lines 38–39 call factor.range[0][0] and factor.range[0][1] directly. FactorSpec.range defaults to None, and the in-memory constructor path (SensitivityDataset(schema, theta, x)) has no assertion that continuous factors carry a range before SensitivityAnalyzer is constructed. If a user builds a schema in-memory with FactorSpec(name="x", type="continuous") (omitting range), the constructor crashes with TypeError: 'NoneType' object is not subscriptable rather than a legible assertion. The from_files path is safe because _infer_missing_factor_ranges fills the range first, but the in-memory path has no equivalent guard here.

'rich' wasn't descriptive and duplicated the MNPE case. Fold it into make_mixed_dataset (now 3 continuous + 2 categorical — 'mixed' = mixed factor types) and drop make_rich_dataset. make_continuous_dataset stays for the NPE path. Two builders, one per estimator, both named for what they are. The MNPE test now asserts all five planted effects; --kind is {mixed, continuous}. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

cvolkcvolk · 2026-06-11T15:36:09Z

+**episode_summary.jsonl** is produced by the eval runner — one JSON object per episode. It
+carries every measured outcome; the analysis picks which one(s) to condition on:
+
+.. code-block:: json


Note to only add a TODO, this is not yet in this version

cvolkcvolk · 2026-06-12T09:00:05Z

/review

arena-review-bot

🤖 Isaac Lab-Arena Review Bot

Summary

Adds a self-contained sensitivity-analysis toolbox (isaaclab_arena/analysis/sensitivity/) that fits an sbi neural posterior (MNPE/NPE) over swept factors conditioned on an outcome and renders one marginals figure. The layering is clean — it's a CPU-only analysis layer over eval JSONL with no sim coupling — the typed FactorSpec/FactorSchema design is nice, and the synthetic-ground-truth tests cover both estimator paths. A couple of points below worth resolving before merge.

Design, Boundaries & Scope

The toolbox is an optional, narrowly-used MVP, but sbi (plus scipy, matplotlib) is now in the core RUNTIME_DEPS, so every install of isaaclab_arena pulls the full SBI stack (pyro/pyknos/…) whether or not anyone runs a report. Per Arena's lean-by-default / conservative-defaults stance, would it be better to make the whole analysis.sensitivity subpackage an optional extra (extras_require={"analysis": ["sbi", "scipy", "matplotlib"]}) and defer the top-level sbi/matplotlib/scipy imports so the core stays lean and an analysis-free install doesn't carry the weight? See the inline note on setup.py.

Findings

🟡 Warning: analyzer.py:73 / generate_report.py — The report path is not seeded, so it isn't reproducible.
🔵 Improvement: dataset.py:220 — SensitivityDataset.prior (and its BoxUniform import) appears to be dead code.
🟡 Warning: setup.py:20 — heavy sbi dependency added to core runtime deps (see Design section).

Test Coverage

Good. test_sensitivity_analysis.py builds in-memory synthetic datasets with a known planted relationship and asserts the posterior recovers it, covering both the MNPE (mixed) and NPE (continuous-only) paths with substantive statistical assertions. These are pure CPU tests with no Isaac Sim dependency, so the inner/outer run_simulation_app_function pattern correctly does not apply and they land in Phase 1. One thing to confirm: the tests set torch.manual_seed(0) before fit(), so they rely on sbi training being deterministic under that seed — if it proves version-sensitive in CI, the statistical assertions could become flaky.

Verdict

Minor fixes needed

arena-review-bot · 2026-06-12T09:03:34Z

+        denormalized[:, : self._num_continuous] = theta[:, : self._num_continuous] * span + self._continuous_low
+        return denormalized
+
+    def fit(self, training_batch_size: int = 50):


🟡 Reproducibility: fit() trains a neural estimator and sample_posterior() draws 5000 samples — both consume the global torch RNG, but nothing in the report path seeds it. The tests only pass because they call torch.manual_seed(0) themselves; generate_report(...) and the synthetic.py --kind ... CLI set no seed, so running the same episode_summary.jsonl twice yields different posteriors and a different figure. For an eval/benchmark library that's a reproducibility gap.

Could we thread a seed through generate_report / the CLI (and ideally the analyzer) and set it via the existing isaaclab_arena.utils.random.set_seed, plus log the seed used so a run can be reproduced?

arena-review-bot · 2026-06-12T09:03:34Z

+        return any(factor.type == "categorical" for factor in self.schema.factors)
+
+    @property
+    def prior(self):


🔵 This prior property doesn't seem to be used anywhere — SensitivityAnalyzer builds its own _normalized_prior() (over the normalized [0,1] continuous dims) and never reads dataset.prior. BoxUniform is imported (line 15) only for this property. Is it needed, or can the property and the import be dropped?

@cvolkcvolk could be true?

Yep — I traced it again to be sure: the only plausible reader is the analyzer, and it builds _normalized_prior() itself rather than calling dataset.prior. So both the property and the BoxUniform import look safe to drop unless you're planning to wire it in soon. Happy to be wrong if there's a downstream consumer I missed.

arena-review-bot · 2026-06-12T09:03:34Z

    "pydantic>=2.0",
    "openai>=2.0",
+    # Sensitivity analysis (isaaclab_arena.analysis.sensitivity), imported at module level.
+    "sbi",


🟡 This puts the full sbi stack (pyro, pyknos, …), plus scipy/matplotlib, into the core RUNTIME_DEPS, so every isaaclab_arena install pulls it even when no one runs a sensitivity report. Since analysis.sensitivity is an optional MVP, would an extra be cleaner?

Suggested change

"sbi",

"openai>=2.0",

...and move these to extras_require (e.g. "analysis": ["sbi", "scipy", "matplotlib"]), making the top-level imports in the analysis.sensitivity modules lazy so an analysis-free install stays lean. What do you think?

We would want to run the sensitivity analysis in one go witht the simulation runs eventually

Concern still stands, I think: as written every isaaclab_arena install pulls the full sbi stack + scipy/matplotlib just to import the package, even for users who never touch sensitivity analysis. Concretely: move sbi/scipy/matplotlib to extras_require under an "analysis" key and make the top-level imports in analysis.sensitivity lazy, so an analysis-free install stays lean and pip install isaaclab_arena[analysis] opts in. Happy to be wrong if there's a reason these need to be core — is there?

Got it (test received 👍). Whenever you have a moment — any thoughts on moving the sbi/scipy/matplotlib stack into an analysis extra and making the analysis.sensitivity imports lazy, so an analysis-free install stays lean?

arena-review-bot

🤖 Isaac Lab-Arena Review Bot

Summary

Adds a self-contained sensitivity-analysis toolbox (isaaclab_arena/analysis/sensitivity/) that fits an MNPE/NPE posterior over swept factors conditioned on an outcome and renders a marginals figure, plus a synthetic ground-truth simulator and end-to-end tests. The module is cleanly layered — pure-Python, sim-agnostic, no coupling into the Scene/Embodiment/Task primitives — and the synthetic-data tests are a nice way to validate recovery of planted effects on CPU. My main concern is the dependency footprint added to the core install; the rest are small cleanups.

Design, Boundaries & Scope

The one scope concern is setup.py: sbi, scipy, and matplotlib are added to the core RUNTIME_DEPS, so every Arena install now pulls them in (sbi in particular drags in a large stack: pyro, nflows/zuko, etc.) for what is an opt-in analysis MVP that nothing in the core import path touches. extras_require already exists (dev). Could these move to an analysis extra so the default install stays lean? See the inline note.

Findings

See inline comments. In short: heavy deps added to the core install (🟡), an unused prior property (🔵), full estimator training in the default test phase (🔵), and docs describing an eval-runner integration that doesn't exist yet (🔵).

Test Coverage

Good: both estimator paths (MNPE for the mixed schema, NPE for continuous-only) are exercised against a known ground truth with substantive assertions on the recovered posterior, and seeds are set for reproducibility. No sim is involved, so the inner/outer run_simulation_app_function pattern correctly doesn't apply and the unmarked tests land in Phase 1 as intended. One caveat on test cost noted inline.

Verdict

Minor fixes needed

arena-review-bot · 2026-06-12T13:22:06Z

    "pydantic>=2.0",
    "openai>=2.0",
+    # Sensitivity analysis (isaaclab_arena.analysis.sensitivity), imported at module level.
+    "sbi",


🟡 Warning — this puts sbi, scipy, and matplotlib in the core RUNTIME_DEPS, so every isaaclab_arena install now pulls them in. sbi especially is heavy (it drags in pyro, nflows/zuko, a large ML stack) and nothing in the core import path imports analysis.sensitivity — it's an opt-in MVP. Could we keep the default install lean by making this an extra instead? There's already an extras_require block to extend:

extras_require={ "dev": DEV_DEPS, "analysis": ["sbi", "scipy", "matplotlib"], },

Then pip install isaaclab_arena[analysis] enables the toolbox without taxing everyone else.

arena-review-bot · 2026-06-12T13:22:06Z

+        return any(factor.type == "categorical" for factor in self.schema.factors)
+
+    @property
+    def prior(self):


🔵 Improvement — this prior property doesn't appear to be used anywhere: the analyzer builds its own _normalized_prior() (it needs the prior in normalized [0,1] space, not the original ranges this returns), and nothing else reads dataset.prior. Looks like dead code — remove it, or if it's meant to be the analyzer's source of truth, wire it in there so the two priors can't drift apart?

arena-review-bot · 2026-06-12T13:22:06Z

+def test_mnpe_recovers_all_planted_effects():
+    """Mixed continuous + categorical (MNPE): recover every planted effect at once."""
+    dataset = make_mixed_dataset(seed=0)
+    analyzer = SensitivityAnalyzer(dataset)


🔵 Improvement — these tests train a full MNPE estimator on 3000 episodes (and NPE on 2000 in the other test). With no marker they run in Phase 1 (the default suite on every PR), and neural-posterior training on CPU isn't cheap. Do the planted effects still recover at a smaller num_episodes (and a capped max_num_epochs on fit)? Trimming it would keep the default phase fast without losing the ground-truth check.

arena-review-bot · 2026-06-12T13:22:06Z

+
+The toolbox is a thin analysis layer over `sbi <https://sbi.readthedocs.io>`_'s
+neural posterior estimators. The flow is:
+


🔵 Improvement — this (and "episode_summary.jsonl is produced by the eval runner" below) describes episode_writer / episode_summary.jsonl in the present tense, but I couldn't find either in the repo — the PR runs on synthetic data and the eval-runner integration is still "Next." Worth marking this step as planned (e.g. "will append" / a note) so readers don't go looking for a writer that isn't there yet.

greptile-apps · 2026-06-12T13:24:23Z

    "pydantic>=2.0",
    "openai>=2.0",
+    # Sensitivity analysis (isaaclab_arena.analysis.sensitivity), imported at module level.
+    "sbi",


Missing version lower-bound for sbi — MNPE unavailable before v0.25.0

MNPE was introduced in sbi v0.25.0 (see changelog). With the current unpinned "sbi" spec, pip install can resolve to any older release; the first line of analyzer.py — from sbi.inference import MNPE, NPE — then raises ImportError: cannot import name 'MNPE' from 'sbi.inference' for every user who doesn't already have a recent sbi installed. Changing the spec to "sbi>=0.25.0" makes the constraint explicit and ensures the MNPE class is always present.

Suggested change

"sbi",

"sbi>=0.25.0",

alexmillane

Great work. Super clean.

I have a bunch of nits, but merge away when you see fit.

alexmillane · 2026-06-12T11:59:37Z

+The sensitivity-analysis toolbox answers a single question about a policy:
+*which environment conditions drive success?* Given the per-episode results of an
+evaluation sweep — where factors such as lighting, object mass, or table material were
+varied — it fits a posterior over those factors conditioned on the outcome and renders


Maybe we should define what the posterior is in this case? Does the posterior mean that we have a prior integrated?

alexmillane · 2026-06-12T12:02:15Z

+- **Factors interact.** How much light a policy needs can depend on the object — a matte
+  object may succeed at low light while a shiny one needs far more. A per-factor
+  "success vs light" curve averages over objects and reports one blurry gate that is wrong
+  for both. The joint posterior keeps the interaction, so you can condition on a specific
+  object and see its gate.
+- **Factors confound each other.** If bright-light episodes also happened to use an easy
+  object, a per-factor light chart cannot tell which one drove success. Modelling all
+  factors together attributes the effect to the factor that actually carries it.


Out of interest, Do I understand correctly that the main reason to do this analysis is to capture a multi-variate distribution. Does the fact that we're modelling a posterior come into play here separately from the multivariate aspect?

alexmillane · 2026-06-12T12:46:17Z

+neural posterior estimators. The flow is:
+
+1. **Per-episode recording.** During evaluation, ``episode_writer`` appends one row per
+   episode to an ``episode_summary.jsonl`` file.


jsonl is correct?

having read on, yes this is correct.

alexmillane · 2026-06-12T12:47:06Z

+2. **Schema.** A ``factors.yaml`` declares the *factors* — which ``arena_env_args`` columns
+   were varied and whether each is continuous or categorical, plus the continuous ranges
+   that were swept (so the analyzer's prior matches the simulation). It does **not** list
+   outcomes — *which* outcome to condition on is chosen at analysis time, not saved here.


outcome (e.g. success-rate)

alexmillane · 2026-06-12T12:47:47Z

+   that were swept (so the analyzer's prior matches the simulation). It does **not** list
+   outcomes — *which* outcome to condition on is chosen at analysis time, not saved here.
+3. **Inference.** ``SensitivityAnalyzer`` loads the pair, trains an estimator on the full
+   ``(theta, x)`` jointly, and samples the joint posterior conditioned on a chosen


theta and x not introduced above. I guess this is varied parameter and the outcome?

alexmillane · 2026-06-12T13:21:14Z

+        return any(factor.type == "categorical" for factor in self.schema.factors)
+
+    @property
+    def prior(self):


@cvolkcvolk could be true?

alexmillane · 2026-06-12T13:27:20Z

We dont test any parsing of the input files (schema/dataset). Suggest to add a short test on that.

alexmillane · 2026-06-12T13:33:52Z

+        factors_yaml: str | Path,
+        jsonl_path: str | Path,


Is there a reason to store these in separate files? Does the jsonlstores the factors schema, the values of the factors taken at each experiment, and the outcomes?

It's a nit though. Happy to keep this as is. We could choose later on down the track to simplify to a single file. Let's go with this for now.

alexmillane · 2026-06-12T13:35:00Z

What do you think about moving this to isaaclab_arena_examples. It's more of a script demoing a feature in the core library, rather than part of the core library.

alexmillane · 2026-06-12T13:40:56Z

+- The analysis assumes the ``episode_summary.jsonl`` is a single coherent slice — one
+  policy, task, and embodiment. **TODO:** add a filter (in the spirit of robolab's
+  ``--filter-policy`` / ``--filter-task``) to select that slice from a larger JSONL,
+  rather than relying on the caller to pre-filter it.


Could be worth including some little example files that are used in the analysis. I.e. the factors yaml and outcome jsonl.

This comment was marked as outdated.

Sign in to view

cvolkcvolk changed the title ~~MVP-1: per-episode sensitivity recording + NPE analyzer~~ Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers May 28, 2026

cvolkcvolk force-pushed the cvolk/feature/sensitivity_analysis_mvp1 branch from 38baa56 to 74585f1 Compare May 28, 2026 15:36

cvolkcvolk added 2 commits June 3, 2026 17:03

cvolkcvolk force-pushed the cvolk/feature/sensitivity_analysis_mvp1 branch from 74585f1 to 48fba5d Compare June 3, 2026 15:38

cvolkcvolk added 9 commits June 4, 2026 10:23

Fix PDF suptitle clipping on narrow single-factor figures

730a1db

Split the report title across two lines (report+episodes / slice) and widen the top margin so it doesn't clip when the grid is a single column. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Merge branch 'main' into cvolk/feature/sensitivity_analysis_mvp1

7ac550a

cvolkcvolk changed the title ~~Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers~~ Sensitivity analysis MVP: per-episode recording, analyzers, single-PDF report Jun 5, 2026

cvolkcvolk commented Jun 8, 2026

View reviewed changes

Comment thread isaaclab_arena/analysis/sensitivity/analyzer.py Outdated

cvolkcvolk commented Jun 8, 2026

View reviewed changes

Comment thread isaaclab_arena/analysis/sensitivity/__init__.py

cvolkcvolk added 10 commits June 8, 2026 14:16

Make sensitivity factory asserts and docstrings self-contained

f4a3fad

Drop git-branch references and the "robolab" attribution from the make_analyzer asserts and the analyzer docstrings; state the unsupported factor mixes plainly. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

cvolkcvolk added 2 commits June 11, 2026 15:44

Revert .gitignore change

2ee03e4

Drop the /eval/ ignore added earlier; leave .gitignore as on main. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

cvolkcvolk mentioned this pull request Jun 11, 2026

Per-episode sensitivity recording (eval + metrics) #781

Open

Merge branch 'main' into cvolk/feature/sensitivity_analysis_mvp1

fe4cad6

cvolkcvolk commented Jun 11, 2026

View reviewed changes

Comment thread docs/pages/concepts/policy/concept_sensitivity_analysis.rst

cvolkcvolk marked this pull request as ready for review June 11, 2026 14:04

cvolkcvolk requested review from alexmillane, peterd-NV, qianl-nv, viiik-inside, xyao-nv and zhx06 as code owners June 11, 2026 14:04

This comment was marked as outdated.

Sign in to view

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread isaaclab_arena/analysis/sensitivity/generate_report.py

Comment thread isaaclab_arena/analysis/sensitivity/analyzer.py

Comment thread isaaclab_arena/analysis/sensitivity/plotting.py

cvolkcvolk added 2 commits June 11, 2026 16:24

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread isaaclab_arena/analysis/sensitivity/generate_report.py

Docs: collapse the synthetic demo to one example

88ff6b8

Show only the 'rich' run (3 continuous + 2 categorical) and mention the other --kind values in a line, instead of three near-identical commands. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread isaaclab_arena/analysis/sensitivity/dataset.py

cvolkcvolk commented Jun 11, 2026

View reviewed changes

cvolkcvolk changed the title ~~Sensitivity analysis MVP: MNPE/NPE posterior report~~ Sensitivity analysis MVP: MNPE/NPE posterior Jun 11, 2026

arena-review-bot Bot reviewed Jun 12, 2026

View reviewed changes

Merge branch 'main' into cvolk/feature/sensitivity_analysis_mvp1

f0afbaf

arena-review-bot Bot reviewed Jun 12, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

cvolkcvolk mentioned this pull request Jun 12, 2026

Sensitivity: record variation draws + self-describing episode summaries #788

Draft

alexmillane approved these changes Jun 12, 2026

View reviewed changes

		self._continuous_low = torch.tensor([factor.range[0][0] for factor in continuous_factors])
		self._continuous_high = torch.tensor([factor.range[0][1] for factor in continuous_factors])


		The toolbox is a thin analysis layer over `sbi <https://sbi.readthedocs.io>`_'s
		neural posterior estimators. The flow is:

Conversation

cvolkcvolk commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Detailed description

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cvolkcvolk commented Jun 12, 2026

Uh oh!

arena-review-bot Bot left a comment

Choose a reason for hiding this comment

🤖 Isaac Lab-Arena Review Bot

Summary

Design, Boundaries & Scope

Findings

Test Coverage

Verdict

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot left a comment

Choose a reason for hiding this comment

🤖 Isaac Lab-Arena Review Bot

Summary

Design, Boundaries & Scope

Findings

Test Coverage

Verdict

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

arena-review-bot Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

cvolkcvolk commented May 27, 2026 •

edited

Loading