Skip to content

Incorporate statistical significance testing to benchmarks#614

Open
Micky774 wants to merge 2 commits into
devfrom
zain/bench/stats
Open

Incorporate statistical significance testing to benchmarks#614
Micky774 wants to merge 2 commits into
devfrom
zain/bench/stats

Conversation

@Micky774

@Micky774 Micky774 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds support for @Arech8's benchstats repo for general significance testing (primarily using the Brunner Munzel test). This is important for distinguishing large-but-noisy and small-but-significant changes.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

New: compare_results.py --stats mode

  • Compares two per-sample CSVs (from --csv-samples) using a statistical test (Brunner-Munzel by default) via the benchstats package, instead of point-estimate ratios.
  • Marks each (config, label) as faster/slower/not-significant; exits 1 on a significant timing regression for CI gating.
  • New flags: --alpha (default 0.001), --method, --always-show-pvalues, --export-to (.txt/.svg/.html).
  • Reports time (main metric, drives exit code) alongside throughput/bandwidth (TFLOPS / GB/s) as a secondary metric.

New: parser_TEsamples.py

  • benchstats parser turning the samples CSV into the {benchmark: {metric: ndarray}} structure.
  • Exposes time_s (always) and a throughput metric keyed by unit; warns when <10 samples.
  • Drops throughput-less composite rows (e.g. Forward+Backward) so the metric set stays uniform.

utils.py — sampling & throughput

  • New --min-samples N flag (default 12) wired globally via run_benchmarks so all benchmark scripts inherit it without edits.
  • time_func tops up torch autorange shortfalls with additional equal-sized timing blocks (new _RawSamples measurement stand-in) instead of re-averaging.
  • Samples CSV gains throughput and unit columns; per-sample throughput derived as thr_mean * ms_mean / sample_ms.
  • Fix: time_ms now uses already-per-run sample times (removed the erroneous extra / number_per_run division).

Supporting

  • New requirements.txt pinning benchstats>=3.4.
  • README.md documents --stats, --min-samples, and the samples-CSV columns.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Micky774 and others added 2 commits June 5, 2026 16:31
Adds a statistically-rigorous comparison path to the microbenchmark suite,
on top of the existing throughput-ratio comparison.

- utils.py: time_func gains a repetitions param; --repetitions flag (default
  15) threaded through run_benchmarks so all bench scripts collect enough
  per-sample data with no per-script edits. repetitions<=1 preserves the
  original single-measurement behavior.
- parser_TEsamples.py: benchstats ParserBase reading the --csv-samples CSV
  into {bench_name: {time_ms: ndarray}}.
- compare_results.py: new --stats mode running a Brunner-Munzel test (via
  benchstats) on two samples CSVs; exits 1 on a significant regression for
  CI gating. Default ratio path unchanged; benchstats lazy-imported.
- requirements.txt: benchstats>=3.4.
- README: document the repetitions knob and --stats workflow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refines the benchstats integration:

- compare_results.py --stats now reports calculated throughput/bandwidth
  (TFLOPS / GB/s) alongside timing. Time (seconds) is the main metric and
  drives the exit code; throughput is a secondary metric. The throughput
  unit suffix is blanked (the column header names the unit) and time is fed
  in seconds so benchstats' auto-scaling (ms/us/ns) is correct.
- utils.py: --csv-samples now also writes per-sample throughput + unit
  columns, derived from the sample time (blank for samples-only records).
- parser_TEsamples.py exposes time_s (main) and a unit-keyed throughput
  metric, dropping throughput-less composite rows so the metric set stays
  uniform for benchstats' renderer.
- utils.py: replace --repetitions with --min-samples (default 12). Instead
  of re-running autorange and averaging, time_func keeps the raw per-block
  samples and tops up any shortfall with additional equal-sized blocks.
  Also fix the samples writer to not re-divide already-per-run times.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Arech8

Arech8 commented Jun 9, 2026

Copy link
Copy Markdown

Wow! I'm super happy to see benchstats is used in the wild, especially in such a deep and well done integration! Thanks for implementing that and pinging me, Meekail!

I'm still a long way to write some blog post/presentation about reliable benchmarking, but can share some things to consider for the future, that are very helpful to combat the noise. Noise is going to be a primary thing affecting comparison results after the correct results comparison procedure, like you did, is in place:

  • unless the code to benchmark runs inside a tight loop in the production setting, it's very beneficial to randomly interleave timed executions of different benchmarks. I.e, if A, B and C are single benchmarked runs of different functions, then if in production setting it doesn't run as A-A-A-... and so on, i.e. in a tight loop, it's also not correct to measure it as A-A-A-..., B-B-B..., C-C-C... First of all, data and instruction caches affect results quite heavily and second - a concurrent noise occuring during one of A-A-A... and not occuring during B-B-B (which is what typically happens even on local machines without other tenants) will skew the results heavily in favor of B. A better approach is to run individual functions in randomly interleaved order, like C-A-B-B-A-B-C-B-.... this helps a lot to even out the noise. The problem is that none of the benchmark runners I know of are doing that, except of my own from benchstats.qbench module... The runner you're using IIUC is Triton's benchmark module, which essentially unfolds to the same quite biased A-A-A-..., B-B-B..., C-C-C...

    The benchmark runner in benchstats.qbench module does interleaved running by default, however, it might be a bit problematic to reuse it directly for your setting, since that runner currently doesn't have a mode to let a caller chose how to time execution. Instead it uses perf_counter_ns() wall clock time, which isn't always what you want when measuring kernels performance (and have to carefully enforce synchronization for inputs and outputs in an async setting). I have a long standing plan to fix that, but since I primarily work with JAX, there's a whole load of other issues leading me to postpone this plan until better times... PyTorch/Triton combination is more flexible and it's easy to extract a kernel runtime from it, I should expedite implementation of that plan...

  • If there should be no serial dependency in run times of code A (i.e. if there are no internal caches, non-deterministic behaviour and so on, and if A1 is the first run and A2 is the second run, then all differences in runtimes should only happen due to noise), then any run Ai is expected to be no better or no worse then any other run Aj, right? Then it's beneficial to randomly permute obtained run-times and compute PValue for it, then repeat the permutation and compute PValue again, and so on - to gather a set of P-Values. Then judge a statistical significance by "any/none" criteria, i.e. "if none of PValues show significance, assume there's really no significance, otherwise there's significance or results aren't reliable". This helps to strengthen outcomes and make them way less prone to noise again (provided that the assumption about absence of dependencies, which are greatly diminished by the prev. bullet point, holds). This "P-Value bootstrapping" is also implemented in the benchmark runner in benchstats.qbench by default... Damn... The ability to reuse it could shrink your integration code & effort in the order of magnitude and turn it into a very concise integration like here I use for benchmarking kernels in jax-triton.... I need to fix it...

And just a note about process: for a non-parametric test that make no assumptions about underlying data distribution, such as Brunner Munzel Test (and in practice this also generally works with Mann–Whitney U test, but never with T-Test or similar), it's safe to use a tiny number of iterations, like 2-3-4 (as long as times are aggregated within a repetition with mean/average, which is the only correct choice almost always). Maybe even 1 would work nicely, I just only tested them up to 2 😁 . Then even with 100-200 repetitions, a total execution budget would be still rather small, but sensitivity of the test will be very high.

@Micky774

Micky774 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Wow! I'm super happy to see benchstats is used in the wild, especially in such a deep and well done integration! Thanks for implementing that and pinging me, Meekail!

It's a great project, and I appreciate your advocacy for statistical testing of benchmark results -- I think it's important for a healthy ecosystem!

  • unless the code to benchmark runs inside a tight loop in the production setting, it's very beneficial to randomly interleave timed executions of different benchmarks. I.e, if A, B and C are single benchmarked runs of different functions, then if in production setting it doesn't run as A-A-A-... and so on, i.e. in a tight loop, it's also not correct to measure it as A-A-A-..., B-B-B..., C-C-C... First of all, data and instruction caches affect results quite heavily and second - a concurrent noise occuring during one of A-A-A... and not occuring during B-B-B (which is what typically happens even on local machines without other tenants) will skew the results heavily in favor of B. A better approach is to run individual functions in randomly interleaved order, like C-A-B-B-A-B-C-B-.... this helps a lot to even out the noise. The problem is that none of the benchmark runners I know of are doing that, except of my own from benchstats.qbench module... The runner you're using IIUC is Triton's benchmark module, which essentially unfolds to the same quite biased A-A-A-..., B-B-B..., C-C-C...

Right, specifically the cache invalidation was a feature that I planned on introducing afterwards, but the interleaving having the advantage of mitigating local machine-level noise is a great point that isn't easily solved without interleaving. Unfortunately interleaving is a non-trivial addition and breaks our current benchmarking flow, but IMO it is worth trying to figure out how to implement on our end.

The benchmark runner in benchstats.qbench module does interleaved running by default, however, it might be a bit problematic to reuse it directly for your setting, since that runner currently doesn't have a mode to let a caller chose how to time execution. Instead it uses perf_counter_ns() wall clock time, which isn't always what you want when measuring kernels performance (and have to carefully enforce synchronization for inputs and outputs in an async setting). I have a long standing plan to fix that, but since I primarily work with JAX, there's a whole load of other issues leading me to postpone this plan until better times... PyTorch/Triton combination is more flexible and it's easy to extract a kernel runtime from it, I should expedite implementation of that plan...

I'd love to see such an update, but as you mentioned right now we're soft-locked onto the pytorch timing solution.

  • If there should be no serial dependency in run times of code A (i.e. if there are no internal caches, non-deterministic behaviour and so on, and if A1 is the first run and A2 is the second run, then all differences in runtimes should only happen due to noise), then any run Ai is expected to be no better or no worse then any other run Aj, right? Then it's beneficial to randomly permute obtained run-times and compute PValue for it, then repeat the permutation and compute PValue again, and so on - to gather a set of P-Values. Then judge a statistical significance by "any/none" criteria, i.e. "if none of PValues show significance, assume there's really no significance, otherwise there's significance or results aren't reliable". This helps to strengthen outcomes and make them way less prone to noise again (provided that the assumption about absence of dependencies, which are greatly diminished by the prev. bullet point, holds). This "P-Value bootstrapping" is also implemented in the benchmark runner in benchstats.qbench by default... Damn... The ability to reuse it could shrink your integration code & effort in the order of magnitude and turn it into a very concise integration like here I use for benchmarking kernels in jax-triton.... I need to fix it...

Just wanted to mention that bootstrapping and permutation testing are such miracles and are part of what I adore of statistics 💯 .

And just a note about process: for a non-parametric test that make no assumptions about underlying data distribution, such as Brunner Munzel Test (and in practice this also generally works with Mann–Whitney U test, but never with T-Test or similar), it's safe to use a tiny number of iterations, like 2-3-4 (as long as times are aggregated within a repetition with mean/average, which is the only correct choice almost always). Maybe even 1 would work nicely, I just only tested them up to 2 😁 . Then even with 100-200 repetitions, a total execution budget would be still rather small, but sensitivity of the test will be very high.

That's a good thing to note. I was wondering what the stability would look like empirically, since eventually we'll want to optimize for runtime in a CI environment. Thanks for the additional info!

Overall, thank you very much for your project, and for taking the time to speak on our current benchmarking implementation. This PR is still much of a prototype PR, but I'm hopeful we can get it integrated and move to some more meaningful results. You've keenly spotted a few challenges we'll have even w/ or w/o this specific approach. Now all that's left is for me to figure out the least painful way to address them 😄

@Arech8

Arech8 commented Jun 9, 2026

Copy link
Copy Markdown

advocacy for statistical testing of benchmark results -- I think it's important for a healthy ecosystem!

Yes, absolutely. Using unreliable benchmarks is like building on top of ever-shifting sands: it's only a matter of time when such benchmarks misfire. Using statistical tools for comparison is an important step towards repeatable results and reliable conclusions.

Right, specifically the cache invalidation was a feature that I planned on introducing afterwards ...

I'm not familiar with Torch's benchmarks runner, so can't tell yet exactly, however a few things to note here: Triton's benchmarking method do_bench() is also based on Torch and it utilizes cache clearing. On both platforms the implementation is identical, - it just allocates and zeros a 256MiB chunk of data, so assuming it's a correct way to clear cache, it's super easy to port. I wish clearing caches on CPUs were so simple! 😁 (for the context - it's incredibly hard and costly to ensure CPU caches are clear, as CPUs are latency focused and their caching algorithms are much smarter)

the interleaving having the advantage of mitigating local machine-level noise is a great point that isn't easily solved without interleaving.

Precisely that!

Unfortunately interleaving is a non-trivial addition and breaks our current benchmarking flow, but IMO it is worth trying to figure out how to implement on our end.

One of the advantages of Torch over JAX is relative openness and maturity of internals. In Torch it's very simple to implement a benchmarks runner that measures precisely a kernel run time. For example, here's how Triton does it with CUDA/HIP events, exported by Torch. Doing that isn't fun in JAX, since JAX (currently) just doesn't expose anything needed for that 😁

If things go right, somewhat soon I might return back to Silo to a team working with Torch and then, if @jcaraban agrees, I could pay much more attention to Torch benchmarking issues, and will definitely implement support in benchstats for custom timers (still in my spare time, sorry, as benchstats is my personal passion project and I'm not going to give it away by working on it in during AMD work time). Then all of that would be achievable with just a few lines of configuration code...

I was wondering what the stability would look like empirically, since eventually we'll want to optimize for runtime in a CI environment.

So, I was using the Jax-Triton benchmarking script I linked above for maybe...at least several hundred times already on a shared machine with multiple tenants working in parallel. Once I implemented interleaving (and this happened pretty early) and later pvalue bootstrapping, I've always used them and, especially after interleaving, I'm getting very consistent benchmarking results: if the results were found significant once, they are basically always significant. Basically - because it still depends on the background noise level and the difference in magnitude of runtimes. But I don't think I recall any failures for differences from 1-2% and higher, even though individual runtimes may fluctuate wildly, and depending on the machine load, even though average runtimes (across all executions of a benchmark in a session) might fluctuate x2+ times between some session - results are always the same. So I'm very happy with it. I've never seen before so solid reproducibility of complex benchmarks even on quiesced machines. I basically even ceased to worry about parallel tenants producing noise, even though this is typically the first thing you have to do before even starting to work on the machine quiescing... But please don't get me wrong - there are very many variables to care about to get solid results - and as long as you aren't using precisely the same in all minor details approach, "your mileage might vary". Hence the only thing I'm confidently expecting is that your results will be definitely more reliable, especially with interleaving.

Now all that's left is for me to figure out the least painful way to address them 😄

You might want to have a look at links above to Triton benchmarks runner for inspiration. This should be not hard to (re-)implement independently in Torch, though I hope I'll deliver custom timers to benchstats.qbench sooner and you just won't need to write a custom runner (but I really have no idea about requirements to your runner and might be wildly off guessing here, you know better).

@Micky774

Micky774 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@matthiasdiener here's an example simulation displaying the effect of transient noise. Results:

Results

==============================================================================
EXPT 0  CONTROL (no transient)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.048     0.002 |       0.9996  0.9842..1.0154
  ctrl | interleaved |    0.056     0.002 |       1.0003  0.9846..1.0158
-------------------------------------------------------------------------

==============================================================================
EXPT 1  WARM-UP RAMP (monotonic drift)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.049     0.000 |       0.9999  0.9844..1.0156
  ctrl | interleaved |    0.051     0.002 |       1.0002  0.9846..1.0157
-------------------------------------------------------------------------
  0.05 | sequential  |    0.864     0.414 |       1.0250  1.0093..1.0414
  0.05 | interleaved |    0.031     0.001 |       1.0009  0.9851..1.0172
-------------------------------------------------------------------------
  0.10 | sequential  |    1.000     0.993 |       1.0496  1.0332..1.0674
  0.10 | interleaved |    0.009     0.000 |       1.0015  0.9840..1.0206
-------------------------------------------------------------------------
  0.20 | sequential  |    1.000     1.000 |       1.0970  1.0768..1.1171
  0.20 | interleaved |    0.000     0.000 |       1.0027  0.9805..1.0253
-------------------------------------------------------------------------

==============================================================================
EXPT 2  STEP THROTTLE (random onset, 50% duration)   (true_ratio B/A = 1.000, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.052     0.002 |       1.0002  0.9843..1.0159
  ctrl | interleaved |    0.057     0.003 |       0.9999  0.9842..1.0156
-------------------------------------------------------------------------
  0.05 | sequential  |    0.625     0.405 |       1.0007  0.9518..1.0508
  0.05 | interleaved |    0.012     0.000 |       1.0001  0.9825..1.0176
-------------------------------------------------------------------------
  0.10 | sequential  |    0.698     0.530 |       1.0000  0.9080..1.1005
  0.10 | interleaved |    0.001     0.000 |       1.0003  0.9776..1.0235
-------------------------------------------------------------------------
  0.20 | sequential  |    0.646     0.467 |       0.9925  0.8323..1.2006
  0.20 | interleaved |    0.000     0.000 |       1.0004  0.9728..1.0269
-------------------------------------------------------------------------
  0.40 | sequential  |    0.652     0.482 |       1.0055  0.7138..1.4004
  0.40 | interleaved |    0.000     0.000 |       0.9996  0.9731..1.0269
-------------------------------------------------------------------------

==============================================================================
EXPT 3  REAL 5% SPEEDUP (B truly 5% FASTER) under warm-up ramp
        rate = fraction calling B SLOWER *with significance* (false regression)   (true_ratio B/A = 0.950, N=30/kernel, trials=4000)
==============================================================================
   amp | policy      |  FPR@.05  FPR@.001 | median ratio    ratio p5..p95
-------------------------------------------------------------------------
  ctrl | sequential  |    0.000     0.000 |       0.9500  0.9352..0.9650
  ctrl | interleaved |    0.000     0.000 |       0.9501  0.9350..0.9652
-------------------------------------------------------------------------
  0.05 | sequential  |    0.000     0.000 |       0.9737  0.9585..0.9893
  0.05 | interleaved |    0.000     0.000 |       0.9508  0.9353..0.9661
-------------------------------------------------------------------------
  0.10 | sequential  |    0.005     0.000 |       0.9974  0.9815..1.0133
  0.10 | interleaved |    0.000     0.000 |       0.9514  0.9340..0.9689
-------------------------------------------------------------------------
  0.20 | sequential  |    0.993     0.767 |       1.0421  1.0236..1.0608
  0.20 | interleaved |    0.000     0.000 |       0.9531  0.9319..0.9745
-------------------------------------------------------------------------

==============================================================================
DISTRIBUTION OF MEASURED RATIO  (median_B/median_A), warm-up ramp amp=0.20, null (true=1.0)
==============================================================================
  Truth = 1.000.  Sequential is shifted (BIAS); interleaved is centered.
  sequential  (biased: B always runs hotter)
    +0.860 | 
    +0.879 | 
    +0.898 | 
    +0.917 | 
    +0.936 | 
    +0.955 | 
    +0.974 | 
    +0.993 | 
    +1.012 | 
    +1.031 | 
    +1.050 | 
    +1.069 | #######
    +1.088 | ##############################################
    +1.107 | ###########################################
    +1.126 | ######
    +1.145 | 
    +1.164 | 
    +1.183 | 
    +1.202 | 
    +1.221 | 
    +1.240 | 
  interleaved (centered on truth)
    +0.860 | 
    +0.879 | 
    +0.898 | 
    +0.917 | 
    +0.936 | 
    +0.955 | 
    +0.974 | ########
    +0.993 | #############################################
    +1.012 | ##############################################
    +1.031 | #########
    +1.050 | 
    +1.069 | 
    +1.088 | 
    +1.107 | 
    +1.126 | 
    +1.145 | 
    +1.164 | 
    +1.183 | 
    +1.202 | 
    +1.221 | 
    +1.240 | 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants