Incorporate statistical significance testing to benchmarks#614
Incorporate statistical significance testing to benchmarks#614Micky774 wants to merge 2 commits into
Conversation
Adds a statistically-rigorous comparison path to the microbenchmark suite,
on top of the existing throughput-ratio comparison.
- utils.py: time_func gains a repetitions param; --repetitions flag (default
15) threaded through run_benchmarks so all bench scripts collect enough
per-sample data with no per-script edits. repetitions<=1 preserves the
original single-measurement behavior.
- parser_TEsamples.py: benchstats ParserBase reading the --csv-samples CSV
into {bench_name: {time_ms: ndarray}}.
- compare_results.py: new --stats mode running a Brunner-Munzel test (via
benchstats) on two samples CSVs; exits 1 on a significant regression for
CI gating. Default ratio path unchanged; benchstats lazy-imported.
- requirements.txt: benchstats>=3.4.
- README: document the repetitions knob and --stats workflow.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refines the benchstats integration: - compare_results.py --stats now reports calculated throughput/bandwidth (TFLOPS / GB/s) alongside timing. Time (seconds) is the main metric and drives the exit code; throughput is a secondary metric. The throughput unit suffix is blanked (the column header names the unit) and time is fed in seconds so benchstats' auto-scaling (ms/us/ns) is correct. - utils.py: --csv-samples now also writes per-sample throughput + unit columns, derived from the sample time (blank for samples-only records). - parser_TEsamples.py exposes time_s (main) and a unit-keyed throughput metric, dropping throughput-less composite rows so the metric set stays uniform for benchstats' renderer. - utils.py: replace --repetitions with --min-samples (default 12). Instead of re-running autorange and averaging, time_func keeps the raw per-block samples and tops up any shortfall with additional equal-sized blocks. Also fix the samples writer to not re-divide already-per-run times. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Wow! I'm super happy to see I'm still a long way to write some blog post/presentation about reliable benchmarking, but can share some things to consider for the future, that are very helpful to combat the noise. Noise is going to be a primary thing affecting comparison results after the correct results comparison procedure, like you did, is in place:
And just a note about process: for a non-parametric test that make no assumptions about underlying data distribution, such as Brunner Munzel Test (and in practice this also generally works with Mann–Whitney U test, but never with T-Test or similar), it's safe to use a tiny number of iterations, like 2-3-4 (as long as times are aggregated within a repetition with |
It's a great project, and I appreciate your advocacy for statistical testing of benchmark results -- I think it's important for a healthy ecosystem!
Right, specifically the cache invalidation was a feature that I planned on introducing afterwards, but the interleaving having the advantage of mitigating local machine-level noise is a great point that isn't easily solved without interleaving. Unfortunately interleaving is a non-trivial addition and breaks our current benchmarking flow, but IMO it is worth trying to figure out how to implement on our end.
I'd love to see such an update, but as you mentioned right now we're soft-locked onto the pytorch timing solution.
Just wanted to mention that bootstrapping and permutation testing are such miracles and are part of what I adore of statistics 💯 .
That's a good thing to note. I was wondering what the stability would look like empirically, since eventually we'll want to optimize for runtime in a CI environment. Thanks for the additional info! Overall, thank you very much for your project, and for taking the time to speak on our current benchmarking implementation. This PR is still much of a prototype PR, but I'm hopeful we can get it integrated and move to some more meaningful results. You've keenly spotted a few challenges we'll have even w/ or w/o this specific approach. Now all that's left is for me to figure out the least painful way to address them 😄 |
Yes, absolutely. Using unreliable benchmarks is like building on top of ever-shifting sands: it's only a matter of time when such benchmarks misfire. Using statistical tools for comparison is an important step towards repeatable results and reliable conclusions.
I'm not familiar with Torch's benchmarks runner, so can't tell yet exactly, however a few things to note here: Triton's benchmarking method
Precisely that!
One of the advantages of Torch over JAX is relative openness and maturity of internals. In Torch it's very simple to implement a benchmarks runner that measures precisely a kernel run time. For example, here's how Triton does it with CUDA/HIP events, exported by Torch. Doing that isn't fun in JAX, since JAX (currently) just doesn't expose anything needed for that 😁 If things go right, somewhat soon I might return back to Silo to a team working with Torch and then, if @jcaraban agrees, I could pay much more attention to Torch benchmarking issues, and will definitely implement support in
So, I was using the Jax-Triton benchmarking script I linked above for maybe...at least several hundred times already on a shared machine with multiple tenants working in parallel. Once I implemented interleaving (and this happened pretty early) and later pvalue bootstrapping, I've always used them and, especially after interleaving, I'm getting very consistent benchmarking results: if the results were found significant once, they are basically always significant. Basically - because it still depends on the background noise level and the difference in magnitude of runtimes. But I don't think I recall any failures for differences from 1-2% and higher, even though individual runtimes may fluctuate wildly, and depending on the machine load, even though average runtimes (across all executions of a benchmark in a session) might fluctuate x2+ times between some session - results are always the same. So I'm very happy with it. I've never seen before so solid reproducibility of complex benchmarks even on quiesced machines. I basically even ceased to worry about parallel tenants producing noise, even though this is typically the first thing you have to do before even starting to work on the machine quiescing... But please don't get me wrong - there are very many variables to care about to get solid results - and as long as you aren't using precisely the same in all minor details approach, "your mileage might vary". Hence the only thing I'm confidently expecting is that your results will be definitely more reliable, especially with interleaving.
You might want to have a look at links above to Triton benchmarks runner for inspiration. This should be not hard to (re-)implement independently in Torch, though I hope I'll deliver custom timers to |
|
@matthiasdiener here's an example simulation displaying the effect of transient noise. Results: Results
|
Description
This PR adds support for @Arech8's benchstats repo for general significance testing (primarily using the Brunner Munzel test). This is important for distinguishing large-but-noisy and small-but-significant changes.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
New:
compare_results.py --statsmode--csv-samples) using a statistical test (Brunner-Munzel by default) via thebenchstatspackage, instead of point-estimate ratios.(config, label)as faster/slower/not-significant; exits1on a significant timing regression for CI gating.--alpha(default0.001),--method,--always-show-pvalues,--export-to(.txt/.svg/.html).New:
parser_TEsamples.pybenchstatsparser turning the samples CSV into the{benchmark: {metric: ndarray}}structure.time_s(always) and a throughput metric keyed by unit; warns when <10 samples.Forward+Backward) so the metric set stays uniform.utils.py— sampling & throughput--min-samples Nflag (default 12) wired globally viarun_benchmarksso all benchmark scripts inherit it without edits.time_functops up torch autorange shortfalls with additional equal-sized timing blocks (new_RawSamplesmeasurement stand-in) instead of re-averaging.throughputandunitcolumns; per-sample throughput derived asthr_mean * ms_mean / sample_ms.time_msnow uses already-per-run sample times (removed the erroneous extra/ number_per_rundivision).Supporting
requirements.txtpinningbenchstats>=3.4.README.mddocuments--stats,--min-samples, and the samples-CSV columns.Checklist: