[Performance Analysis] Adding intra-kernel timing runs #829
[Performance Analysis] Adding intra-kernel timing runs #829SergioMartin86 wants to merge 19 commits into
Conversation
…. This is important for accurate timing
There was a problem hiding this comment.
Code Review
This pull request introduces a performance timing framework for AICPU kernels, enabling warmup and timed execution iterations configurable via environment variables. The changes include a new two-phase barrier for thread synchronization, the use of thread-local storage for thread indexing, and enhanced logging. Feedback highlights several critical issues: an operator precedence bug in the thread completion logic that prevents proper cleanup, thread-safety violations when calling initialization routines concurrently, and a break in binary compatibility due to field insertion in the Runtime class. Additionally, improvements are suggested for memory ordering in the barrier, robustness in environment variable parsing, and correcting a log message typo.
| std::string env_timing_iterations_string = std::string(env_timing_iterations); | ||
| bool isValidValue = false; | ||
| if (env_timing_iterations_string == "True") { runtime->is_timing_enabled = true; isValidValue = true; } | ||
| if (env_timing_iterations_string == "False") { runtime->is_timing_enabled = false; isValidValue = true; } | ||
| if (isValidValue == false) | ||
| { | ||
| LOG_WARN("PTO2_KERNEL_TIMING_ENABLED=%s is invalid, using default: \"False\"", env_timing_iterations); | ||
| runtime->is_timing_enabled = false; | ||
| } | ||
| } |
There was a problem hiding this comment.
The environment variable parsing for PTO2_KERNEL_TIMING_ENABLED is brittle as it only accepts exact case-sensitive matches for 'True' or 'False'. It would be more robust to support a wider range of boolean representations (e.g., '1', '0', 'true', 'false', 'on', 'off') and perform case-insensitive comparisons.
|
Please see the updated PR description: it contains this comparison |
We want to add the ability to run a task multiple times inside the same kernel launch. This is essential for precise timing and performance evaluation of both orchestration and scheduling.
We add:
By running multiple timed runs, we dissipate OS/device noise that cause random variations in running time. This noise is significant when running these extremely low-latency kernels, so, if we want to precisely measure scheduling/orchestration performance, we need to use a statistical analysis with many samples inside the same kernel launch.
Relevant Change:
See https://github.com/hw-native-sys/simpler/pull/829/changes#diff-f1bd1d412c7f0c6e99f4f11c3830d67582037fbbd6ef3a981c34edb244f9a849R761 for main timing function we added.
Why is it necessary:
Simpler already provides a way to repeat a kernel launch
Ntimes within the same run by using the--rounds Nparameter. This can be use for obtaining many samples for a statistical performance analysis. However, measuring the scheduling/orchestration time using different kernel launches leads to high variance and, for some reason, a bimodal distribution:This might indicate a bug or problem that is introduced by re-launching a kernel immediate after the previous launch finishes and needs to be taken a look at.
What this PR adds is the ability of running
Nsamples within the same kernel launch. By doing this, the bug mentioned above, along with other possible the sources of noise and variance that are unrelated to the scheduling/orchestration efficiency can be removed. This yields a stable distribution (orange) that is in the same range as the ~500kns peak:This result coincides with the slower half of the
--rounds 100samples, but it is slightly more concentrated.