[Performance Analysis] Adding intra-kernel timing runs by SergioMartin86 · Pull Request #829 · hw-native-sys/simpler

SergioMartin86 · 2026-05-20T10:03:42Z

We want to add the ability to run a task multiple times inside the same kernel launch. This is essential for precise timing and performance evaluation of both orchestration and scheduling.

We add:

Warmup runs: used to disregard cache intialization/dlopen/kernel launch noise.
Timed runs: these are actually timed, and an average + stddev is reported.

By running multiple timed runs, we dissipate OS/device noise that cause random variations in running time. This noise is significant when running these extremely low-latency kernels, so, if we want to precisely measure scheduling/orchestration performance, we need to use a statistical analysis with many samples inside the same kernel launch.

Relevant Change:

See https://github.com/hw-native-sys/simpler/pull/829/changes#diff-f1bd1d412c7f0c6e99f4f11c3830d67582037fbbd6ef3a981c34edb244f9a849R761 for main timing function we added.

Why is it necessary:

Simpler already provides a way to repeat a kernel launch N times within the same run by using the --rounds N parameter. This can be use for obtaining many samples for a statistical performance analysis. However, measuring the scheduling/orchestration time using different kernel launches leads to high variance and, for some reason, a bimodal distribution:

This might indicate a bug or problem that is introduced by re-launching a kernel immediate after the previous launch finishes and needs to be taken a look at.

What this PR adds is the ability of running N samples within the same kernel launch. By doing this, the bug mentioned above, along with other possible the sources of noise and variance that are unrelated to the scheduling/orchestration efficiency can be removed. This yields a stable distribution (orange) that is in the same range as the ~500kns peak:

This result coincides with the slower half of the --rounds 100 samples, but it is slightly more concentrated.

…. This is important for accurate timing

gemini-code-assist

Code Review

This pull request introduces a performance timing framework for AICPU kernels, enabling warmup and timed execution iterations configurable via environment variables. The changes include a new two-phase barrier for thread synchronization, the use of thread-local storage for thread indexing, and enhanced logging. Feedback highlights several critical issues: an operator precedence bug in the thread completion logic that prevents proper cleanup, thread-safety violations when calling initialization routines concurrently, and a break in binary compatibility due to field insertion in the Runtime class. Additionally, improvements are suggested for memory ordering in the barrier, robustness in environment variable parsing, and correcting a log message typo.

gemini-code-assist · 2026-05-20T10:05:37Z

+            std::string env_timing_iterations_string = std::string(env_timing_iterations);
+            bool isValidValue = false;
+            if (env_timing_iterations_string == "True") { runtime->is_timing_enabled = true; isValidValue = true; }
+            if (env_timing_iterations_string == "False") { runtime->is_timing_enabled = false; isValidValue = true; }
+            if (isValidValue == false) 
+            {
+                LOG_WARN("PTO2_KERNEL_TIMING_ENABLED=%s is invalid, using default: \"False\"", env_timing_iterations);
+                runtime->is_timing_enabled = false;
+            }
+        }


The environment variable parsing for PTO2_KERNEL_TIMING_ENABLED is brittle as it only accepts exact case-sensitive matches for 'True' or 'False'. It would be more robust to support a wider range of boolean representations (e.g., '1', '0', 'true', 'false', 'on', 'off') and perform case-insensitive comparisons.

ChaoWao · 2026-05-21T03:00:12Z

Please give the comparison data of N*kernelLaunch and 1 kernelLaunch*N inner run.

Run 100 times and trim highest 10 and lowest 10

SergioMartin86 · 2026-05-26T09:45:20Z

Please give the comparison data of NkernelLaunch and 1 kernelLaunchN inner run.

Run 100 times and trim highest 10 and lowest 10

Please see the updated PR description: it contains this comparison

SergioMartin86 added 5 commits May 19, 2026 15:24

Adding the capability of re-running a task within a single kernel run…

a1ecccf

…. This is important for accurate timing

Adding timing

847b64c

Progress

38e5008

Simplifying

f1953c5

merging with upstream

7c37c6c

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

SergioMartin86 added 3 commits May 20, 2026 12:10

Addressing some agent suggestions

3bf8780

Fixes

f9f25c3

Fix

220d1d9

ChaoWao closed this May 21, 2026

ChaoWao reopened this May 21, 2026

SergioMartin86 added 11 commits May 21, 2026 10:14

Succesffuly running two consecutive inner runs

192ef42

Recovering timing runs

108bfc4

Separated orchestration loading from actual run

55580d6

Separating orhestration from scheduling activities

570660f

Adding timing statistics

20140dd

Moving initialization routine outside main loop

454f875

Now producing avg+stddev

4da588a

Adding missing barrier

726c3bc

Small finalization fix

3080389

Merge branch 'main' into intraKernelTiming

df40ddb

merge with main

7f9c651

SergioMartin86 marked this pull request as ready for review May 26, 2026 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance Analysis] Adding intra-kernel timing runs #829

[Performance Analysis] Adding intra-kernel timing runs #829
SergioMartin86 wants to merge 19 commits into
hw-native-sys:mainfrom
huawei-csl:intraKernelTiming

SergioMartin86 commented May 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

Uh oh!

ChaoWao commented May 21, 2026

Uh oh!

SergioMartin86 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SergioMartin86 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChaoWao commented May 21, 2026

Uh oh!

SergioMartin86 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SergioMartin86 commented May 20, 2026 •

edited

Loading