Introduce roofline analyzer #86

kaiming-cheng · 2026-01-29T04:36:11Z

This module provides a roofline analyzer using NCU's Speed of Light (SOL) metrics to determine kernel efficiency relative to hardware limits.

Key Components:

RooflineResult - Dataclass with core metrics:
- compute_sol_pct / memory_sol_pct - SM and memory throughput as % of peak
- efficiency_pct - max(compute, memory) SOL
- bottleneck - "memory" | "compute" | "latency"
- at_roofline - True if efficiency >= 90% threshold (configurable)
RooflineAnalyzer - Main analysis class:
- analyze(ncu_metrics) - Takes NCU metrics dict, returns RooflineResult
- should_stop(result) - Determines if optimization should stop (at roofline or converged)
- Bottleneck classification: lower SOL = bottleneck; both < 60% = latency bound

Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']

Jack-Khuu · 2026-01-29T06:26:35Z

kernel_perf_agent/kernel_opt/roofline/ncu_roofline.py

+        else:
+            return "compute"
+
+    def _default_result(self) -> RooflineResult:


why do we need a default function? Can't we just set it in the object

good point - let me move it inline

Laurawly · 2026-01-29T18:27:22Z

kernel_perf_agent/kernel_opt/roofline/ncu_roofline.py

+    """Configuration for roofline analysis."""
+
+    threshold: float = 0.90  # 90% SOL = at roofline
+    early_stop: bool = True  # Stop optimization when at roofline


early_stop config unused.if self.config.early_stop and result.at_roofline:

Laurawly · 2026-01-29T18:28:57Z

kernel_perf_agent/kernel_opt/roofline/ncu_roofline.py

+class RooflineConfig:
+    """Configuration for roofline analysis."""
+
+    threshold: float = 0.90  # 90% SOL = at roofline


Either store threshold as percent (90.0), or keep as fraction but rename to threshold_frac and document “0–1”.

good call - let me rename it to threshold_pct and updated accordingly

Laurawly · 2026-01-29T18:31:00Z

kernel_perf_agent/kernel_opt/roofline/ncu_roofline.py

+        )
+        return tc_cycles > self.config.tensor_core_threshold
+
+    def _classify_bottleneck(self, compute_sol: float, memory_sol: float) -> str:


Comment says: “lower SOL = bottleneck”. That’s plausible as “what’s limiting utilization”, but you also define: both < 60% → “latency”. That “latency” bucket is really “neither compute nor memory saturated” (could be instruction mix, occupancy, launch config, dependency stalls, small problem size, etc.). Calling it “latency” is OK as a heuristic, but I’d suggest naming it "underutilized" or "latency/overhead" to avoid overclaiming.

Sounds good! Change the name for the threshold and the config

Laurawly · 2026-01-29T18:31:36Z

kernel_perf_agent/kernel_opt/roofline/ncu_roofline.py

+- Compute SOL: SM throughput as % of peak
+- Memory SOL: DRAM throughput as % of peak
+
+Updated in January 20226


Laurawly · 2026-01-29T18:33:03Z

kernel_perf_agent/kernel_opt/roofline/__init__.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Kernel Performance Agent package."""


This is odd:

"""Kernel Performance Agent package.""" # "Kernel Performance Agent package __all__ = []

Laurawly · 2026-01-29T18:35:18Z

kernel_perf_agent/kernel_opt/roofline/ncu_roofline.py

+            RooflineResult with SOL-based efficiency analysis
+        """
+        # Extract SOL metrics
+        compute_sol = ncu_metrics.get(


If either metric is missing, you’ll treat missing as 0 and proceed, potentially misclassifying.
Consider add warnings when keys are missing, e.g.
if compute metric missing → warning + set compute_sol_pct=0

if memory metric missing → warning + set memory_sol_pct=0
and only default-fail when both keys are absent (not just both values 0).

Kaiming Cheng and others added 27 commits January 15, 2026 11:44

NCU profiling wrapper generation and execution

07a3268

Refactor profiling components and add kernel_perf_util

3c4b124

Refactor profiling components and add kernel_perf_util

11f4e79

Refactor profiling components and add kernel_perf_util

251f419

update directory name and add package in pyproject

b789660

Remove kernel_perf_util directory

4d35d57

move gpu spec.py to future PR and fix import

d871678

Add copyright header

db0c754

fix ruff

cd29759

address previous comments

bbfa6cd

fix ruff

543453a

Introducing benchmarking infra for kernel performance

4febdd6

fix ruff

d92a7b7

fix ruff

2994315

address comments

1378fc3

Diagnose module - prompt constructor

45fec80

Refactors the diagnose_prompt module into a modular architecture

b640cde

fix diff issue

e952123

fix ruff issue

e7ba29a

fix

72ac4d1

fix ruff

e2c599e

Merge branch 'main' into kaiming/opt_component_3

8ab907c

fix gpu_spec based on feedback and remove judger_prompt for future PR

e350802

Remove judger_prompts.py changes from this PR

8541299

Merge branch 'main' into kaiming/opt_component_3

313a84f

Update gpu_specs_database.py

9e608ac

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2026

kaiming-cheng requested review from Jack-Khuu and Laurawly January 29, 2026 04:36

Jack-Khuu approved these changes Jan 29, 2026

View reviewed changes

Laurawly reviewed Jan 29, 2026

View reviewed changes

Kaiming Cheng and others added 3 commits January 29, 2026 10:44

address feedback

f3220e1

ruff fix

4443f33

Merge branch 'main' into kaiming/opt_component_3

b12b138

kaiming-cheng requested a review from Laurawly January 29, 2026 19:21

Kaiming Cheng added 2 commits January 30, 2026 14:26

introduce roofline analyzer

31d0d70

update doc string in init and fix ncu_roofline

3c607b5

kaiming-cheng force-pushed the kaiming/roofline branch from e77967c to 3c607b5 Compare January 30, 2026 22:27

add metrics to profiler and data processing (flat_dict) to roofline

22c2e66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce roofline analyzer #86

Introduce roofline analyzer #86

Uh oh!

kaiming-cheng commented Jan 29, 2026

Uh oh!

Jack-Khuu Jan 29, 2026

Uh oh!

kaiming-cheng Jan 29, 2026

Uh oh!

Laurawly Jan 29, 2026

Uh oh!

Laurawly Jan 29, 2026

Uh oh!

kaiming-cheng Jan 29, 2026

Uh oh!

Laurawly Jan 29, 2026

Uh oh!

kaiming-cheng Jan 29, 2026

Uh oh!

Laurawly Jan 29, 2026

Uh oh!

Laurawly Jan 29, 2026

Uh oh!

Laurawly Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Introduce roofline analyzer #86

Are you sure you want to change the base?

Introduce roofline analyzer #86

Uh oh!

Conversation

kaiming-cheng commented Jan 29, 2026

Key Components:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants