-
Notifications
You must be signed in to change notification settings - Fork 28
Introduce roofline analyzer #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
| else: | ||
| return "compute" | ||
|
|
||
| def _default_result(self) -> RooflineResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a default function? Can't we just set it in the object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - let me move it inline
| """Configuration for roofline analysis.""" | ||
|
|
||
| threshold: float = 0.90 # 90% SOL = at roofline | ||
| early_stop: bool = True # Stop optimization when at roofline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
early_stop config unused.if self.config.early_stop and result.at_roofline:
| class RooflineConfig: | ||
| """Configuration for roofline analysis.""" | ||
|
|
||
| threshold: float = 0.90 # 90% SOL = at roofline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either store threshold as percent (90.0), or keep as fraction but rename to threshold_frac and document “0–1”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call - let me rename it to threshold_pct and updated accordingly
| ) | ||
| return tc_cycles > self.config.tensor_core_threshold | ||
|
|
||
| def _classify_bottleneck(self, compute_sol: float, memory_sol: float) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment says: “lower SOL = bottleneck”. That’s plausible as “what’s limiting utilization”, but you also define: both < 60% → “latency”. That “latency” bucket is really “neither compute nor memory saturated” (could be instruction mix, occupancy, launch config, dependency stalls, small problem size, etc.). Calling it “latency” is OK as a heuristic, but I’d suggest naming it "underutilized" or "latency/overhead" to avoid overclaiming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Change the name for the threshold and the config
| - Compute SOL: SM throughput as % of peak | ||
| - Memory SOL: DRAM throughput as % of peak | ||
|
|
||
| Updated in January 20226 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> 2026
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Kernel Performance Agent package.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd:
"""Kernel Performance Agent package."""
# "Kernel Performance Agent package
__all__ = []
| RooflineResult with SOL-based efficiency analysis | ||
| """ | ||
| # Extract SOL metrics | ||
| compute_sol = ncu_metrics.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If either metric is missing, you’ll treat missing as 0 and proceed, potentially misclassifying.
Consider add warnings when keys are missing, e.g.
if compute metric missing → warning + set compute_sol_pct=0
if memory metric missing → warning + set memory_sol_pct=0
and only default-fail when both keys are absent (not just both values 0).
e77967c to
3c607b5
Compare
This module provides a roofline analyzer using NCU's Speed of Light (SOL) metrics to determine kernel efficiency relative to hardware limits.
Key Components:
compute_sol_pct/memory_sol_pct- SM and memory throughput as % of peakefficiency_pct- max(compute, memory) SOLbottleneck- "memory" | "compute" | "latency"at_roofline- True if efficiency >= 90% threshold (configurable)analyze(ncu_metrics)- Takes NCU metrics dict, returns RooflineResultshould_stop(result)- Determines if optimization should stop (at roofline or converged)Bottleneck classification: lower SOL = bottleneck; both < 60% = latency bound