Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

This module provides a roofline analyzer using NCU's Speed of Light (SOL) metrics to determine kernel efficiency relative to hardware limits.

Key Components:

  • RooflineResult - Dataclass with core metrics:
    • compute_sol_pct / memory_sol_pct - SM and memory throughput as % of peak
    • efficiency_pct - max(compute, memory) SOL
    • bottleneck - "memory" | "compute" | "latency"
    • at_roofline - True if efficiency >= 90% threshold (configurable)
  • RooflineAnalyzer - Main analysis class:
    • analyze(ncu_metrics) - Takes NCU metrics dict, returns RooflineResult
    • should_stop(result) - Determines if optimization should stop (at roofline or converged)
    • Bottleneck classification: lower SOL = bottleneck; both < 60% = latency bound

Kaiming Cheng and others added 27 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 29, 2026
else:
return "compute"

def _default_result(self) -> RooflineResult:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a default function? Can't we just set it in the object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - let me move it inline

"""Configuration for roofline analysis."""

threshold: float = 0.90 # 90% SOL = at roofline
early_stop: bool = True # Stop optimization when at roofline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early_stop config unused.if self.config.early_stop and result.at_roofline:

class RooflineConfig:
"""Configuration for roofline analysis."""

threshold: float = 0.90 # 90% SOL = at roofline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either store threshold as percent (90.0), or keep as fraction but rename to threshold_frac and document “0–1”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - let me rename it to threshold_pct and updated accordingly

)
return tc_cycles > self.config.tensor_core_threshold

def _classify_bottleneck(self, compute_sol: float, memory_sol: float) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says: “lower SOL = bottleneck”. That’s plausible as “what’s limiting utilization”, but you also define: both < 60% → “latency”. That “latency” bucket is really “neither compute nor memory saturated” (could be instruction mix, occupancy, launch config, dependency stalls, small problem size, etc.). Calling it “latency” is OK as a heuristic, but I’d suggest naming it "underutilized" or "latency/overhead" to avoid overclaiming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Change the name for the threshold and the config

- Compute SOL: SM throughput as % of peak
- Memory SOL: DRAM throughput as % of peak

Updated in January 20226
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> 2026

# See the License for the specific language governing permissions and
# limitations under the License.

"""Kernel Performance Agent package."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd:

"""Kernel Performance Agent package."""
# "Kernel Performance Agent package
__all__ = []

RooflineResult with SOL-based efficiency analysis
"""
# Extract SOL metrics
compute_sol = ncu_metrics.get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If either metric is missing, you’ll treat missing as 0 and proceed, potentially misclassifying.
Consider add warnings when keys are missing, e.g.
if compute metric missing → warning + set compute_sol_pct=0

if memory metric missing → warning + set memory_sol_pct=0
and only default-fail when both keys are absent (not just both values 0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants