Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

This PR adds a new Jinja2 template for bottleneck-guided kernel optimization.

Each optimization round targets exactly one root cause with one recommended fix

kernel_optimization.j2

  • ROOFLINE ANALYSIS section with SOL%, efficiency, headroom
  • BOTTLENECK ANALYSIS expects singular bottleneck with root_cause and recommended_fix
  • PERFORMANCE TARGET targets 10% improvement over current best or Eager baseline
  • TARGET GPU includes all fields from gpu_specs_database

prompt_manager.py

  • Added kernel_optimization template
  • New render_kernel_optimization_prompt() with explicit params: category, summary, reasoning, root_cause, recommended_fix

Example Usage

  1. Generate bottleneck analysis from NCU data
  from kernel_perf_agent.kernel_opt.diagnose_prompt.judger_prompt import (                                                                                                             
      build_bottleneck_prompt,                                                                                                                                                         
      parse_bottleneck_response,                                                                                                                                                       
  )                                                                                                                                                                                    
  from kernel_perf_agent.kernel_opt.roofline.ncu_roofline import RooflineAnalyzer                                                                                                      
                                                                                                                                                                                       
  # Analyze roofline                                                                                                                                                                   
  analyzer = RooflineAnalyzer()                                                                                                                                                        
  roofline = analyzer.analyze(ncu_metrics)                                                                                                                                             
                                                                                                                                                                                       
  # Build prompt for LLM                                                                                                                                                               
  prompt = build_bottleneck_prompt(                                                                                                                                                    
      kernel_code=kernel_src,                                                                                                                                                          
      ncu_metrics=ncu_metrics,                                                                                                                                                         
      roofline=roofline,                                                                                                                                                               
      gpu_specs=gpu_specs,                                                                                                                                                             
      num_bottlenecks=2,                                                                                                                                                               
      num_causes=2,                                                                                                                                                                    
      num_fixes=1,                                                                                                                                                                     
  )                                                                                                                                                                                    
                                                                                                                                                                                       
  # Call LLM and parse response                                                                                                                                                        
  llm_response = call_llm(prompt)                                                                                                                                                      
  bottlenecks = parse_bottleneck_response(llm_response)                                                                                                                                
  # Returns: [BottleneckResult, BottleneckResult]                                                                                                                                      
  1. Render optimization prompt (one cause, one fix)
  from triton_kernel_agent.prompt_manager import PromptManager                                                                                                                         
                                                                                                                                                                      
  prompt_manager = PromptManager()                                                                                                                                                     
                                                                                                                                                                                       
  # Pick first bottleneck, first cause/fix pair                                                                                                                                        
  b = bottlenecks[0]                                                                                                                                                                   
                                                                                                                                                                                       
  prompt = prompt_manager.render_kernel_optimization_prompt(                                                                                                                           
      problem_description="Fused attention kernel",                                                                                                                                    
      kernel_code=current_kernel,                                                                                                                                                      
      gpu_specs=gpu_specs,                                                                                                                                                             
      roofline={                                                                                                                                                                       
          "bottleneck": roofline.bottleneck,                                                                                                                                           
          "compute_sol_pct": roofline.compute_sol_pct,                                                                                                                                 
          "memory_sol_pct": roofline.memory_sol_pct,                                                                                                                                   
          "efficiency_pct": roofline.efficiency_pct,                                                                                                                                   
          "headroom_pct": roofline.headroom_pct,                                                                                                                                       
          "at_roofline": roofline.at_roofline,                                                                                                                                         
          "uses_tensor_cores": roofline.uses_tensor_cores,                                                                                                                             
          "warnings": roofline.warnings,                                                                                                                                               
      },                                                                                                                                                                               
      category=b.category,                                                                                                                                                             
      summary=b.summary,                                                                                                                                                               
      reasoning=b.reasoning,                                                                                                                                                           
      root_cause=b.root_causes[0],                                                                                                                                                     
      recommended_fix=b.recommended_fixes[0],                                                                                                                                          
      pytorch_baseline_ms=1.234,                                                                                                                                                       
      current_best_ms=0.987,  # from previous iteration                                                                                                                                
  )                                                                                                                                                                                    
                                                                                                                                                                                       
  # Call LLM to generate optimized kernel                                                                                                                                              
  optimized_kernel = call_llm(prompt)        

Kaiming Cheng and others added 30 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 31, 2026
@kaiming-cheng kaiming-cheng changed the title Add Kernel Optimization Template to PromptManager #77Kaiming/opt template Add Kernel Optimization Template to PromptManager Jan 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants