Nsight Systems traces are powerful, but the SQLite export is still hard to interpret when you’re chasing LLM inference bottlenecks.
This tool turns a trace.sqlite into a concise report: top kernels, launch storms, sync indicators, GPU idle gaps, NVTX ranges, and per-PID breakdowns.
It is designed for vLLM-style multi-process traces and is A100-first (capture/runbook assumes A100).
Every number is trace-derived and the report calls out coverage/limitations explicitly.
python3.9 -m pip install -e .nsys profile \
--trace=cuda,nvtx,osrt \
--sample=none \
--cpuctxsw=none \
--cuda-graph-trace=node \
-o trace \
python your_workload.py
nsys export --type sqlite --output trace.sqlite --force-overwrite=true --lazy=false trace.nsys-rep
nsys-llm-explain trace.sqlite --out artifacts/run_YYYYMMDD_HHMMSS/Notes:
--cuda-graph-trace=nodematters for workloads that use CUDA graphs.- Optional NVTX phase mapping:
--phase-map phases.json(alias:--phases-json).
The output directory contains:
report.md,report.jsontables/kernels.csvtables/gpu_idle_gaps.csv(if computed)tables/nvtx_ranges.csv(if present)tables/kernels_by_pid.csv,tables/sync_by_pid.csv,tables/nvtx_by_pid.csv(best-effort, if PID/NVTX info exists)
See examples/a100_vllm/ for a committed, real capture (outputs only). The raw trace.sqlite is intentionally omitted to keep the repo small.
Excerpt (from examples/a100_vllm/report.md):
## Warnings
- NVTX-attributed GPU time is best-effort (NVTX→runtime→kernel correlation). Coverage is 2.2% (< 70.0%). Low coverage → interpret cautiously.
- Per-PID NVTX-attributed GPU time has low coverage for at least one PID (worst PID 495200: 2.2%). Interpret per-phase/per-PID attribution cautiously.
## What to do next
- **[medium] CPU↔GPU synchronization detected (runtime API)**
- **Evidence**:
- Top sync-like call `cudaEventSynchronize_v3020` total 129.97 ms across 129 calls.
- All sync-like calls total 233.63 ms.
- **[high] Significant GPU idle gaps**
- **Evidence**:
- GPU 0 idle 99.4% of observed window (82727.3 ms / 83203.8 ms).
## Global: top CUDA kernels (by total time)
| kernel_name | device_id | total_ms | calls | avg_us | p50_us | p90_us | pct_kernel_time |
| ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn | 0 | 47.015 | 3612 | 13.02 | 12.29 | 16.86 | 9.9 |
From examples/a100_vllm/metadata.txt:
- GPU: NVIDIA A100-SXM4-80GB
- Nsight Systems: 2023.4.4.54-234433681190v0
- Python: 3.12.12
- vLLM: 0.15.2rc1.dev110+g785cf28ff.d20260209
- OS / driver / CUDA / PyTorch: not recorded in this capture (outputs-only example)
“GPU idle” is computed over the observed kernel time window (first kernel start → last kernel end). If your capture includes model load, long warmup, or a long idle tail, idle% can look extreme.
Recommended practice:
- Warm up first, then start the capture.
- Capture a short steady-state window (seconds).
- If your Nsight Systems version supports capture-range options, consider restricting capture to an NVTX range (see
nsys profile --help).
- Top CUDA kernels: from
CUPTI_ACTIVITY_KIND_KERNEL(GPU kernel intervals), names resolved viaStringIds. - Launch storm: kernel launches/sec + duration percentiles derived from kernel intervals.
- CPU↔GPU sync indicators: runtime/driver API call durations from
CUPTI_ACTIVITY_KIND_RUNTIMEfiltered to sync-like calls (e.g.,cudaDeviceSynchronize,cudaStreamSynchronize, waits). - GPU idle gaps (estimate): per-device union of kernel intervals to estimate busy vs idle within the kernel time window.
- NVTX breakdown:
NVTX_EVENTSrows withendtimestamps summarized by range name; optional mapping of range names into phases via--phase-map. - NVTX-attributed GPU kernel time (best-effort): if
correlationId+globalTidare present, attributes GPU kernel time to enclosing NVTX ranges via NVTX→runtime→kernel correlation; can be disabled with--no-nvtx-kernel-map. - Multi-process (PID) breakdown (best-effort): top PIDs by kernel time, per-PID top kernels, and per-PID sync-like calls when PID columns are available in the export.
- Offline-only, trace-derived metrics only.
- Reports coverage/limitations instead of implying certainty.
- No benchmark claims or speedup promises.
- Kernel launch storm: classified using thresholds in
src/nsys_llm_explainer/heuristics.py(high launches/sec + tiny median kernel). - Dominant kernel:
- If the top kernel is ≥ 50% of total kernel time, that is usually the first place to focus (single hotspot dominates).
- Sync calls:
- Frequent
cudaDeviceSynchronize/cudaStreamSynchronize/cudaEventSynchronizecan indicate CPU↔GPU barriers that reduce overlap.
- Frequent
- CPU-bound signatures:
- Large GPU idle gaps + many short kernels can be consistent with CPU scheduling/launch overhead or unnecessary synchronization.
- NVTX phase interpretation:
- NVTX wall-time is host timing. NVTX-attributed GPU kernel time (if present) is best-effort correlation and must be interpreted with its reported coverage.
- Tested export: the committed example was captured/exported in the ASU environment; see
examples/a100_vllm/metadata.txtfor the recordednsys --versionoutput. - Graceful degradation: the tool probes the SQLite schema at runtime and only emits sections it can compute from available tables/columns.
Key probes and fallbacks:
- String table: prefers
StringIds(id,value), falls back to anyid+valuetable. - Kernel activity: prefers
CUPTI_ACTIVITY_KIND_KERNEL, falls back toCUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL. - Runtime API: prefers
CUPTI_ACTIVITY_KIND_RUNTIME. - NVTX: prefers
NVTX_EVENTS.
- Nsight Systems tables are created lazily; not all tables are present in every export. The tool probes schema at runtime and degrades gracefully.
- Timestamps are interpreted as nanoseconds (Nsight Systems CUPTI exports) and converted to ms/us. If an export uses a different unit scale, time-derived values will be wrong; the report warns when it cannot run a sanity check.
- Idle/busy is kernel-interval based (does not include non-kernel GPU work unless you extend it to include memcpy/memset workloads).
- NVTX phase attribution depends on NVTX being present, and on Nsight exporting
correlationId/globalTidneeded to correlate kernels back to NVTX ranges. Coverage may be partial. - Per-PID sections depend on PID-bearing columns (
globalPid/globalTid/pid/processId). The report will emit a warning if PID attribution looks ambiguous.
- Nsight Systems SQLite exporter schema reference:
https://docs.nvidia.com/nsight-systems/(seensys-exporterdocs for your version).
- Add a small “known-good schema” matrix per Nsight Systems version (tables/columns observed).
- Add optional inclusion of memcpy/memset activity into the “busy/idle” estimate.
- Add guidance for common inference stacks (vLLM, torch.compile, CUDA graphs) in the runbook.