Skip to content

feat: expert I/O microbenchmark suite with NVTX instrumentation#89

Closed
drunkcoding wants to merge 20 commits intodevfrom
feat/expert-io-microbench
Closed

feat: expert I/O microbenchmark suite with NVTX instrumentation#89
drunkcoding wants to merge 20 commits intodevfrom
feat/expert-io-microbench

Conversation

@drunkcoding
Copy link
Copy Markdown
Contributor

Summary

  • 10-component microbenchmark suite for expert I/O overhead decomposition
  • In-situ profiling via MOE_INFINITY_PROFILE_IO=1 environment variable with JSONL output
  • C++ NVTX v3 annotations in ExpertDispatcher, Node::SetDevice, ArcherTaskPool, and AIO handle
  • Python NVTX annotations in routing, dispatch, cache, prefetch, and scheduler paths
  • Extended ExpertTracer with I/O event ring buffer tracking
  • Integration runner with bandwidth analysis and executive summary

Docker Benchmark Results (DeepSeek-V2-Lite-Chat, A5000 24GB)

Component Metric Value
Routing overhead mean / p50 / p95 0.114ms / 0.115ms / 0.151ms
Sync overhead % of step time 88.6%
Pipeline bubble ratio expert_wait / step_total 87.15%
Step decomposition total step time (mean) 947.5ms
Step decomposition expert wait time (mean) 825.7ms
Step decomposition non-wait compute (mean) 121.8ms

Key Finding

87% of the inference step is idle time waiting for expert I/O (cache miss → disk→CPU→GPU transfer). This confirms the main loop bottleneck is expert fetching, with synchronization overhead consuming 88.6% of step time.

Optimization Targets (ranked by impact)

  1. Sync barrier (88.6%) — wait_expert() blocks until ALL experts are loaded. Pipelining expert compute with I/O would eliminate most of this.
  2. SetDevice strong synccudaStreamSynchronize after CudaMemcpyAsync kills compute/transfer overlap.
  3. dispatch_local .cpu().numpy() — GPU→CPU sync for routing stats fires every layer every step.

Usage

# Full suite
python benchmarks/expert_io_microbench/run_all.py \
  --model deepseek-ai/DeepSeek-V2-Lite-Chat \
  --offload-dir /path/to/offload \
  --output-json results.json

# In-situ profiling during real inference
MOE_INFINITY_PROFILE_IO=1 MOE_INFINITY_PROFILE_IO_OUT=profile.jsonl \
  python examples/interface_example.py ...

# NVTX in Nsight Systems
nsys profile python benchmarks/expert_io_microbench/run_all.py --scenario routing ...

Files Changed

  • setup.py — NVTX_DISABLE build toggle + vendored nvtx3 include path
  • core/include/nvtx3/nvtx3.hpp — Vendored NVTX v3 header (header-only, ~100KB)
  • core/parallel/expert_dispatcher.{cpp,h} — C++ NVTX: expert_enqueue, gpu_fetch, expert_compute, expert_wait_barrier
  • core/model/model_topology.{cpp,h} — C++ NVTX: disk_to_cpu, cpu_to_gpu, gpu_to_cpu, cuda_stream_sync
  • core/prefetch/task_scheduler.cpp — C++ NVTX: task_queue_push/pop/execute
  • core/aio/archer_prio_aio_handle.{cpp,h} — C++ NVTX: aio_read/write_submit, aio_wait
  • moe_infinity/profiling/ — IOProfiler singleton + nvtx_utils helper
  • moe_infinity/memory/expert_tracer.py — Extended with I/O event tracking
  • moe_infinity/distributed/expert_executor.py — Python NVTX routing/dispatch/wait
  • moe_infinity/memory/expert_{predictor,prefetcher}.py — Python NVTX prefetch
  • moe_infinity/engine/{unified_transfer_scheduler,expert_offload_coordinator}.py — Python NVTX scheduler/cache
  • benchmarks/expert_io_microbench/ — Full benchmark suite (6 scripts + stats harness + README)

xly added 20 commits April 4, 2026 10:31
…hs; add Docker benchmark results

Vendored nvtx3.hpp from NVIDIA NVTX v3 (header-only) so builds work
in environments where CUDA toolkit doesn't include nvtx3/ directory.

Docker benchmark results (DeepSeek-V2-Lite-Chat on A5000 24GB):
- Routing: mean=0.114ms p50=0.115ms p95=0.151ms (78 events/3 iters)
- Sync overhead: 88.6% of step time spent in synchronization
- Pipeline bubble ratio: 87.15% (main loop stalled waiting for expert I/O)
- Transfer bandwidth: disk_to_cpu/cpu_to_gpu events only visible via C++ NVTX/nsys
… of bubble

Disk mode:      88.3% bubble (1017ms step, 899ms wait)
Host-only mode: 86.8% bubble (1028ms step, 892ms wait)
Delta:          only 1.5% bubble reduction removing disk I/O

Key finding: existing prefetcher already hides disk latency.
The dominant bottleneck is synchronization overhead, NOT disk reads.
@drunkcoding drunkcoding changed the base branch from main to dev April 5, 2026 11:18
@drunkcoding
Copy link
Copy Markdown
Contributor Author

Merged via local fast-forward into dev after rebasing onto current dev HEAD (48f169d).

Conflict resolution:

  • core/model/model_topology.cpp/.h: kept event-based sync from dev, added NVTX scoped ranges around new paths
  • moe_infinity/distributed/expert_executor.py: kept speculative prefetch plumbing from dev, added NVTX/profiler instrumentation
  • moe_infinity/memory/expert_prefetcher.py: kept speculative/correct prefetch from dev, added NVTX/profiler wrapping

Note: benchmark numbers in PR description are from the pre-event-sync codebase. Updated numbers need a Docker re-run since dev now uses event-based sync instead of cudaStreamSynchronize (the top optimization target this PR identified).

@drunkcoding drunkcoding closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant