feat: expert I/O microbenchmark suite with NVTX instrumentation by drunkcoding · Pull Request #89 · EfficientMoE/MoE-Infinity

drunkcoding · 2026-04-04T14:42:05Z

Summary

10-component microbenchmark suite for expert I/O overhead decomposition
In-situ profiling via MOE_INFINITY_PROFILE_IO=1 environment variable with JSONL output
C++ NVTX v3 annotations in ExpertDispatcher, Node::SetDevice, ArcherTaskPool, and AIO handle
Python NVTX annotations in routing, dispatch, cache, prefetch, and scheduler paths
Extended ExpertTracer with I/O event ring buffer tracking
Integration runner with bandwidth analysis and executive summary

Docker Benchmark Results (DeepSeek-V2-Lite-Chat, A5000 24GB)

Component	Metric	Value
Routing overhead	mean / p50 / p95	0.114ms / 0.115ms / 0.151ms
Sync overhead	% of step time	88.6%
Pipeline bubble ratio	expert_wait / step_total	87.15%
Step decomposition	total step time (mean)	947.5ms
Step decomposition	expert wait time (mean)	825.7ms
Step decomposition	non-wait compute (mean)	121.8ms

Key Finding

87% of the inference step is idle time waiting for expert I/O (cache miss → disk→CPU→GPU transfer). This confirms the main loop bottleneck is expert fetching, with synchronization overhead consuming 88.6% of step time.

Optimization Targets (ranked by impact)

Sync barrier (88.6%) — wait_expert() blocks until ALL experts are loaded. Pipelining expert compute with I/O would eliminate most of this.
SetDevice strong sync — cudaStreamSynchronize after CudaMemcpyAsync kills compute/transfer overlap.
dispatch_local .cpu().numpy() — GPU→CPU sync for routing stats fires every layer every step.

Usage

# Full suite
python benchmarks/expert_io_microbench/run_all.py \
  --model deepseek-ai/DeepSeek-V2-Lite-Chat \
  --offload-dir /path/to/offload \
  --output-json results.json

# In-situ profiling during real inference
MOE_INFINITY_PROFILE_IO=1 MOE_INFINITY_PROFILE_IO_OUT=profile.jsonl \
  python examples/interface_example.py ...

# NVTX in Nsight Systems
nsys profile python benchmarks/expert_io_microbench/run_all.py --scenario routing ...

Files Changed

setup.py — NVTX_DISABLE build toggle + vendored nvtx3 include path
core/include/nvtx3/nvtx3.hpp — Vendored NVTX v3 header (header-only, ~100KB)
core/parallel/expert_dispatcher.{cpp,h} — C++ NVTX: expert_enqueue, gpu_fetch, expert_compute, expert_wait_barrier
core/model/model_topology.{cpp,h} — C++ NVTX: disk_to_cpu, cpu_to_gpu, gpu_to_cpu, cuda_stream_sync
core/prefetch/task_scheduler.cpp — C++ NVTX: task_queue_push/pop/execute
core/aio/archer_prio_aio_handle.{cpp,h} — C++ NVTX: aio_read/write_submit, aio_wait
moe_infinity/profiling/ — IOProfiler singleton + nvtx_utils helper
moe_infinity/memory/expert_tracer.py — Extended with I/O event tracking
moe_infinity/distributed/expert_executor.py — Python NVTX routing/dispatch/wait
moe_infinity/memory/expert_{predictor,prefetcher}.py — Python NVTX prefetch
moe_infinity/engine/{unified_transfer_scheduler,expert_offload_coordinator}.py — Python NVTX scheduler/cache
benchmarks/expert_io_microbench/ — Full benchmark suite (6 scripts + stats harness + README)

…iles

… and sampling

… in setup.py

…it barrier

…duler paths

…chmarks

…tive summary

…bagents

…hs; add Docker benchmark results Vendored nvtx3.hpp from NVIDIA NVTX v3 (header-only) so builds work in environments where CUDA toolkit doesn't include nvtx3/ directory. Docker benchmark results (DeepSeek-V2-Lite-Chat on A5000 24GB): - Routing: mean=0.114ms p50=0.115ms p95=0.151ms (78 events/3 iters) - Sync overhead: 88.6% of step time spent in synchronization - Pipeline bubble ratio: 87.15% (main loop stalled waiting for expert I/O) - Transfer bandwidth: disk_to_cpu/cpu_to_gpu events only visible via C++ NVTX/nsys

…mentation

… of bubble Disk mode: 88.3% bubble (1017ms step, 899ms wait) Host-only mode: 86.8% bubble (1028ms step, 892ms wait) Delta: only 1.5% bubble reduction removing disk I/O Key finding: existing prefetcher already hides disk latency. The dominant bottleneck is synchronization overhead, NOT disk reads.

drunkcoding · 2026-04-06T20:37:33Z

Merged via local fast-forward into dev after rebasing onto current dev HEAD (48f169d).

Conflict resolution:

core/model/model_topology.cpp/.h: kept event-based sync from dev, added NVTX scoped ranges around new paths
moe_infinity/distributed/expert_executor.py: kept speculative prefetch plumbing from dev, added NVTX/profiler instrumentation
moe_infinity/memory/expert_prefetcher.py: kept speculative/correct prefetch from dev, added NVTX/profiler wrapping

Note: benchmark numbers in PR description are from the pre-event-sync codebase. Updated numbers need a Docker re-run since dev now uses event-based sync instead of cudaStreamSynchronize (the top optimization target this PR identified).

xly added 20 commits April 4, 2026 10:31

chore(bench): document existing NVTX baseline from nsys profiles

211481d

feat(bench): add profiling stats harness with JSON output and percent…

89522cf

…iles

feat(profiling): extend ExpertTracer with I/O event tracking ring buffer

dc28773

feat(profiling): add MOE_INFINITY_PROFILE_IO toggle with JSONL output…

7572abc

… and sampling

build(nvtx): add NVTX v3 header-only support with NVTX_DISABLE toggle…

8dbb893

… in setup.py

feat(nvtx): add Python NVTX annotations for routing, dispatch, and wa…

df9bcd8

…it barrier

feat(nvtx): add Python NVTX annotations for cache, prefetch, and sche…

249921f

…duler paths

feat(nvtx): add C++ NVTX v3 ranges to ArcherTaskPool and AIO handle

02d9a6e

feat(nvtx): add C++ NVTX v3 ranges to ExpertDispatcher hot path

7d0e2c0

feat(nvtx): add C++ NVTX v3 ranges to Node::SetDevice transfer paths

7c7b3f7

feat(bench): add routing overhead and cache lookup microbenchmarks

a26a6c1

feat(bench): add transfer timing microbenchmarks with bandwidth analysis

16300be

feat(bench): add expert compute, eviction, and queue coordination ben…

3bcd77c

…chmarks

feat(bench): add pipeline bubble detection and step stall measurement

81da48e

feat(bench): add integration runner with bandwidth analysis and execu…

f2d4085

…tive summary

feat(bench): add baseline comparison against existing nsys-rep profiles

0d2f073

chore: remove gitignored .sisyphus files accidentally committed by su…

a64b354

…bagents

feat(bench): add --host-only mode (tmpfs offload) and Docker shm docu…

53c1681

…mentation

drunkcoding changed the base branch from main to dev April 5, 2026 11:18

drunkcoding closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expert I/O microbenchmark suite with NVTX instrumentation#89

feat: expert I/O microbenchmark suite with NVTX instrumentation#89
drunkcoding wants to merge 20 commits intodevfrom
feat/expert-io-microbench

drunkcoding commented Apr 4, 2026

Uh oh!

drunkcoding commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drunkcoding commented Apr 4, 2026

Summary

Docker Benchmark Results (DeepSeek-V2-Lite-Chat, A5000 24GB)

Key Finding

Optimization Targets (ranked by impact)

Usage

Files Changed

Uh oh!

drunkcoding commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant