feat: expert I/O microbenchmark suite with NVTX instrumentation#89
Closed
drunkcoding wants to merge 20 commits intodevfrom
Closed
feat: expert I/O microbenchmark suite with NVTX instrumentation#89drunkcoding wants to merge 20 commits intodevfrom
drunkcoding wants to merge 20 commits intodevfrom
Conversation
added 20 commits
April 4, 2026 10:31
…hs; add Docker benchmark results Vendored nvtx3.hpp from NVIDIA NVTX v3 (header-only) so builds work in environments where CUDA toolkit doesn't include nvtx3/ directory. Docker benchmark results (DeepSeek-V2-Lite-Chat on A5000 24GB): - Routing: mean=0.114ms p50=0.115ms p95=0.151ms (78 events/3 iters) - Sync overhead: 88.6% of step time spent in synchronization - Pipeline bubble ratio: 87.15% (main loop stalled waiting for expert I/O) - Transfer bandwidth: disk_to_cpu/cpu_to_gpu events only visible via C++ NVTX/nsys
… of bubble Disk mode: 88.3% bubble (1017ms step, 899ms wait) Host-only mode: 86.8% bubble (1028ms step, 892ms wait) Delta: only 1.5% bubble reduction removing disk I/O Key finding: existing prefetcher already hides disk latency. The dominant bottleneck is synchronization overhead, NOT disk reads.
Contributor
Author
|
Merged via local fast-forward into dev after rebasing onto current dev HEAD (48f169d). Conflict resolution:
Note: benchmark numbers in PR description are from the pre-event-sync codebase. Updated numbers need a Docker re-run since dev now uses event-based sync instead of cudaStreamSynchronize (the top optimization target this PR identified). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MOE_INFINITY_PROFILE_IO=1environment variable with JSONL outputDocker Benchmark Results (DeepSeek-V2-Lite-Chat, A5000 24GB)
Key Finding
87% of the inference step is idle time waiting for expert I/O (cache miss → disk→CPU→GPU transfer). This confirms the main loop bottleneck is expert fetching, with synchronization overhead consuming 88.6% of step time.
Optimization Targets (ranked by impact)
wait_expert()blocks until ALL experts are loaded. Pipelining expert compute with I/O would eliminate most of this.cudaStreamSynchronizeafterCudaMemcpyAsynckills compute/transfer overlap..cpu().numpy()— GPU→CPU sync for routing stats fires every layer every step.Usage
Files Changed
setup.py— NVTX_DISABLE build toggle + vendored nvtx3 include pathcore/include/nvtx3/nvtx3.hpp— Vendored NVTX v3 header (header-only, ~100KB)core/parallel/expert_dispatcher.{cpp,h}— C++ NVTX: expert_enqueue, gpu_fetch, expert_compute, expert_wait_barriercore/model/model_topology.{cpp,h}— C++ NVTX: disk_to_cpu, cpu_to_gpu, gpu_to_cpu, cuda_stream_synccore/prefetch/task_scheduler.cpp— C++ NVTX: task_queue_push/pop/executecore/aio/archer_prio_aio_handle.{cpp,h}— C++ NVTX: aio_read/write_submit, aio_waitmoe_infinity/profiling/— IOProfiler singleton + nvtx_utils helpermoe_infinity/memory/expert_tracer.py— Extended with I/O event trackingmoe_infinity/distributed/expert_executor.py— Python NVTX routing/dispatch/waitmoe_infinity/memory/expert_{predictor,prefetcher}.py— Python NVTX prefetchmoe_infinity/engine/{unified_transfer_scheduler,expert_offload_coordinator}.py— Python NVTX scheduler/cachebenchmarks/expert_io_microbench/— Full benchmark suite (6 scripts + stats harness + README)