feat(benchmarks): add Dockerized comparison benchmark suite (vLLM, llama.cpp, gpt-oss-20b) by drunkcoding · Pull Request #88 · EfficientMoE/MoE-Infinity

drunkcoding · 2026-04-03T09:56:57Z

Summary

Adds a complete, reproducible benchmark comparison suite for measuring MoE-Infinity's per-token-latency against vLLM v0.18.1 and llama.cpp b8640 across 4 MoE models on a single 24GB GPU.

What's new

Benchmark scripts (benchmarks/comparison/):

common.py — Shared types (BenchmarkConfig, BenchmarkResult), 4-model registry including openai/gpt-oss-20b, 20-prompt fixed dataset, save_result/load_results utilities
run_moe_infinity.py — MoE-Infinity benchmark (FP16, expert offloading, StopWatch timing)
run_vllm.py — vLLM v0.18.1 benchmark (FP8 quantization for large models, graceful exit codes)
run_llamacpp.py — llama.cpp b8640 benchmark (Q4_K_M GGUF, GPU offloading via -ngl)
aggregate_results.py — Combines JSON results into a Markdown comparison table
run_all.sh — Orchestrator: builds Docker images, runs frameworks sequentially with 60s thermal cooldown, aggregates results

Docker isolation (benchmarks/comparison/):

Dockerfile.vllm — wraps vllm/vllm-openai:v0.18.1
Dockerfile.llamacpp — wraps ghcr.io/ggml-org/llama.cpp:server-cuda
MoE-Infinity uses existing docker/Dockerfile

Documentation:

docs/benchmark_reproduction.md — Step-by-step reproduction guide (prerequisites, quick start, per-framework setup, troubleshooting, extending)
README.md — Updated performance table with new structure: 4 models (incl. gpt-oss-20b), 3 frameworks with precision labels (FP16/FP8/Q4_K_M), em-dash placeholders, legacy table in collapsible section

Models benchmarked

Model	deepseek-v2-lite	mixtral-8x7b	qwen3-30b	gpt-oss-20b
MoE-Infinity	FP16 + offloading	FP16 + offloading	FP16 + offloading	FP16 + offloading
vLLM v0.18.1	FP8	FP8	FP8	X (unsupported)
llama.cpp b8640	Q4_K_M	Q4_K_M	Q4_K_M	X (no GGUF)

Running benchmarks

# A5000 GPU detected — infrastructure is ready
bash benchmarks/comparison/run_all.sh \
    --offload-dir /path/to/ssd/offload \
    --results-dir benchmarks/comparison/results/

Takes ~2-3 hours. Results saved as JSON + Markdown table.

Key design decisions

Docker per framework: each framework runs in its own container to avoid CUDA/PyTorch version conflicts
Precision transparency: precision column shows what each framework sacrifices to fit on 24GB (FP8/Q4_K_M vs MoE-Infinity's FP16)
sglang excluded: cannot serve MoE on single 24GB GPU (no expert offloading)
gpt-oss-20b added: competitors get "X" since no vLLM support or GGUF available
Exit codes standardized: 0=success, 2=not supported, 3=OOM, 4=GGUF unavailable

Verification

All 4 final wave reviewers: APPROVE (F1 plan compliance, F2 code quality, F3 manual QA, F4 scope fidelity)
Python 3.9.13 compatible (verified: AST parse + import + --help all pass)
All Docker images build successfully and --help works in container

…onal not | union)

…ution in run_vllm.py

…n and gpt-oss-20b

…nstallation section

…run_llamacpp

…allbacks

xly added 14 commits April 3, 2026 09:52

feat(benchmarks): add comparison framework scaffolding

3d75d3b

fix(benchmarks): fix Python 3.9 compat in load_results (use List/Opti…

e38873b

…onal not | union)

feat(benchmarks): add MoE-Infinity comparison benchmark script

176305e

feat(benchmarks): add llama.cpp comparison benchmark script + Dockerfile

110f0c9

feat(benchmarks): add vLLM comparison benchmark script + Dockerfile

d611483

fix(benchmarks): fix vLLM Dockerfile ENTRYPOINT and Docker path resol…

61be276

…ution in run_vllm.py

feat(benchmarks): add orchestrator script for all frameworks

5075fe0

feat(benchmarks): add result aggregation and table generator

887991c

docs(readme): update performance table structure with precision colum…

00181f1

…n and gpt-oss-20b

docs(benchmarks): add benchmark reproduction guide

7766e39

fix(readme): remove spurious HTML comment and add blank line before I…

61bba24

…nstallation section

fix(benchmarks): add warning prints for non-fatal exception paths in …

41e6ce1

…run_llamacpp

fix(benchmarks): log warning instead of silent pass in get_gpu_name f…

8d1f87c

…allbacks

fix(benchmarks): add Dict/Any types to common.py for LSP clean pass

08a308a

drunkcoding merged commit ce2393c into dev Apr 6, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): add Dockerized comparison benchmark suite (vLLM, llama.cpp, gpt-oss-20b)#88

feat(benchmarks): add Dockerized comparison benchmark suite (vLLM, llama.cpp, gpt-oss-20b)#88
drunkcoding merged 14 commits intodevfrom
feat/benchmark-comparison

drunkcoding commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drunkcoding commented Apr 3, 2026

Summary

What's new

Models benchmarked

Running benchmarks

Key design decisions

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant