Skip to content

feat(benchmarks): add Dockerized comparison benchmark suite (vLLM, llama.cpp, gpt-oss-20b)#88

Merged
drunkcoding merged 14 commits intodevfrom
feat/benchmark-comparison
Apr 6, 2026
Merged

feat(benchmarks): add Dockerized comparison benchmark suite (vLLM, llama.cpp, gpt-oss-20b)#88
drunkcoding merged 14 commits intodevfrom
feat/benchmark-comparison

Conversation

@drunkcoding
Copy link
Copy Markdown
Contributor

Summary

Adds a complete, reproducible benchmark comparison suite for measuring MoE-Infinity's per-token-latency against vLLM v0.18.1 and llama.cpp b8640 across 4 MoE models on a single 24GB GPU.

What's new

Benchmark scripts (benchmarks/comparison/):

  • common.py — Shared types (BenchmarkConfig, BenchmarkResult), 4-model registry including openai/gpt-oss-20b, 20-prompt fixed dataset, save_result/load_results utilities
  • run_moe_infinity.py — MoE-Infinity benchmark (FP16, expert offloading, StopWatch timing)
  • run_vllm.py — vLLM v0.18.1 benchmark (FP8 quantization for large models, graceful exit codes)
  • run_llamacpp.py — llama.cpp b8640 benchmark (Q4_K_M GGUF, GPU offloading via -ngl)
  • aggregate_results.py — Combines JSON results into a Markdown comparison table
  • run_all.sh — Orchestrator: builds Docker images, runs frameworks sequentially with 60s thermal cooldown, aggregates results

Docker isolation (benchmarks/comparison/):

  • Dockerfile.vllm — wraps vllm/vllm-openai:v0.18.1
  • Dockerfile.llamacpp — wraps ghcr.io/ggml-org/llama.cpp:server-cuda
  • MoE-Infinity uses existing docker/Dockerfile

Documentation:

  • docs/benchmark_reproduction.md — Step-by-step reproduction guide (prerequisites, quick start, per-framework setup, troubleshooting, extending)
  • README.md — Updated performance table with new structure: 4 models (incl. gpt-oss-20b), 3 frameworks with precision labels (FP16/FP8/Q4_K_M), em-dash placeholders, legacy table in collapsible section

Models benchmarked

Model deepseek-v2-lite mixtral-8x7b qwen3-30b gpt-oss-20b
MoE-Infinity FP16 + offloading FP16 + offloading FP16 + offloading FP16 + offloading
vLLM v0.18.1 FP8 FP8 FP8 X (unsupported)
llama.cpp b8640 Q4_K_M Q4_K_M Q4_K_M X (no GGUF)

Running benchmarks

# A5000 GPU detected — infrastructure is ready
bash benchmarks/comparison/run_all.sh \
    --offload-dir /path/to/ssd/offload \
    --results-dir benchmarks/comparison/results/

Takes ~2-3 hours. Results saved as JSON + Markdown table.

Key design decisions

  • Docker per framework: each framework runs in its own container to avoid CUDA/PyTorch version conflicts
  • Precision transparency: precision column shows what each framework sacrifices to fit on 24GB (FP8/Q4_K_M vs MoE-Infinity's FP16)
  • sglang excluded: cannot serve MoE on single 24GB GPU (no expert offloading)
  • gpt-oss-20b added: competitors get "X" since no vLLM support or GGUF available
  • Exit codes standardized: 0=success, 2=not supported, 3=OOM, 4=GGUF unavailable

Verification

  • All 4 final wave reviewers: APPROVE (F1 plan compliance, F2 code quality, F3 manual QA, F4 scope fidelity)
  • Python 3.9.13 compatible (verified: AST parse + import + --help all pass)
  • All Docker images build successfully and --help works in container

@drunkcoding drunkcoding merged commit ce2393c into dev Apr 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant