Skip to content

docs: add MTP speculative decoding benchmark results (M5 Pro 64GB)#105

Merged
solderzzc merged 3 commits into
mainfrom
docs/mtp-benchmark-results
May 13, 2026
Merged

docs: add MTP speculative decoding benchmark results (M5 Pro 64GB)#105
solderzzc merged 3 commits into
mainfrom
docs/mtp-benchmark-results

Conversation

@solderzzc
Copy link
Copy Markdown
Member

Adds fresh benchmark results for MTP speculative decoding on Gemma 4-26B-A4B-it-4bit, measured locally on an M5 Pro 64 GB MacBook Pro.

Results Summary

Configuration Avg TPS 100K TTFT GPU Alloc @ 40K
Baseline 43.6 tok/s 63.11s 54.8 GB
MTP Speculative 46.3 tok/s (+6%) 55.17s 54.6 GB
MTP + TurboQuant 66.5 tok/s (+53%) 33.95s 23.9 GB

Key highlights:

  • 🚀 +53% avg throughput with MTP + TurboQuant
  • 🏎️ Nearly 2× faster TTFT at 100K context (33.95s vs 63.11s)
  • 💾 56% less GPU memory at 40K context (23.9 GB vs 54.8 GB)
  • 🔬 MTP alone costs zero extra memory while improving TTFT

Gemma 4-26B-A4B benchmarks across Baseline / MTP Speculative / MTP+TurboQuant:
- MTP + TurboQuant: 66.5 tok/s avg (+53% vs baseline)
- TTFT at 100K context: 33.95s vs 63.11s (-46%)
- GPU alloc at 40K context: 23.9 GB vs 54.8 GB (-56%)
- MTP alone: +6% TPS, lower TTFT, zero memory overhead
Copilot AI review requested due to automatic review settings May 13, 2026 05:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new README performance section documenting benchmark results for MTP speculative decoding (with and without TurboQuant) on gemma-4-26b-a4b-it-4bit on an M5 Pro 64 GB MacBook Pro, including reproduction instructions via the existing profiling script.

Changes:

  • Introduces a dedicated “MTP Speculative Decoding” benchmark section with throughput, TTFT, and memory tables.
  • Summarizes key takeaways (throughput, TTFT improvements, and memory impact) for Baseline vs MTP vs MTP+TurboQuant.
  • Adds a command to reproduce the benchmark with scripts/profiling/profile_runner.py across 512/40K/100K contexts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated
- 🚀 **+53% avg throughput** — MTP + TurboQuant delivers 66.5 tok/s average vs 43.6 tok/s baseline
- 🏎️ **Nearly 2× faster TTFT at 100K context** — 33.95s vs 63.11s baseline (46% reduction)
- 💾 **Massive memory savings at long context** — GPU allocation drops from 54.8 GB → 23.9 GB at 40K tokens (TurboQuant KV compression)
- 🔬 **MTP alone is free** — +6% TPS and lower TTFT with zero additional memory overhead
Comment thread README.md
Comment on lines +74 to +80
### GPU Memory (allocated / peak physical RAM)

| Configuration | 512 tokens | 40K tokens | 100K tokens |
|---|---|---|---|
| Baseline | 20.4 GB / 15.8 GB | 54.8 GB / 19.3 GB | 54.3 GB / 23.3 GB |
| MTP Speculative | 20.4 GB / 16.0 GB | 54.6 GB / 20.8 GB | 54.3 GB / 23.1 GB |
| **MTP + TurboQuant** ⭐ | **20.3 GB / 15.8 GB** | **23.9 GB / 17.3 GB** | **26.4 GB / 18.2 GB** |
@solderzzc solderzzc merged commit d5a9d11 into main May 13, 2026
1 check passed
@solderzzc solderzzc deleted the docs/mtp-benchmark-results branch May 13, 2026 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants