docs: add MTP speculative decoding benchmark results (M5 Pro 64GB)#105
Merged
Conversation
Gemma 4-26B-A4B benchmarks across Baseline / MTP Speculative / MTP+TurboQuant: - MTP + TurboQuant: 66.5 tok/s avg (+53% vs baseline) - TTFT at 100K context: 33.95s vs 63.11s (-46%) - GPU alloc at 40K context: 23.9 GB vs 54.8 GB (-56%) - MTP alone: +6% TPS, lower TTFT, zero memory overhead
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new README performance section documenting benchmark results for MTP speculative decoding (with and without TurboQuant) on gemma-4-26b-a4b-it-4bit on an M5 Pro 64 GB MacBook Pro, including reproduction instructions via the existing profiling script.
Changes:
- Introduces a dedicated “MTP Speculative Decoding” benchmark section with throughput, TTFT, and memory tables.
- Summarizes key takeaways (throughput, TTFT improvements, and memory impact) for Baseline vs MTP vs MTP+TurboQuant.
- Adds a command to reproduce the benchmark with
scripts/profiling/profile_runner.pyacross 512/40K/100K contexts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - 🚀 **+53% avg throughput** — MTP + TurboQuant delivers 66.5 tok/s average vs 43.6 tok/s baseline | ||
| - 🏎️ **Nearly 2× faster TTFT at 100K context** — 33.95s vs 63.11s baseline (46% reduction) | ||
| - 💾 **Massive memory savings at long context** — GPU allocation drops from 54.8 GB → 23.9 GB at 40K tokens (TurboQuant KV compression) | ||
| - 🔬 **MTP alone is free** — +6% TPS and lower TTFT with zero additional memory overhead |
Comment on lines
+74
to
+80
| ### GPU Memory (allocated / peak physical RAM) | ||
|
|
||
| | Configuration | 512 tokens | 40K tokens | 100K tokens | | ||
| |---|---|---|---| | ||
| | Baseline | 20.4 GB / 15.8 GB | 54.8 GB / 19.3 GB | 54.3 GB / 23.3 GB | | ||
| | MTP Speculative | 20.4 GB / 16.0 GB | 54.6 GB / 20.8 GB | 54.3 GB / 23.1 GB | | ||
| | **MTP + TurboQuant** ⭐ | **20.3 GB / 15.8 GB** | **23.9 GB / 17.3 GB** | **26.4 GB / 18.2 GB** | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds fresh benchmark results for MTP speculative decoding on Gemma 4-26B-A4B-it-4bit, measured locally on an M5 Pro 64 GB MacBook Pro.
Results Summary
Key highlights: