docs: add MTP speculative decoding benchmark results (M5 Pro 64GB) by solderzzc · Pull Request #105 · SharpAI/SwiftLM

solderzzc · 2026-05-13T05:55:59Z

Adds fresh benchmark results for MTP speculative decoding on Gemma 4-26B-A4B-it-4bit, measured locally on an M5 Pro 64 GB MacBook Pro.

Results Summary

Configuration	Avg TPS	100K TTFT	GPU Alloc @ 40K
Baseline	43.6 tok/s	63.11s	54.8 GB
MTP Speculative	46.3 tok/s (+6%)	55.17s	54.6 GB
MTP + TurboQuant	66.5 tok/s (+53%)	33.95s	23.9 GB

Key highlights:

🚀 +53% avg throughput with MTP + TurboQuant
🏎️ Nearly 2× faster TTFT at 100K context (33.95s vs 63.11s)
💾 56% less GPU memory at 40K context (23.9 GB vs 54.8 GB)
🔬 MTP alone costs zero extra memory while improving TTFT

Gemma 4-26B-A4B benchmarks across Baseline / MTP Speculative / MTP+TurboQuant: - MTP + TurboQuant: 66.5 tok/s avg (+53% vs baseline) - TTFT at 100K context: 33.95s vs 63.11s (-46%) - GPU alloc at 40K context: 23.9 GB vs 54.8 GB (-56%) - MTP alone: +6% TPS, lower TTFT, zero memory overhead

Copilot

Pull request overview

Adds a new README performance section documenting benchmark results for MTP speculative decoding (with and without TurboQuant) on gemma-4-26b-a4b-it-4bit on an M5 Pro 64 GB MacBook Pro, including reproduction instructions via the existing profiling script.

Changes:

Introduces a dedicated “MTP Speculative Decoding” benchmark section with throughput, TTFT, and memory tables.
Summarizes key takeaways (throughput, TTFT improvements, and memory impact) for Baseline vs MTP vs MTP+TurboQuant.
Adds a command to reproduce the benchmark with scripts/profiling/profile_runner.py across 512/40K/100K contexts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+- 🚀 **+53% avg throughput** — MTP + TurboQuant delivers 66.5 tok/s average vs 43.6 tok/s baseline
+- 🏎️ **Nearly 2× faster TTFT at 100K context** — 33.95s vs 63.11s baseline (46% reduction)
+- 💾 **Massive memory savings at long context** — GPU allocation drops from 54.8 GB → 23.9 GB at 40K tokens (TurboQuant KV compression)
+- 🔬 **MTP alone is free** — +6% TPS and lower TTFT with zero additional memory overhead


+### GPU Memory (allocated / peak physical RAM)
+
+| Configuration | 512 tokens | 40K tokens | 100K tokens |
+|---|---|---|---|
+| Baseline | 20.4 GB / 15.8 GB | 54.8 GB / 19.3 GB | 54.3 GB / 23.3 GB |
+| MTP Speculative | 20.4 GB / 16.0 GB | 54.6 GB / 20.8 GB | 54.3 GB / 23.1 GB |
+| **MTP + TurboQuant** ⭐ | **20.3 GB / 15.8 GB** | **23.9 GB / 17.3 GB** | **26.4 GB / 18.2 GB** |


Copilot AI review requested due to automatic review settings May 13, 2026 05:55

Copilot started reviewing on behalf of solderzzc May 13, 2026 05:56 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

github-actions Bot added 2 commits May 12, 2026 23:01

docs: fix avg TPS to use time-weighted harmonic mean

1239f09

docs: show per-context speedup multipliers in benchmark table

e5dcbcb

solderzzc merged commit d5a9d11 into main May 13, 2026
1 check passed

solderzzc deleted the docs/mtp-benchmark-results branch May 13, 2026 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add MTP speculative decoding benchmark results (M5 Pro 64GB)#105

docs: add MTP speculative decoding benchmark results (M5 Pro 64GB)#105
solderzzc merged 3 commits into
mainfrom
docs/mtp-benchmark-results

solderzzc commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

solderzzc commented May 13, 2026

Results Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants