Elastic-Cache is a training-free framework that accelerates diffusion language models through intelligent KV caching. Achieve up to 45× speedup while maintaining or even improving accuracy.
| Metric | Value |
|---|---|
| Speedup | Upto 45.1x (GSM8K, 512 tokens) |
| Accuracy | 81.50% vs 80.36% baseline |
| Code Generation | 5x faster (HumanEval) |
Illustration of the Key-Value cache method for diffusion LLMs. (a) The fast-dLLM (Wu et al., 2025) block-wise decoding method caches the Key-Value of all tokens outside the current block at each step. The KV cache is updated after completing a block of decoding. (b) Our proposed method, Elastic-Cache, caches the key-value of tokens outside a sliding window that flexibly moves through the sentence from left to right at each iteration. When the attention weights corresponding to the most-attended tokens (one for each layer) change significantly at a layer l, we start recomputing the KV cache from layer l + 1 to the last layer.
Our approach introduces three complementary strategies:
- Sliding Window Decoding - Flexible window that caches distant MASK tokens while computing attention for active tokens
- Attention-Aware Monitoring - Track most-attended tokens and trigger updates based on attention pattern changes
- Layer-Aware Updates - Selective cache refresh starting from deeper layers where changes are most significant
Visualization of our motivation. (a) MASK tokens located near each other receive high attention, while those situated far apart have minimal influence. (b) Over time, the representations in the KV states of cached tokens evolve, with deeper layers experiencing more substantial changes. (c) The changes in attention weights of most-attended tokens exhibit similar patterns to the changes in KV states of all cached tokens. (d) KV states of the most-attended tokens have the least changes.
Our design is motivated by three key observations:
- Spatial Locality: Distant MASK tokens have minimal attention influence
- Layer-wise KV Drift: Deeper layers exhibit more significant changes over time
- Attention Stability: Most-attended tokens show smallest changes, serving as reliable cache validity indicators
Comprehensive benchmark results on the LLaDA-1.5 suite. Each cell shows accuracy (top) and decoding throughput in tokens/sec with relative speedup to the LLaDA baseline (bottom, blue: t/s; orange: speedup). Bold cells denote the highest throughput and speedup per configuration.
- Training-Free: No model modifications or retraining required
- Architecture-Agnostic: Works with LLaDA, Dream, LLaDA-V, and other diffusion LLMs
- Scalable: Better performance with longer sequences
- Controllable Trade-offs: Tune between accuracy and latency
# Clone the repository
git clone https://github.com/VILA-Lab/elastic-cache.git
cd elastic-cache
# Install dependencies
pip install -r requirements.txt| Parameter | Description |
|---|---|
--gen_length |
Maximum length of generated text |
--window_size |
Sliding window length (≤ gen_length) |
--threshold |
Confidence-aware decoding threshold |
--gamma |
Cache update trigger threshold |
--track_num |
Number of most-attended tokens for cache update trigger |
--block_caching |
Block caching for far-away [MASK] tokens |
LLaDA Model:
cd llada
bash eval_{task}.shDream Model:
cd dream
bash eval_{task}.sh- [✅] Serve diffusion LLMs with Elastic-Cache and batch inference
- [🚀] Triton implementation
- [🚀] Integrate into additional models (e.g., MMaDA)
- [🚀] Elastic-Cache v2
@article{nguyen2025attention,
title={Attention is all you need for kv cache in diffusion llms},
author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},
journal={arXiv preprint arXiv:2510.14973},
year={2025}
}This repository is built upon LLaDA, Dream, LLaDA-V, and lm-evaluation-harness.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
