Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri^* Mukul Ranjan^* Zhiqiang Shen

^*Equal Contribution

Elastic-Cache

Elastic-Cache is a training-free framework that accelerates diffusion language models through intelligent KV caching. Achieve up to 45× speedup while maintaining or even improving accuracy.

Key Results

Metric	Value
Speedup	Upto 45.1x (GSM8K, 512 tokens)
Accuracy	81.50% vs 80.36% baseline
Code Generation	5x faster (HumanEval)

Method Overview

Illustration of the Key-Value cache method for diffusion LLMs. (a) The fast-dLLM (Wu et al., 2025) block-wise decoding method caches the Key-Value of all tokens outside the current block at each step. The KV cache is updated after completing a block of decoding. (b) Our proposed method, Elastic-Cache, caches the key-value of tokens outside a sliding window that flexibly moves through the sentence from left to right at each iteration. When the attention weights corresponding to the most-attended tokens (one for each layer) change significantly at a layer l, we start recomputing the KV cache from layer l + 1 to the last layer.

Our approach introduces three complementary strategies:

Sliding Window Decoding - Flexible window that caches distant MASK tokens while computing attention for active tokens
Attention-Aware Monitoring - Track most-attended tokens and trigger updates based on attention pattern changes
Layer-Aware Updates - Selective cache refresh starting from deeper layers where changes are most significant

Empirical Motivation

Visualization of our motivation. (a) MASK tokens located near each other receive high attention, while those situated far apart have minimal influence. (b) Over time, the representations in the KV states of cached tokens evolve, with deeper layers experiencing more substantial changes. (c) The changes in attention weights of most-attended tokens exhibit similar patterns to the changes in KV states of all cached tokens. (d) KV states of the most-attended tokens have the least changes.

Our design is motivated by three key observations:

Spatial Locality: Distant MASK tokens have minimal attention influence
Layer-wise KV Drift: Deeper layers exhibit more significant changes over time
Attention Stability: Most-attended tokens show smallest changes, serving as reliable cache validity indicators

Performance Results

Comprehensive benchmark results on the LLaDA-1.5 suite. Each cell shows accuracy (top) and decoding throughput in tokens/sec with relative speedup to the LLaDA baseline (bottom, blue: t/s; orange: speedup). Bold cells denote the highest throughput and speedup per configuration.

Highlights

Training-Free: No model modifications or retraining required
Architecture-Agnostic: Works with LLaDA, Dream, LLaDA-V, and other diffusion LLMs
Scalable: Better performance with longer sequences
Controllable Trade-offs: Tune between accuracy and latency

Installation

# Clone the repository
git clone https://github.com/VILA-Lab/elastic-cache.git
cd elastic-cache

# Install dependencies
pip install -r requirements.txt

Usage

Parameters

Parameter	Description
`--gen_length`	Maximum length of generated text
`--window_size`	Sliding window length (≤ gen_length)
`--threshold`	Confidence-aware decoding threshold
`--gamma`	Cache update trigger threshold
`--track_num`	Number of most-attended tokens for cache update trigger
`--block_caching`	Block caching for far-away [MASK] tokens

Running Experiments

LLaDA Model:

cd llada
bash eval_{task}.sh

Dream Model:

cd dream
bash eval_{task}.sh

Roadmap

[✅] Serve diffusion LLMs with Elastic-Cache and batch inference
[🚀] Triton implementation
[🚀] Integrate into additional models (e.g., MMaDA)
[🚀] Elastic-Cache v2

Citation

@article{nguyen2025attention,
  title={Attention is all you need for kv cache in diffusion llms},
  author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},
  journal={arXiv preprint arXiv:2510.14973},
  year={2025}
}

Acknowledgements

This repository is built upon LLaDA, Dream, LLaDA-V, and lm-evaluation-harness.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dream		dream
llada		llada
paper		paper
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention Is All You Need for KV Cache in Diffusion LLMs

Elastic-Cache

Key Results

Method Overview

Empirical Motivation

Performance Results

Highlights

Installation

Usage

Parameters

Running Experiments

Roadmap

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

VILA-Lab/Elastic-Cache

Folders and files

Latest commit

History

Repository files navigation

Attention Is All You Need for KV Cache in Diffusion LLMs

Elastic-Cache

Key Results

Method Overview

Empirical Motivation

Performance Results

Highlights

Installation

Usage

Parameters

Running Experiments

Roadmap

Citation

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages