Skip to content

Official pytorch implementation for "Attention Is All You Need for KV Cache in Diffusion LLMs"

License

Notifications You must be signed in to change notification settings

VILA-Lab/Elastic-Cache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri *   Mukul Ranjan *   Zhiqiang Shen  

*Equal Contribution

arXiv Project Our Page GitHub issues GitHub stars GitHub license


Elastic-Cache

Elastic-Cache is a training-free framework that accelerates diffusion language models through intelligent KV caching. Achieve up to 45× speedup while maintaining or even improving accuracy.

Key Results

Metric Value
Speedup Upto 45.1x (GSM8K, 512 tokens)
Accuracy 81.50% vs 80.36% baseline
Code Generation 5x faster (HumanEval)

Method Overview

Illustration of the Key-Value cache method for diffusion LLMs. (a) The fast-dLLM (Wu et al., 2025) block-wise decoding method caches the Key-Value of all tokens outside the current block at each step. The KV cache is updated after completing a block of decoding. (b) Our proposed method, Elastic-Cache, caches the key-value of tokens outside a sliding window that flexibly moves through the sentence from left to right at each iteration. When the attention weights corresponding to the most-attended tokens (one for each layer) change significantly at a layer l, we start recomputing the KV cache from layer l + 1 to the last layer.

Our approach introduces three complementary strategies:

  1. Sliding Window Decoding - Flexible window that caches distant MASK tokens while computing attention for active tokens
  2. Attention-Aware Monitoring - Track most-attended tokens and trigger updates based on attention pattern changes
  3. Layer-Aware Updates - Selective cache refresh starting from deeper layers where changes are most significant

Empirical Motivation

Visualization of our motivation. (a) MASK tokens located near each other receive high attention, while those situated far apart have minimal influence. (b) Over time, the representations in the KV states of cached tokens evolve, with deeper layers experiencing more substantial changes. (c) The changes in attention weights of most-attended tokens exhibit similar patterns to the changes in KV states of all cached tokens. (d) KV states of the most-attended tokens have the least changes.

Our design is motivated by three key observations:

  • Spatial Locality: Distant MASK tokens have minimal attention influence
  • Layer-wise KV Drift: Deeper layers exhibit more significant changes over time
  • Attention Stability: Most-attended tokens show smallest changes, serving as reliable cache validity indicators

Performance Results

Comprehensive benchmark results on the LLaDA-1.5 suite. Each cell shows accuracy (top) and decoding throughput in tokens/sec with relative speedup to the LLaDA baseline (bottom, blue: t/s; orange: speedup). Bold cells denote the highest throughput and speedup per configuration.

Highlights

  • Training-Free: No model modifications or retraining required
  • Architecture-Agnostic: Works with LLaDA, Dream, LLaDA-V, and other diffusion LLMs
  • Scalable: Better performance with longer sequences
  • Controllable Trade-offs: Tune between accuracy and latency

Installation

# Clone the repository
git clone https://github.com/VILA-Lab/elastic-cache.git
cd elastic-cache

# Install dependencies
pip install -r requirements.txt

Usage

Parameters

Parameter Description
--gen_length Maximum length of generated text
--window_size Sliding window length (≤ gen_length)
--threshold Confidence-aware decoding threshold
--gamma Cache update trigger threshold
--track_num Number of most-attended tokens for cache update trigger
--block_caching Block caching for far-away [MASK] tokens

Running Experiments

LLaDA Model:

cd llada
bash eval_{task}.sh

Dream Model:

cd dream
bash eval_{task}.sh

Roadmap

  • [✅] Serve diffusion LLMs with Elastic-Cache and batch inference
  • [🚀] Triton implementation
  • [🚀] Integrate into additional models (e.g., MMaDA)
  • [🚀] Elastic-Cache v2

Citation

@article{nguyen2025attention,
  title={Attention is all you need for kv cache in diffusion llms},
  author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},
  journal={arXiv preprint arXiv:2510.14973},
  year={2025}
}

Acknowledgements

This repository is built upon LLaDA, Dream, LLaDA-V, and lm-evaluation-harness.


License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

About

Official pytorch implementation for "Attention Is All You Need for KV Cache in Diffusion LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •