kv-cache-rl

RL environment for KV-cache eviction policy optimization in LLM inference serving.

Inspiration

This env borrows ideas from NVIDIA Dynamo's March 2026 post, "Full-Stack Optimizations for Agentic Inference with Dynamo":

Blog: https://docs.nvidia.com/dynamo/dev/blog/agentic-inference

Adapted from the post:

KV block value differences (system/context high value vs reasoning/ephemeral lower value)
Priority/retention-style cache policy ideas
Agent lifecycle angle (ephemeral work should be easier to evict)

This is a simplified RL simulator inspired by those ideas, not Dynamo itself.

Problem

The agent manages GPU KV-cache pressure with multiple actions per step:

Action	Effect
`evict(seq_id)`	Removes a sequence, frees GPU memory, sequence is dropped
`compress(seq_id)`	Quantizes one tier (fp16→int8→int4), each halving memory cost
`swap_to_cpu(seq_id)`	Frees GPU memory, adds swap access latency
`keep()`	No action

The model can call multiple tools per step — under high pressure it should batch 2-3 evictions/compressions to create headroom.

Goal: maximize throughput while minimizing failures, latency, and unsafe evictions.

├── __init__.py            # Compatibility exports for env loading
└── kv_cache_rl/
    ├── kv_cache_eviction.py   # StatefulToolEnv + rewards + load_environment()
    ├── simulator.py           # KVCacheSimulator + dynamics + pressure tracking
    └── scenarios.py           # Scenario generation + compact prompt builders

Simulator Highlights

Episode length: 15 steps
Difficulty tiers:
- Easy: memory budget 0.90-1.00, low/medium arrivals
- Medium: memory budget 0.75-0.90, medium/high arrivals
- Hard: memory budget 0.65-0.80, high/bursty arrivals
Block types: system, context, generation, reasoning, ephemeral
eviction_value indicates eviction safety (0.0 worst to evict, 1.0 safest)
Compression uses actual quantization multipliers (fp16=1.0, int8=0.5, int4=0.25)

Pressure Signals

The simulator tracks and exposes:

time_above_0_95: number of steps with normalized memory usage above 0.95
pending_queue_growth: step-to-step change in pending queue size

Observation Format (Compact)

Prompts include a compact summary instead of full raw cache dumps to reduce token usage.

Top-level fields include:

memory_usage, memory_budget, episode_remaining
time_above_0_95, pending_queue_growth
cache_overview (block counts, compressed/swapped counts)
top_memory_entries (largest entries)
top_eviction_candidates (ranked by safety/priority/progress)
pending_summary
visible_seq_ids (recommended IDs for tool actions)

Reward Functions

Function	Weight	Description
`failure_penalty`	0.40	Accelerating penalty: `min(f0.15 + max(0,f-2)0.1, 1.0)`
`throughput_reward`	0.25	`min(total_tokens / 500, 1.0)`
`headroom_bonus`	0.18	Per-step memory tracking, rewards staying below 0.8/0.95/1.0 thresholds
`memory_efficiency`	0.07	Step function on final memory state (0.2 if <0.85, -0.2 if overflow)
`eviction_quality`	0.05	Rewards preserving high-value blocks (system, context)
`latency_penalty`	0.03	`-min(total_swap_latency * 0.2, 0.3)`
`risky_eviction_penalty`	0.02	Penalizes evicting system/high-priority/high-progress blocks

Tracked metrics (weight=0): total_tokens_metric, total_failures_metric, total_latency_metric, pressure_steps_metric.

Usage

# Install local environment
prime env install ./environments/kv_cache_eviction

# Run eval
prime eval run kv_cache_eviction -m gpt-5.4-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -n 5 -r 1 --max-concurrent 1

# Hard-only scenarios
prime eval run kv_cache_eviction --env-args '{"difficulty":"hard"}'

Environment Arguments

Arg	Type	Default	Description
`difficulty`	str	`"all"`	One of `easy`, `medium`, `hard`, `all`
`num_examples`	int	`-1`	Number of scenarios to use (`-1` = all)
`seed`	int	`42`	Scenario generation seed

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
kv_cache_rl		kv_cache_rl
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kv-cache-rl

Inspiration

Problem

Simulator Highlights

Pressure Signals

Observation Format (Compact)

Reward Functions

Usage

Environment Arguments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kv-cache-rl

Inspiration

Problem

Simulator Highlights

Pressure Signals

Observation Format (Compact)

Reward Functions

Usage

Environment Arguments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages