RL environment for KV-cache eviction policy optimization in LLM inference serving.
This env borrows ideas from NVIDIA Dynamo's March 2026 post, "Full-Stack Optimizations for Agentic Inference with Dynamo":
- Blog:
https://docs.nvidia.com/dynamo/dev/blog/agentic-inference
Adapted from the post:
- KV block value differences (
system/contexthigh value vsreasoning/ephemerallower value) - Priority/retention-style cache policy ideas
- Agent lifecycle angle (ephemeral work should be easier to evict)
This is a simplified RL simulator inspired by those ideas, not Dynamo itself.
The agent manages GPU KV-cache pressure with multiple actions per step:
| Action | Effect |
|---|---|
evict(seq_id) |
Removes a sequence, frees GPU memory, sequence is dropped |
compress(seq_id) |
Quantizes one tier (fp16→int8→int4), each halving memory cost |
swap_to_cpu(seq_id) |
Frees GPU memory, adds swap access latency |
keep() |
No action |
The model can call multiple tools per step — under high pressure it should batch 2-3 evictions/compressions to create headroom.
Goal: maximize throughput while minimizing failures, latency, and unsafe evictions.
├── __init__.py # Compatibility exports for env loading
└── kv_cache_rl/
├── kv_cache_eviction.py # StatefulToolEnv + rewards + load_environment()
├── simulator.py # KVCacheSimulator + dynamics + pressure tracking
└── scenarios.py # Scenario generation + compact prompt builders
- Episode length: 15 steps
- Difficulty tiers:
- Easy: memory budget 0.90-1.00, low/medium arrivals
- Medium: memory budget 0.75-0.90, medium/high arrivals
- Hard: memory budget 0.65-0.80, high/bursty arrivals
- Block types:
system,context,generation,reasoning,ephemeral eviction_valueindicates eviction safety (0.0worst to evict,1.0safest)- Compression uses actual quantization multipliers (fp16=1.0, int8=0.5, int4=0.25)
The simulator tracks and exposes:
time_above_0_95: number of steps with normalized memory usage above 0.95pending_queue_growth: step-to-step change in pending queue size
Prompts include a compact summary instead of full raw cache dumps to reduce token usage.
Top-level fields include:
memory_usage,memory_budget,episode_remainingtime_above_0_95,pending_queue_growthcache_overview(block counts, compressed/swapped counts)top_memory_entries(largest entries)top_eviction_candidates(ranked by safety/priority/progress)pending_summaryvisible_seq_ids(recommended IDs for tool actions)
| Function | Weight | Description |
|---|---|---|
failure_penalty |
0.40 | Accelerating penalty: min(f*0.15 + max(0,f-2)*0.1, 1.0) |
throughput_reward |
0.25 | min(total_tokens / 500, 1.0) |
headroom_bonus |
0.18 | Per-step memory tracking, rewards staying below 0.8/0.95/1.0 thresholds |
memory_efficiency |
0.07 | Step function on final memory state (0.2 if <0.85, -0.2 if overflow) |
eviction_quality |
0.05 | Rewards preserving high-value blocks (system, context) |
latency_penalty |
0.03 | -min(total_swap_latency * 0.2, 0.3) |
risky_eviction_penalty |
0.02 | Penalizes evicting system/high-priority/high-progress blocks |
Tracked metrics (weight=0): total_tokens_metric, total_failures_metric, total_latency_metric, pressure_steps_metric.
# Install local environment
prime env install ./environments/kv_cache_eviction
# Run eval
prime eval run kv_cache_eviction -m gpt-5.4-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -n 5 -r 1 --max-concurrent 1
# Hard-only scenarios
prime eval run kv_cache_eviction --env-args '{"difficulty":"hard"}'| Arg | Type | Default | Description |
|---|---|---|---|
difficulty |
str | "all" |
One of easy, medium, hard, all |
num_examples |
int | -1 |
Number of scenarios to use (-1 = all) |
seed |
int | 42 |
Scenario generation seed |