Skip to content

semioz/kv-cache-env

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kv-cache-rl

RL environment for KV-cache eviction policy optimization in LLM inference serving.

Inspiration

This env borrows ideas from NVIDIA Dynamo's March 2026 post, "Full-Stack Optimizations for Agentic Inference with Dynamo":

  • Blog: https://docs.nvidia.com/dynamo/dev/blog/agentic-inference

Adapted from the post:

  • KV block value differences (system/context high value vs reasoning/ephemeral lower value)
  • Priority/retention-style cache policy ideas
  • Agent lifecycle angle (ephemeral work should be easier to evict)

This is a simplified RL simulator inspired by those ideas, not Dynamo itself.

Problem

The agent manages GPU KV-cache pressure with multiple actions per step:

Action Effect
evict(seq_id) Removes a sequence, frees GPU memory, sequence is dropped
compress(seq_id) Quantizes one tier (fp16→int8→int4), each halving memory cost
swap_to_cpu(seq_id) Frees GPU memory, adds swap access latency
keep() No action

The model can call multiple tools per step — under high pressure it should batch 2-3 evictions/compressions to create headroom.

Goal: maximize throughput while minimizing failures, latency, and unsafe evictions.

├── __init__.py            # Compatibility exports for env loading
└── kv_cache_rl/
    ├── kv_cache_eviction.py   # StatefulToolEnv + rewards + load_environment()
    ├── simulator.py           # KVCacheSimulator + dynamics + pressure tracking
    └── scenarios.py           # Scenario generation + compact prompt builders

Simulator Highlights

  • Episode length: 15 steps
  • Difficulty tiers:
    • Easy: memory budget 0.90-1.00, low/medium arrivals
    • Medium: memory budget 0.75-0.90, medium/high arrivals
    • Hard: memory budget 0.65-0.80, high/bursty arrivals
  • Block types: system, context, generation, reasoning, ephemeral
  • eviction_value indicates eviction safety (0.0 worst to evict, 1.0 safest)
  • Compression uses actual quantization multipliers (fp16=1.0, int8=0.5, int4=0.25)

Pressure Signals

The simulator tracks and exposes:

  • time_above_0_95: number of steps with normalized memory usage above 0.95
  • pending_queue_growth: step-to-step change in pending queue size

Observation Format (Compact)

Prompts include a compact summary instead of full raw cache dumps to reduce token usage.

Top-level fields include:

  • memory_usage, memory_budget, episode_remaining
  • time_above_0_95, pending_queue_growth
  • cache_overview (block counts, compressed/swapped counts)
  • top_memory_entries (largest entries)
  • top_eviction_candidates (ranked by safety/priority/progress)
  • pending_summary
  • visible_seq_ids (recommended IDs for tool actions)

Reward Functions

Function Weight Description
failure_penalty 0.40 Accelerating penalty: min(f*0.15 + max(0,f-2)*0.1, 1.0)
throughput_reward 0.25 min(total_tokens / 500, 1.0)
headroom_bonus 0.18 Per-step memory tracking, rewards staying below 0.8/0.95/1.0 thresholds
memory_efficiency 0.07 Step function on final memory state (0.2 if <0.85, -0.2 if overflow)
eviction_quality 0.05 Rewards preserving high-value blocks (system, context)
latency_penalty 0.03 -min(total_swap_latency * 0.2, 0.3)
risky_eviction_penalty 0.02 Penalizes evicting system/high-priority/high-progress blocks

Tracked metrics (weight=0): total_tokens_metric, total_failures_metric, total_latency_metric, pressure_steps_metric.

Usage

# Install local environment
prime env install ./environments/kv_cache_eviction

# Run eval
prime eval run kv_cache_eviction -m gpt-5.4-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -n 5 -r 1 --max-concurrent 1

# Hard-only scenarios
prime eval run kv_cache_eviction --env-args '{"difficulty":"hard"}'

Environment Arguments

Arg Type Default Description
difficulty str "all" One of easy, medium, hard, all
num_examples int -1 Number of scenarios to use (-1 = all)
seed int 42 Scenario generation seed

About

RL environment for training LLMs to optimize KV-cache eviction policies in inference serving.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages