Feature/observability metrics by chanhopark1 · Pull Request #5 · moreh-dev/LMCache

chanhopark1 · 2026-03-10T01:33:59Z

What this PR does / why we need it:
This PR updates the observability and telemetry subsystems to track metrics by storage tier.

When we use multiple storage backends (CPU, disk, remote) or async retrieve operations, aggregating all hits into a single counter makes it impossible to debug performance bottlenecks.

Main changes:

Tiered hit tracking: Added tier="cpu", tier="disk", and tier="remote" labels to Prometheus hit token counters.
Disk I/O metrics: Added specific metrics for local disk reads, writes, latencies, and evictions.
Async propagation: Passed backend names through the async prefetch and layerwise retrieve pipelines so we correctly attribute hits even when operations happen in the background.
Safer defaults: Added default values (0, 0.0, field(default_factory=list)) to LMCacheStats fields to fix dataclass initialization errors.

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Instrument LocalDiskBackend.read_file() and write_file() which already compute timing and bandwidth internally (logged via logger.debug). New metrics (mirrors remote_* pattern): lmcache:local_disk_read_bytes_total Counter lmcache:local_disk_write_bytes_total Counter lmcache:local_disk_read_latency Histogram (seconds) lmcache:local_disk_write_latency Histogram (seconds)

Track disk backend evictions via lmcache:local_disk_evict_count counter. Mirrors existing local_cpu_evict_count pattern. Called from LocalDiskBackend.remove() when cached entries are evicted.

Split aggregate hit tokens into per-backend (cpu/disk/remote) counters. _process_tokens_internal already iterates block_mapping keyed by backend name. This PR counts tokens per backend and exposes them via the num_hit_tokens counter with tier={cpu,disk,remote} labels. Before: lmcache:num_hit_tokens{tier="local"} 5000 After: lmcache:num_hit_tokens{tier="cpu"} 4200 lmcache:num_hit_tokens{tier="disk"} 800 lmcache:num_hit_tokens{tier="remote"} 0

Time each batched_get call per backend in _process_tokens_internal. Exposes lmcache:tier_get_latency{tier="cpu|disk|remote"} histogram so operators can identify which cache tier causes retrieval latency. New metric: lmcache:tier_get_latency{tier="cpu"} Histogram (seconds) lmcache:tier_get_latency{tier="disk"} Histogram (seconds) lmcache:tier_get_latency{tier="remote"} Histogram (seconds)

Track which cache tier served each retrieve request: lmcache:request_tier_served{tier="cpu|disk|remote|mixed|miss"} Counter lmcache:request_tier_hit_tokens{tier="cpu|disk|remote"} Histogram Dominant tier = tier serving >50% of tokens. "mixed" = no majority. Enables tier migration heatmaps and per-request routing analysis.

Layerwise path (retrieve_layer): location is already known via storage_manager.contains(). Set cpu/disk/remote_hit_tokens on retrieve_stats before on_retrieve_finished. Async path (_async_process_tokens_internal): track which backend each key came from via key_to_backend map built during result unpacking. Propagate per-backend token counts to retrieve_stats. Also update cleanup_memory_objs to handle the new result format from gather_with_keys which now includes backend names.

Modify gather_with_keys() to include backend_name alongside each tier's key-memobj pairs, enabling per-tier hit attribution in the async path. Update prefetch_all_done_callback to unpack the new tuple format. Update existing tests to match.

Clean up comments referencing internal PR numbers and reuse disk_latency_buckets for tier_get_latency histogram.

Prevents dataclass initialization errors by providing sensible defaults for all fields (0 for ints, 0.0 for floats, field(default_factory=list) for lists).

Change DEFAULT_PROMETHEUS_CONFIG to enabled=True so metrics are collected out of the box.

chanhopark1 added 9 commits March 10, 2026 10:01

Add local disk eviction counter

2ea29fa

Track disk backend evictions via lmcache:local_disk_evict_count counter. Mirrors existing local_cpu_evict_count pattern. Called from LocalDiskBackend.remove() when cached entries are evicted.

Remove internal PR references and deduplicate bucket list

d7213c7

Clean up comments referencing internal PR numbers and reuse disk_latency_buckets for tier_get_latency histogram.

Add default values to LMCacheStats dataclass fields

52afd9c

Prevents dataclass initialization errors by providing sensible defaults for all fields (0 for ints, 0.0 for floats, field(default_factory=list) for lists).

gitgod-bot assigned chanhopark1 Mar 10, 2026

chanhopark1 force-pushed the feature/observability-metrics branch 8 times, most recently from 3e84e60 to f4cf7c2 Compare March 10, 2026 06:56

Enable Prometheus metrics by default

1a07771

Change DEFAULT_PROMETHEUS_CONFIG to enabled=True so metrics are collected out of the box.

chanhopark1 force-pushed the feature/observability-metrics branch from f4cf7c2 to 1a07771 Compare March 10, 2026 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/observability metrics#5

Feature/observability metrics#5
chanhopark1 wants to merge 10 commits intodevfrom
feature/observability-metrics

chanhopark1 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chanhopark1 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant