Skip to content

Feature/observability metrics#5

Open
chanhopark1 wants to merge 10 commits intodevfrom
feature/observability-metrics
Open

Feature/observability metrics#5
chanhopark1 wants to merge 10 commits intodevfrom
feature/observability-metrics

Conversation

@chanhopark1
Copy link

What this PR does / why we need it:
This PR updates the observability and telemetry subsystems to track metrics by storage tier.

When we use multiple storage backends (CPU, disk, remote) or async retrieve operations, aggregating all hits into a single counter makes it impossible to debug performance bottlenecks.

Main changes:

  • Tiered hit tracking: Added tier="cpu", tier="disk", and tier="remote" labels to Prometheus hit token counters.
  • Disk I/O metrics: Added specific metrics for local disk reads, writes, latencies, and evictions.
  • Async propagation: Passed backend names through the async prefetch and layerwise retrieve pipelines so we correctly attribute hits even when operations happen in the background.
  • Safer defaults: Added default values (0, 0.0, field(default_factory=list)) to LMCacheStats fields to fix dataclass initialization errors.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Instrument LocalDiskBackend.read_file() and write_file() which already
compute timing and bandwidth internally (logged via logger.debug).

New metrics (mirrors remote_* pattern):
  lmcache:local_disk_read_bytes_total   Counter
  lmcache:local_disk_write_bytes_total  Counter
  lmcache:local_disk_read_latency       Histogram (seconds)
  lmcache:local_disk_write_latency      Histogram (seconds)
Track disk backend evictions via lmcache:local_disk_evict_count counter.
Mirrors existing local_cpu_evict_count pattern. Called from
LocalDiskBackend.remove() when cached entries are evicted.
Split aggregate hit tokens into per-backend (cpu/disk/remote) counters.
_process_tokens_internal already iterates block_mapping keyed by backend
name. This PR counts tokens per backend and exposes them via the
num_hit_tokens counter with tier={cpu,disk,remote} labels.

Before: lmcache:num_hit_tokens{tier="local"} 5000
After:  lmcache:num_hit_tokens{tier="cpu"} 4200
        lmcache:num_hit_tokens{tier="disk"} 800
        lmcache:num_hit_tokens{tier="remote"} 0
Time each batched_get call per backend in _process_tokens_internal.
Exposes lmcache:tier_get_latency{tier="cpu|disk|remote"} histogram
so operators can identify which cache tier causes retrieval latency.

New metric:
  lmcache:tier_get_latency{tier="cpu"}    Histogram (seconds)
  lmcache:tier_get_latency{tier="disk"}   Histogram (seconds)
  lmcache:tier_get_latency{tier="remote"} Histogram (seconds)
Track which cache tier served each retrieve request:
  lmcache:request_tier_served{tier="cpu|disk|remote|mixed|miss"} Counter
  lmcache:request_tier_hit_tokens{tier="cpu|disk|remote"} Histogram

Dominant tier = tier serving >50% of tokens. "mixed" = no majority.
Enables tier migration heatmaps and per-request routing analysis.
Layerwise path (retrieve_layer): location is already known via
storage_manager.contains(). Set cpu/disk/remote_hit_tokens on
retrieve_stats before on_retrieve_finished.

Async path (_async_process_tokens_internal): track which backend
each key came from via key_to_backend map built during result
unpacking. Propagate per-backend token counts to retrieve_stats.

Also update cleanup_memory_objs to handle the new result format
from gather_with_keys which now includes backend names.
Modify gather_with_keys() to include backend_name alongside each
tier's key-memobj pairs, enabling per-tier hit attribution in the
async path. Update prefetch_all_done_callback to unpack the new
tuple format. Update existing tests to match.
Clean up comments referencing internal PR numbers and
reuse disk_latency_buckets for tier_get_latency histogram.
Prevents dataclass initialization errors by providing sensible
defaults for all fields (0 for ints, 0.0 for floats, field(default_factory=list) for lists).
@chanhopark1 chanhopark1 force-pushed the feature/observability-metrics branch 8 times, most recently from 3e84e60 to f4cf7c2 Compare March 10, 2026 06:56
Change DEFAULT_PROMETHEUS_CONFIG to enabled=True so metrics
are collected out of the box.
@chanhopark1 chanhopark1 force-pushed the feature/observability-metrics branch from f4cf7c2 to 1a07771 Compare March 10, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant