Skip to content

Redis-backed application cache for hot read paths#388

Open
ptrlrd wants to merge 2 commits into
mainfrom
feat/redis-cache
Open

Redis-backed application cache for hot read paths#388
ptrlrd wants to merge 2 commits into
mainfrom
feat/redis-cache

Conversation

@ptrlrd
Copy link
Copy Markdown
Owner

@ptrlrd ptrlrd commented Jun 1, 2026

Run uploads went from ~9k/week to ~130k/week. Every uncached read is amplifying Mongo load proportionally. PR #385 added a Mongo-backed lazy cache (stats_summary) which handles the worst-case "first user pays 5-10 s" path; this PR puts an even faster, configurable, cross-endpoint layer in front of it and the other hot reads.

What

A fail-safe Redis layer (services/cache.py) plus wiring into the four hottest read paths.

Cache substrate

  • services/cache.py — lazy redis-py client, get_json / set_json / delete / delete_pattern (SCAN-based). Every operation is fail-safe: Redis being unavailable returns None on get and no-ops on set, so the caller falls back to its existing data source. The cache is always an optimization, never load-bearing.
  • metrics.pyspire_codex_cache_hits_total / _misses_total / _errors_total, labeled by key namespace (stats, leaderboard, run, entity_scores).

Wired in

Endpoint Layer 0 TTL Why
/api/runs/stats Redis check before stats_summary 60s Matches refresher cycle; cluster-wide instead of per-worker
/api/runs/leaderboard Redis check before leaderboard_summary 60s Same reasoning; covers today / paginated combos that aren't in leaderboard_summary
/api/runs/shared/{hash} Redis check before disk 15min Runs are immutable, but I don't want every viewed run squatting in cache forever as the collection grows. 15min absorbs share-link bursts on a hot URL and lets cold runs drop off the LRU
/api/runs/scores/{entity_type} Redis check before snapshot read 5min Hit constantly by tier-list + detail-page sort columns

Infra

  • redis:7-alpine service in docker-compose.yml (dev), docker-compose.beta.yml, docker-compose.prod.yml. allkeys-lru eviction, AOF off (rebuildable from Mongo on miss), maxmemory 512 MB in prod / 128 MB in beta+dev. Healthcheck in prod.
  • REDIS_URL env var passed into the backend in each compose, with sensible defaults that point at the bundled redis service. When unset, every cache call no-ops -- the existing data paths run unchanged.
  • redis==5.2.1 added to requirements.txt.

Operational story

  • Redis down or unreachable → all reads miss, all writes no-op. Every endpoint still works, just at pre-PR latency. No 500s.
  • Cache cold after restart / deploy → the existing Mongo materialization (stats_summary, leaderboard_summary, entity_stats_snapshot) is still warm, so first requests hit ~ms (Mongo find_one) rather than the live aggregation. Redis warms up naturally on traffic.
  • Memory pressureallkeys-lru evicts oldest keys; cap protects the host.

What this unlocks for the next PR

Adding a new cache target is app_cache.get_json(key) / app_cache.set_json(key, val, ttl_seconds=N) -- no additional infrastructure. Obvious next wins:

  • /api/cards/*, /api/relics/*, /api/potions/* list responses (high QPS, deterministic per (entity, lang) between deploys; could go on a multi-hour TTL with a deploy-time delete_pattern).
  • /api/auth/me (one Mongo find_one per request becomes one Redis GET; cookie-keyed).
  • slowapi rate-limit storage so limits become cluster-wide and survive worker restarts (slowapi[redis] + a config line).

Compose deploy note

When this lands, the prod compose changes (Redis service + REDIS_URL passthrough) require a docker compose -f docker-compose.prod.yml up -d on the box to bring up the new spire-codex-redis container. The backend image alone won't pick up Redis until the compose is re-applied.

ptrlrd added 2 commits May 31, 2026 23:07
130k uploads/week from 9k a week earlier means every uncached query is
amplifying Mongo load. This adds a Redis layer (per-namespace, fail-safe,
opt-in via REDIS_URL) and wires it into the four hottest read paths.

Cache substrate
- backend/app/services/cache.py: lazy redis-py client, JSON+raw helpers,
  glob-pattern invalidation via SCAN. Every operation is fail-safe --
  Redis being unavailable returns None on get / no-ops on set, so the
  caller falls back to its existing data source. The cache is always an
  optimization, never load-bearing.
- backend/app/metrics.py: Prometheus hit/miss/error counters keyed on
  the key namespace (stats / leaderboard / run / entity_scores).

Wired in
- /api/runs/stats: layer 0 ahead of stats_summary; same write-through
  shape we already use for the lazy stats materialization. 60s TTL.
- /api/runs/leaderboard: 60s TTL matching the leader-refresher cycle.
- /api/runs/shared/{hash}: 6h TTL; runs are immutable once submitted so
  this turns share-link scrapes into pure Redis reads.
- /api/runs/scores/{entity_type}: 5m TTL; tier-list / detail-page sort
  columns hit this constantly between snapshot rebuilds.

Infra
- redis service in dev / beta / prod compose. allkeys-lru eviction,
  AOF off (rebuildable from Mongo on miss), 512m cap in prod / 128m in
  beta+dev. Healthcheck on prod.
- REDIS_URL passed into the backend service in each compose. Empty
  default in dev is fine -- the client init no-ops.

Forward path
Adding a new cache target is `app_cache.get_json(key)` /
`app_cache.set_json(key, val, ttl_seconds=N)` -- no further
infrastructure work. Next obvious wins: /api/cards/* and /api/relics/*
list responses (high QPS, deterministic per (entity, lang) between
deploys), /api/auth/me (one find_one per request becomes one Redis
GET), and slowapi rate-limit storage so limits become cluster-wide.
Per follow-up on #388: runs are immutable, but caching every run that's
ever been viewed for 6h means the runs collection growth pulls Redis
memory along with it. 15min absorbs Discord share-link bursts onto a
single URL (which is the main reuse pattern), keeps username
claims/renames propagating quickly, and lets cold runs naturally drop
off the LRU instead of squatting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant