perf(api): stop TrendingJob from continuously rewriting score tables by raymondjacobson · Pull Request #890 · AudiusProject/api

raymondjacobson · 2026-06-01T20:34:47Z

Summary

TrendingJob was the dominant source of IO starvation on the indexer. It ran on a 10s schedule, but each run took minutes and did two expensive things on every cycle:

Two non-concurrent REFRESH MATERIALIZED VIEW calls (aggregate_interval_plays, trending_params). A plain refresh takes an ACCESS EXCLUSIVE lock for the duration of the (multi-minute, plays-table-scanning) rebuild — blocking every reader, including the block-indexing loop's downstream queries.
A full DELETE-then-INSERT of the ~16GB track_trending_scores table (and the analogous playlist table), rewriting ~12.7M rows × 6 indexes per cycle.

On a 10s schedule with multi-minute runtimes, this ran effectively continuously and IO-starved block indexing.

Changes

Concurrent matview refresh. Use REFRESH MATERIALIZED VIEW CONCURRENTLY, with a graceful fallback to a blocking refresh when the matview is unpopulated (fresh/test DB → SQLSTATE 0A000) or the required unique index isn't present yet (pre-migration window of a rolling deploy → SQLSTATE 55000). This avoids a hard ordering dependency on the migration.
Migration 0214 promotes the track_id indexes on aggregate_interval_plays and trending_params to UNIQUE (a prerequisite for CONCURRENTLY). Verified in prod that both matviews are exactly one row per track (COUNT(*) == COUNT(DISTINCT track_id)), so the unique constraint is valid. Statements are intentionally not wrapped in a transaction so each index build autocommits.
Upsert + prune instead of delete + insert. Scores are staged into an ON COMMIT DROP temp table, then merged with INSERT ... ON CONFLICT (...) DO UPDATE ... WHERE col IS DISTINCT FROM EXCLUDED.col (skip-unchanged), followed by an anti-join DELETE ... WHERE NOT EXISTS to prune rows that left the source set. This stops the per-cycle full-table churn that bloated the table and its indexes.
Hourly cadence. Schedule TrendingJob every 1h instead of 10s, matching discovery's effective cadence: apps' celery beat ticked index_trending every 10s but an internal gate (trending_refresh_seconds, default 3600) only recomputed once an hour. The vendored ETL port had dropped that gate.

Notes

sql/01_schema.sql (the pg_dump used to build the test template) will pick up the unique indexes on the next make test-schema regen. Until then, the code's CONCURRENTLY→blocking fallback keeps tests green.
Overlaps with perf(indexer): size jobs pool, per-cadence challenges, throttle aggregates #889 on indexer/indexer.go — both edit startParityJobs. Minor merge conflict expected; whoever merges second rebases. perf(indexer): size jobs pool, per-cadence challenges, throttle aggregates #889 leaves TrendingJob untouched, so the two changes are complementary.

Test plan

go test ./jobs/... — existing TestTrendingJob_PopulatesScores passes; new TestTrendingJob_PrunesStaleRows covers the anti-join prune (seed track, run, delete track, re-run, assert rows pruned).
go vet ./jobs/... clean; touched files gofmt-clean.
Watch indexer IO / block lag after deploy to confirm starvation is resolved.

🤖 Generated with Claude Code

TrendingJob ran on a 10s schedule but each run took minutes: it did two non-concurrent REFRESH MATERIALIZED VIEW (AccessExclusiveLock, blocking all readers including the block-indexing loop) and then a full DELETE+INSERT of the ~16GB track_trending_scores table plus 6 indexes. Effectively continuous, it IO-starved the indexer. Changes: - Refresh matviews with REFRESH MATERIALIZED VIEW CONCURRENTLY, falling back to a blocking refresh if the matview is unpopulated (fresh/test DB) or the unique index isn't present yet (pre-migration rolling deploy). - Migration 0214 promotes the track_id indexes on aggregate_interval_plays and trending_params to UNIQUE so CONCURRENTLY is possible (both matviews are one row per track). - Rewrite the score recompute as upsert-into-staging-temp-table + ON CONFLICT DO UPDATE (skip-unchanged) + anti-join prune, instead of DELETE-then-INSERT of every row each cycle. - Schedule TrendingJob hourly to match discovery's effective cadence (trending_refresh_seconds=3600); the vendored port had dropped that gate. sql/01_schema.sql picks up the unique indexes on the next `make test-schema` regen; the code falls back safely until then. Adds a prune-coverage test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Condense the package doc, refreshMatview doc, migration 0214 header, and the indexer cadence comment to the essential rationale. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## Summary - #890 swapped the trending score-table rewrite for a temp-table stage + `INSERT ... ON CONFLICT DO UPDATE` (skip-unchanged) + anti-join prune, to avoid rewriting the ~16GB `*_trending_scores` tables every run. - In prod that upsert does one unique-index probe + `IS DISTINCT FROM` comparison **per row over ~4M rows**. It took **40+ minutes** and never finished inside the hourly window — so trending scores went stale (observed 12–18h old) and `isRunning` blocked every subsequent tick. - This reverts `computeTrendingTracks` / `computeTrendingPlaylists` to the strategy discovery's Python ran for years: within the single transaction already opened, `DELETE` the `(type, version)` slice then repopulate with plain bulk `INSERT`s. ## Why this is correct - **No empty-table window:** the `DELETE` and `INSERT`s are in one transaction, so the `DELETE` isn't visible to readers until `COMMIT` — exactly how Python's `update_track_score_query` behaved. - **Scores unchanged:** the score expressions are byte-for-byte identical to before (and to apps' `pnagD`/`AnlGe` strategies); only bind-parameter numbering changed because `type`/`version`/`created_at` moved back into the `SELECT`. - **Concurrent matview refresh from #890 is kept** — that part was a genuine improvement over Python (which refreshed non-concurrently). ## Tradeoff The bulk rewrite produces more dead tuples per run than the skip-unchanged upsert. Autovacuum handled this under Python for years. The alternative — a job that never completes — is strictly worse. ## Evidence - Live prod query showed the upsert `INSERT INTO track_trending_scores` active for **2650s (44 min)** on `IO/DataFileRead`, not blocked, still on step 1 of 4. - `track_trending_scores` max `created_at` frozen 12–18h while the pod had been up >1h and the matview refreshes succeeded. ## Test plan - [x] `go build ./jobs/` clean - [x] `gofmt` clean - [x] `go test ./jobs/ -run Trending` passes, incl. `TestTrendingJob_PrunesStaleRows` (deleted tracks still drop out, now via the full DELETE) - [ ] After deploy: confirm `max(created_at)` in `track_trending_scores` advances each cycle and a full cycle completes in seconds 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

raymondjacobson and others added 2 commits June 1, 2026 13:34

docs(trending): trim verbose comments

f1ff626

Condense the package doc, refreshMatview doc, migration 0214 header, and the indexer cadence comment to the essential rationale. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

raymondjacobson mentioned this pull request Jun 1, 2026

perf(indexer): size jobs pool, per-cadence challenges, throttle aggregates #889

Open

5 tasks

raymondjacobson merged commit b0bba05 into main Jun 1, 2026
5 checks passed

raymondjacobson deleted the api/p1-trending-perf branch June 1, 2026 22:13

raymondjacobson mentioned this pull request Jun 2, 2026

perf(api): revert TrendingJob to bulk DELETE+INSERT score rewrite #894

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(api): stop TrendingJob from continuously rewriting score tables#890

perf(api): stop TrendingJob from continuously rewriting score tables#890
raymondjacobson merged 2 commits into
mainfrom
api/p1-trending-perf

raymondjacobson commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented Jun 1, 2026

Summary

Changes

Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant