perf(api): make play-count milestones incremental, single pass by raymondjacobson · Pull Request #896 · AudiusProject/api

raymondjacobson · 2026-06-02T02:30:45Z

Summary

The p1/p2/p3 play milestones (250/1k/10k plays, verified artists only) each ran a non-incremental GROUP BY every tick — a Parallel Seq Scan of ~485k tracks producing ~1,867 rows in ~49s, times three tiers. Measuring pg_stat_statements, this one processor was the dominant cost of IndexChallengesJob and the reason a per-cadence "fast/medium/slow" job taxonomy looked tempting (see #889). Fix the processor instead and the single serial loop stays simple.

Merge NewPlayCount250/1000/10000Processor into one PlayCountMilestonesProcessor that checkpoints plays.id and recomputes only the verified artists whose tracks were played since the checkpoint (the reconcileIncrementalUsers dirty-set pattern), deriving all three tiers in a single pass.
Tier eligibility gates on the live count reaching the previous tier's threshold (p2 ≥250, p3 ≥1000). A milestone completes exactly when the count crosses its step_count, so this is equivalent to Python's "previous milestone complete" check — and it removes the old cross-tick cascade (p2 no longer has to wait a tick for p1 to commit).
Migration 0215 seeds the new checkpoint to max(plays.id) so prod starts "from now" rather than backfilling the whole plays table (mirrors 0208).
IndexChallengesJob keeps its single serial loop — three registrations collapse to one.

aggregate_monthly_plays has no monotonic column (updated in place), so plays.id is the high-water mark; the dirty scan joins through tracks/users to surface only verified, non-deactivated artists.

Measurements (read-only `EXPLAIN ANALYZE` on the write primary)

	before	after
dirty scan (200k-play catch-up window, LIMIT 5000)	—	~1.4s (far less per normal 30s tick)
recompute (50 sampled artists — larger than a typical dirty set)	—	~113ms
per tick	48.7s × 3 tiers, every tick	proportional to new plays

Test plan

go build ./..., go vet ./jobs/...
go test ./jobs/challenges/ — updated play_count_milestones_test.go to seed plays rows (drive the dirty scan) and assert the merged single-pass tier gating: verified-only, p1→p2→p3 thresholds in one run
After deploy: confirm IndexChallengesJob "Reconciled challenges" duration drops and milestone user_challenges rows keep landing

🤖 Generated with Claude Code

The p1/p2/p3 play milestones (250/1k/10k plays, verified artists only) each ran a non-incremental GROUP BY every tick: a Parallel Seq Scan of ~485k tracks producing ~1,867 rows in ~49s, times three tiers. That one processor was the dominant cost of IndexChallengesJob. Merge the three into a single PlayCountMilestonesProcessor that checkpoints plays.id and recomputes only the verified artists whose tracks were played since the checkpoint, deriving all three tiers in one pass. Tier eligibility gates on the live count reaching the previous tier's threshold, which is equivalent to Python's "previous milestone complete" check and removes the cross-tick cascade. Against prod (read-only EXPLAIN ANALYZE): the dirty scan is ~1.4s for a 200k-play catch-up window (far less per normal tick) and the bounded recompute is ~113ms for 50 users, versus 48.7s x 3 every tick before. Migration 0215 seeds the new checkpoint to max(plays.id) so prod starts "from now" rather than backfilling the whole plays table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The challenge loop runs ~20 processors serially per tick but logged only the total cycle duration, so the loop's ~150s/tick cost in prod couldn't be attributed to a specific processor. After #896 removed the heaviest (play-count milestones), the remaining cost is unidentified. Time each processor, log any over a 1s threshold with its challenge_id, and add the slowest processor + its duration to the cycle summary line. Sub-second processors stay quiet to avoid a line-per-tick per processor. Observability only — no change to processing behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## Summary `IndexChallengesJob` runs ~20 challenge processors serially per tick but logged only the **total** cycle duration. In prod the cycle steadily takes ~150s/tick. After #896 made the play-count milestones incremental (was ~49s×3/tick → ~2s), the remaining ~150s is unattributed — and the loop gives us no way to see which processor owns it. This adds per-processor timing so the next bottleneck is found by measurement, not guessing: - Time each `runProcessor` call. Log any processor over a **1s** threshold with its `challenge_id` and duration. Sub-second processors stay quiet so the loop logs at most a few lines per tick. - Add `slowest_id` + `slowest_duration` to the existing "Reconciled challenges" cycle summary line. - Include `duration` on the existing `processor failed` error log too. Observability only — no change to processing behavior or transaction handling. ## Test plan - [x] `go build ./...`, `go vet ./jobs/` - [ ] After deploy: confirm the "slow challenge processor" lines identify which processor(s) dominate the ~150s cycle (suspected: the trending challenge processors) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

## Summary `FirstWeeklyComment` (challenge `c`) was the dominant cost of `IndexChallengesJob` — ~100s per 30s tick in prod (whole loop ~120-133s, surfaced by the per-processor timing added in #897). Every tick it rescanned the **entire** `comments` table (~321k rows), re-derived `DISTINCT (user_id, ISO-year, ISO-week)` across all history, and re-upserted a `user_challenge` row for every (user, week) pair. This converts it to the same checkpoint + dirty-set pattern already used by `TrackUpload` (#875) and `PlayCountMilestones` (#896): checkpoint `comments.blocknumber` (monotonic, bumped on every insert/update/delete) and recompute only the users whose comments changed since the checkpoint. The existing `EXTRACT(ISOYEAR)/EXTRACT(WEEK)` logic and `is_delete=false AND is_visible=true` filter are preserved exactly for parity; the per-user recompute just adds `AND user_id = ANY($1)`. This is the last big full-rescan challenge processor. `ProfileVerified` (`v`, ~4-17s) and `Tastemaker` (`t`, ~8-9s) are secondary and out of scope. ## Changes - `jobs/challenges/first_weekly_comment.go` — `Reconcile` now `LoadChallenge` → `reconcileIncrementalUsers(commentCheckpoint, commentDirtySQL, recompute)`, mirroring `track_upload.go`. New const `commentCheckpoint = "challenges:c:last_blocknumber"`. - `ddl/migrations/0216_comments_blocknumber_idx.sql` — `CREATE INDEX CONCURRENTLY` on `comments(blocknumber)`. Unlike the other source tables, `comments` had no blocknumber index, so the dirty scan would otherwise be a 321k-row seq scan each tick. Mirrors #897's `user_events` index migration (0209). - `ddl/migrations/0217_seed_first_weekly_comment_checkpoint.sql` — seeds `challenges:c:last_blocknumber` to `max(comments.blocknumber)` so prod starts "from now" instead of re-deriving all history. Mirrors 0215 (`ON CONFLICT DO NOTHING`). - `first_weekly_comment_test.go` — seeded comments now carry a `blocknumber`; tests don't run migrations, so the checkpoint stays 0 and the dirty scan picks up all fixtures. Existing assertions unchanged. ## Test plan - [x] `go build ./...` - [x] `go vet ./jobs/...` - [x] `go test ./jobs/challenges/` (incl. `TestFirstWeeklyComment_OneRowPerUserPerWeek`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

raymondjacobson merged commit 9c911ed into main Jun 2, 2026
5 checks passed

raymondjacobson deleted the incremental-play-count-milestones branch June 2, 2026 02:36

raymondjacobson mentioned this pull request Jun 2, 2026

obs(api): log per-processor timing in IndexChallengesJob #897

Merged

2 tasks

raymondjacobson mentioned this pull request Jun 2, 2026

perf(api): make FirstWeeklyComment challenge incremental #899

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(api): make play-count milestones incremental, single pass#896

perf(api): make play-count milestones incremental, single pass#896
raymondjacobson merged 1 commit into
mainfrom
incremental-play-count-milestones

raymondjacobson commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented Jun 2, 2026

Summary

Measurements (read-only EXPLAIN ANALYZE on the write primary)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Measurements (read-only `EXPLAIN ANALYZE` on the write primary)