obs(api): log per-processor timing in IndexChallengesJob by raymondjacobson · Pull Request #897 · AudiusProject/api

raymondjacobson · 2026-06-02T03:00:41Z

Summary

IndexChallengesJob runs ~20 challenge processors serially per tick but logged only the total cycle duration. In prod the cycle steadily takes ~150s/tick. After #896 made the play-count milestones incremental (was ~49s×3/tick → ~2s), the remaining ~150s is unattributed — and the loop gives us no way to see which processor owns it.

This adds per-processor timing so the next bottleneck is found by measurement, not guessing:

Time each runProcessor call. Log any processor over a 1s threshold with its challenge_id and duration. Sub-second processors stay quiet so the loop logs at most a few lines per tick.
Add slowest_id + slowest_duration to the existing "Reconciled challenges" cycle summary line.
Include duration on the existing processor failed error log too.

Observability only — no change to processing behavior or transaction handling.

Test plan

go build ./..., go vet ./jobs/
After deploy: confirm the "slow challenge processor" lines identify which processor(s) dominate the ~150s cycle (suspected: the trending challenge processors)

🤖 Generated with Claude Code

The challenge loop runs ~20 processors serially per tick but logged only the total cycle duration, so the loop's ~150s/tick cost in prod couldn't be attributed to a specific processor. After #896 removed the heaviest (play-count milestones), the remaining cost is unidentified. Time each processor, log any over a 1s threshold with its challenge_id, and add the slowest processor + its duration to the cycle summary line. Sub-second processors stay quiet to avoid a line-per-tick per processor. Observability only — no change to processing behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## Summary `FirstWeeklyComment` (challenge `c`) was the dominant cost of `IndexChallengesJob` — ~100s per 30s tick in prod (whole loop ~120-133s, surfaced by the per-processor timing added in #897). Every tick it rescanned the **entire** `comments` table (~321k rows), re-derived `DISTINCT (user_id, ISO-year, ISO-week)` across all history, and re-upserted a `user_challenge` row for every (user, week) pair. This converts it to the same checkpoint + dirty-set pattern already used by `TrackUpload` (#875) and `PlayCountMilestones` (#896): checkpoint `comments.blocknumber` (monotonic, bumped on every insert/update/delete) and recompute only the users whose comments changed since the checkpoint. The existing `EXTRACT(ISOYEAR)/EXTRACT(WEEK)` logic and `is_delete=false AND is_visible=true` filter are preserved exactly for parity; the per-user recompute just adds `AND user_id = ANY($1)`. This is the last big full-rescan challenge processor. `ProfileVerified` (`v`, ~4-17s) and `Tastemaker` (`t`, ~8-9s) are secondary and out of scope. ## Changes - `jobs/challenges/first_weekly_comment.go` — `Reconcile` now `LoadChallenge` → `reconcileIncrementalUsers(commentCheckpoint, commentDirtySQL, recompute)`, mirroring `track_upload.go`. New const `commentCheckpoint = "challenges:c:last_blocknumber"`. - `ddl/migrations/0216_comments_blocknumber_idx.sql` — `CREATE INDEX CONCURRENTLY` on `comments(blocknumber)`. Unlike the other source tables, `comments` had no blocknumber index, so the dirty scan would otherwise be a 321k-row seq scan each tick. Mirrors #897's `user_events` index migration (0209). - `ddl/migrations/0217_seed_first_weekly_comment_checkpoint.sql` — seeds `challenges:c:last_blocknumber` to `max(comments.blocknumber)` so prod starts "from now" instead of re-deriving all history. Mirrors 0215 (`ON CONFLICT DO NOTHING`). - `first_weekly_comment_test.go` — seeded comments now carry a `blocknumber`; tests don't run migrations, so the checkpoint stays 0 and the dirty scan picks up all fixtures. Existing assertions unchanged. ## Test plan - [x] `go build ./...` - [x] `go vet ./jobs/...` - [x] `go test ./jobs/challenges/` (incl. `TestFirstWeeklyComment_OneRowPerUserPerWeek`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…top (#898) ## Summary Replaces the per-write `count(*)` recounts in `handle_repost` / `handle_save` with O(1) deltas (consistency with `handle_follow`), and adds a periodic reconciliation job as the drift backstop for all aggregate counts. ### 1. Delta-based aggregate maintenance (hot path) - `handle_repost.sql` / `handle_save.sql`: every `count(*)` recount → `<col> = <col> + delta`. - Delta is now **transition-aware**: `delta = (new active?1:0) − (UPDATE & old active?1:0)`, where `active = NOT is_delete`. Correct for upsert-in-place toggles; idempotent (delta 0) on no-op re-delivery. - All three triggers (`handle_repost`, `handle_save`, `handle_follow`) move from `AFTER INSERT` → `AFTER INSERT OR UPDATE` so the entity_manager upsert path maintains counts. `handle_follow` gets the same transition-delta for consistency. ### 2. Reconciliation backstop (off hot path) - New `jobs/reconcile_aggregates.go` (`ReconcileAggregatesJob`), modeled on `prune_plays.go`. Ports the three full-recompute queries from discovery's `update_aggregates.py` (user / track / playlist counts + `dominant_genre`). - Scheduled every **10 min** in `indexer.go` (matches discovery's celery cadence). Writes only the count columns — **column-disjoint** from the score-only `AggregatesCalculator`, so they run concurrently without collision. - Faithful 1:1 port of the SQL, with one deliberate improvement: nullable `dominant_genre` / `dominant_genre_count` comparisons use `is distinct from` instead of Python's `!=`. `!=` returns NULL on a genre flipping to/from NULL, so such a row is silently skipped until a count also changes; `is distinct from` converges it. Never writes a different value — strictly catches one edge case Python misses. ### 3. Packaged etl dependency - Bumps `go-openaudio` + `go-openaudio/pkg/etl` to `5ed068b`, which includes the now-merged **OpenAudio/go-openaudio#331** (upsert social writes in place instead of demote-then-insert). - This is a **hard requirement** for section 1: the `AFTER INSERT OR UPDATE` triggers only track `is_delete` transitions correctly when the indexer upserts the single `is_current` row in place. The dependency ordering is satisfied by this bump — #331 is merged and packaged here, so this PR is safe to merge on its own. ## Test plan - [x] `go build ./...` passes against the bumped etl. - [x] Rebased onto current `main`; dropped a stale copy of the per-processor-timing change already merged via #897 (would have double-declared `slowProcessorThreshold`). - [ ] After deploy: confirm repost/save/follow counts increment/decrement on toggle and stay correct on re-delivery. - [ ] Confirm `ReconcileAggregatesJob` logs only when it corrects drift (`corrected > 0`) and otherwise stays quiet. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

raymondjacobson force-pushed the challenge-per-processor-timing branch from 68a5ace to 4710e1c Compare June 2, 2026 03:51

raymondjacobson merged commit df00b5a into main Jun 2, 2026
5 checks passed

raymondjacobson deleted the challenge-per-processor-timing branch June 2, 2026 03:56

raymondjacobson mentioned this pull request Jun 2, 2026

perf(api): make FirstWeeklyComment challenge incremental #899

Merged

3 tasks

raymondjacobson mentioned this pull request Jun 2, 2026

perf(api): delta-based social aggregate counts + reconciliation backstop #898

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs(api): log per-processor timing in IndexChallengesJob#897

obs(api): log per-processor timing in IndexChallengesJob#897
raymondjacobson merged 1 commit into
mainfrom
challenge-per-processor-timing

raymondjacobson commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented Jun 2, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant