obs(api): log per-processor timing in IndexChallengesJob#897
Merged
Conversation
The challenge loop runs ~20 processors serially per tick but logged only the total cycle duration, so the loop's ~150s/tick cost in prod couldn't be attributed to a specific processor. After #896 removed the heaviest (play-count milestones), the remaining cost is unidentified. Time each processor, log any over a 1s threshold with its challenge_id, and add the slowest processor + its duration to the cycle summary line. Sub-second processors stay quiet to avoid a line-per-tick per processor. Observability only — no change to processing behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
68a5ace to
4710e1c
Compare
3 tasks
raymondjacobson
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary `FirstWeeklyComment` (challenge `c`) was the dominant cost of `IndexChallengesJob` — ~100s per 30s tick in prod (whole loop ~120-133s, surfaced by the per-processor timing added in #897). Every tick it rescanned the **entire** `comments` table (~321k rows), re-derived `DISTINCT (user_id, ISO-year, ISO-week)` across all history, and re-upserted a `user_challenge` row for every (user, week) pair. This converts it to the same checkpoint + dirty-set pattern already used by `TrackUpload` (#875) and `PlayCountMilestones` (#896): checkpoint `comments.blocknumber` (monotonic, bumped on every insert/update/delete) and recompute only the users whose comments changed since the checkpoint. The existing `EXTRACT(ISOYEAR)/EXTRACT(WEEK)` logic and `is_delete=false AND is_visible=true` filter are preserved exactly for parity; the per-user recompute just adds `AND user_id = ANY($1)`. This is the last big full-rescan challenge processor. `ProfileVerified` (`v`, ~4-17s) and `Tastemaker` (`t`, ~8-9s) are secondary and out of scope. ## Changes - `jobs/challenges/first_weekly_comment.go` — `Reconcile` now `LoadChallenge` → `reconcileIncrementalUsers(commentCheckpoint, commentDirtySQL, recompute)`, mirroring `track_upload.go`. New const `commentCheckpoint = "challenges:c:last_blocknumber"`. - `ddl/migrations/0216_comments_blocknumber_idx.sql` — `CREATE INDEX CONCURRENTLY` on `comments(blocknumber)`. Unlike the other source tables, `comments` had no blocknumber index, so the dirty scan would otherwise be a 321k-row seq scan each tick. Mirrors #897's `user_events` index migration (0209). - `ddl/migrations/0217_seed_first_weekly_comment_checkpoint.sql` — seeds `challenges:c:last_blocknumber` to `max(comments.blocknumber)` so prod starts "from now" instead of re-deriving all history. Mirrors 0215 (`ON CONFLICT DO NOTHING`). - `first_weekly_comment_test.go` — seeded comments now carry a `blocknumber`; tests don't run migrations, so the checkpoint stays 0 and the dirty scan picks up all fixtures. Existing assertions unchanged. ## Test plan - [x] `go build ./...` - [x] `go vet ./jobs/...` - [x] `go test ./jobs/challenges/` (incl. `TestFirstWeeklyComment_OneRowPerUserPerWeek`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
raymondjacobson
added a commit
that referenced
this pull request
Jun 2, 2026
…top (#898) ## Summary Replaces the per-write `count(*)` recounts in `handle_repost` / `handle_save` with O(1) deltas (consistency with `handle_follow`), and adds a periodic reconciliation job as the drift backstop for all aggregate counts. ### 1. Delta-based aggregate maintenance (hot path) - `handle_repost.sql` / `handle_save.sql`: every `count(*)` recount → `<col> = <col> + delta`. - Delta is now **transition-aware**: `delta = (new active?1:0) − (UPDATE & old active?1:0)`, where `active = NOT is_delete`. Correct for upsert-in-place toggles; idempotent (delta 0) on no-op re-delivery. - All three triggers (`handle_repost`, `handle_save`, `handle_follow`) move from `AFTER INSERT` → `AFTER INSERT OR UPDATE` so the entity_manager upsert path maintains counts. `handle_follow` gets the same transition-delta for consistency. ### 2. Reconciliation backstop (off hot path) - New `jobs/reconcile_aggregates.go` (`ReconcileAggregatesJob`), modeled on `prune_plays.go`. Ports the three full-recompute queries from discovery's `update_aggregates.py` (user / track / playlist counts + `dominant_genre`). - Scheduled every **10 min** in `indexer.go` (matches discovery's celery cadence). Writes only the count columns — **column-disjoint** from the score-only `AggregatesCalculator`, so they run concurrently without collision. - Faithful 1:1 port of the SQL, with one deliberate improvement: nullable `dominant_genre` / `dominant_genre_count` comparisons use `is distinct from` instead of Python's `!=`. `!=` returns NULL on a genre flipping to/from NULL, so such a row is silently skipped until a count also changes; `is distinct from` converges it. Never writes a different value — strictly catches one edge case Python misses. ### 3. Packaged etl dependency - Bumps `go-openaudio` + `go-openaudio/pkg/etl` to `5ed068b`, which includes the now-merged **OpenAudio/go-openaudio#331** (upsert social writes in place instead of demote-then-insert). - This is a **hard requirement** for section 1: the `AFTER INSERT OR UPDATE` triggers only track `is_delete` transitions correctly when the indexer upserts the single `is_current` row in place. The dependency ordering is satisfied by this bump — #331 is merged and packaged here, so this PR is safe to merge on its own. ## Test plan - [x] `go build ./...` passes against the bumped etl. - [x] Rebased onto current `main`; dropped a stale copy of the per-processor-timing change already merged via #897 (would have double-declared `slowProcessorThreshold`). - [ ] After deploy: confirm repost/save/follow counts increment/decrement on toggle and stay correct on re-delivery. - [ ] Confirm `ReconcileAggregatesJob` logs only when it corrects drift (`corrected > 0`) and otherwise stays quiet. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IndexChallengesJobruns ~20 challenge processors serially per tick but logged only the total cycle duration. In prod the cycle steadily takes ~150s/tick. After #896 made the play-count milestones incremental (was ~49s×3/tick → ~2s), the remaining ~150s is unattributed — and the loop gives us no way to see which processor owns it.This adds per-processor timing so the next bottleneck is found by measurement, not guessing:
runProcessorcall. Log any processor over a 1s threshold with itschallenge_idand duration. Sub-second processors stay quiet so the loop logs at most a few lines per tick.slowest_id+slowest_durationto the existing "Reconciled challenges" cycle summary line.durationon the existingprocessor failederror log too.Observability only — no change to processing behavior or transaction handling.
Test plan
go build ./...,go vet ./jobs/🤖 Generated with Claude Code