Skip to content

obs(api): log per-processor timing in IndexChallengesJob#897

Merged
raymondjacobson merged 1 commit into
mainfrom
challenge-per-processor-timing
Jun 2, 2026
Merged

obs(api): log per-processor timing in IndexChallengesJob#897
raymondjacobson merged 1 commit into
mainfrom
challenge-per-processor-timing

Conversation

@raymondjacobson
Copy link
Copy Markdown
Member

Summary

IndexChallengesJob runs ~20 challenge processors serially per tick but logged only the total cycle duration. In prod the cycle steadily takes ~150s/tick. After #896 made the play-count milestones incremental (was ~49s×3/tick → ~2s), the remaining ~150s is unattributed — and the loop gives us no way to see which processor owns it.

This adds per-processor timing so the next bottleneck is found by measurement, not guessing:

  • Time each runProcessor call. Log any processor over a 1s threshold with its challenge_id and duration. Sub-second processors stay quiet so the loop logs at most a few lines per tick.
  • Add slowest_id + slowest_duration to the existing "Reconciled challenges" cycle summary line.
  • Include duration on the existing processor failed error log too.

Observability only — no change to processing behavior or transaction handling.

Test plan

  • go build ./..., go vet ./jobs/
  • After deploy: confirm the "slow challenge processor" lines identify which processor(s) dominate the ~150s cycle (suspected: the trending challenge processors)

🤖 Generated with Claude Code

The challenge loop runs ~20 processors serially per tick but logged only
the total cycle duration, so the loop's ~150s/tick cost in prod couldn't
be attributed to a specific processor. After #896 removed the heaviest
(play-count milestones), the remaining cost is unidentified.

Time each processor, log any over a 1s threshold with its challenge_id,
and add the slowest processor + its duration to the cycle summary line.
Sub-second processors stay quiet to avoid a line-per-tick per processor.
Observability only — no change to processing behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@raymondjacobson raymondjacobson force-pushed the challenge-per-processor-timing branch from 68a5ace to 4710e1c Compare June 2, 2026 03:51
@raymondjacobson raymondjacobson merged commit df00b5a into main Jun 2, 2026
5 checks passed
@raymondjacobson raymondjacobson deleted the challenge-per-processor-timing branch June 2, 2026 03:56
raymondjacobson added a commit that referenced this pull request Jun 2, 2026
## Summary

`FirstWeeklyComment` (challenge `c`) was the dominant cost of
`IndexChallengesJob` — ~100s per 30s tick in prod (whole loop ~120-133s,
surfaced by the per-processor timing added in #897). Every tick it
rescanned the **entire** `comments` table (~321k rows), re-derived
`DISTINCT (user_id, ISO-year, ISO-week)` across all history, and
re-upserted a `user_challenge` row for every (user, week) pair.

This converts it to the same checkpoint + dirty-set pattern already used
by `TrackUpload` (#875) and `PlayCountMilestones` (#896): checkpoint
`comments.blocknumber` (monotonic, bumped on every insert/update/delete)
and recompute only the users whose comments changed since the
checkpoint. The existing `EXTRACT(ISOYEAR)/EXTRACT(WEEK)` logic and
`is_delete=false AND is_visible=true` filter are preserved exactly for
parity; the per-user recompute just adds `AND user_id = ANY($1)`.

This is the last big full-rescan challenge processor. `ProfileVerified`
(`v`, ~4-17s) and `Tastemaker` (`t`, ~8-9s) are secondary and out of
scope.

## Changes

- `jobs/challenges/first_weekly_comment.go` — `Reconcile` now
`LoadChallenge` → `reconcileIncrementalUsers(commentCheckpoint,
commentDirtySQL, recompute)`, mirroring `track_upload.go`. New const
`commentCheckpoint = "challenges:c:last_blocknumber"`.
- `ddl/migrations/0216_comments_blocknumber_idx.sql` — `CREATE INDEX
CONCURRENTLY` on `comments(blocknumber)`. Unlike the other source
tables, `comments` had no blocknumber index, so the dirty scan would
otherwise be a 321k-row seq scan each tick. Mirrors #897's `user_events`
index migration (0209).
- `ddl/migrations/0217_seed_first_weekly_comment_checkpoint.sql` — seeds
`challenges:c:last_blocknumber` to `max(comments.blocknumber)` so prod
starts "from now" instead of re-deriving all history. Mirrors 0215 (`ON
CONFLICT DO NOTHING`).
- `first_weekly_comment_test.go` — seeded comments now carry a
`blocknumber`; tests don't run migrations, so the checkpoint stays 0 and
the dirty scan picks up all fixtures. Existing assertions unchanged.

## Test plan

- [x] `go build ./...`
- [x] `go vet ./jobs/...`
- [x] `go test ./jobs/challenges/` (incl.
`TestFirstWeeklyComment_OneRowPerUserPerWeek`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
raymondjacobson added a commit that referenced this pull request Jun 2, 2026
…top (#898)

## Summary

Replaces the per-write `count(*)` recounts in `handle_repost` /
`handle_save` with O(1) deltas (consistency with `handle_follow`), and
adds a periodic reconciliation job as the drift backstop for all
aggregate counts.

### 1. Delta-based aggregate maintenance (hot path)
- `handle_repost.sql` / `handle_save.sql`: every `count(*)` recount →
`<col> = <col> + delta`.
- Delta is now **transition-aware**: `delta = (new active?1:0) − (UPDATE
& old active?1:0)`, where `active = NOT is_delete`. Correct for
upsert-in-place toggles; idempotent (delta 0) on no-op re-delivery.
- All three triggers (`handle_repost`, `handle_save`, `handle_follow`)
move from `AFTER INSERT` → `AFTER INSERT OR UPDATE` so the
entity_manager upsert path maintains counts. `handle_follow` gets the
same transition-delta for consistency.

### 2. Reconciliation backstop (off hot path)
- New `jobs/reconcile_aggregates.go` (`ReconcileAggregatesJob`), modeled
on `prune_plays.go`. Ports the three full-recompute queries from
discovery's `update_aggregates.py` (user / track / playlist counts +
`dominant_genre`).
- Scheduled every **10 min** in `indexer.go` (matches discovery's celery
cadence). Writes only the count columns — **column-disjoint** from the
score-only `AggregatesCalculator`, so they run concurrently without
collision.
- Faithful 1:1 port of the SQL, with one deliberate improvement:
nullable `dominant_genre` / `dominant_genre_count` comparisons use `is
distinct from` instead of Python's `!=`. `!=` returns NULL on a genre
flipping to/from NULL, so such a row is silently skipped until a count
also changes; `is distinct from` converges it. Never writes a different
value — strictly catches one edge case Python misses.

### 3. Packaged etl dependency
- Bumps `go-openaudio` + `go-openaudio/pkg/etl` to `5ed068b`, which
includes the now-merged **OpenAudio/go-openaudio#331** (upsert social
writes in place instead of demote-then-insert).
- This is a **hard requirement** for section 1: the `AFTER INSERT OR
UPDATE` triggers only track `is_delete` transitions correctly when the
indexer upserts the single `is_current` row in place. The dependency
ordering is satisfied by this bump — #331 is merged and packaged here,
so this PR is safe to merge on its own.

## Test plan
- [x] `go build ./...` passes against the bumped etl.
- [x] Rebased onto current `main`; dropped a stale copy of the
per-processor-timing change already merged via #897 (would have
double-declared `slowProcessorThreshold`).
- [ ] After deploy: confirm repost/save/follow counts
increment/decrement on toggle and stay correct on re-delivery.
- [ ] Confirm `ReconcileAggregatesJob` logs only when it corrects drift
(`corrected > 0`) and otherwise stays quiet.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant