Smoother completion: Earlier backpressure + possible early completion + fewer warnings#1292
Smoother completion: Earlier backpressure + possible early completion + fewer warnings#1292brandur wants to merge 1 commit into
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 168f657fb4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| c.setStateParamsMu.Unlock() | ||
|
|
||
| <-waitChan |
There was a problem hiding this comment.
Release backlog waiters when a batch fails
When backlog pressure is active and JobSetStateIfRunningMany fails after retries, handleBatch returns before the only close(c.waitOnBacklogChan) path. This newly added wait means the goroutine that enabled pressure has not enqueued its completion yet, so a transient DB error at the 10k threshold leaves it and any later callers blocked on the same channel indefinitely. Please either release the wait channel on error/requeue the batch, or let the initiating completion fall through as before.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed and w/ regression test.
168f657 to
d177425
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d177425483
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| jobRows, err = completeSubBatch(params) | ||
| if err != nil { | ||
| c.requeueBatch(ctx, setStateBatch, setStateStartTimes) | ||
| return err |
There was a problem hiding this comment.
Avoid requeueing non-retryable completion errors
When JobSetStateIfRunningMany returns context.Canceled or riverdriver.ErrClosedPool while the batch completer is still running, withRetries returns immediately without any backoff because those errors are considered non-retryable. Requeueing the same batch here puts it straight back into setStateParams; for batches at or above the run loop's batchCompleterStartThreshold (100 jobs), Start immediately calls handleBatch again, creating a tight error loop until the service is stopped instead of giving up or backing off.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
handleBatch will now distinguish between retryable and non-retryable (context.Canceled, riverdriver.ErrClosedPool) errors and give up on the batch in the case of the latter.
d177425 to
fb7ef2c
Compare
|
@codex review |
fb7ef2c to
4acc752
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
… + fewer warnings
This one's largely aimed at trying to resolve a common problem that we
see during benchmarks, in which the `BatchCompleter` becomes completely
saturated and starts producing noisy warnings saying that it'll produce
backpressure while waiting for completions:
$ go run ./cmd/river bench --database-url $DATABASE_URL --num-total-jobs 300_000
bench: jobs worked [ 0 ], inserted [ 300000 ], job/sec [ 0.0 ] [0s]
Jun 20 07:33:22.258 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
Jun 20 07:33:23.464 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [ 74578 ], inserted [ 0 ], job/sec [ 37289.0 ] [2s]
Jun 20 07:33:24.480 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [ 87308 ], inserted [ 0 ], job/sec [ 43654.0 ] [2s]
Jun 20 07:33:25.873 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
Jun 20 07:33:27.060 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [ 113346 ], inserted [ 0 ], job/sec [ 56673.0 ] [2s]
bench: jobs worked [ 24768 ], inserted [ 0 ], job/sec [ 12384.0 ] [2s]
bench: total jobs worked [ 300000 ], total jobs inserted [ 300000 ], overall job/sec [ 46803.8 ], running 6.409735541s
Here, we modify our approach so that instead of waiting until we're at a
backlog of 20k jobs in the completer, we start to apply backpressure at
10k jobs, and reserve the warnings for if we get to the 20k level, which
will be much more difficult to get to. This has the effect of producing
much fewer warnings as we're running a benchmark:
$ go run ./cmd/river bench --database-url $DATABASE_URL --num-total-jobs 300_000
bench: jobs worked [ 0 ], inserted [ 300000 ], job/sec [ 0.0 ] [0s]
bench: jobs worked [ 79849 ], inserted [ 0 ], job/sec [ 39924.5 ] [2s]
bench: jobs worked [ 96590 ], inserted [ 0 ], job/sec [ 48295.0 ] [2s]
bench: jobs worked [ 105726 ], inserted [ 0 ], job/sec [ 52863.0 ] [2s]
bench: jobs worked [ 17835 ], inserted [ 0 ], job/sec [ 8917.5 ] [2s]
bench: total jobs worked [ 300000 ], total jobs inserted [ 300000 ], overall job/sec [ 47026.1 ], running 6.379439334s
In my tests, throughput is roughly the same. Before at 20k the completer
was the bottleneck, and it's still the bottleneck. We do put in one
minor optimization in which `BatchCompleter` will start completing as
soon as it's got a full batch rather than having to wait for a 50 ms
tick. This doesn't have a super significant effect on numbers though.
An argument against this change is that to some degree, we'd be sweeping
the slow completer under the rug. I'm receptive to that, but a couple
things:
* It looks bad when the benchmark process regularly produces warnings as
it runs.
* I think we can do better than warnings by putting in a more
comprehensive stats layer that tracks specific durations for each
phase of a job's life cycle (time to lock, time to work, time waiting
to complete, time to complete, etc.). This seems like a much better
alternative to the status quo.
4acc752 to
2bb559c
Compare
This one's largely aimed at trying to resolve a common problem that we
see during benchmarks, in which the
BatchCompleterbecomes completelysaturated and starts producing noisy warnings saying that it'll produce
backpressure while waiting for completions:
Here, we modify our approach so that instead of waiting until we're at a
backlog of 20k jobs in the completer, we start to apply backpressure at
10k jobs, and reserve the warnings for if we get to the 20k level, which
will be much more difficult to get to. This has the effect of producing
much fewer warnings as we're running a benchmark:
In my tests, throughput is roughly the same. Before at 20k the completer
was the bottleneck, and it's still the bottleneck. We do put in one
minor optimization in which
BatchCompleterwill start completing assoon as it's got a full batch rather than having to wait for a 50 ms
tick. This doesn't have a super significant effect on numbers though.
An argument against this change is that to some degree, we'd be sweeping
the slow completer under the rug. I'm receptive to that, but a couple
things:
It looks bad when the benchmark process regularly produces warnings as
it runs.
I think we can do better than warnings by putting in a more
comprehensive stats layer that tracks specific durations for each
phase of a job's life cycle (time to lock, time to work, time waiting
to complete, time to complete, etc.). This seems like a much better
alternative to the status quo.