Skip to content

Smoother completion: Earlier backpressure + possible early completion + fewer warnings#1292

Open
brandur wants to merge 1 commit into
masterfrom
brandur-smoother-completion
Open

Smoother completion: Earlier backpressure + possible early completion + fewer warnings#1292
brandur wants to merge 1 commit into
masterfrom
brandur-smoother-completion

Conversation

@brandur

@brandur brandur commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

This one's largely aimed at trying to resolve a common problem that we
see during benchmarks, in which the BatchCompleter becomes completely
saturated and starts producing noisy warnings saying that it'll produce
backpressure while waiting for completions:

$ go run ./cmd/river bench --database-url $DATABASE_URL --num-total-jobs 300_000
bench: jobs worked [          0 ], inserted [     300000 ], job/sec [        0.0 ] [0s]
Jun 20 07:33:22.258 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
Jun 20 07:33:23.464 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [      74578 ], inserted [          0 ], job/sec [    37289.0 ] [2s]
Jun 20 07:33:24.480 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [      87308 ], inserted [          0 ], job/sec [    43654.0 ] [2s]
Jun 20 07:33:25.873 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
Jun 20 07:33:27.060 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     113346 ], inserted [          0 ], job/sec [    56673.0 ] [2s]
bench: jobs worked [      24768 ], inserted [          0 ], job/sec [    12384.0 ] [2s]
bench: total jobs worked [     300000 ], total jobs inserted [     300000 ], overall job/sec [    46803.8 ], running 6.409735541s

Here, we modify our approach so that instead of waiting until we're at a
backlog of 20k jobs in the completer, we start to apply backpressure at
10k jobs, and reserve the warnings for if we get to the 20k level, which
will be much more difficult to get to. This has the effect of producing
much fewer warnings as we're running a benchmark:

$ go run ./cmd/river bench --database-url $DATABASE_URL --num-total-jobs 300_000
bench: jobs worked [          0 ], inserted [     300000 ], job/sec [        0.0 ] [0s]
bench: jobs worked [      79849 ], inserted [          0 ], job/sec [    39924.5 ] [2s]
bench: jobs worked [      96590 ], inserted [          0 ], job/sec [    48295.0 ] [2s]
bench: jobs worked [     105726 ], inserted [          0 ], job/sec [    52863.0 ] [2s]
bench: jobs worked [      17835 ], inserted [          0 ], job/sec [     8917.5 ] [2s]
bench: total jobs worked [     300000 ], total jobs inserted [     300000 ], overall job/sec [    47026.1 ], running 6.379439334s

In my tests, throughput is roughly the same. Before at 20k the completer
was the bottleneck, and it's still the bottleneck. We do put in one
minor optimization in which BatchCompleter will start completing as
soon as it's got a full batch rather than having to wait for a 50 ms
tick. This doesn't have a super significant effect on numbers though.

An argument against this change is that to some degree, we'd be sweeping
the slow completer under the rug. I'm receptive to that, but a couple
things:

  • It looks bad when the benchmark process regularly produces warnings as
    it runs.

  • I think we can do better than warnings by putting in a more
    comprehensive stats layer that tracks specific durations for each
    phase of a job's life cycle (time to lock, time to work, time waiting
    to complete, time to complete, etc.). This seems like a much better
    alternative to the status quo.

@brandur

brandur commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 168f657fb4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/jobcompleter/job_completer.go Outdated
}
c.setStateParamsMu.Unlock()

<-waitChan

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Release backlog waiters when a batch fails

When backlog pressure is active and JobSetStateIfRunningMany fails after retries, handleBatch returns before the only close(c.waitOnBacklogChan) path. This newly added wait means the goroutine that enabled pressure has not enqueued its completion yet, so a transient DB error at the 10k threshold leaves it and any later callers blocked on the same channel indefinitely. Please either release the wait channel on error/requeue the batch, or let the initiating completion fall through as before.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and w/ regression test.

@brandur brandur force-pushed the brandur-smoother-completion branch from 168f657 to d177425 Compare June 20, 2026 16:07
@brandur

brandur commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d177425483

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/jobcompleter/job_completer.go Outdated
Comment on lines 481 to 484
jobRows, err = completeSubBatch(params)
if err != nil {
c.requeueBatch(ctx, setStateBatch, setStateStartTimes)
return err

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid requeueing non-retryable completion errors

When JobSetStateIfRunningMany returns context.Canceled or riverdriver.ErrClosedPool while the batch completer is still running, withRetries returns immediately without any backoff because those errors are considered non-retryable. Requeueing the same batch here puts it straight back into setStateParams; for batches at or above the run loop's batchCompleterStartThreshold (100 jobs), Start immediately calls handleBatch again, creating a tight error loop until the service is stopped instead of giving up or backing off.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleBatch will now distinguish between retryable and non-retryable (context.Canceled, riverdriver.ErrClosedPool) errors and give up on the batch in the case of the latter.

@brandur brandur force-pushed the brandur-smoother-completion branch from d177425 to fb7ef2c Compare June 20, 2026 18:26
@brandur

brandur commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@brandur brandur force-pushed the brandur-smoother-completion branch from fb7ef2c to 4acc752 Compare June 20, 2026 19:17
@brandur

brandur commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 👍

Reviewed commit: 4acc7526c2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@brandur brandur requested a review from bgentry June 20, 2026 19:57
… + fewer warnings

This one's largely aimed at trying to resolve a common problem that we
see during benchmarks, in which the `BatchCompleter` becomes completely
saturated and starts producing noisy warnings saying that it'll produce
backpressure while waiting for completions:

    $ go run ./cmd/river bench --database-url $DATABASE_URL --num-total-jobs 300_000
    bench: jobs worked [          0 ], inserted [     300000 ], job/sec [        0.0 ] [0s]
    Jun 20 07:33:22.258 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
    Jun 20 07:33:23.464 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
    bench: jobs worked [      74578 ], inserted [          0 ], job/sec [    37289.0 ] [2s]
    Jun 20 07:33:24.480 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
    bench: jobs worked [      87308 ], inserted [          0 ], job/sec [    43654.0 ] [2s]
    Jun 20 07:33:25.873 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
    Jun 20 07:33:27.060 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
    bench: jobs worked [     113346 ], inserted [          0 ], job/sec [    56673.0 ] [2s]
    bench: jobs worked [      24768 ], inserted [          0 ], job/sec [    12384.0 ] [2s]
    bench: total jobs worked [     300000 ], total jobs inserted [     300000 ], overall job/sec [    46803.8 ], running 6.409735541s

Here, we modify our approach so that instead of waiting until we're at a
backlog of 20k jobs in the completer, we start to apply backpressure at
10k jobs, and reserve the warnings for if we get to the 20k level, which
will be much more difficult to get to. This has the effect of producing
much fewer warnings as we're running a benchmark:

    $ go run ./cmd/river bench --database-url $DATABASE_URL --num-total-jobs 300_000
    bench: jobs worked [          0 ], inserted [     300000 ], job/sec [        0.0 ] [0s]
    bench: jobs worked [      79849 ], inserted [          0 ], job/sec [    39924.5 ] [2s]
    bench: jobs worked [      96590 ], inserted [          0 ], job/sec [    48295.0 ] [2s]
    bench: jobs worked [     105726 ], inserted [          0 ], job/sec [    52863.0 ] [2s]
    bench: jobs worked [      17835 ], inserted [          0 ], job/sec [     8917.5 ] [2s]
    bench: total jobs worked [     300000 ], total jobs inserted [     300000 ], overall job/sec [    47026.1 ], running 6.379439334s

In my tests, throughput is roughly the same. Before at 20k the completer
was the bottleneck, and it's still the bottleneck. We do put in one
minor optimization in which `BatchCompleter` will start completing as
soon as it's got a full batch rather than having to wait for a 50 ms
tick. This doesn't have a super significant effect on numbers though.

An argument against this change is that to some degree, we'd be sweeping
the slow completer under the rug. I'm receptive to that, but a couple
things:

* It looks bad when the benchmark process regularly produces warnings as
  it runs.

* I think we can do better than warnings by putting in a more
  comprehensive stats layer that tracks specific durations for each
  phase of a job's life cycle (time to lock, time to work, time waiting
  to complete, time to complete, etc.). This seems like a much better
  alternative to the status quo.
@brandur brandur force-pushed the brandur-smoother-completion branch from 4acc752 to 2bb559c Compare June 20, 2026 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant