Add Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289
Open
brandur wants to merge 1 commit into
Open
Add Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289brandur wants to merge 1 commit into
Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289brandur wants to merge 1 commit into
Conversation
1234376 to
c5151af
Compare
brandur
commented
Jun 19, 2026
|
|
||
| var setStateParams *riverdriver.JobSetStateIfRunningParams | ||
| if job.Attempt >= job.MaxAttempts { | ||
| setStateParams = riverdriver.JobSetStateDiscarded(job.ID, now, errData, nil) |
Contributor
Author
There was a problem hiding this comment.
This mirrors existing behavior where a soft stop will set an error and potentially send the job to discarded, but looking at this again, this existing behavior does seem potentially wrong.
…rored Here, add a new `Config.HardStopTimeout` on top of the existing `SoftStopTimeout` whose job it is to recover badly behaving job as much as possible before coming to a full stop. Currently, if a client is stopping and is running jobs that don't respond to context cancellation, those jobs end up getting left in a `running` state, which means that they won't be recoverable again until they're rescued an hour later. `HardStopTimeout` engages after soft stop, and has each producer perform a "hard stop", which means to have it set any jobs still running to an error state. Because they're errored, they'll get to run immediately the next time a client starts up. Ideally, users don't need to depend on this functionality since the "correct" behavior would be to make sure that all jobs are able to respond to context cancellation, so we make this new feature optional.
c5151af to
7342765
Compare
brandur
added a commit
that referenced
this pull request
Jun 19, 2026
While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.
brandur
added a commit
that referenced
this pull request
Jun 20, 2026
While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Here, add a new
Config.HardStopTimeouton top of the existingSoftStopTimeoutwhose job it is to recover badly behaving job as muchas possible before coming to a full stop. Currently, if a client is
stopping and is running jobs that don't respond to context cancellation,
those jobs end up getting left in a
runningstate, which means thatthey won't be recoverable again until they're rescued an hour later.
HardStopTimeoutengages after soft stop, and has each producer performa "hard stop", which means to have it set any jobs still running to an
error state. Because they're errored, they'll get to run immediately the
next time a client starts up.
Ideally, users don't need to depend on this functionality since the
"correct" behavior would be to make sure that all jobs are able to
respond to context cancellation, so we make this new feature optional.