Skip to content

Add Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289

Open
brandur wants to merge 1 commit into
masterfrom
brandur-hard-stop-timeout
Open

Add Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289
brandur wants to merge 1 commit into
masterfrom
brandur-hard-stop-timeout

Conversation

@brandur

@brandur brandur commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Here, add a new Config.HardStopTimeout on top of the existing
SoftStopTimeout whose job it is to recover badly behaving job as much
as possible before coming to a full stop. Currently, if a client is
stopping and is running jobs that don't respond to context cancellation,
those jobs end up getting left in a running state, which means that
they won't be recoverable again until they're rescued an hour later.

HardStopTimeout engages after soft stop, and has each producer perform
a "hard stop", which means to have it set any jobs still running to an
error state. Because they're errored, they'll get to run immediately the
next time a client starts up.

Ideally, users don't need to depend on this functionality since the
"correct" behavior would be to make sure that all jobs are able to
respond to context cancellation, so we make this new feature optional.

@brandur brandur marked this pull request as draft June 19, 2026 15:08
@brandur brandur force-pushed the brandur-hard-stop-timeout branch 2 times, most recently from 1234376 to c5151af Compare June 19, 2026 15:16
Comment thread producer.go

var setStateParams *riverdriver.JobSetStateIfRunningParams
if job.Attempt >= job.MaxAttempts {
setStateParams = riverdriver.JobSetStateDiscarded(job.ID, now, errData, nil)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors existing behavior where a soft stop will set an error and potentially send the job to discarded, but looking at this again, this existing behavior does seem potentially wrong.

…rored

Here, add a new `Config.HardStopTimeout` on top of the existing
`SoftStopTimeout` whose job it is to recover badly behaving job as much
as possible before coming to a full stop. Currently, if a client is
stopping and is running jobs that don't respond to context cancellation,
those jobs end up getting left in a `running` state, which means that
they won't be recoverable again until they're rescued an hour later.

`HardStopTimeout` engages after soft stop, and has each producer perform
a "hard stop", which means to have it set any jobs still running to an
error state. Because they're errored, they'll get to run immediately the
next time a client starts up.

Ideally, users don't need to depend on this functionality since the
"correct" behavior would be to make sure that all jobs are able to
respond to context cancellation, so we make this new feature optional.
@brandur brandur force-pushed the brandur-hard-stop-timeout branch from c5151af to 7342765 Compare June 19, 2026 17:44
brandur added a commit that referenced this pull request Jun 19, 2026
While working on #1289, I realized that jobs which are "soft stopped"
via context cancellation are still prone to the same side effects as if
they errored in any other way:

* Their number of attempts is incremented.
* They may be discarded if reaching max attempts.
* They'll have to wait to be retried according to retry policy.

This doesn't really seem right because these jobs didn't actually
misbehave in any way, but were rather just slow-to-run jobs that
couldn't finish cleanly inside the default stop allowance while a client
was restarting or being deployed.

The proper behavior should probably be more like a snooze. i.e. The soft
timeout cancellation doesn't count and the jobs get a chance to be
retried immediately. Here, make that change.
brandur added a commit that referenced this pull request Jun 20, 2026
While working on #1289, I realized that jobs which are "soft stopped"
via context cancellation are still prone to the same side effects as if
they errored in any other way:

* Their number of attempts is incremented.
* They may be discarded if reaching max attempts.
* They'll have to wait to be retried according to retry policy.

This doesn't really seem right because these jobs didn't actually
misbehave in any way, but were rather just slow-to-run jobs that
couldn't finish cleanly inside the default stop allowance while a client
was restarting or being deployed.

The proper behavior should probably be more like a snooze. i.e. The soft
timeout cancellation doesn't count and the jobs get a chance to be
retried immediately. Here, make that change.
@brandur brandur marked this pull request as ready for review June 20, 2026 03:46
@brandur brandur requested a review from bgentry June 20, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant