Skip to content

Make jobs cancelled due to a soft stop immediately available#1290

Open
brandur wants to merge 1 commit into
masterfrom
brandur-soft-stopped-jobs-immediately-available
Open

Make jobs cancelled due to a soft stop immediately available#1290
brandur wants to merge 1 commit into
masterfrom
brandur-soft-stopped-jobs-immediately-available

Conversation

@brandur

@brandur brandur commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

While working on #1289, I realized that jobs which are "soft stopped"
via context cancellation are still prone to the same side effects as if
they errored in any other way:

  • Their number of attempts is incremented.
  • They may be discarded if reaching max attempts.
  • They'll have to wait to be retried according to retry policy.

This doesn't really seem right because these jobs didn't actually
misbehave in any way, but were rather just slow-to-run jobs that
couldn't finish cleanly inside the default stop allowance while a client
was restarting or being deployed.

The proper behavior should probably be more like a snooze. i.e. The soft
timeout cancellation doesn't count and the jobs get a chance to be
retried immediately. Here, make that change.

@brandur brandur force-pushed the brandur-soft-stopped-jobs-immediately-available branch from 00b244b to 3c09064 Compare June 19, 2026 20:16
@brandur brandur requested a review from bgentry June 19, 2026 20:16
@brandur

brandur commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

@bgentry Do you agree with the rationale for this one?

@brandur

brandur commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c090646e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/jobexecutor/job_executor.go Outdated
}

if softStopped {
params := riverdriver.JobSetStateErrorAvailable(jobRow.ID, now, ptrutil.Ptr(max(jobRow.Attempt-1, 0)), errData, metadataUpdates)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep soft-stop records out of retry backoff

For jobs that are soft-stopped and then later fail for a real reason, this path resets attempt but still appends errData to errors. The default retry policy uses len(job.Errors)+1 rather than attempt, so a deployment stop still changes future retry behavior: the first genuine failure after one soft stop is treated like the second error and backs off for 16s instead of 1s, with repeated stops inflating it further. This means soft-stop cancellations still count for retry scheduling even though the attempt count is restored.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Changed this to be more like snooze and not persist the cancellation error at all.

While working on #1289, I realized that jobs which are "soft stopped"
via context cancellation are still prone to the same side effects as if
they errored in any other way:

* Their number of attempts is incremented.
* They may be discarded if reaching max attempts.
* They'll have to wait to be retried according to retry policy.

This doesn't really seem right because these jobs didn't actually
misbehave in any way, but were rather just slow-to-run jobs that
couldn't finish cleanly inside the default stop allowance while a client
was restarting or being deployed.

The proper behavior should probably be more like a snooze. i.e. The soft
timeout cancellation doesn't count and the jobs get a chance to be
retried immediately. Here, make that change.
@brandur brandur force-pushed the brandur-soft-stopped-jobs-immediately-available branch from 3c09064 to 2e49010 Compare June 20, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant