Skip to content

fix(schedule): always refresh Action.Args on existing schedules#73

Merged
bakayolo merged 1 commit intomainfrom
bena/fix-schedule-stale-args
Apr 30, 2026
Merged

fix(schedule): always refresh Action.Args on existing schedules#73
bakayolo merged 1 commit intomainfrom
bena/fix-schedule-stale-args

Conversation

@bakayolo
Copy link
Copy Markdown
Collaborator

Background

While debugging a 2-day staging snapshot gap (no snapshots/2026/04/29 or 04/30 writes), I traced the failure to the cron-scheduled orchestrator. Datadog showed the workflow logging exactly one line — INFO Starting orchestrator workflow — and then nothing else. The first WorkflowTask completed in 137 µs and Temporal recorded the run as WORKFLOW_EXECUTION_STATUS_FAILED.

Decoding the staging Schedule's input payload confirmed the issue:

data: "eyJTY2FuSUQiOiIiLCJSZXNvdXJjZVR5cGVzIjpudWxsfQ=="
       └─ {"ScanID":"","ResourceTypes":null}

That ResourceTypes:null is exactly what triggers orchestrator.ErrNoResourceTypes (added in #55). The orchestrator returns the error immediately after the Starting orchestrator workflow log — no error log line is emitted, hence the one log then silence symptom.

Root cause

EnsureSchedule short-circuited when the existing Schedule's cron+jitter matched and never touched Action.Args. So a Schedule created on a pre-#55 revision (when the orchestrator carried a hardcoded fallback resource list and ResourceTypes could safely be empty) kept that empty/null ResourceTypes baked into its Action.Args forever. Subsequent worker restarts ran the new code that requires ResourceTypes, but the Schedule's args were never refreshed.

The bug is silent: nothing in the worker startup logs flags the drift, the Schedule itself looks healthy in the Temporal UI, and the failure only surfaces as workflows that fail before logging anything past the first line.

Fix

Drop the skip when same optimization in the existing-Schedule branch and always rewrite Spec + Action.Args + Action.TaskQueue from the current cfg whenever EnsureSchedule finds an existing Schedule.

Args are stored in the Schedule as opaque payloads, so we cannot reliably diff them against cfg.ResourceTypes here. One Update RPC per pod startup is a trivial cost compared to silent arg drift causing every cron firing to fail.

Tests

  • New: TestEnsureSchedule_Update_RefreshesActionArgs simulates a Schedule whose existing Action.Args contains a stale ResourceTypes payload and asserts that EnsureSchedule rewrites Args (and TaskQueue) with values from the current cfg.
  • Updated: TestEnsureSchedule_AlreadyExists_AlwaysUpdates (renamed from ...SameCron) — the contract changed from don't update when cron matches to always update so Args is refreshed.
  • Removed: TestEnsureSchedule_DescribeError — the update path no longer calls Describe, so the test is obsolete.
$ go test ./pkg/schedule/...
ok  	github.com/block/Version-Guard/pkg/schedule	0.812s

Operational follow-up

Anyone running this service needs to verify their existing Temporal Schedule's Action.Args is non-null. Quick check via Temporal CLI:

temporal schedule describe --schedule-id <your-schedule-id> -o json \
  | jq -r '.schedule.action.startWorkflow.input.payloads[0].data' \
  | base64 -d

If it shows {"ScanID":"","ResourceTypes":null}, delete the Schedule and restart the worker — the next startup with this fix will recreate it correctly. (For staging at Block, I already did this manually.)

Considered alternatives

  • Compare Args and skip when equal. Rejected: Args come back from Describe as opaque *commonpb.Payloads and would need data-converter-aware unmarshalling to compare against the typed orchestrator.WorkflowInput. Not worth the complexity for a once-per-startup RPC.
  • Refuse to start when stale Args detected. Rejected: leaves the service down until a human deletes the Schedule. Auto-healing on every startup is safer.

The previous update path skipped when cron+jitter matched and never
touched Action.Args, so a schedule created on an older code revision
(when the orchestrator carried a hardcoded fallback resource list)
kept a stale ResourceTypes:null in its Args forever. After PR #55
made empty ResourceTypes a hard error (ErrNoResourceTypes), every
cron firing failed instantly with no log past 'Starting orchestrator
workflow' — the workflow returned the error before logging it.

Fix: drop the skip-when-same optimization and always rewrite Spec
+ Action.Args + TaskQueue from the current cfg whenever an existing
schedule is found. Args are encoded as opaque payloads in the
Schedule and cannot be reliably diffed; one Update RPC per pod
startup is a trivial cost compared to the silent-arg-drift outage.

Adds a regression test that simulates a stale Args payload and
asserts the rewrite produces the current cfg's ResourceTypes.

Amp-Thread-ID: https://ampcode.com/threads/T-019ddff6-22a2-764e-bc7e-da55598b7ccc
Co-authored-by: Amp <amp@ampcode.com>
@bakayolo bakayolo requested a review from a team as a code owner April 30, 2026 21:42
@bakayolo bakayolo merged commit 2f94add into main Apr 30, 2026
7 checks passed
@bakayolo bakayolo deleted the bena/fix-schedule-stale-args branch April 30, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants