fix(schedule): always refresh Action.Args on existing schedules by bakayolo · Pull Request #73 · block/version-guard

bakayolo · 2026-04-30T21:42:43Z

Background

While debugging a 2-day staging snapshot gap (no snapshots/2026/04/29 or 04/30 writes), I traced the failure to the cron-scheduled orchestrator. Datadog showed the workflow logging exactly one line — INFO Starting orchestrator workflow — and then nothing else. The first WorkflowTask completed in 137 µs and Temporal recorded the run as WORKFLOW_EXECUTION_STATUS_FAILED.

Decoding the staging Schedule's input payload confirmed the issue:

data: "eyJTY2FuSUQiOiIiLCJSZXNvdXJjZVR5cGVzIjpudWxsfQ=="
       └─ {"ScanID":"","ResourceTypes":null}

That ResourceTypes:null is exactly what triggers orchestrator.ErrNoResourceTypes (added in #55). The orchestrator returns the error immediately after the Starting orchestrator workflow log — no error log line is emitted, hence the one log then silence symptom.

Root cause

EnsureSchedule short-circuited when the existing Schedule's cron+jitter matched and never touched Action.Args. So a Schedule created on a pre-#55 revision (when the orchestrator carried a hardcoded fallback resource list and ResourceTypes could safely be empty) kept that empty/null ResourceTypes baked into its Action.Args forever. Subsequent worker restarts ran the new code that requires ResourceTypes, but the Schedule's args were never refreshed.

The bug is silent: nothing in the worker startup logs flags the drift, the Schedule itself looks healthy in the Temporal UI, and the failure only surfaces as workflows that fail before logging anything past the first line.

Fix

Drop the skip when same optimization in the existing-Schedule branch and always rewrite Spec + Action.Args + Action.TaskQueue from the current cfg whenever EnsureSchedule finds an existing Schedule.

Args are stored in the Schedule as opaque payloads, so we cannot reliably diff them against cfg.ResourceTypes here. One Update RPC per pod startup is a trivial cost compared to silent arg drift causing every cron firing to fail.

Tests

New: TestEnsureSchedule_Update_RefreshesActionArgs simulates a Schedule whose existing Action.Args contains a stale ResourceTypes payload and asserts that EnsureSchedule rewrites Args (and TaskQueue) with values from the current cfg.
Updated: TestEnsureSchedule_AlreadyExists_AlwaysUpdates (renamed from ...SameCron) — the contract changed from don't update when cron matches to always update so Args is refreshed.
Removed: TestEnsureSchedule_DescribeError — the update path no longer calls Describe, so the test is obsolete.

$ go test ./pkg/schedule/...
ok  	github.com/block/Version-Guard/pkg/schedule	0.812s

Operational follow-up

Anyone running this service needs to verify their existing Temporal Schedule's Action.Args is non-null. Quick check via Temporal CLI:

temporal schedule describe --schedule-id <your-schedule-id> -o json \
  | jq -r '.schedule.action.startWorkflow.input.payloads[0].data' \
  | base64 -d

If it shows {"ScanID":"","ResourceTypes":null}, delete the Schedule and restart the worker — the next startup with this fix will recreate it correctly. (For staging at Block, I already did this manually.)

Considered alternatives

Compare Args and skip when equal. Rejected: Args come back from Describe as opaque *commonpb.Payloads and would need data-converter-aware unmarshalling to compare against the typed orchestrator.WorkflowInput. Not worth the complexity for a once-per-startup RPC.
Refuse to start when stale Args detected. Rejected: leaves the service down until a human deletes the Schedule. Auto-healing on every startup is safer.

The previous update path skipped when cron+jitter matched and never touched Action.Args, so a schedule created on an older code revision (when the orchestrator carried a hardcoded fallback resource list) kept a stale ResourceTypes:null in its Args forever. After PR #55 made empty ResourceTypes a hard error (ErrNoResourceTypes), every cron firing failed instantly with no log past 'Starting orchestrator workflow' — the workflow returned the error before logging it. Fix: drop the skip-when-same optimization and always rewrite Spec + Action.Args + TaskQueue from the current cfg whenever an existing schedule is found. Args are encoded as opaque payloads in the Schedule and cannot be reliably diffed; one Update RPC per pod startup is a trivial cost compared to the silent-arg-drift outage. Adds a regression test that simulates a stale Args payload and asserts the rewrite produces the current cfg's ResourceTypes. Amp-Thread-ID: https://ampcode.com/threads/T-019ddff6-22a2-764e-bc7e-da55598b7ccc Co-authored-by: Amp <amp@ampcode.com>

bakayolo requested a review from a team as a code owner April 30, 2026 21:42

revied approved these changes Apr 30, 2026

View reviewed changes

bakayolo merged commit 2f94add into main Apr 30, 2026
7 checks passed

bakayolo deleted the bena/fix-schedule-stale-args branch April 30, 2026 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(schedule): always refresh Action.Args on existing schedules#73

fix(schedule): always refresh Action.Args on existing schedules#73
bakayolo merged 1 commit intomainfrom
bena/fix-schedule-stale-args

bakayolo commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bakayolo commented Apr 30, 2026

Background

Root cause

Fix

Tests

Operational follow-up

Considered alternatives

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants