fix(schedule): always refresh Action.Args on existing schedules#73
Merged
fix(schedule): always refresh Action.Args on existing schedules#73
Conversation
The previous update path skipped when cron+jitter matched and never touched Action.Args, so a schedule created on an older code revision (when the orchestrator carried a hardcoded fallback resource list) kept a stale ResourceTypes:null in its Args forever. After PR #55 made empty ResourceTypes a hard error (ErrNoResourceTypes), every cron firing failed instantly with no log past 'Starting orchestrator workflow' — the workflow returned the error before logging it. Fix: drop the skip-when-same optimization and always rewrite Spec + Action.Args + TaskQueue from the current cfg whenever an existing schedule is found. Args are encoded as opaque payloads in the Schedule and cannot be reliably diffed; one Update RPC per pod startup is a trivial cost compared to the silent-arg-drift outage. Adds a regression test that simulates a stale Args payload and asserts the rewrite produces the current cfg's ResourceTypes. Amp-Thread-ID: https://ampcode.com/threads/T-019ddff6-22a2-764e-bc7e-da55598b7ccc Co-authored-by: Amp <amp@ampcode.com>
revied
approved these changes
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
While debugging a 2-day staging snapshot gap (no
snapshots/2026/04/29or04/30writes), I traced the failure to the cron-scheduled orchestrator. Datadog showed the workflow logging exactly one line —INFO Starting orchestrator workflow— and then nothing else. The first WorkflowTask completed in 137 µs and Temporal recorded the run asWORKFLOW_EXECUTION_STATUS_FAILED.Decoding the staging Schedule's input payload confirmed the issue:
That
ResourceTypes:nullis exactly what triggersorchestrator.ErrNoResourceTypes(added in #55). The orchestrator returns the error immediately after theStarting orchestrator workflowlog — no error log line is emitted, hence theone log then silencesymptom.Root cause
EnsureScheduleshort-circuited when the existing Schedule's cron+jitter matched and never touchedAction.Args. So a Schedule created on a pre-#55 revision (when the orchestrator carried a hardcoded fallback resource list andResourceTypescould safely be empty) kept that empty/nullResourceTypesbaked into itsAction.Argsforever. Subsequent worker restarts ran the new code that requiresResourceTypes, but the Schedule's args were never refreshed.The bug is silent: nothing in the worker startup logs flags the drift, the Schedule itself looks healthy in the Temporal UI, and the failure only surfaces as workflows that fail before logging anything past the first line.
Fix
Drop the
skip when sameoptimization in the existing-Schedule branch and always rewriteSpec+Action.Args+Action.TaskQueuefrom the current cfg wheneverEnsureSchedulefinds an existing Schedule.Argsare stored in the Schedule as opaque payloads, so we cannot reliably diff them againstcfg.ResourceTypeshere. OneUpdateRPC per pod startup is a trivial cost compared to silent arg drift causing every cron firing to fail.Tests
TestEnsureSchedule_Update_RefreshesActionArgssimulates a Schedule whose existingAction.Argscontains a staleResourceTypespayload and asserts thatEnsureSchedulerewritesArgs(andTaskQueue) with values from the currentcfg.TestEnsureSchedule_AlreadyExists_AlwaysUpdates(renamed from...SameCron) — the contract changed fromdon't update when cron matchestoalways update so Args is refreshed.TestEnsureSchedule_DescribeError— the update path no longer callsDescribe, so the test is obsolete.Operational follow-up
Anyone running this service needs to verify their existing Temporal Schedule's
Action.Argsis non-null. Quick check via Temporal CLI:If it shows
{"ScanID":"","ResourceTypes":null}, delete the Schedule and restart the worker — the next startup with this fix will recreate it correctly. (For staging at Block, I already did this manually.)Considered alternatives
Argsand skip when equal. Rejected:Argscome back fromDescribeas opaque*commonpb.Payloadsand would need data-converter-aware unmarshalling to compare against the typedorchestrator.WorkflowInput. Not worth the complexity for a once-per-startup RPC.