[Scheduler] Rework and instrument CHASM scheduler tasks, add a bunch of coverage#10369
Open
lina-temporal wants to merge 5 commits into
Open
[Scheduler] Rework and instrument CHASM scheduler tasks, add a bunch of coverage#10369lina-temporal wants to merge 5 commits into
lina-temporal wants to merge 5 commits into
Conversation
lina-temporal
commented
May 23, 2026
Comment on lines
+267
to
284
| for _, start := range i.GetBufferedStarts() { | ||
| if start.GetRunId() != "" || start.GetCompleted() != nil || start.GetAttempt() < 0 { | ||
| continue | ||
| } | ||
| switch { | ||
| case start.GetAttempt() == 0: | ||
| pending++ | ||
| hasUnprocessed = true | ||
| case start.GetBackoffTime().AsTime().After(lastProcessed): | ||
| pending++ | ||
| backoff := start.GetBackoffTime().AsTime() | ||
| if nextDeadline.IsZero() || backoff.Before(nextDeadline) { | ||
| nextDeadline = backoff | ||
| } | ||
| default: | ||
| eligible++ | ||
| } | ||
| } |
Contributor
Author
There was a problem hiding this comment.
This isn't a bug fix, but fixes an unnecessary transaction/task execution. Previously, all deferred starts would get a task added, and then be cancelled out by running the task itself. Here, the gate to add a task is just a bit more careful.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I wanted to rework CHASM scheduler's tasks to harden them against the possibility of being dropped, or at least, give us a good amount of visibility into when they're dropped. I've added functional test coverage for the "edge case" schedules we've recently encountered, which uncovered a few bugs that are fixed. Unfortunately, this is something of a large PR, because some of the bugs and edge cases tested only become green with all of the fixes applied; but most of the delta here are tests.
Please ignore the logging, in particular, for now: once the EventLog PR is merged, I'm going to do a full logging + eventlogging pass. Any statements here are just a few small breadcrumbs/not exhaustive.
What changed?
Generator
GeneratorTasknow keeps running while paused (advances the HWM, updatesFutureActionTimes, drops actions at the source). This is an improvement over V1.isManualOnly/isHeldOpenhelpers carve out schedules that have no automated wakeup source (e.g., empty spec, manual-only, pending backfill) so the idle timer can't silently close them.Idle task
isManualOnly/isHeldOpenhelpers, plus a.After()deadline compare so sub-precision drift no longer drops valid tasks.recentActions(a closing workflow could regress observed time; not user-visible in practice but cleaner).Invoker
ExecuteTaskcaptures the framework clock at read time and threads it throughapplyBackoff, soBackoffTimestamps stay consistent withLastProcessedTime(no more wall-clock drift in tests/replay).ProcessBufferTasknow checks the catchup window before consuming aLimitedActionsslot.recordExecuteResultis idempotent perRequestId: a duplicateCompletedStartorRetryableStartwhose live entry already hasRunIdset is dropped (without stompingRunId/StartTime/HasCallbackor rewindingAttempt/BackoffTime).addTasksconsolidated into a single-pass classifier; the broadened re-arm gate fixes the stranded-retry bug.Backfiller
hasMoreAllowAllBackfillswas a misreading of the original code; it's gone. Any backfill (not justALLOW_ALL) holds a paused schedule open while idling out.Tests / infra
New metrics
All counters are tagged with
namespaceandschedule_backendvianewTaggedMetricsHandler. Reason-tagged counters use areasontag with limited cardinality.Generator
schedule_generator_ticksschedule_generator_paused_ticksscheduler_generator_loop_completedIdle
schedule_idle_task_firedschedule_idle_task_invalidatedreason:held_open/expiration_shift/closedInvoker
schedule_invoker_process_buffer_firedschedule_invoker_process_buffer_invalidatedreason:stale_hwmschedule_invoker_execute_firedschedule_invoker_execute_invalidatedreason:no_work(Validate) oralready_recorded(concurrent ExecuteTask already wroteRunId)schedule_buffered_start_droppedreason:missed_catchup_windoworpaused_or_limitedBackfiller
schedule_backfiller_firedschedule_backfiller_invalidatedreason:stale_hwmschedule_backfiller_completedNew tests
Unit tests (
chasm/lib/scheduler/)Idle (
scheduler_idle_tasks_test.go)TestIdleTask_Validate_ManualOnlyHeldOpen— empty-spec held open instead of auto-closed.TestIdleTask_Validate_HasBackfillHeldOpen— any pending backfill blocks close.TestIdleTask_Validate_ExpirationShiftedLater— deadline moved later drops the task.TestIdleTask_Validate_ExpirationStableFires— equal-time deadline still fires.TestIdleTask_Validate_SentinelNotHeldOpen— sentinels close regardless of otherwise-blocking state.TestIdleTask_Validate_MetricReasons— each invalidation path emits the correctreasontag.Generator (
generator_tasks_test.go)TestGeneratorTask_PausedDropsActionsAdvancesHWM— paused fire produces no starts but advances HWM.Invoker
ExecuteTask(invoker_execute_task_test.go)TestExecuteTask_Validate— table of 6 cases (empty / eligible / running / backing-off / cancel-pending / terminate-pending).TestExecuteTask_Validate_BackoffEqualToLPTIsEligible— boundary regression for.Before→!.After.TestExecuteTask_BackoffUsesFrameworkClock— advancesTimeSourceby 24h; assertsBackoffTimetracks the framework clock.TestExecuteTask_RecordResultIdempotentOnRace— concurrent CompletedStart for an already-running entry: no double-count, no stomp ofRunId/StartTime/HasCallback.TestExecuteTask_RecordResultIdempotentOnRetryableRace— concurrent RetryableStart for an already-running entry:Attempt/BackoffTimeare not rewound.Invoker
ProcessBufferTask(invoker_process_buffer_task_test.go)TestProcessBufferTask_Validate— table of 5 cases coveringvalidateTaskHighWaterMarkbranches.TestProcessBufferTask_MissedCatchupPreservesRemainingActions— regression test, but otherwise not reachable via the public API.TestProcessBufferTask_PausedDropsAutomatedKeepsManual— paused schedule drops automated start, promotes manual toAttempt=1.TestProcessBufferTask_RearmsBackedOffRetry— no-op ProcessBuffer with an eligible backed-off start still emits anExecuteTask.TestInvoker_AddTasks_AllDeferredEmitsNothing— buffer of only deferred starts emits noProcessBufferTask.Nexus completion (
scheduler_nexus_completion_test.go)TestHandleNexusCompletion_ReenablesDeferredStarts— full defer → re-enable → promote cascade on completion.Functional tests (
tests/schedule_test.go)Shared (run against both V1 and CHASM backends)
testBufferOneDeferredFiresAfterCompletion— BUFFER_ONE deferred start fires when the running workflow completes.testTriggerImmediatelyOnPausedSchedule— manual trigger fires despite pause.testTriggerImmediatelyAfterActionsExhausted— manual trigger fires despiteRemainingActions=0.testBackfillWithSkipOverlap— SKIP collapses a 5-tick backfill to a single execution.testBackfillWithBufferOneOverlap(skipped) — pins expected BUFFER_ONE behavior; currently fails on V1 and CHASM (deferred starts not re-enabling for backfill-Manual entries). Worth a separate investigation.testMultiRangeBackfillCountedExactlyOnce— concurrent BackfillerTasks don't inflateActionCount(mirrors the upstream Go SDK feature test).testBackfillRangeSmallerThanInterval— narrow backfill produces no actions and drains cleanly.testUpdateScheduleRequestIDTooLong/testUpdateScheduleBlobSizeLimit— input validation.CHASM-only
testPausedDropsCatchup— pause-window ticks don't replay on unpause.testPausedScheduleNeverIdles— paused schedules survive pastIdleTime.testPausedEmptySpecStaysOpen— empty-spec paused schedule survives pastIdleTime.testManualOnlyUnpausedStaysOpen— manual-only carve-out: unpaused empty-spec stays open pastIdleTime.testScheduleClosesFromIdle— parameterized over 4 spec shapes (single-date, multi-date, limited-actions, interval-end-time); each closes from idle.testPauseDuringIdleWindow—held_openinvalidation + re-arm on unpause.testUpdateDuringIdleWindow—expiration_shiftinvalidation + re-arm at the new deadline.testBackfillBlocksIdleClose— pending backfill blocks idle close; closes when backfill drains.testBackfillOnPausedSchedule— backfill against a paused schedule still fires.testRecentActionsAdvanceWhilePaused— workflow status transitions RUNNING → COMPLETED in ListSchedules during pause.testFutureActionTimesAdvanceWhilePaused— projected times re-publish while paused.Potential risks
This changes a substantial amount of scheduler surface area, but the bulk of the diff is tests and the fixes are individually validated.