Skip to content

[Scheduler] Rework and instrument CHASM scheduler tasks, add a bunch of coverage#10369

Open
lina-temporal wants to merge 5 commits into
mainfrom
sched-task-rework
Open

[Scheduler] Rework and instrument CHASM scheduler tasks, add a bunch of coverage#10369
lina-temporal wants to merge 5 commits into
mainfrom
sched-task-rework

Conversation

@lina-temporal
Copy link
Copy Markdown
Contributor

@lina-temporal lina-temporal commented May 23, 2026

I wanted to rework CHASM scheduler's tasks to harden them against the possibility of being dropped, or at least, give us a good amount of visibility into when they're dropped. I've added functional test coverage for the "edge case" schedules we've recently encountered, which uncovered a few bugs that are fixed. Unfortunately, this is something of a large PR, because some of the bugs and edge cases tested only become green with all of the fixes applied; but most of the delta here are tests.

Please ignore the logging, in particular, for now: once the EventLog PR is merged, I'm going to do a full logging + eventlogging pass. Any statements here are just a few small breadcrumbs/not exhaustive.

What changed?

Generator

  • GeneratorTask now keeps running while paused (advances the HWM, updates FutureActionTimes, drops actions at the source). This is an improvement over V1.
  • New isManualOnly / isHeldOpen helpers carve out schedules that have no automated wakeup source (e.g., empty spec, manual-only, pending backfill) so the idle timer can't silently close them.

Idle task

  • Reworked to use the new isManualOnly / isHeldOpen helpers, plus a .After() deadline compare so sub-precision drift no longer drops valid tasks.
  • Fixed a non-monotonicity bug in recentActions (a closing workflow could regress observed time; not user-visible in practice but cleaner).

Invoker

  • ExecuteTask captures the framework clock at read time and threads it through applyBackoff, so BackoffTime stamps stay consistent with LastProcessedTime (no more wall-clock drift in tests/replay).
  • ProcessBufferTask now checks the catchup window before consuming a LimitedActions slot.
  • recordExecuteResult is idempotent per RequestId: a duplicate CompletedStart or RetryableStart whose live entry already has RunId set is dropped (without stomping RunId/StartTime/HasCallback or rewinding Attempt/BackoffTime).
  • addTasks consolidated into a single-pass classifier; the broadened re-arm gate fixes the stranded-retry bug.

Backfiller

  • hasMoreAllowAllBackfills was a misreading of the original code; it's gone. Any backfill (not just ALLOW_ALL) holds a paused schedule open while idling out.
  • On completion, the backfiller revives the Generator so it can re-evaluate idle/close eligibility.

Tests / infra

  • Ran the schedule functional suite in parallel (367s → 67s). Sorry to omnibus that in, but it was painful without.

New metrics

All counters are tagged with namespace and schedule_backend via newTaggedMetricsHandler. Reason-tagged counters use a reason tag with limited cardinality.

Generator

Metric Purpose
schedule_generator_ticks Every Generator fire (baseline for paused-vs-active attribution)
schedule_generator_paused_ticks Generator fires while paused (HWM advanced, no actions buffered)
scheduler_generator_loop_completed Generator stopped rescheduling without arming idle — held open for an external trigger

Idle

Metric Purpose
schedule_idle_task_fired Idle task fired and closed the schedule
schedule_idle_task_invalidated Idle task dropped, tagged reason: held_open / expiration_shift / closed

Invoker

Metric Purpose
schedule_invoker_process_buffer_fired Each ProcessBuffer execute
schedule_invoker_process_buffer_invalidated ProcessBuffer dropped by Validate, reason: stale_hwm
schedule_invoker_execute_fired Each Execute side-effect task
schedule_invoker_execute_invalidated Execute work dropped, reason: no_work (Validate) or already_recorded (concurrent ExecuteTask already wrote RunId)
schedule_buffered_start_dropped Buffered start dropped, reason: missed_catchup_window or paused_or_limited

Backfiller

Metric Purpose
schedule_backfiller_fired Each Backfiller execute
schedule_backfiller_invalidated Backfiller dropped by Validate, reason: stale_hwm
schedule_backfiller_completed Backfiller drained and deleted itself (end-to-end lifecycle signal)

New tests

Unit tests (chasm/lib/scheduler/)

Idle (scheduler_idle_tasks_test.go)

  • TestIdleTask_Validate_ManualOnlyHeldOpen — empty-spec held open instead of auto-closed.
  • TestIdleTask_Validate_HasBackfillHeldOpen — any pending backfill blocks close.
  • TestIdleTask_Validate_ExpirationShiftedLater — deadline moved later drops the task.
  • TestIdleTask_Validate_ExpirationStableFires — equal-time deadline still fires.
  • TestIdleTask_Validate_SentinelNotHeldOpen — sentinels close regardless of otherwise-blocking state.
  • TestIdleTask_Validate_MetricReasons — each invalidation path emits the correct reason tag.

Generator (generator_tasks_test.go)

  • TestGeneratorTask_PausedDropsActionsAdvancesHWM — paused fire produces no starts but advances HWM.

Invoker ExecuteTask (invoker_execute_task_test.go)

  • TestExecuteTask_Validate — table of 6 cases (empty / eligible / running / backing-off / cancel-pending / terminate-pending).
  • TestExecuteTask_Validate_BackoffEqualToLPTIsEligible — boundary regression for .Before!.After.
  • TestExecuteTask_BackoffUsesFrameworkClock — advances TimeSource by 24h; asserts BackoffTime tracks the framework clock.
  • TestExecuteTask_RecordResultIdempotentOnRace — concurrent CompletedStart for an already-running entry: no double-count, no stomp of RunId/StartTime/HasCallback.
  • TestExecuteTask_RecordResultIdempotentOnRetryableRace — concurrent RetryableStart for an already-running entry: Attempt/BackoffTime are not rewound.

Invoker ProcessBufferTask (invoker_process_buffer_task_test.go)

  • TestProcessBufferTask_Validate — table of 5 cases covering validateTaskHighWaterMark branches.
  • TestProcessBufferTask_MissedCatchupPreservesRemainingActions — regression test, but otherwise not reachable via the public API.
  • TestProcessBufferTask_PausedDropsAutomatedKeepsManual — paused schedule drops automated start, promotes manual to Attempt=1.
  • TestProcessBufferTask_RearmsBackedOffRetry — no-op ProcessBuffer with an eligible backed-off start still emits an ExecuteTask.
  • TestInvoker_AddTasks_AllDeferredEmitsNothing — buffer of only deferred starts emits no ProcessBufferTask.

Nexus completion (scheduler_nexus_completion_test.go)

  • TestHandleNexusCompletion_ReenablesDeferredStarts — full defer → re-enable → promote cascade on completion.

Functional tests (tests/schedule_test.go)

Shared (run against both V1 and CHASM backends)

  • testBufferOneDeferredFiresAfterCompletion — BUFFER_ONE deferred start fires when the running workflow completes.
  • testTriggerImmediatelyOnPausedSchedule — manual trigger fires despite pause.
  • testTriggerImmediatelyAfterActionsExhausted — manual trigger fires despite RemainingActions=0.
  • testBackfillWithSkipOverlap — SKIP collapses a 5-tick backfill to a single execution.
  • testBackfillWithBufferOneOverlap (skipped) — pins expected BUFFER_ONE behavior; currently fails on V1 and CHASM (deferred starts not re-enabling for backfill-Manual entries). Worth a separate investigation.
  • testMultiRangeBackfillCountedExactlyOnce — concurrent BackfillerTasks don't inflate ActionCount (mirrors the upstream Go SDK feature test).
  • testBackfillRangeSmallerThanInterval — narrow backfill produces no actions and drains cleanly.
  • testUpdateScheduleRequestIDTooLong / testUpdateScheduleBlobSizeLimit — input validation.

CHASM-only

  • testPausedDropsCatchup — pause-window ticks don't replay on unpause.
  • testPausedScheduleNeverIdles — paused schedules survive past IdleTime.
  • testPausedEmptySpecStaysOpen — empty-spec paused schedule survives past IdleTime.
  • testManualOnlyUnpausedStaysOpen — manual-only carve-out: unpaused empty-spec stays open past IdleTime.
  • testScheduleClosesFromIdle — parameterized over 4 spec shapes (single-date, multi-date, limited-actions, interval-end-time); each closes from idle.
  • testPauseDuringIdleWindowheld_open invalidation + re-arm on unpause.
  • testUpdateDuringIdleWindowexpiration_shift invalidation + re-arm at the new deadline.
  • testBackfillBlocksIdleClose — pending backfill blocks idle close; closes when backfill drains.
  • testBackfillOnPausedSchedule — backfill against a paused schedule still fires.
  • testRecentActionsAdvanceWhilePaused — workflow status transitions RUNNING → COMPLETED in ListSchedules during pause.
  • testFutureActionTimesAdvanceWhilePaused — projected times re-publish while paused.

Potential risks

This changes a substantial amount of scheduler surface area, but the bulk of the diff is tests and the fixes are individually validated.

Comment on lines +267 to 284
for _, start := range i.GetBufferedStarts() {
if start.GetRunId() != "" || start.GetCompleted() != nil || start.GetAttempt() < 0 {
continue
}
switch {
case start.GetAttempt() == 0:
pending++
hasUnprocessed = true
case start.GetBackoffTime().AsTime().After(lastProcessed):
pending++
backoff := start.GetBackoffTime().AsTime()
if nextDeadline.IsZero() || backoff.Before(nextDeadline) {
nextDeadline = backoff
}
default:
eligible++
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a bug fix, but fixes an unnecessary transaction/task execution. Previously, all deferred starts would get a task added, and then be cancelled out by running the task itself. Here, the gate to add a task is just a bit more careful.

@lina-temporal lina-temporal marked this pull request as ready for review May 23, 2026 03:08
@lina-temporal lina-temporal requested review from a team as code owners May 23, 2026 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant