feat: add failed_retryable state and recovery loop#50
Draft
Conversation
State changes: - Add Apply.FailedRetryable and Task.FailedRetryable states - DeriveApplyState: FailedRetryable propagates from tasks to apply (lower priority than permanent Failed) - IsTerminalApplyState/IsTerminalTaskState: FailedRetryable is NOT terminal — recovery loop can re-dispatch - NormalizeTaskStatus/normalizeApplyState: handle FAILED_RETRYABLE Schema changes: - Add attempt column to applies and tasks tables (INT UNSIGNED DEFAULT 0) - Self-bootstrapping via EnsureSchema picks up the new column on deploy Engine: - Add RetryableError type for engines to wrap transient failures - Add IsRetryable() helper for the apply orchestration to classify errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Unit tests for DeriveApplyState with FailedRetryable (priority, normalization, permanent Failed overrides FailedRetryable) - Unit tests for IsRetryable and RetryableError unwrapping - IsTerminalApplyState test for FailedRetryable (not terminal) - README.md updated with FailedRetryable in state tables, diagrams, and derivation rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add attempt to applyColumns/taskColumns and scan functions - Update applies.Update to persist attempt counter - ClaimForRecovery now picks up failed_retryable applies (attempt < 5) in addition to stale active applies - Increment attempt on claim so each recovery try is tracked - Add maxRecoveryAttempts constant (5) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- failApplyWithTasks classifies errors via IsRetryable to decide between failed vs failed_retryable state - PlanetScale engine wraps transient errors (snapshot, network, rate limit) as RetryableError when engine retries are exhausted - isRetryableEngineError classifies known transient PlanetScale errors - Recovery loop (StartRecoveryLoop) polls for failed_retryable applies via ClaimForRecovery and re-dispatches them through executeApplyAtomic - parseOptionsMap converts stored options back to string map for re-dispatch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a failed_retryable execution state and foundational plumbing to automatically re-dispatch transient schema-change failures, including attempt tracking and retryability classification at the engine and orchestration layers.
Changes:
- Introduces
Apply.FailedRetryable/Task.FailedRetryablestates and updates state derivation/normalization. - Adds
attemptcolumns/fields and extends MySQL storage scanning/updating + recovery claiming logic. - Adds
engine.RetryableError+engine.IsRetryable()and PlanetScale retryable error wrapping; introduces a LocalClient recovery polling loop.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/tern/local_client.go | Adds LocalClient recovery loop to claim and re-dispatch retryable applies. |
| pkg/tern/local_apply.go | Marks apply/tasks as failed vs failed_retryable based on engine retryability. |
| pkg/storage/types.go | Adds Attempt fields to Apply and Task models. |
| pkg/storage/mysqlstore/applies.go | Adds attempt to applies selection/update, introduces max attempt constant, extends ClaimForRecovery selection and increments attempt. |
| pkg/storage/mysqlstore/tasks.go | Adds attempt to tasks selection/scanning. |
| pkg/state/apply.go | Adds FailedRetryable apply state and derivation/normalization support. |
| pkg/state/task.go | Adds FailedRetryable task state and normalization support. |
| pkg/state/apply_test.go | Adds tests for apply state derivation/terminal/normalization around failed_retryable. |
| pkg/state/README.md | Documents the new failed_retryable state and recovery transitions. |
| pkg/schema/mysql/applies.sql | Adds attempt column to applies schema. |
| pkg/schema/mysql/tasks.sql | Adds attempt column to tasks schema. |
| pkg/engine/engine.go | Introduces RetryableError, NewRetryableError, and IsRetryable. |
| pkg/engine/engine_test.go | Adds unit tests for retryable error detection and unwrap behavior. |
| pkg/engine/planetscale/planetscale.go | Wraps exhausted transient engine failures as RetryableError to trigger failed_retryable state. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Tests the full recovery flow: 1. Creates a plan (CREATE TABLE) via API 2. Inserts an apply record directly in failed_retryable state (simulating a transient engine failure) 3. Starts the recovery loop with a fast poll interval 4. Verifies the recovery loop claims the apply, re-dispatches it, and the apply completes with the table created in MySQL Also adds SetRecoveryInterval for test control of poll frequency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace string matching with ps.Error.Code checks: - ErrRetry (422), ErrInternal (500), ErrResponseMalformed → retryable - ErrNotFound, ErrPermission, ErrInvalid → permanent - Fall through to message matching for non-SDK errors (snapshot, network, context deadline) Extract isTransientNetworkError for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reconcile with existing error code system: - engine_error: permanent engine failure - engine_error_retryable: transient failure, recovery loop will retry - deriveErrorCode returns the retryable code for failed_retryable state - IsRetryableErrorCode includes the new code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. ClaimForRecovery only claims failed_retryable applies (not stale active applies which are handled by heartbeat/conflict detection) 2. Verify plan/tasks load BEFORE transitioning to running — on failure mark permanently failed instead of leaving stuck in running 3. Don't set CompletedAt on retryable task failures (cleared on retry) 4. Reset task states and clear error messages on recovery retry 5. Protect cancelRecovery with mutex, make StartRecoveryLoop idempotent 6. Add ExpireRetryable to transition exhausted retries to permanent failed 7. Add failRecovery helper for clean permanent failure on recovery errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aparajon
added a commit
that referenced
this pull request
Apr 28, 2026
Replace string-matching isDeployValidating with isRetryablePSError that uses the PlanetScale SDK's typed error codes (ps.ErrRetry for 422, ps.ErrInternal for 500, ps.ErrResponseMalformed for parse failures). Falls back to isSnapshotInProgress for errors outside the SDK. Aligns with the error classification in PR #50. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Foundation for automatic retry of transient apply failures at scale
State changes:
Apply.FailedRetryableandTask.FailedRetryablestatesFailedRetryableis NOT terminal — recovery loop re-dispatchesDeriveApplyState: permanentFailedoverridesFailedRetryableattemptcolumn on applies and tasks tablesError classification:
engine.RetryableErrortype wraps transient failuresengine.IsRetryable()checks error chainfailApplyWithTasksusesIsRetryableto choose stateRecovery loop:
StartRecoveryLooppolls every 30s forfailed_retryableappliesClaimForRecoveryusesFOR UPDATE SKIP LOCKEDfor safe multi-pod concurrencyfailedexecuteApplyAtomicwith resume stateStill needed:
failed_retryablestateStartRecoveryLoopinto server startup🤖 Generated with Claude Code