Skip to content

feat: add failed_retryable state and recovery loop#50

Draft
aparajon wants to merge 8 commits intomainfrom
armand/retryable
Draft

feat: add failed_retryable state and recovery loop#50
aparajon wants to merge 8 commits intomainfrom
armand/retryable

Conversation

@aparajon
Copy link
Copy Markdown
Collaborator

@aparajon aparajon commented Apr 28, 2026

Summary

Foundation for automatic retry of transient apply failures at scale

State changes:

  • Add Apply.FailedRetryable and Task.FailedRetryable states
  • FailedRetryable is NOT terminal — recovery loop re-dispatches
  • DeriveApplyState: permanent Failed overrides FailedRetryable
  • Schema: attempt column on applies and tasks tables

Error classification:

  • engine.RetryableError type wraps transient failures
  • engine.IsRetryable() checks error chain
  • PlanetScale engine: snapshot, network, rate limit errors → retryable
  • failApplyWithTasks uses IsRetryable to choose state

Recovery loop:

  • StartRecoveryLoop polls every 30s for failed_retryable applies
  • ClaimForRecovery uses FOR UPDATE SKIP LOCKED for safe multi-pod concurrency
  • Increments attempt counter on each claim
  • Max 5 recovery attempts before transitioning to permanent failed
  • Re-dispatches via executeApplyAtomic with resume state

Still needed:

  • Integration test: verify recovery loop picks up and completes a retryable apply
  • Spirit engine: classify binlog/lock errors as retryable
  • CLI UX: show "Failed (retrying...)" for failed_retryable state
  • Wire StartRecoveryLoop into server startup

🤖 Generated with Claude Code

aparajon and others added 4 commits April 27, 2026 23:12
State changes:
- Add Apply.FailedRetryable and Task.FailedRetryable states
- DeriveApplyState: FailedRetryable propagates from tasks to apply
  (lower priority than permanent Failed)
- IsTerminalApplyState/IsTerminalTaskState: FailedRetryable is NOT
  terminal — recovery loop can re-dispatch
- NormalizeTaskStatus/normalizeApplyState: handle FAILED_RETRYABLE

Schema changes:
- Add attempt column to applies and tasks tables (INT UNSIGNED DEFAULT 0)
- Self-bootstrapping via EnsureSchema picks up the new column on deploy

Engine:
- Add RetryableError type for engines to wrap transient failures
- Add IsRetryable() helper for the apply orchestration to classify errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Unit tests for DeriveApplyState with FailedRetryable (priority,
  normalization, permanent Failed overrides FailedRetryable)
- Unit tests for IsRetryable and RetryableError unwrapping
- IsTerminalApplyState test for FailedRetryable (not terminal)
- README.md updated with FailedRetryable in state tables, diagrams,
  and derivation rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add attempt to applyColumns/taskColumns and scan functions
- Update applies.Update to persist attempt counter
- ClaimForRecovery now picks up failed_retryable applies (attempt < 5)
  in addition to stale active applies
- Increment attempt on claim so each recovery try is tracked
- Add maxRecoveryAttempts constant (5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- failApplyWithTasks classifies errors via IsRetryable to decide
  between failed vs failed_retryable state
- PlanetScale engine wraps transient errors (snapshot, network, rate
  limit) as RetryableError when engine retries are exhausted
- isRetryableEngineError classifies known transient PlanetScale errors
- Recovery loop (StartRecoveryLoop) polls for failed_retryable applies
  via ClaimForRecovery and re-dispatches them through executeApplyAtomic
- parseOptionsMap converts stored options back to string map for re-dispatch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 28, 2026 13:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a failed_retryable execution state and foundational plumbing to automatically re-dispatch transient schema-change failures, including attempt tracking and retryability classification at the engine and orchestration layers.

Changes:

  • Introduces Apply.FailedRetryable / Task.FailedRetryable states and updates state derivation/normalization.
  • Adds attempt columns/fields and extends MySQL storage scanning/updating + recovery claiming logic.
  • Adds engine.RetryableError + engine.IsRetryable() and PlanetScale retryable error wrapping; introduces a LocalClient recovery polling loop.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/tern/local_client.go Adds LocalClient recovery loop to claim and re-dispatch retryable applies.
pkg/tern/local_apply.go Marks apply/tasks as failed vs failed_retryable based on engine retryability.
pkg/storage/types.go Adds Attempt fields to Apply and Task models.
pkg/storage/mysqlstore/applies.go Adds attempt to applies selection/update, introduces max attempt constant, extends ClaimForRecovery selection and increments attempt.
pkg/storage/mysqlstore/tasks.go Adds attempt to tasks selection/scanning.
pkg/state/apply.go Adds FailedRetryable apply state and derivation/normalization support.
pkg/state/task.go Adds FailedRetryable task state and normalization support.
pkg/state/apply_test.go Adds tests for apply state derivation/terminal/normalization around failed_retryable.
pkg/state/README.md Documents the new failed_retryable state and recovery transitions.
pkg/schema/mysql/applies.sql Adds attempt column to applies schema.
pkg/schema/mysql/tasks.sql Adds attempt column to tasks schema.
pkg/engine/engine.go Introduces RetryableError, NewRetryableError, and IsRetryable.
pkg/engine/engine_test.go Adds unit tests for retryable error detection and unwrap behavior.
pkg/engine/planetscale/planetscale.go Wraps exhausted transient engine failures as RetryableError to trigger failed_retryable state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/storage/mysqlstore/applies.go Outdated
Comment thread pkg/storage/mysqlstore/tasks.go
Comment thread pkg/tern/local_apply.go Outdated
Comment thread pkg/tern/local_client.go
Comment thread pkg/tern/local_client.go Outdated
Comment thread pkg/tern/local_client.go Outdated
aparajon and others added 4 commits April 28, 2026 10:03
Tests the full recovery flow:
1. Creates a plan (CREATE TABLE) via API
2. Inserts an apply record directly in failed_retryable state
   (simulating a transient engine failure)
3. Starts the recovery loop with a fast poll interval
4. Verifies the recovery loop claims the apply, re-dispatches it,
   and the apply completes with the table created in MySQL

Also adds SetRecoveryInterval for test control of poll frequency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace string matching with ps.Error.Code checks:
- ErrRetry (422), ErrInternal (500), ErrResponseMalformed → retryable
- ErrNotFound, ErrPermission, ErrInvalid → permanent
- Fall through to message matching for non-SDK errors (snapshot,
  network, context deadline)

Extract isTransientNetworkError for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reconcile with existing error code system:
- engine_error: permanent engine failure
- engine_error_retryable: transient failure, recovery loop will retry
- deriveErrorCode returns the retryable code for failed_retryable state
- IsRetryableErrorCode includes the new code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. ClaimForRecovery only claims failed_retryable applies (not stale
   active applies which are handled by heartbeat/conflict detection)
2. Verify plan/tasks load BEFORE transitioning to running — on failure
   mark permanently failed instead of leaving stuck in running
3. Don't set CompletedAt on retryable task failures (cleared on retry)
4. Reset task states and clear error messages on recovery retry
5. Protect cancelRecovery with mutex, make StartRecoveryLoop idempotent
6. Add ExpireRetryable to transition exhausted retries to permanent failed
7. Add failRecovery helper for clean permanent failure on recovery errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aparajon aparajon changed the title Add failed_retryable state and recovery loop feat: add failed_retryable state and recovery loop Apr 28, 2026
aparajon added a commit that referenced this pull request Apr 28, 2026
Replace string-matching isDeployValidating with isRetryablePSError
that uses the PlanetScale SDK's typed error codes (ps.ErrRetry for
422, ps.ErrInternal for 500, ps.ErrResponseMalformed for parse
failures). Falls back to isSnapshotInProgress for errors outside
the SDK. Aligns with the error classification in PR #50.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants