Skip to content

direct: Fix WAL corruption after two consecutive failed deploys#5598

Open
denik wants to merge 4 commits into
mainfrom
denik/wal-bug
Open

direct: Fix WAL corruption after two consecutive failed deploys#5598
denik wants to merge 4 commits into
mainfrom
denik/wal-bug

Conversation

@denik

@denik denik commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Changes

Two failed deploys in a row left the direct-engine state WAL with a serial
ahead of the committed state, after which every bundle command failed WAL
recovery (WAL serial (N) is ahead of expected) until the WAL was deleted by
hand.

  • Don't open the WAL for write when planning already failed, so a failed plan
    no longer leaves a header-only WAL behind.
  • Recovering a header-only WAL no longer advances the serial, so a crash
    between UpgradeToWrite and Finalize can't wedge later deploys either.

Why

Previously error in plan or deploy could leave bundle locally undeployable until .wal is manually removed.

Tests

  • bundle/deploy/wal/two-failed-deploys: two plan failures (injected fault) no
    longer leave a WAL; the next deploy succeeds.
  • bundle/deploy/wal/two-crashed-deploys: two deploys killed mid-apply recover
    without wedging.
  • Unit TestHeaderOnlyWALRecoveryDoesNotAdvanceSerial. Each test was confirmed
    to fail when its corresponding fix is reverted.

This pull request and its description were written by Isaac.

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/bundle/ - needs approval

10 files changed
Suggested: @janniklasrose
Also eligible: @pietern, @shreyas-goenka, @lennartkats-db, @andrewnester, @anton-107

/bundle/ - needs approval

Files: bundle/direct/dstate/state.go, bundle/direct/dstate/state_test.go, bundle/phases/deploy.go
Suggested: @janniklasrose
Also eligible: @pietern, @shreyas-goenka, @lennartkats-db, @andrewnester, @anton-107

General files (require maintainer)

Files: NEXT_CHANGELOG.md
Based on git history:

  • @pietern -- recent work in ./, bundle/direct/dstate/, bundle/phases/

Any maintainer (@andrewnester, @anton-107, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

denik added 3 commits June 14, 2026 15:38
Two consecutive failed deploys left the local state WAL with a serial
ahead of the committed state, after which every bundle command failed
WAL recovery until the WAL was deleted by hand.

- Don't open the WAL for write when planning already failed, so a failed
  plan no longer leaves a header-only WAL behind.
- Don't advance the serial when recovering a header-only WAL, so a crash
  between UpgradeToWrite and Finalize can't wedge later deploys.

Co-authored-by: Isaac
InitForApply receives ctx and could log a diagnostic without returning an
error, so the call site cannot prove it never will. Re-check logdiag before
deploying. UpgradeToWrite takes no ctx and thus cannot log, so the earlier
check alone is enough to guard opening the WAL.

Co-authored-by: Isaac
Drop the hand-written resources.json.tmpl so the test no longer depends on
the internal state-file format. Deploy the job normally, then inject a fault
on the plan-stage refresh GET so the next two deploys fail while planning and
the last one recovers.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 14, 2026 22:40 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 14, 2026 22:40 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 14, 2026 22:48 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 14, 2026 22:48 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 9d377c6

Run: 27514518440

Env 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 15 264 979 7:48
💚​ aws windows 7 15 266 977 11:59
💚​ aws-ucws linux 7 15 360 893 7:18
💚​ aws-ucws windows 7 15 362 891 12:13
💚​ azure linux 1 17 267 977 8:08
💚​ azure windows 1 17 269 975 11:12
💚​ azure-ucws linux 1 17 365 889 7:06
💚​ azure-ucws windows 1 17 367 887 12:27
💚​ gcp linux 1 17 263 980 7:40
💚​ gcp windows 1 17 265 978 13:20
22 interesting tests: 15 SKIP, 7 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
Top 30 slowest tests (at least 2 minutes):
duration env testname
6:16 aws windows TestAccept
6:07 azure windows TestAccept
6:05 gcp windows TestAccept
5:58 aws-ucws windows TestAccept
5:57 azure-ucws windows TestAccept
4:48 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:17 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:07 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:55 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:22 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:20 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:15 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:12 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:12 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:11 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:59 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:58 aws linux TestAccept
2:54 azure linux TestAccept
2:51 gcp linux TestAccept
2:48 aws-ucws linux TestAccept
2:46 azure-ucws linux TestAccept
2:43 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:43 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:43 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:42 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:38 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:28 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:28 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:25 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:23 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants