Skip to content

πŸ”§ chore(ci): harden E2E/CI against transient flakes#436

Merged
scttbnsn merged 4 commits into
mainfrom
ci/harden-flakes
Jun 15, 2026
Merged

πŸ”§ chore(ci): harden E2E/CI against transient flakes#436
scttbnsn merged 4 commits into
mainfrom
ci/harden-flakes

Conversation

@scttbnsn

Copy link
Copy Markdown
Contributor

Why

The rc.36 release cut was blocked by a CI flake: the E2E bootstrap failed with drydock only has 6/7 ready containers after 90s β€” a registry-latency race, not a code bug (the same tree passed minutes earlier on the PR). A CI-audit workflow root-caused this and surfaced related latent flake sources. This PR hardens them at the source so transient failures stop blocking releases.

Changes

Kills the confirmed 6/7-in-90s race:

  • scripts/start-drydock.sh: discovery wait 90sβ†’150s, DD_WATCHER_LOCAL_JITTER=0 (removes up-to-60s watcher jitter), DD_READY_TOLERANCE=1 (a single slow registry call no longer hard-fails the gate; overrideable to 0)
  • app/registries/token-bucket.ts: Docker Hub + GHCR burst 5β†’10 (aligns to the default used by every other registry; absorbs the concurrent cold-start scan wave that was being throttled) β€” token-bucket tests updated

Retry transient steps (mirrors the existing Playwright retries:1):

  • ci-verify.yml: Cucumber --retry 1; retry-wrap setup-test-containers.sh and the two load-test npm ci steps with the already-pinned nick-fields/retry
  • QA health-wait 120sβ†’180s + compose log dump on failure (ci-verify.yml DAST job + e2e-playwright.yml)

Release resilience:

  • release-cut.yml: CI polling budget 60β†’90 min (covers a full CI re-run; no effect on first-pass green)

Misc:

  • ui/vitest.config.ts: retry: 1 (mirrors app/vitest.config.ts)
  • dropped a redundant waitForTimeout(500) in containers.spec.ts (the following web-first assertion already waits)

Notes

  • Targets main for rc.37; rc.36 is already cut and unaffected.
  • Deliberately did not soft-warn the actions-tab update-control assertion (a separate audit suggestion) β€” the Redis Cache fixture include already makes it deterministic, and downgrading it would gut the test that caught the Can't set up auto updates to workΒ #321 regression.

Refs: #321

Root-causes the release-blocking "6/7 containers ready after 90s" race and
related flake sources surfaced by a CI audit:

- start-drydock.sh: discovery wait 90s→150s, DD_WATCHER_LOCAL_JITTER=0,
  DD_READY_TOLERANCE=1 (tolerate one slow registry call)
- token-bucket: raise Docker Hub/GHCR burst 5β†’10 (align to default; absorbs
  the concurrent cold-start scan wave that was being throttled)
- ci-verify.yml: Cucumber --retry 1, retry-wrap setup-test-containers,
  retry-wrap load-test npm ci, longer QA health-wait + log dump
- e2e-playwright.yml: longer QA health-wait + log dump
- release-cut.yml: CI polling budget 60β†’90min
- ui vitest retry:1 (mirror app); drop a redundant Playwright sleep
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
drydock-website Ready Ready Preview, Comment Jun 15, 2026 6:49pm
drydockdemo-website Ready Ready Preview, Comment Jun 15, 2026 6:49pm

@biggest-littlest biggest-littlest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI flake-hardening β€” root-causes the 6/7-in-90s race + related sources. Reviewed the diff; targeted, low-risk, well-scoped.

ALARGECOMPANY
ALARGECOMPANY previously approved these changes Jun 15, 2026

@ALARGECOMPANY ALARGECOMPANY left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Script/CI hardening + token-bucket burst aligned to default; deferred the riskier items appropriately.

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

βœ… All modified and coverable lines are covered by tests.

πŸ“’ Thoughts on this report? Let us know!

scttbnsn added 2 commits June 15, 2026 13:04
…he full set

Root-causes the Cucumber `api-container.feature` ">= 7 entries" flake (CI run
27559941520): the home-assistant and radarr fixtures run their real application
entrypoints, which crash/exit in CI (need writable config dirs). drydock's docker
watcher lists only running containers, so an exited fixture vanishes from
/api/containers β€” dropping the count below 7 and making `hub_homeassistant_latest`
unfindable by name.

- πŸ§ͺ test(e2e): override entrypoint to `tail -f /dev/null` for hub_homeassistant_202161,
  hub_homeassistant_latest, and trueforge_radarr. drydock only reads the image
  reference for metadata; the container does not need to run the real app.
- πŸ”§ chore(ci): revert READY_TOLERANCE default 1 β†’ 0 (strict). The tolerance masked
  a missing container at the bootstrap gate, then the feature's zero-tolerance ">= 7"
  assertion failed downstream. With fixtures kept alive deterministically, strict is
  correct and matches the feature gate.
- πŸ› fix(ci): correct DEFAULT_EXPECTED off-by-one (10 β†’ 11); quay_prometheus was added
  without bumping the baseline, so the bootstrap targeted one fewer container than CI
  actually starts.
- πŸ› fix(e2e): drop broken `\b...\b` anchors in openAnyContainerDetail (a word boundary
  after an escaped ")" never matches "Nginx (Hooked)"), and replace waitForTimeout(300)
  with deterministic waits (input cleared + rows rendered) before the non-auto-waiting
  count() lookups.
GHSA-96hv-2xvq-fx4p: ws 8.0.0–8.20.x is vulnerable to memory exhaustion from
tiny WebSocket fragments; first fixed in 8.21.0. Pulled in transitively via
artillery and pinned by the e2e `overrides` block. Test-only dev dependency
(not shipped in the drydock image); surfaced by osv-scanner in the qlty gate.
…aces

osv-scanner (qlty gate) flagged a batch of CVEs disclosed 2026-06-15 in transitive
dependencies of app, ui, root, e2e, and apps/demo. None are shipped runtime-
exploitable, but they fail the required Lint gate. Patched via overrides (or direct
bump where the dep is direct):

- vite 8.0.14/8.0.12 β†’ 8.0.16 (root, app, ui): CVE-2026-53571, CVE-2026-53632
- vite 7.3.2 β†’ 7.3.5 (apps/demo, direct devDep): CVE-2026-53571, CVE-2026-53632
- @babel/core 7.29.0 β†’ 7.29.6 (app, ui): CVE-2026-49356
- form-data 4.0.5 β†’ 4.0.6 (app, e2e): CVE-2026-12143
- protobufjs 7.5.8/7.6.1 β†’ 7.6.3 (app, e2e): CVE-2026-54269

js-yaml@3.14.2 (e2e, via artillery) is triaged in .qlty/qlty.toml: its only fix is
js-yaml 4.2.0, which removes the safeLoad() artillery calls (breaking), and the DoS
is unreachable since artillery parses only trusted in-repo load-test configs.

@biggest-littlest biggest-littlest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI fully green on 528f66d β€” Lint (fresh OSV) clean, Cucumber + Playwright e2e pass (flake fix validated), 100% coverage. Flake-hardening + today's CVE batch. LGTM.

@ALARGECOMPANY ALARGECOMPANY left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified all required checks pass on the latest push. Keep-alive e2e fixtures resolve the Cucumber discovery flake; CVE batch closed across all workspaces. Approving.

@scttbnsn scttbnsn merged commit 0490d7e into main Jun 15, 2026
24 checks passed
@scttbnsn scttbnsn deleted the ci/harden-flakes branch June 15, 2026 19:12
scttbnsn added a commit that referenced this pull request Jun 15, 2026
…rdening) (#437)

Release prep for **v1.5.0-rc.37**. Adds the CHANGELOG entry required by
`release-cut.yml`'s validator.

rc.37 = rc.36 + #436 (`πŸ”§ chore(ci): harden E2E/CI against transient
flakes`):

- **Security** β€” patched the 2026-06-15 transitive CVE batch (`vite`
CVE-2026-53571/53632, `@babel/core` CVE-2026-49356, `form-data`
CVE-2026-12143, `protobufjs` CVE-2026-54269, `ws` CVE-2026-48779) via
overrides/direct bumps across root/app/ui/e2e/apps/demo; `js-yaml`
(artillery, test-only) triaged unreachable.
- **Changed** β€” registry rate-limiter burst 5β†’10 for ghcr.io / Docker
Hub; E2E/CI flake hardening (keep-alive fixtures so the watcher
discovers the full set, strict bootstrap readiness, deterministic
Playwright waits).

Changelog-only; no source changes in this PR. Once merged I'll dispatch
`release-cut.yml --ref main -f release_tag=v1.5.0-rc.37`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants