π§ chore(ci): harden E2E/CI against transient flakes#436
Conversation
Root-causes the release-blocking "6/7 containers ready after 90s" race and related flake sources surfaced by a CI audit: - start-drydock.sh: discovery wait 90sβ150s, DD_WATCHER_LOCAL_JITTER=0, DD_READY_TOLERANCE=1 (tolerate one slow registry call) - token-bucket: raise Docker Hub/GHCR burst 5β10 (align to default; absorbs the concurrent cold-start scan wave that was being throttled) - ci-verify.yml: Cucumber --retry 1, retry-wrap setup-test-containers, retry-wrap load-test npm ci, longer QA health-wait + log dump - e2e-playwright.yml: longer QA health-wait + log dump - release-cut.yml: CI polling budget 60β90min - ui vitest retry:1 (mirror app); drop a redundant Playwright sleep
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
biggest-littlest
left a comment
There was a problem hiding this comment.
CI flake-hardening β root-causes the 6/7-in-90s race + related sources. Reviewed the diff; targeted, low-risk, well-scoped.
ALARGECOMPANY
left a comment
There was a problem hiding this comment.
Approved. Script/CI hardening + token-bucket burst aligned to default; deferred the riskier items appropriately.
Codecov Reportβ All modified and coverable lines are covered by tests. π’ Thoughts on this report? Let us know! |
β¦he full set Root-causes the Cucumber `api-container.feature` ">= 7 entries" flake (CI run 27559941520): the home-assistant and radarr fixtures run their real application entrypoints, which crash/exit in CI (need writable config dirs). drydock's docker watcher lists only running containers, so an exited fixture vanishes from /api/containers β dropping the count below 7 and making `hub_homeassistant_latest` unfindable by name. - π§ͺ test(e2e): override entrypoint to `tail -f /dev/null` for hub_homeassistant_202161, hub_homeassistant_latest, and trueforge_radarr. drydock only reads the image reference for metadata; the container does not need to run the real app. - π§ chore(ci): revert READY_TOLERANCE default 1 β 0 (strict). The tolerance masked a missing container at the bootstrap gate, then the feature's zero-tolerance ">= 7" assertion failed downstream. With fixtures kept alive deterministically, strict is correct and matches the feature gate. - π fix(ci): correct DEFAULT_EXPECTED off-by-one (10 β 11); quay_prometheus was added without bumping the baseline, so the bootstrap targeted one fewer container than CI actually starts. - π fix(e2e): drop broken `\b...\b` anchors in openAnyContainerDetail (a word boundary after an escaped ")" never matches "Nginx (Hooked)"), and replace waitForTimeout(300) with deterministic waits (input cleared + rows rendered) before the non-auto-waiting count() lookups.
GHSA-96hv-2xvq-fx4p: ws 8.0.0β8.20.x is vulnerable to memory exhaustion from tiny WebSocket fragments; first fixed in 8.21.0. Pulled in transitively via artillery and pinned by the e2e `overrides` block. Test-only dev dependency (not shipped in the drydock image); surfaced by osv-scanner in the qlty gate.
827a6da
β¦aces osv-scanner (qlty gate) flagged a batch of CVEs disclosed 2026-06-15 in transitive dependencies of app, ui, root, e2e, and apps/demo. None are shipped runtime- exploitable, but they fail the required Lint gate. Patched via overrides (or direct bump where the dep is direct): - vite 8.0.14/8.0.12 β 8.0.16 (root, app, ui): CVE-2026-53571, CVE-2026-53632 - vite 7.3.2 β 7.3.5 (apps/demo, direct devDep): CVE-2026-53571, CVE-2026-53632 - @babel/core 7.29.0 β 7.29.6 (app, ui): CVE-2026-49356 - form-data 4.0.5 β 4.0.6 (app, e2e): CVE-2026-12143 - protobufjs 7.5.8/7.6.1 β 7.6.3 (app, e2e): CVE-2026-54269 js-yaml@3.14.2 (e2e, via artillery) is triaged in .qlty/qlty.toml: its only fix is js-yaml 4.2.0, which removes the safeLoad() artillery calls (breaking), and the DoS is unreachable since artillery parses only trusted in-repo load-test configs.
biggest-littlest
left a comment
There was a problem hiding this comment.
CI fully green on 528f66d β Lint (fresh OSV) clean, Cucumber + Playwright e2e pass (flake fix validated), 100% coverage. Flake-hardening + today's CVE batch. LGTM.
ALARGECOMPANY
left a comment
There was a problem hiding this comment.
Verified all required checks pass on the latest push. Keep-alive e2e fixtures resolve the Cucumber discovery flake; CVE batch closed across all workspaces. Approving.
β¦rdening) (#437) Release prep for **v1.5.0-rc.37**. Adds the CHANGELOG entry required by `release-cut.yml`'s validator. rc.37 = rc.36 + #436 (`π§ chore(ci): harden E2E/CI against transient flakes`): - **Security** β patched the 2026-06-15 transitive CVE batch (`vite` CVE-2026-53571/53632, `@babel/core` CVE-2026-49356, `form-data` CVE-2026-12143, `protobufjs` CVE-2026-54269, `ws` CVE-2026-48779) via overrides/direct bumps across root/app/ui/e2e/apps/demo; `js-yaml` (artillery, test-only) triaged unreachable. - **Changed** β registry rate-limiter burst 5β10 for ghcr.io / Docker Hub; E2E/CI flake hardening (keep-alive fixtures so the watcher discovers the full set, strict bootstrap readiness, deterministic Playwright waits). Changelog-only; no source changes in this PR. Once merged I'll dispatch `release-cut.yml --ref main -f release_tag=v1.5.0-rc.37`.
Why
The rc.36 release cut was blocked by a CI flake: the E2E bootstrap failed with
drydock only has 6/7 ready containers after 90sβ a registry-latency race, not a code bug (the same tree passed minutes earlier on the PR). A CI-audit workflow root-caused this and surfaced related latent flake sources. This PR hardens them at the source so transient failures stop blocking releases.Changes
Kills the confirmed 6/7-in-90s race:
scripts/start-drydock.sh: discovery wait 90sβ150s,DD_WATCHER_LOCAL_JITTER=0(removes up-to-60s watcher jitter),DD_READY_TOLERANCE=1(a single slow registry call no longer hard-fails the gate; overrideable to 0)app/registries/token-bucket.ts: Docker Hub + GHCR burst 5β10 (aligns to the default used by every other registry; absorbs the concurrent cold-start scan wave that was being throttled) β token-bucket tests updatedRetry transient steps (mirrors the existing Playwright
retries:1):ci-verify.yml: Cucumber--retry 1; retry-wrapsetup-test-containers.shand the two load-testnpm cisteps with the already-pinnednick-fields/retryci-verify.ymlDAST job +e2e-playwright.yml)Release resilience:
release-cut.yml: CI polling budget 60β90 min (covers a full CI re-run; no effect on first-pass green)Misc:
ui/vitest.config.ts:retry: 1(mirrorsapp/vitest.config.ts)waitForTimeout(500)incontainers.spec.ts(the following web-first assertion already waits)Notes
Refs: #321