Skip to content

Comprehensive review, hardening, and a platform-wide design-system rework#354

Merged
enyineer merged 24 commits into
mainfrom
chore/comprehensive-review-and-improvements
Jun 21, 2026
Merged

Comprehensive review, hardening, and a platform-wide design-system rework#354
enyineer merged 24 commits into
mainfrom
chore/comprehensive-review-and-improvements

Conversation

@enyineer

Copy link
Copy Markdown
Owner

A large, verified body of work spanning six areas - correctness, security, testing, UX, docs, and a premium, consistent UI design language - plus a low-noise retune of anomaly-detection defaults.

Design system (premium UI rework)

  • New @checkstack/ui foundation: surface elevation tokens, an aurora gradient signature, a colorblind-safe status triad, a comfortable/compact density model (provider + user toggle), polished skeleton/empty/error states, and honest token-driven chart primitives (time series, sparkline, radial gauge, request waterfall, uptime ribbon).
  • A signature aurora page-header (icon-stroke gradient) + deeper cards, an elevated app shell, and reskinned dashboard / health-check / SLO views.
  • Every plugin frontend adopts the tokens; the highest-impact surfaces are then redesigned to a premium bar (depth, number-led hierarchy, multi-encoded status). Pure tone/format logic extracted into unit-tested modules.
  • The system-health dashboard widget was reworked (actionable headline + proportional composition bar + a legend that doubles as filters, with an empty state when a filter has no matches).
  • All alerts unified onto one premium Alert.

Warning

Breaking change: the duplicate InfoBanner component (and its sub-components) is removed - use Alert, a drop-in replacement with the same variants and composable parts.

Anomaly-detection defaults (low-noise problem detection)

  • Reviewed all 264 metrics across every health-check strategy + the hardware collector: 94 noisy/un-baselineable ones default-disabled (raw counts/identifiers, config echoes, payload sizes, deterministic values like certificate days-remaining that a static health threshold already governs), 80 kept-and-hardened with confirmation windows + practical-significance floors, the rest already correct. Disabled metrics stay chartable and opt-in. No engine/schema changes.

Security hardening

  • At-rest encryption with key rotation + fail-loud decryption, brute-force/token-timing fixes, HTTP-collector SSRF guard, fail-closed plugin supply-chain integrity pinning, SQL plugin-schema identifier hardening, notification email HTML sanitization, per-assignment satellite result authorization, and a first-run onboarding TOCTOU guard.

Testing

  • A real end-to-end suite (Playwright + Testcontainers Postgres) covering the authenticated app, made a blocking CI job, plus extracted pure-logic unit tests throughout.

UX & accessibility

  • Form quality across editors, mobile responsiveness and touch targets, accessible overlays/forms, list/loading/empty/error state consistency, onboarding + point-of-use coaching, wider command-palette coverage, and sidebar IA.

Refactors & docs

  • Typed router-factory args + structured logging, typed Drizzle JSON columns, shared formatting helpers, removal of boundary casts, and same-PR docs/AI updates.

Notes

  • Changesets included; platform is in BETA so all bumps are minor/patch (no major). Design-system and anomaly changesets are consolidated.
  • Verified throughout: typecheck, lint, unit tests, and the full e2e suite green at each stage.

🤖 Generated with Claude Code

https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE

@changeset-bot

changeset-bot Bot commented Jun 20, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: a02d2bd

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 98 packages
Name Type
@checkstack/status-page-frontend Minor
@checkstack/auth-frontend Minor
@checkstack/ai-backend Minor
@checkstack/ai-common Minor
@checkstack/healthcheck-http-backend Minor
@checkstack/healthcheck-dns-backend Patch
@checkstack/healthcheck-grpc-backend Minor
@checkstack/healthcheck-ping-backend Patch
@checkstack/healthcheck-tcp-backend Patch
@checkstack/healthcheck-tls-backend Patch
@checkstack/healthcheck-redis-backend Patch
@checkstack/healthcheck-postgres-backend Patch
@checkstack/healthcheck-mysql-backend Patch
@checkstack/healthcheck-ssh-backend Patch
@checkstack/healthcheck-script-backend Minor
@checkstack/healthcheck-jenkins-backend Minor
@checkstack/healthcheck-rcon-backend Patch
@checkstack/collector-hardware-backend Patch
@checkstack/api-docs-frontend Minor
@checkstack/auth-backend Minor
@checkstack/automation-backend Patch
@checkstack/dependency-backend Patch
@checkstack/status-page-backend Patch
@checkstack/satellite-backend Patch
@checkstack/gitops-backend Patch
@checkstack/secrets-backend Patch
@checkstack/notification-backend Patch
@checkstack/script-packages-backend Patch
@checkstack/ui Minor
@checkstack/pluginmanager-frontend Patch
@checkstack/command-frontend Patch
@checkstack/backend-api Minor
@checkstack/dependency-frontend Patch
@checkstack/about-frontend Patch
@checkstack/ai-frontend Patch
@checkstack/announcement-frontend Patch
@checkstack/anomaly-frontend Patch
@checkstack/automation-frontend Minor
@checkstack/cache-frontend Patch
@checkstack/catalog-frontend Patch
@checkstack/dashboard-frontend Minor
@checkstack/frontend Minor
@checkstack/gitops-frontend Patch
@checkstack/healthcheck-frontend Minor
@checkstack/incident-frontend Minor
@checkstack/infrastructure-frontend Patch
@checkstack/integration-frontend Patch
@checkstack/maintenance-frontend Minor
@checkstack/notification-frontend Patch
@checkstack/queue-frontend Patch
@checkstack/satellite-frontend Patch
@checkstack/script-packages-frontend Patch
@checkstack/secrets-frontend Patch
@checkstack/slo-frontend Minor
@checkstack/theme-frontend Minor
@checkstack/tips-frontend Patch
@checkstack/backend Minor
@checkstack/healthcheck-backend Minor
@checkstack/anomaly-backend Patch
@checkstack/catalog-backend Patch
@checkstack/incident-backend Patch
@checkstack/maintenance-backend Patch
@checkstack/slo-backend Patch
@checkstack/announcement-backend Patch
@checkstack/theme-backend Patch
@checkstack/tips-backend Patch
@checkstack/auth-credential-backend Patch
@checkstack/auth-github-backend Patch
@checkstack/auth-ldap-backend Patch
@checkstack/auth-saml-backend Patch
@checkstack/integration-jira-backend Patch
@checkstack/integration-script-backend Patch
@checkstack/integration-teams-backend Patch
@checkstack/integration-webex-backend Patch
@checkstack/integration-webhook-backend Patch
@checkstack/integration-backend Patch
@checkstack/secrets-backend-local Patch
@checkstack/secrets-backend-vault Patch
@checkstack/notification-backstage-backend Patch
@checkstack/notification-discord-backend Patch
@checkstack/notification-gotify-backend Patch
@checkstack/notification-pushover-backend Patch
@checkstack/notification-slack-backend Patch
@checkstack/notification-smtp-backend Patch
@checkstack/notification-teams-backend Patch
@checkstack/notification-telegram-backend Patch
@checkstack/notification-webex-backend Patch
@checkstack/satellite Patch
@checkstack/script-packages-store-postgres Patch
@checkstack/script-packages-store-s3 Patch
@checkstack/cache-backend Patch
@checkstack/command-backend Patch
@checkstack/queue-backend Patch
@checkstack/signal-backend Patch
@checkstack/test-utils-backend Patch
@checkstack/cache-memory-backend Patch
@checkstack/queue-bullmq-backend Patch
@checkstack/queue-memory-backend Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions

github-actions Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

📦 Changeset Coverage Incomplete

The following packages have code changes but are not included in any changeset:

  • @checkstack/test-utils-backend
  • @checkstack/queue-memory-backend

⚠️ Please add a changeset entry for each of the listed packages before merging.

…tem rework

A large, verified body of work bringing Checkstack up across six areas:
correctness, security, testing, UX, docs, and AI - capped by a premium,
consistent UI design language.

Design system (premium UI rework)
- New `@checkstack/ui` foundation: surface elevation tokens, aurora gradient,
  colorblind-safe status triad, density model (comfortable/compact) + provider
  and user toggle, polished skeleton/empty/error states, and honest token-driven
  chart primitives (time series, sparkline, radial gauge, request waterfall,
  uptime ribbon).
- A signature aurora page-header + deeper cards, an elevated app shell, and
  reskinned dashboard / health-check / SLO views.
- Every plugin frontend adopts the tokens, then its highest-impact surfaces are
  redesigned to a premium bar (depth, number-led hierarchy, multi-encoded
  status). Pure tone/format logic extracted into unit-tested modules.
- All alerts unified onto one premium `Alert`; the duplicate `InfoBanner` is
  removed (BREAKING: use `Alert`).

Security hardening
- At-rest encryption with key rotation + fail-loud decryption, brute-force /
  token-timing fixes, HTTP-collector SSRF guard, fail-closed plugin
  supply-chain integrity pinning, SQL plugin-schema identifier hardening,
  notification email HTML sanitization, per-assignment satellite result
  authorization, and a first-run onboarding TOCTOU guard.

Testing
- A real end-to-end suite (Playwright + Testcontainers Postgres) covering the
  authenticated app, made a blocking CI job, plus extracted pure-logic unit
  tests throughout.

UX & accessibility
- All review-surfaced UX improvements: form quality across editors, mobile
  responsiveness and touch targets, accessible overlays/forms, list/loading/
  empty/error state consistency, onboarding + point-of-use coaching, wider
  command-palette coverage, and sidebar IA.

Refactors & docs
- Typed router-factory args + structured logging, typed Drizzle JSON columns,
  shared formatting helpers, removal of boundary casts, and same-PR docs/AI
  updates.

Reliability
- Retune anomaly-detection defaults across every health-check strategy and the
  hardware collector for a low-noise posture: noisy or un-baselineable metrics
  (raw counts, config echoes, payload sizes, deterministic values) default to
  off, while latency, availability, and saturation-percent signals are kept and
  hardened with confirmation windows and practical-significance floors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@enyineer enyineer force-pushed the chore/comprehensive-review-and-improvements branch from 7a0cbeb to 831ad49 Compare June 20, 2026 09:43
enyineer and others added 10 commits June 20, 2026 12:10
…ssertable results

A collector must fail only when the transport could not complete (DNS/connect/
TLS failure, timeout, aborted, unspawnable process). A successfully-received
result that is simply "not what you hoped" - an HTTP 4xx/5xx, a gRPC NOT_SERVING,
offline Jenkins nodes, a non-zero script exit - is an ASSERTABLE METRIC, not a
collector failure; the user's assertions (or the no-assertion default) decide
health.

Fixes the HTTP collector hard-failing on 404 (now a successful collection with
statusCode exposed and assertable), plus gRPC, Jenkins node-health, and the
script execute collector. Audited every other strategy: they already failed only
on genuine transport failures. Adds regression tests, docs, a new project rule
(.claude/rules/healthcheck-collectors.md), and a changeset (BREAKING: affected
checks now need an explicit assertion to fail on a non-OK result).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…unchanged)

Records an optional structured metadata.timings per run (DNS, connect, TLS,
wait/TTFB, transfer, and a processing catch-all for non-HTTP operation time);
the run-detail view renders the present phases in transport order and falls back
to the old Connection+Processing split for older runs.

HTTP: the request is byte-for-byte the same fetch path (IP-pinned, original
Host + SNI) - request behavior is unchanged. Timing is measured around it: fetch
resolves at the response headers so wait (TTFB) and transfer (body) are exact on
the request, DNS is timed at the resolve step, and connect/TLS come from a
short-lived best-effort raw net/tls probe to the same validated IP (Bun's fetch
socket emits no connect/handshake events; raw sockets do). The probe is
timing-only and never fails the check. Other transports surface the connect and
operation times they already measure.

Also fixes the run-detail "slowest" badge colliding with the bar, and shows a
genuinely sub-millisecond phase as "<1 ms".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fixes HTTP/2 404s)

The SSRF hardening added in this branch pinned the request to the resolved IP:
it rewrote the URL host to the IP literal and moved the hostname into the `Host`
header. That breaks HTTP/2 origins, whose authority comes from the URL's
`:authority` pseudo-header rather than `Host`, so real hosts such as google.com
started answering 404/429 instead of 200.

Keep the SSRF guard as a pre-flight validation (still rejects cloud-metadata /
link-local and operator-denied ranges, and direct denied IP literals) but drop
the pin and `fetch` the original URL verbatim, byte-identical to a plain fetch.
The resolved IP is reused only for the best-effort timing probe. The only thing
lost is DNS-rebind TOCTOU protection - a narrow window whose price was breaking
every legitimate HTTP/2 request.

Verified: example.com and cloudflare.com (both HTTP/2) return 200 with the full
timing breakdown intact; SSRF guard tests still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
The e2e suite hung indefinitely at "stopping ephemeral Postgres...". Bisected
locally: container.stop({ timeout: 0 }) resolves in ~180ms, but the process
never exits. Testcontainers' Ryuk reaper keeps a persistent socket open to its
sidecar for the process lifetime and relies on socket.unref() so it does not
block exit - the Bun runtime does not honor that unref, so the socket keeps the
event loop alive forever after the suite finishes. In CI the step pipes through
`tee`, which only ends when our stdout closes, hence the indefinite hang.

Disable Ryuk in the harness: the wrapper already stops and removes the container
deterministically in `finally` on every exit path (verified the container is
gone after stop() with Ryuk off), and CI runners are ephemeral, so the reaper is
unnecessary. The process now exits naturally - no force-exit workaround.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
The assistant had healthcheck.status (every check globally) but no way to map a
check to a system, so it had to GUESS which check monitored a given system (e.g.
"google.com expecting 201" vs "expecting 200"). Project the existing
getSystemConfigurations query as the read-only AI tool healthcheck.listSystemChecks:
given a systemId (resolved from a name via catalog.listSystems), it returns the
checks assigned to that system - id, name, strategy, interval, collectors/
assertions, and paused state.

The tool inherits the source procedure's system-scoped configuration.read gate
(parentScope on catalog.system), so it stays team-scoped and needs no new
permission. Adds a projection test mirroring the sibling tools, documents the
system-scoped read pattern in registering-tools.md, and regenerates the docs
index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
Closing a downtime window depended entirely on catching the system's transient
health-recovery edge (onEntityChanged), which is emitted only by a check RUN.
Fixing, pausing, deleting, or unassigning the offending check just invalidates
the read cache and emits no edge, and even a plain edit can lose the single
recovery delivery - leaving the open window orphaned until the once-daily
reconcile. Result: the SLO read 100% availability (live health is authoritative
for the budget) while "Recent Downtime Events" still showed a 25-day "ongoing"
window. The two views disagreed.

The user-facing SLO reads now reconcile against live health before reporting:
getDowntimeEvents and the status reads void an orphaned open window when the
system is currently healthy, reusing the same voidOrphanedDowntime the daily job
runs. The dashboard self-heals the moment it is viewed instead of waiting for
midnight. The reactive entity read / computeStatus stays side-effect-free; the
reconcile is a cheap no-op when there are no open events. The void primitive is
already unit-tested; the router change is thin glue over it (no router harness
exists in this package).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
The chat loop replays earlier tool results verbatim with no age annotation, and
the system prompt injected "current time" but never how long the thread had been
idle. So resuming an old conversation, the model answered from stale captured
data (a check's old name, a "failing" status) instead of current state.

streamTurn now measures the idle gap before the message (the conversation's
last-activity timestamp, captured before the new user message bumps updatedAt)
and, once it exceeds 10 minutes, folds a "Data freshness" directive into the
system prompt telling the model to re-call the relevant read tools for current
state rather than trust earlier-turn results. The directive sits at the volatile
end of the prompt (next to the time line) so the cache-friendly stable prefix is
unaffected; an active back-and-forth never sees it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
The reconcile-on-read fix DELETED orphaned downtime windows (when the system is
healthy but the recovery edge was missed). For a genuine multi-day outage whose
recovery edge was lost, that erased real downtime - so the SLO read a false 100%
availability with full error budget, even though the system had been down for
~25 days.

Reconcile now PRESERVES the downtime: it resolves the system's actual recovery
time (the first healthy run on/after the window opened, via healthcheck
getHistory) and CLOSES the window at that instant, so the real outage is counted
against availability and the error budget. It only DELETES as a fallback when the
recovery time can't be determined (e.g. run history pruned), where the unprovable
downtime must not be counted.

- closeDowntimeEvent gains an optional explicit endTime (clamped >= startTime).
- SloEngine gains a recovery-time resolver, wired in afterPluginsReady from the
  healthcheck run history; voidOrphanedDowntime -> reconcileOrphanedDowntime.
- Forward-only: already-written daily snapshots are not retroactively corrected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
A chat Skill (e.g. "write like a redneck") held during tool-calling steps but
normalized back to professional tone in the synthesized reply. The multi-step
loop's forced final-answer step replaces the whole system prompt with a
tool-less "answer now, be concise" instruction, dropping the skill preamble on
the exact step that writes the user-visible answer.

prepareFinalAnswerStep now accepts persistent guidance (the skill preamble) and
appends it after the base final-answer instruction, so the skill's voice governs
the synthesized reply too. The headless runner passes none, so it is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
Asked "how do I add a system to the catalog?", the assistant answered with the
internal tool name (catalog.createSystem) and its input JSON schema - but the
operator cannot call tools and never sees them; that is the assistant's own
mechanism, not a workflow.

The chat system prompt now states tools are the assistant's own (not a public
API), and a how-to must be answered in product terms (the UI, grounded in docs)
and/or by offering to do it for the operator - never by presenting tool names,
tool input JSON, or parameter schemas as steps to follow. Chat-only; the
headless runner is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
Comment thread plugins/healthcheck-http-backend/src/connect-probe.ts Fixed
enyineer and others added 2 commits June 20, 2026 16:02
CodeQL flagged the connect-probe's `rejectUnauthorized: false` ("disabling
certificate validation is strongly discouraged"). The probe is timing-only, but
disabling validation is unnecessary: it dials the validated IP with the original
hostname as SNI, so a valid cert verifies against `servername`, and the real
`fetch` already validates strictly (a bad cert fails the check regardless). Drop
the override; if the handshake can't complete (invalid/self-signed cert) the
existing error handler resolves with just the TCP `connectMs` and no `tlsMs` -
timing stays best-effort, never fatal.

Verified: valid hosts still report tlsMs; a cert/servername mismatch degrades to
connectMs only with no crash.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

❌ PR Checks Failed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ❌ Failed
❌ E2E Failures
... (truncated 494 lines)
�[1A�[2K[1/5] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/5] [chromium] › tests/queue.spec.ts:33:3 › Queue admin area › opens the Queue runtime panel from the infrastructure shell
�[1A�[2K[3/5] [chromium] › tests/queue.spec.ts:65:3 › Queue admin area › the runtime panel shows the instance-scope banner and count tiles
�[1A�[2K[4/5] [chromium] › tests/queue.spec.ts:93:3 › Queue admin area › the job-state sub-tabs render their listings
�[1A�[2K[5/5] [chromium] › tests/queue.spec.ts:131:3 › Queue admin area › the Queue tab also exposes the configuration sub-section
�[1A�[2K  5 passed (16.2s)

========== satellite.spec.ts ==========

Running 4 tests using 1 worker

�[1A�[2K[1/4] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/4] [chromium] › tests/satellite.spec.ts:14:3 › satellites › renders the page chrome with title, subtitle and create action
�[1A�[2K[3/4] [chromium] › tests/satellite.spec.ts:43:3 › satellites › shows the onboarding empty state when no satellites are registered
�[1A�[2K[4/4] [chromium] › tests/satellite.spec.ts:77:3 › satellites › the create affordance opens the registration dialog
�[1A�[2K  4 passed (17.3s)

========== script-packages.spec.ts ==========

Running 7 tests using 1 worker

�[1A�[2K[1/7] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/7] [chromium] › tests/script-packages.spec.ts:32:3 › Script Packages settings › renders with the install-state card and an empty allowlist
�[1A�[2K[3/7] [chromium] › tests/script-packages.spec.ts:68:3 › Script Packages settings › Add is disabled until a name and a valid pinned version are present
�[1A�[2K[4/7] [chromium] › tests/script-packages.spec.ts:99:3 › Script Packages settings › the Advanced section exposes the registry-URL config
�[1A�[2K[5/7] [chromium] › tests/script-packages.spec.ts:126:3 › Script Sandbox settings › renders the global policy editor with its key controls
�[1A�[2K[6/7] [chromium] › tests/script-packages.spec.ts:159:3 › Script Sandbox settings › switching network mode to allowlist reveals the destinations field
�[1A�[2K[7/7] [chromium] › tests/script-packages.spec.ts:180:3 › Script Sandbox settings › saving the policy surfaces the success confirmation
�[1A�[2K  7 passed (18.8s)

========== secrets.spec.ts ==========

Running 7 tests using 1 worker

�[1A�[2K[1/7] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/7] [chromium] › tests/secrets.spec.ts:44:3 › admin secrets › shows the empty state with no secrets
�[1A�[2K[3/7] [chromium] › tests/secrets.spec.ts:62:3 › admin secrets › disables the create button until name and value are provided
�[1A�[2K[4/7] [chromium] › tests/secrets.spec.ts:86:3 › admin secrets › creates a secret and lists it without ever exposing the value
�[1A�[2K[5/7] [chromium] › tests/secrets.spec.ts:113:3 › admin secrets › creating with an existing name rotates it rather than duplicating
�[1A�[2K[6/7] [chromium] › tests/secrets.spec.ts:141:3 › admin secrets › rotates a secret via the dialog without revealing the value
�[1A�[2K[7/7] [chromium] › tests/secrets.spec.ts:182:3 › admin secrets › deletes a secret only after confirming
�[1A�[2K  7 passed (20.2s)

========== slo.spec.ts ==========

Running 5 tests using 1 worker

�[1A�[2K[1/5] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/5] [chromium] › tests/slo.spec.ts:23:3 › SLOs › overview renders its empty state when no SLOs exist
�[1A�[2K[3/5] [chromium] › tests/slo.spec.ts:48:3 › SLOs › config page renders its empty state when no objectives exist
�[1A�[2K[4/5] [chromium] › tests/slo.spec.ts:65:3 › SLOs › create flow validates required system and target range
�[1A�[2K[5/5] [chromium] › tests/slo.spec.ts:165:3 › SLOs › overview lists the created SLO and links to its detail page
�[1A�[2K  5 passed (18.4s)

========== smoke.spec.ts ==========

Running 2 tests using 1 worker

�[1A�[2K[1/2] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/2] [chromium] › tests/smoke.spec.ts:11:3 › app shell › boots and renders the dashboard chrome
�[1A�[2K  2 passed (12.5s)

========== status-page.spec.ts ==========

Running 4 tests using 1 worker

�[1A�[2K[1/4] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/4] [chromium] › tests/status-page.spec.ts:31:3 › Status pages › list renders its empty state when no pages exist
�[1A�[2K[3/4] [chromium] › tests/status-page.spec.ts:43:3 › Status pages › an unpublished page is NOT served on the public route
�[1A�[2K[4/4] [chromium] › tests/status-page.spec.ts:68:3 › Status pages › operator builds, publishes, and the public page serves the content
�[1A�[2K  4 passed (19.4s)

========== theme.spec.ts ==========

Running 5 tests using 1 worker

�[1A�[2K[1/5] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/5] [chromium] › tests/theme.spec.ts:46:3 › theme / dark-mode switcher › the dark-mode switch lives in the user menu and reflects the applied theme
�[1A�[2K[3/5] [chromium] › tests/theme.spec.ts:68:3 › theme / dark-mode switcher › toggling to dark applies the `dark` class on <html>
�[1A�[2K[4/5] [chromium] › tests/theme.spec.ts:98:3 › theme / dark-mode switcher › toggling back to light reverts the `dark` class
�[1A�[2K[5/5] [chromium] › tests/theme.spec.ts:126:3 › theme / dark-mode switcher › the chosen theme persists across reload (localStorage + backend)
�[1A�[2K  5 passed (17.7s)

========== user-guide.spec.ts ==========

Running 5 tests using 1 worker

�[1A�[2K[1/5] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/5] [chromium] › tests/user-guide.spec.ts:10:3 › in-app user guide › serves the Starlight docs at /checkstack/user-guide/ (not a 404)
�[1A�[2K[3/5] [chromium] › tests/user-guide.spec.ts:24:3 › in-app user guide › a deep-linked docs page resolves in-app
�[1A�[2K[4/5] [chromium] › tests/user-guide.spec.ts:33:3 › in-app user guide › the sidebar Docs link targets the user guide
�[1A�[2K[5/5] [chromium] › tests/user-guide.spec.ts:45:3 › in-app user guide › an unknown /checkstack/ path returns the docs 404, not the SPA shell
�[1A�[2K  5 passed (13.0s)

================ summary ================
passed: 30/31
FAILED: catalog.spec.ts
[e2e] stopping ephemeral Postgres...
[e2e] teardown complete.

How to fix: These are the Playwright end-to-end tests. Reproduce locally with bun run --filter @checkstack/e2e test:e2e (it provisions an ephemeral Postgres via Testcontainers, so Docker must be running). Read the failing assertions and uploaded traces, then fix the implementation or the selectors so the flows pass. Do not weaken or skip the tests.

@enyineer The above code quality issues were found in this PR. Please fix them before merging.

@github-actions

Copy link
Copy Markdown
Contributor

✅ All PR Checks Passed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ✅ Passed

@enyineer All quality checks have passed. This PR is ready for your review.

The catalog spec runs as a serial group with retries:2, but the e2e DB is reset
only per file boot, NOT per retry, and a serial group retries from the top. So a
flake in any later test (e.g. "edits a system name") re-ran the WHOLE group
against an already-populated catalog: the global empty-state assertions then
hard-failed ("No systems in the catalog yet" is gone), and creates would collide
on the fixed name suffix. That turned a single transient into a red E2E job.

Two changes make it retry-safe:
- Split the two read-only empty-state tests into catalog-empty.spec.ts. run-all
  boots a fresh, migration-empty DB per file and this file creates nothing, so
  the empty state holds on every attempt (a retry re-asserts against the same
  empty DB; the mutating spec runs in a separate invocation and can't pollute
  it).
- Key the mutating chain's created names to the retry attempt (`-r<n>`, via
  test.info().retry) so a group retry runs in its own namespace and never
  collides with the previous attempt's leftover rows. Drop the delete test's
  global "No systems yet" assertion (can't hold against retry leftovers; the
  empty-state file owns that check).

Verified structurally: playwright --list discovers all tests in both files; lint
and typecheck pass. Full behavioral verification is via the e2e CI run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

✅ All PR Checks Passed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ✅ Passed

@enyineer All quality checks have passed. This PR is ready for your review.

enyineer and others added 2 commits June 20, 2026 18:25
The E2E job ran all ~32 spec files serially on one runner (~12 min). run-all.ts
already supports round-robin sharding (CHECKSTACK_E2E_SHARD_INDEX/TOTAL +
selectShard); this wires it into the workflow as a 3-way matrix.

Each shard is an independent runner with its OWN ephemeral Postgres
(Testcontainers), booting one backend at a time - the proven single-Postgres /
one-backend-per-runner model is unchanged, so there's no new cross-test
contention; we only split the FILE list across runners, cutting test wall-clock
~linearly (verified 32 specs split 11/11/10, each spec exactly once).

- matrix shard [1,2,3], fail-fast:false; CHECKSTACK_E2E_SHARD_TOTAL uses
  ${{ strategy.job-total }} so the matrix size is the single source of truth.
- Per-shard artifact names (e2e-output-<n>, e2e-traces-<n>): v4+ artifacts
  reject duplicate names across parallel legs.
- report job: download e2e-output-* with merge-multiple; readOutput now
  concatenates all .txt in a job's artifact dir (single-output jobs unchanged,
  sharded E2E gets every shard's tail). needs.e2e.result already aggregates the
  matrix legs, so the pass/fail gate is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
…hards

Follow-up to the e2e sharding so nothing is hand-maintained and CI minutes
aren't wasted:

- Shard COUNT is now derived from the actual spec files. A tiny e2e_matrix job
  counts core/e2e/tests/*.spec.ts and emits a 1-based JSON shard array
  (~11 files/shard, capped at 5 runners); the e2e job consumes it via
  fromJSON(needs.e2e_matrix.outputs.shards). Adding/removing a spec needs no
  workflow edit. (The file LIST was already auto-discovered by run-all.ts; this
  removes the last hand-maintained literal, the shard count.) Portable array
  build via `seq | paste -sd,` to avoid the BSD `seq -s` trailing-comma quirk.

- Build the frontend + docs ONCE in a new e2e_build job and upload the two dist
  dirs the backend serves (core/frontend/dist, docs/dist) as an artifact; each
  shard DOWNLOADS it instead of rebuilding. Removes the per-shard build (the
  dominant cost) and drops git-LFS from the shards (images are baked into the
  built docs/dist). e2e now `needs: [e2e_matrix, e2e_build]`.

- report job: add e2e_matrix + e2e_build to needs and require both
  success/skipped, so a generator/build failure (which skips e2e) can't read as
  a false green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

❌ PR Checks Failed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ❌ Failed
❌ E2E Failures
... (truncated 527 lines)
�[1A�[2K[29/11] (retries) [chromium] › tests/catalog.spec.ts:298:3 › Systems & Catalog › creates an environment and attaches a system to it (retry #2)
�[1A�[2K[30/11] (retries) [chromium] › tests/catalog.spec.ts:340:3 › Systems & Catalog › filtered browse shows a no-matches state with clear-filters (retry #2)
�[1A�[2K[31/11] (retries) [chromium] › tests/catalog.spec.ts:377:3 › Systems & Catalog › deletes a system with confirmation (retry #2)
�[1A�[2K  1 failed
    [chromium] › tests/catalog.spec.ts:224:3 › Systems & Catalog › edits a system name ─────────────
  1 flaky
    [chromium] › tests/catalog.spec.ts:115:3 › Systems & Catalog › creating a system adds it to management and browse 
  5 did not run
  4 passed (36.2s)

========== dependency.spec.ts ==========

Running 4 tests using 1 worker

�[1A�[2K[1/4] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/4] [chromium] › tests/dependency.spec.ts:64:3 › dependency map › renders the map page with its instructional header and graph toolbar
�[1A�[2K[3/4] [chromium] › tests/dependency.spec.ts:94:3 › dependency map › shows an empty graph when there are no systems
�[1A�[2K[4/4] [chromium] › tests/dependency.spec.ts:111:3 › dependency map › reflects a dependency created between two systems
�[1A�[2K  4 passed (17.6s)

========== incident.spec.ts ==========

Running 8 tests using 1 worker

�[1A�[2K[1/8] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/8] [chromium] › tests/incident.spec.ts:100:3 › incidents › shows the empty incidents state on a fresh database
�[1A�[2K[3/8] [chromium] › tests/incident.spec.ts:117:3 › incidents › validates that an incident requires at least one system
�[1A�[2K[4/8] [chromium] › tests/incident.spec.ts:159:3 › incidents › creates a system via the catalog so incidents can target it
�[1A�[2K[5/8] [chromium] › tests/incident.spec.ts:193:3 › incidents › creates an incident against the system
�[1A�[2K[6/8] [chromium] › tests/incident.spec.ts:220:3 › incidents › opens the incident detail page via the system history
�[1A�[2K[7/8] [chromium] › tests/incident.spec.ts:247:3 › incidents › resolves the incident from the detail page and reflects the new status
�[1A�[2K[8/8] [chromium] › tests/incident.spec.ts:269:3 › incidents › resolved incident is hidden by default and visible via 'Show resolved'
�[1A�[2K  8 passed (25.1s)

========== maintenance.spec.ts ==========

Running 9 tests using 1 worker

�[1A�[2K[1/9] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/9] [chromium] › tests/maintenance.spec.ts:59:3 › maintenance windows › creates the prerequisite system via the catalog UI
�[1A�[2K[3/9] [chromium] › tests/maintenance.spec.ts:84:3 › maintenance windows › resolves the created system's id from the catalog browse row
�[1A�[2K[4/9] [chromium] › tests/maintenance.spec.ts:109:3 › maintenance windows › shows the empty state before any maintenance exists
�[1A�[2K[5/9] [chromium] › tests/maintenance.spec.ts:124:3 › maintenance windows › validates required fields and end-before-start in the editor
�[1A�[2K[6/9] [chromium] › tests/maintenance.spec.ts:186:3 › maintenance windows › creates a maintenance window and lists it
�[1A�[2K[7/9] [chromium] › tests/maintenance.spec.ts:224:3 › maintenance windows › opens the detail page from the system history
�[1A�[2K[8/9] [chromium] › tests/maintenance.spec.ts:254:3 › maintenance windows › edits an existing maintenance window
�[1A�[2K[9/9] [chromium] › tests/maintenance.spec.ts:283:3 › maintenance windows › deletes a maintenance window with confirmation
�[1A�[2K  9 passed (21.5s)

========== permissions.spec.ts ==========

Running 5 tests using 1 worker

�[1A�[2K[1/5] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/5] [setup-member] › tests/member.setup.ts:14:1 › register a non-admin member
�[1A�[2K[3/5] [member] › tests/permissions.spec.ts:21:3 › UI permissions (non-admin member) › the member is signed in as themselves, not the admin
�[1A�[2K[4/5] [member] › tests/permissions.spec.ts:33:3 › UI permissions (non-admin member) › an admin-only route renders the Access Denied gate
�[1A�[2K[5/5] [member] › tests/permissions.spec.ts:48:3 › UI permissions (non-admin member) › admin-only navigation is not rendered for the member
�[1A�[2K  5 passed (15.6s)

========== queue.spec.ts ==========

Running 5 tests using 1 worker

�[1A�[2K[1/5] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/5] [chromium] › tests/queue.spec.ts:33:3 › Queue admin area › opens the Queue runtime panel from the infrastructure shell
�[1A�[2K[3/5] [chromium] › tests/queue.spec.ts:65:3 › Queue admin area › the runtime panel shows the instance-scope banner and count tiles
�[1A�[2K[4/5] [chromium] › tests/queue.spec.ts:93:3 › Queue admin area › the job-state sub-tabs render their listings
�[1A�[2K[5/5] [chromium] › tests/queue.spec.ts:131:3 › Queue admin area › the Queue tab also exposes the configuration sub-section
�[1A�[2K  5 passed (15.2s)

========== secrets.spec.ts ==========

Running 7 tests using 1 worker

�[1A�[2K[1/7] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/7] [chromium] › tests/secrets.spec.ts:44:3 › admin secrets › shows the empty state with no secrets
�[1A�[2K[3/7] [chromium] › tests/secrets.spec.ts:62:3 › admin secrets › disables the create button until name and value are provided
�[1A�[2K[4/7] [chromium] › tests/secrets.spec.ts:86:3 › admin secrets › creates a secret and lists it without ever exposing the value
�[1A�[2K[5/7] [chromium] › tests/secrets.spec.ts:113:3 › admin secrets › creating with an existing name rotates it rather than duplicating
�[1A�[2K[6/7] [chromium] › tests/secrets.spec.ts:141:3 › admin secrets › rotates a secret via the dialog without revealing the value
�[1A�[2K[7/7] [chromium] › tests/secrets.spec.ts:182:3 › admin secrets › deletes a secret only after confirming
�[1A�[2K  7 passed (19.2s)

========== status-page.spec.ts ==========

Running 4 tests using 1 worker

�[1A�[2K[1/4] [setup-admin] › tests/auth.setup.ts:15:1 › authenticate (onboard first admin)
�[1A�[2K[2/4] [chromium] › tests/status-page.spec.ts:31:3 › Status pages › list renders its empty state when no pages exist
�[1A�[2K[3/4] [chromium] › tests/status-page.spec.ts:43:3 › Status pages › an unpublished page is NOT served on the public route
�[1A�[2K[4/4] [chromium] › tests/status-page.spec.ts:68:3 › Status pages › operator builds, publishes, and the public page serves the content
�[1A�[2K  4 passed (17.2s)

================ summary ================
passed: 9/10
FAILED: catalog.spec.ts
[e2e] stopping ephemeral Postgres...
[e2e] teardown complete.

How to fix: These are the Playwright end-to-end tests. Reproduce locally with bun run --filter @checkstack/e2e test:e2e (it provisions an ephemeral Postgres via Testcontainers, so Docker must be running). Read the failing assertions and uploaded traces, then fix the implementation or the selectors so the flows pass. Do not weaken or skip the tests.

@enyineer The above code quality issues were found in this PR. Please fix them before merging.

…ety)

The catalog retry-safety fix keyed system/group/env NAMES to the retry attempt
but left SYSTEM_DESCRIPTION a constant. On a serial-group retry the management
table lists every system - including the previous attempt's leftover row - so a
shared description matched two rows: `getByText(SYSTEM_DESCRIPTION)` tripped
strict mode ("resolved to 2 elements") and failed E2E shard 3. The local run
never retried, so it didn't surface.

Make the description per-attempt (`-r<n>`) like the names, so every value an
assertion matches on is unique to the attempt and a retry's leftover rows can't
collide. Audited all getByText/name assertions in the spec: the description was
the only remaining fixed data value.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

✅ All PR Checks Passed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ✅ Passed

@enyineer All quality checks have passed. This PR is ready for your review.

…tempt

Makes every spec retry-safe by construction and ends the per-spec whack-a-mole
(empty-state-first + serial-mutate specs - catalog, incident, maintenance,
secrets, status-page - all shared the same latent fragility).

Root cause: the e2e DB is reset per FILE boot, not per Playwright retry, and a
serial group retries from the top. So in-process retries re-ran against the
previous attempt's polluted DB - global empty-state assertions failed, and
fixed names/descriptions collided with leftover rows.

Move retries from Playwright (same DB) to run-all at the FILE level: set
Playwright retries:0, and on a spec failure re-run the whole `playwright test
<file>` invocation (up to 3 attempts in CI). Each invocation re-boots the
backend (webServer reuseExistingServer:false), which DROP/CREATEs the e2e DB, so
every attempt starts from a fresh, empty, migration-reset database - the serial
chain simply re-runs clean. A trace is captured only on a retry, so the happy
path keeps no per-test tracing overhead.

Verified locally with an induced flake: attempt 1 fails (no in-process retry),
run-all re-boots, attempt 2 passes -> "all spec files green".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

✅ All PR Checks Passed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ✅ Passed

@enyineer All quality checks have passed. This PR is ready for your review.

…B per attempt"

This reverts commit b549ae7.

The file-level retry was unnecessary machinery: the suite was already green on
the previous commit (the empty-state split + per-attempt naming) using
Playwright's built-in retries. Retrying harder masks transient flakiness rather
than fixing it - the real fix is the per-spec robustness (test isolation +
idempotent assertions), which removes the DETERMINISTIC fragility. Keep
Playwright's standard CI retries as the thin safety net for genuine transient
browser races; do not re-run whole files on a fresh DB.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

✅ All PR Checks Passed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ✅ Passed

@enyineer All quality checks have passed. This PR is ready for your review.

enyineer and others added 2 commits June 20, 2026 19:52
Audit showed the retry-fragility is suite-wide (~15 mutating serial specs), not
a handful: a serial group retries from the top against a DB reset only per file
boot, so empty-state assertions and fixed-value matches collide with the prior
attempt's leftover rows. Hardening each spec by hand (extract empty-state file +
per-attempt naming) is whack-a-mole that every future spec would also need.

Reinstate the structural fix instead (reverts the earlier revert 9592c2b): set
Playwright retries:0 and retry a failed spec at the FILE level in run-all (3
attempts in CI). Each invocation re-boots the backend, which DROP/CREATEs the
e2e DB - so every retry starts from a fresh, empty, migration-reset database.
This makes the retries we already keep honor the suite's own per-file-fresh-DB
design, fixing all specs (and future ones) uniformly. It is not "retry harder" -
it gives each retry a clean slate, which is the actual root-cause fix. Verified
locally with an induced flake: attempt 1 fails, run-all re-boots, attempt 2
passes.

(catalog.spec keeps its per-test idempotency from the earlier commits as
harmless defense-in-depth; no other spec needs per-spec changes now.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
… it)

With the file-level retry giving every attempt a fresh DB, the per-spec
robustness added to catalog earlier (empty-state split into catalog-empty.spec
+ per-attempt naming) is no longer needed. Restore catalog.spec.ts to its
original inline form and remove catalog-empty.spec.ts, so the suite has ONE
retry-safety mechanism (fresh DB per attempt) instead of a mix - nothing for
future specs to cargo-cult.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

✅ All PR Checks Passed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ✅ Passed

@enyineer All quality checks have passed. This PR is ready for your review.

Each spec file boots the backend, which re-ran ALL migrations (~100+ across ~25
plugin schemas) on every boot because the reset created an EMPTY database. Build
the migrated schema ONCE per run and clone it instead.

- template-db.ts: build the template by booting the REAL backend once against an
  empty DB (the exact production migration path + idempotent role/access-rule
  seeding), wait for readiness, stop it, and drain its connections so the
  template can be a CREATE DATABASE ... TEMPLATE source. Built from current
  migrations every run -> drift-proof, no checked-in dump. No admin user is
  seeded, so per-file onboarding is unchanged.
- with-e2e-postgres.ts: build the template once after Postgres is up, before the
  spec loop (inside the try, so a build failure still tears the container down
  and fails loudly).
- start-e2e-server.ts: reset by `CREATE DATABASE ... TEMPLATE` when the template
  exists (file copy -> boot-time migrations no-op), falling back to empty-create
  + migrate when it doesn't (direct test:e2e:file runs).

Verified locally: template builds in ~3s, catalog spec passes through the clone
path. Green CI proves the path is active (build failure would fail loudly).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

❌ PR Checks Failed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ❌ Failed
Security ✅ Passed
E2E ✅ Passed
❌ Integration Test Failures
... (truncated 127 lines)
(skip) MCP Streamable-HTTP conformance > tools/list never lists an out-of-scope tool
(skip) MCP Streamable-HTTP conformance > tools/call for an out-of-scope tool is REFUSED 403 (not merely hidden)
(skip) MCP Streamable-HTTP conformance > tools/call for a mutating tool is refused by the structural effect-gate

::endgroup::

::group::core/backend/src/public-host/routing.e2e.it.test.ts:
(pass) custom-domain host (status.fake.test) — locked down > 404s an admin data endpoint [5.97ms]
(pass) custom-domain host (status.fake.test) — locked down > 404s REST and platform endpoints [0.58ms]
(pass) custom-domain host (status.fake.test) — locked down > allows the single public read [0.42ms]
(pass) custom-domain host (status.fake.test) — locked down > /api/config returns the custom origin + publicHost (never the admin origin) [0.41ms]
(pass) custom-domain host (status.fake.test) — locked down > serves the PUBLIC bundle for navigational routes [0.57ms]
(pass) admin host (admin.fake.test) — unaffected > admin data endpoint is reachable [0.26ms]
(pass) admin host (admin.fake.test) — unaffected > serves the ADMIN bundle and admin config [0.73ms]
(pass) unknown host — no regression (admin behavior) > is not locked down [0.57ms]

::endgroup::

::group::core/catalog-backend/src/services/entity-service.it.test.ts:
(pass) EntityService (real Postgres) > getSystemByName (case-insensitive uniqueness lookup) > matches regardless of case so 'Api' collides with 'api' [3.60ms]
(pass) EntityService (real Postgres) > getSystemByName (case-insensitive uniqueness lookup) > returns undefined when the name is free [1.07ms]
(pass) EntityService (real Postgres) > removeContact (compound id + systemId scoping) > does not delete a contact when the systemId does not match [7.30ms]
(pass) EntityService (real Postgres) > removeLink (compound id + systemId scoping) > does not delete a link when the systemId does not match [6.67ms]

::endgroup::

::group::core/automation-backend/src/dispatch/stage1.it.test.ts:
(pass) Stage-1 routing exactly-once (real Redis) > one ENTITY_CHANGED-style job runs the routing handler exactly once across two workers [1525.45ms]

::endgroup::

::group::core/automation-backend/src/dispatch/dwell.it.test.ts:
(pass) dwell-store atomic claim (real Postgres) > two concurrent delete(id) calls → exactly one returns a row [20.78ms]

::endgroup::

::group::core/automation-backend/src/dispatch/stage2-stalled.it.test.ts:
(pass) Stage-2 stalled redelivery (real Redis) > a dead worker's job is redelivered to another worker and completed once [2074.41ms]

::endgroup::

::group::core/automation-backend/src/entity/wake-index.it.test.ts:
(pass) wake-index arm race + intersection lookup (real Postgres) > intersection lookup returns the owning until-lock for a concrete ref [20.89ms]
(pass) wake-index arm race + intersection lookup (real Postgres) > matches a kind-level wildcard wait [4.38ms]
(pass) wake-index arm race + intersection lookup (real Postgres) > concurrent same-(lock, ref) inserts leave exactly one row [26.99ms]

::endgroup::

::group::core/automation-backend/src/entity/cross-pod-read-consistency.it.test.ts:
(pass) cross-pod reactive-entity read consistency (real Postgres) > durable kind: a write on pod A is visible to pod B's read + getMany [22.87ms]
(pass) cross-pod reactive-entity read consistency (real Postgres) > NEGATIVE CONTROL: a pod-local read does NOT see another pod's write (proves teeth) [5.78ms]

::endgroup::

::group::core/backend-api/src/script-sandbox/rootless-egress.it.test.ts:
(skip) rootless egress (real slirp4netns) > delivers filtered egress: blocks a non-allowlisted destination
(skip) rootless egress (real slirp4netns) > the network decision picks the rootless path on this host

::endgroup::

::group::core/backend-api/src/script-sandbox/forkbomb.it.test.ts:
(skip) per-run fork-bomb containment (real bwrap) > caps a shell fork bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > caps an ESM spawn-loop bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > still runs a benign script to success under the same fail-closed default

::endgroup::

::group::core/backend/src/services/plugin-installers/install-from-tarball.it.test.ts:
(pass) installBundleFromArtifacts (real bun install) > resolves an intra-bundle sibling dependency without a registry [34.63ms]

::endgroup::

16 tests skipped:
(skip) external plugin install (real instance + UI) > (unnamed)
(skip) external plugin install (real instance + UI) > installs the packaged plugin via the UI; frontend + backend + core plugins load
(skip) external plugin install (real instance + UI) > (unnamed)
(skip) MCP Streamable-HTTP conformance > initialize advertises a protocol version and a session id
(skip) MCP Streamable-HTTP conformance > initialize echoes a negotiated protocol version
(skip) MCP Streamable-HTTP conformance > tools/list WITHOUT a session id is refused (session enforced, not cosmetic)
(skip) MCP Streamable-HTTP conformance > tools/list returns the read-only tool surface
(skip) MCP Streamable-HTTP conformance > tools/call returns a non-error content block
(skip) MCP Streamable-HTTP conformance > tools/list never lists an out-of-scope tool
(skip) MCP Streamable-HTTP conformance > tools/call for an out-of-scope tool is REFUSED 403 (not merely hidden)
(skip) MCP Streamable-HTTP conformance > tools/call for a mutating tool is refused by the structural effect-gate
(skip) rootless egress (real slirp4netns) > delivers filtered egress: blocks a non-allowlisted destination
(skip) rootless egress (real slirp4netns) > the network decision picks the rootless path on this host
(skip) per-run fork-bomb containment (real bwrap) > caps a shell fork bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > caps an ESM spawn-loop bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > still runs a benign script to success under the same fail-closed default


1 tests failed:
(fail) external plugin lifecycle (published tarballs) > installs from the local registry [35553.00ms]

 62 pass
 16 skip
 1 fail
 209 expect() calls
Ran 79 tests across 24 files. [70.10s]

How to fix: These are the real-services integration tests (*.it.test.ts). To reproduce locally, start the dev services with docker compose -f docker-compose-dev.yml up -d postgres redis, then run CHECKSTACK_IT=1 bun test it.test. Read the failing assertions and fix the implementation so the tests pass against real Postgres/Redis. Do not weaken or skip the tests.

@enyineer The above code quality issues were found in this PR. Please fix them before merging.

@github-actions

Copy link
Copy Markdown
Contributor

❌ PR Checks Failed

⚠️ Escalation: Automated fixes have not resolved the issues after 3 attempts. Manual intervention is required.

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ❌ Failed
Security ✅ Passed
E2E ✅ Passed
❌ Integration Test Failures
... (truncated 167 lines)
(skip) MCP Streamable-HTTP conformance > tools/call for a mutating tool is refused by the structural effect-gate

::endgroup::

::group::core/backend/src/public-host/routing.e2e.it.test.ts:
(pass) custom-domain host (status.fake.test) — locked down > 404s an admin data endpoint [3.97ms]
(pass) custom-domain host (status.fake.test) — locked down > 404s REST and platform endpoints [0.54ms]
(pass) custom-domain host (status.fake.test) — locked down > allows the single public read [0.41ms]
(pass) custom-domain host (status.fake.test) — locked down > /api/config returns the custom origin + publicHost (never the admin origin) [0.51ms]
(pass) custom-domain host (status.fake.test) — locked down > serves the PUBLIC bundle for navigational routes [0.53ms]
(pass) admin host (admin.fake.test) — unaffected > admin data endpoint is reachable [0.32ms]
(pass) admin host (admin.fake.test) — unaffected > serves the ADMIN bundle and admin config [1.83ms]
(pass) unknown host — no regression (admin behavior) > is not locked down [0.58ms]

::endgroup::

::group::core/catalog-backend/src/services/entity-service.it.test.ts:
(pass) EntityService (real Postgres) > getSystemByName (case-insensitive uniqueness lookup) > matches regardless of case so 'Api' collides with 'api' [4.75ms]
(pass) EntityService (real Postgres) > getSystemByName (case-insensitive uniqueness lookup) > returns undefined when the name is free [1.11ms]
(pass) EntityService (real Postgres) > removeContact (compound id + systemId scoping) > does not delete a contact when the systemId does not match [7.91ms]
(pass) EntityService (real Postgres) > removeLink (compound id + systemId scoping) > does not delete a link when the systemId does not match [7.65ms]

::endgroup::

::group::core/automation-backend/src/dispatch/stage1.it.test.ts:
(pass) Stage-1 routing exactly-once (real Redis) > one ENTITY_CHANGED-style job runs the routing handler exactly once across two workers [1528.55ms]

::endgroup::

::group::core/automation-backend/src/dispatch/dwell.it.test.ts:
(pass) dwell-store atomic claim (real Postgres) > two concurrent delete(id) calls → exactly one returns a row [19.63ms]

::endgroup::

::group::core/automation-backend/src/dispatch/stage2-stalled.it.test.ts:
(pass) Stage-2 stalled redelivery (real Redis) > a dead worker's job is redelivered to another worker and completed once [2068.65ms]

::endgroup::

::group::core/automation-backend/src/entity/wake-index.it.test.ts:
(pass) wake-index arm race + intersection lookup (real Postgres) > intersection lookup returns the owning until-lock for a concrete ref [20.01ms]
(pass) wake-index arm race + intersection lookup (real Postgres) > matches a kind-level wildcard wait [5.00ms]
(pass) wake-index arm race + intersection lookup (real Postgres) > concurrent same-(lock, ref) inserts leave exactly one row [24.02ms]

::endgroup::

::group::core/automation-backend/src/entity/cross-pod-read-consistency.it.test.ts:
(pass) cross-pod reactive-entity read consistency (real Postgres) > durable kind: a write on pod A is visible to pod B's read + getMany [20.88ms]
(pass) cross-pod reactive-entity read consistency (real Postgres) > NEGATIVE CONTROL: a pod-local read does NOT see another pod's write (proves teeth) [3.48ms]

::endgroup::

::group::core/backend-api/src/script-sandbox/rootless-egress.it.test.ts:
(skip) rootless egress (real slirp4netns) > delivers filtered egress: blocks a non-allowlisted destination
(skip) rootless egress (real slirp4netns) > the network decision picks the rootless path on this host

::endgroup::

::group::core/backend-api/src/script-sandbox/forkbomb.it.test.ts:
(skip) per-run fork-bomb containment (real bwrap) > caps a shell fork bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > caps an ESM spawn-loop bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > still runs a benign script to success under the same fail-closed default

::endgroup::

::group::core/backend/src/services/plugin-installers/install-from-tarball.it.test.ts:
(pass) installBundleFromArtifacts (real bun install) > resolves an intra-bundle sibling dependency without a registry [33.50ms]

::endgroup::

16 tests skipped:
(skip) external plugin install (real instance + UI) > (unnamed)
(skip) external plugin install (real instance + UI) > installs the packaged plugin via the UI; frontend + backend + core plugins load
(skip) external plugin install (real instance + UI) > (unnamed)
(skip) MCP Streamable-HTTP conformance > initialize advertises a protocol version and a session id
(skip) MCP Streamable-HTTP conformance > initialize echoes a negotiated protocol version
(skip) MCP Streamable-HTTP conformance > tools/list WITHOUT a session id is refused (session enforced, not cosmetic)
(skip) MCP Streamable-HTTP conformance > tools/list returns the read-only tool surface
(skip) MCP Streamable-HTTP conformance > tools/call returns a non-error content block
(skip) MCP Streamable-HTTP conformance > tools/list never lists an out-of-scope tool
(skip) MCP Streamable-HTTP conformance > tools/call for an out-of-scope tool is REFUSED 403 (not merely hidden)
(skip) MCP Streamable-HTTP conformance > tools/call for a mutating tool is refused by the structural effect-gate
(skip) rootless egress (real slirp4netns) > delivers filtered egress: blocks a non-allowlisted destination
(skip) rootless egress (real slirp4netns) > the network decision picks the rootless path on this host
(skip) per-run fork-bomb containment (real bwrap) > caps a shell fork bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > caps an ESM spawn-loop bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > still runs a benign script to success under the same fail-closed default


3 tests failed:
(fail) external plugin lifecycle (published tarballs) > installs from the local registry [33404.69ms]
(fail) external plugin lifecycle (published tarballs) > validates, packs, and bundles (workspace rewrite is a no-op for @checkstack deps) [5.36ms]
(fail) external plugin lifecycle (published tarballs) > boots the dev server and serves POST /api/<pluginId>/* == 200 with a JSON array [1006.34ms]

 60 pass
 16 skip
 3 fail
 200 expect() calls
Ran 79 tests across 24 files. [47.37s]

How to fix: These are the real-services integration tests (*.it.test.ts). To reproduce locally, start the dev services with docker compose -f docker-compose-dev.yml up -d postgres redis, then run CHECKSTACK_IT=1 bun test it.test. Read the failing assertions and fix the implementation so the tests pass against real Postgres/Redis. Do not weaken or skip the tests.

@enyineer The above code quality issues were found in this PR. Automated fixes have not resolved them after 3 attempts. Manual intervention is required.

…le resets"

This reverts commit 4e3202c.

Measured against the no-template baseline, the template clone gave no reliable
CI speedup (within run-to-run noise): migrations were never the bottleneck - the
per-file FULL backend boot (initializing ~50 plugins to readiness) + onboarding
dominate the ~24s/file, and the template only removes the small migration slice.
Per the decision, drop the template-DB complexity and pursue boot-once (boot the
backend once + isolate test data per worker) as the real lever instead. Also
corrects a stale changeset entry that still described catalog's reverted
per-spec retry-safety.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

❌ PR Checks Failed

⚠️ Escalation: Automated fixes have not resolved the issues after 3 attempts. Manual intervention is required.

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ❌ Failed
Security ✅ Passed
E2E ✅ Passed
❌ Integration Test Failures
... (truncated 164 lines)
(skip) MCP Streamable-HTTP conformance > tools/call for a mutating tool is refused by the structural effect-gate

::endgroup::

::group::core/backend/src/public-host/routing.e2e.it.test.ts:
(pass) custom-domain host (status.fake.test) — locked down > 404s an admin data endpoint [3.52ms]
(pass) custom-domain host (status.fake.test) — locked down > 404s REST and platform endpoints [0.43ms]
(pass) custom-domain host (status.fake.test) — locked down > allows the single public read [0.32ms]
(pass) custom-domain host (status.fake.test) — locked down > /api/config returns the custom origin + publicHost (never the admin origin) [0.40ms]
(pass) custom-domain host (status.fake.test) — locked down > serves the PUBLIC bundle for navigational routes [0.42ms]
(pass) admin host (admin.fake.test) — unaffected > admin data endpoint is reachable [0.23ms]
(pass) admin host (admin.fake.test) — unaffected > serves the ADMIN bundle and admin config [0.41ms]
(pass) unknown host — no regression (admin behavior) > is not locked down [0.38ms]

::endgroup::

::group::core/catalog-backend/src/services/entity-service.it.test.ts:
(pass) EntityService (real Postgres) > getSystemByName (case-insensitive uniqueness lookup) > matches regardless of case so 'Api' collides with 'api' [3.32ms]
(pass) EntityService (real Postgres) > getSystemByName (case-insensitive uniqueness lookup) > returns undefined when the name is free [0.88ms]
(pass) EntityService (real Postgres) > removeContact (compound id + systemId scoping) > does not delete a contact when the systemId does not match [7.42ms]
(pass) EntityService (real Postgres) > removeLink (compound id + systemId scoping) > does not delete a link when the systemId does not match [4.61ms]

::endgroup::

::group::core/automation-backend/src/dispatch/stage1.it.test.ts:
(pass) Stage-1 routing exactly-once (real Redis) > one ENTITY_CHANGED-style job runs the routing handler exactly once across two workers [1522.09ms]

::endgroup::

::group::core/automation-backend/src/dispatch/dwell.it.test.ts:
(pass) dwell-store atomic claim (real Postgres) > two concurrent delete(id) calls → exactly one returns a row [24.23ms]

::endgroup::

::group::core/automation-backend/src/dispatch/stage2-stalled.it.test.ts:
(pass) Stage-2 stalled redelivery (real Redis) > a dead worker's job is redelivered to another worker and completed once [2065.75ms]

::endgroup::

::group::core/automation-backend/src/entity/wake-index.it.test.ts:
(pass) wake-index arm race + intersection lookup (real Postgres) > intersection lookup returns the owning until-lock for a concrete ref [16.57ms]
(pass) wake-index arm race + intersection lookup (real Postgres) > matches a kind-level wildcard wait [3.82ms]
(pass) wake-index arm race + intersection lookup (real Postgres) > concurrent same-(lock, ref) inserts leave exactly one row [18.91ms]

::endgroup::

::group::core/automation-backend/src/entity/cross-pod-read-consistency.it.test.ts:
(pass) cross-pod reactive-entity read consistency (real Postgres) > durable kind: a write on pod A is visible to pod B's read + getMany [20.26ms]
(pass) cross-pod reactive-entity read consistency (real Postgres) > NEGATIVE CONTROL: a pod-local read does NOT see another pod's write (proves teeth) [2.76ms]

::endgroup::

::group::core/backend-api/src/script-sandbox/rootless-egress.it.test.ts:
(skip) rootless egress (real slirp4netns) > delivers filtered egress: blocks a non-allowlisted destination
(skip) rootless egress (real slirp4netns) > the network decision picks the rootless path on this host

::endgroup::

::group::core/backend-api/src/script-sandbox/forkbomb.it.test.ts:
(skip) per-run fork-bomb containment (real bwrap) > caps a shell fork bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > caps an ESM spawn-loop bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > still runs a benign script to success under the same fail-closed default

::endgroup::

::group::core/backend/src/services/plugin-installers/install-from-tarball.it.test.ts:
(pass) installBundleFromArtifacts (real bun install) > resolves an intra-bundle sibling dependency without a registry [27.33ms]

::endgroup::

16 tests skipped:
(skip) external plugin install (real instance + UI) > (unnamed)
(skip) external plugin install (real instance + UI) > installs the packaged plugin via the UI; frontend + backend + core plugins load
(skip) external plugin install (real instance + UI) > (unnamed)
(skip) MCP Streamable-HTTP conformance > initialize advertises a protocol version and a session id
(skip) MCP Streamable-HTTP conformance > initialize echoes a negotiated protocol version
(skip) MCP Streamable-HTTP conformance > tools/list WITHOUT a session id is refused (session enforced, not cosmetic)
(skip) MCP Streamable-HTTP conformance > tools/list returns the read-only tool surface
(skip) MCP Streamable-HTTP conformance > tools/call returns a non-error content block
(skip) MCP Streamable-HTTP conformance > tools/list never lists an out-of-scope tool
(skip) MCP Streamable-HTTP conformance > tools/call for an out-of-scope tool is REFUSED 403 (not merely hidden)
(skip) MCP Streamable-HTTP conformance > tools/call for a mutating tool is refused by the structural effect-gate
(skip) rootless egress (real slirp4netns) > delivers filtered egress: blocks a non-allowlisted destination
(skip) rootless egress (real slirp4netns) > the network decision picks the rootless path on this host
(skip) per-run fork-bomb containment (real bwrap) > caps a shell fork bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > caps an ESM spawn-loop bomb and keeps the supervisor alive
(skip) per-run fork-bomb containment (real bwrap) > still runs a benign script to success under the same fail-closed default


3 tests failed:
(fail) external plugin lifecycle (published tarballs) > installs from the local registry [29471.67ms]
(fail) external plugin lifecycle (published tarballs) > validates, packs, and bundles (workspace rewrite is a no-op for @checkstack deps) [4.21ms]
(fail) external plugin lifecycle (published tarballs) > boots the dev server and serves POST /api/<pluginId>/* == 200 with a JSON array [1004.03ms]

 60 pass
 16 skip
 3 fail
 200 expect() calls
Ran 79 tests across 24 files. [43.39s]

How to fix: These are the real-services integration tests (*.it.test.ts). To reproduce locally, start the dev services with docker compose -f docker-compose-dev.yml up -d postgres redis, then run CHECKSTACK_IT=1 bun test it.test. Read the failing assertions and fix the implementation so the tests pass against real Postgres/Redis. Do not weaken or skip the tests.

@enyineer The above code quality issues were found in this PR. Automated fixes have not resolved them after 3 attempts. Manual intervention is required.

The old harness (run-all.ts) rebooted the backend and reset the DB once PER SPEC
FILE and ran files serially - ~24s/file of pure reboot overhead. A measured PoC
showed the per-file reboot, not migrations, was the bottleneck (the template-DB
approach moved nothing). So boot the backend ONCE per run/shard and run every
spec in PARALLEL against one shared DB.

This works because every spec is now DATA-ISOLATED:
- Each namespaces the entities it creates (`const NS = ...`; unique suffix), so
  parallel specs sharing the DB never collide.
- No spec asserts global / whole-DB state (empty lists, global counts).
- Onboarding / "fresh install" empty-state assertions moved to a dedicated
  PRISTINE phase: `*.empty.spec.ts` in an `empty-state` Playwright project that
  the data specs depend on, so it runs first on the clean DB. dashboard, ai,
  notification, infrastructure, queue, gitops became `.empty` specs; the
  per-domain empties deleted during isolation are reconstructed in
  onboarding.empty.spec.ts.

Harness:
- playwright.config.ts: setup-admin -> empty-state -> chromium (parallel) ->
  member, with fullyParallel + workers.
- with-e2e-postgres.ts: runs `playwright test` once; forwards `--shard=i/N`.
- CI e2e job shards with Playwright's NATIVE --shard (matrix size = job-total).
- Because data-isolated specs make in-process retries safe again, the file-level
  retry runner is retired: run-all.ts, shard.ts, shard.test.ts, and the PoC
  scaffolding are removed.

Verified: full suite (168 tests, 34 files) green locally boot-once at workers=4
in ~80s (single machine), vs minutes per shard before.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XVkYp7R1AtNoBziSNUpTQE
@github-actions

Copy link
Copy Markdown
Contributor

✅ All PR Checks Passed

Check Status
Typecheck ✅ Passed
Lint ✅ Passed
Deps ✅ Passed
Test ✅ Passed
Integration ✅ Passed
Security ✅ Passed
E2E ✅ Passed

@enyineer All quality checks have passed. This PR is ready for your review.

@enyineer enyineer merged commit 8cad340 into main Jun 21, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants