Skip to content

feat: durable execution model#11

Merged
raisedadead merged 78 commits into
mainfrom
refactor/durable-exec
Jun 4, 2026
Merged

feat: durable execution model#11
raisedadead merged 78 commits into
mainfrom
refactor/durable-exec

Conversation

@raisedadead
Copy link
Copy Markdown
Member

@raisedadead raisedadead commented Jun 3, 2026

No description provided.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a more durable execution model by adding Postgres-backed state (deploy/alias metadata, outbox, tombstones, repo-request queue) and wiring it into a worker/engine workflow system (Hatchet), plus adding GC/reconcile workflows and local/e2e/integration tooling to exercise the full stack.

Changes:

  • Add Postgres schema + repositories for deploy metadata, aliases (CAS), tombstones, outbox events, registry storage/import, and repo-request queue.
  • Introduce a worker runtime abstraction with metrics + an outbox relay, plus Hatchet adapter + integration tests that validate concurrency/poison/at-least-once semantics.
  • Add GC/reconcile/backfill logic, new delete/purge endpoints, durable GitHub team caching, and local + e2e harness scripts/compose files.

Reviewed changes

Copilot reviewed 94 out of 96 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/integration/hatchet/compose.hatchet.yaml Compose stack for live Hatchet-lite integration tests.
test/e2e/helpers_test.go E2E helper utilities for site registration/deploy minting/PG outbox waits.
test/e2e/harness_test.go E2E harness for HTTP/R2/PG connectivity and request helpers.
test/e2e/compose.e2e.yaml Full-stack compose for e2e (pg/valkey/minio/hatchet/fakegithub/artemis).
scripts/loadgen.sh Local load harness runner with ephemeral Postgres.
scripts/e2e-local.sh Local end-to-end runner that boots stack, mints keys/tokens, runs e2e tests.
justfile Adds hatchet-integration, e2e-local, and loadgen convenience targets.
internal/worker/runtime.go Worker runtime abstraction around an engine with workflow registration.
internal/worker/runtime_test.go Tests runtime registration/dup detection/engine delegation.
internal/worker/relay.go Outbox relay that publishes events and marks them published.
internal/worker/relay_test.go Tests relay ordering, partial progress, and at-least-once behavior.
internal/worker/metrics.go Prometheus metrics for worker runs/DLQ/relay.
internal/worker/metrics_test.go Tests worker metrics behavior and nil-safety.
internal/worker/deployflows.go Helper to register deploy workflows with per-site concurrency.
internal/worker/deployflows_test.go Tests deploy workflows register with the site concurrency key.
internal/worker/debounce.go Debouncer for coalescing site-change signals per site.
internal/worker/debounce_test.go Tests debouncer coalescing/retrigger/stop behavior.
internal/teamcache/teamcache.go Valkey-backed durable cache for GitHub team membership lists.
internal/teamcache/teamcache_test.go Tests teamcache hit/miss/TTL/error behavior.
internal/server/server.go Adds deploy delete route to the HTTP API.
internal/registry/valkey/store.go Adds pub-sub publish helpers for registry change notifications.
internal/registry/valkey/reader.go Refactors reader to support separate source-of-truth vs pubsub transport.
internal/registry/valkey/cutover_test.go Tests PG-source + Valkey-pubsub cutover behavior.
internal/r2/r2.go Adds delete/move/list helpers for prefixes and site listing.
internal/pg/saga.go Adds atomic finalize flow that writes deploy + alias + outbox in one tx.
internal/pg/saga_test.go Tests finalize atomicity/idempotency/outbox emission.
internal/pg/repoqueue.go Postgres implementation of repo-request queue with CAS transitions.
internal/pg/repoqueue_test.go Tests repo-request queue behavior and name-claim semantics.
internal/pg/repo.go Postgres deploy/alias/tombstone repository operations.
internal/pg/repo_test.go Tests deploy/alias roundtrip and tombstone lifecycle with testcontainers.
internal/pg/registry.go Postgres registry store + import-on-boot support and onChange hook.
internal/pg/registry_test.go Tests registry CRUD and onChange hook firing.
internal/pg/registry_import_test.go Tests registry import behavior and timestamp preservation.
internal/pg/pg.go Postgres DB wrapper with connect/ping/close.
internal/pg/outbox.go Outbox enqueue/fetch/mark-published implementation.
internal/pg/outbox_test.go Tests outbox atomicity with tx rollback and publish marking.
internal/pg/migrations/0003_repo_requests.sql Adds repo_requests table + name-claim and ordering indexes.
internal/pg/migrations/0002_registry.sql Adds sites table for Postgres-backed registry.
internal/pg/migrations/0001_init.sql Adds deploys/aliases/tombstones/outbox tables + indexes.
internal/pg/migrate.go Embedded SQL migrations runner with advisory lock + ledger.
internal/pg/migrate_test.go Tests migrations are applied and idempotent.
internal/pg/alias.go Adds alias CAS update with row-level locking and outbox emission.
internal/pg/alias_test.go Tests alias CAS no-lost-update and drift rejection.
internal/hatchet/integration_relay_test.go Integration test: relay drains outbox across engine restart.
internal/hatchet/integration_poison_test.go Integration test: poison workflow deadletters without blocking key.
internal/hatchet/integration_harness_test.go Integration test harness for Hatchet workflows and observability.
internal/hatchet/integration_concurrency_test.go Integration tests for per-site concurrency and parallelism across sites.
internal/hatchet/adapter.go Hatchet adapter implementing worker.Engine and Publisher.
internal/hatchet/adapter_test.go Tests adapter registration, publish-before-start, and bad addr handling.
internal/handler/test_helpers_test.go Extends fake R2 store with MovePrefix for purge/tombstone tests.
internal/handler/site.go Emits site.changed on promote/rollback.
internal/handler/site_register.go Adds site purge option that moves all site bytes to trash and records tombstone.
internal/handler/site_purge_test.go Tests site purge moves bytes to trash and records a site-level tombstone.
internal/handler/readyz.go Adds PG health probing and “degraded” readiness response.
internal/handler/readyz_test.go Adds tests for PG degraded readiness behavior.
internal/handler/outbox_emit_test.go Tests site.changed emission on finalize/promote.
internal/handler/handler.go Adds TombstoneStore/Outbox/PGHealth hooks and site.changed emitter helper.
internal/handler/deploy.go Writes a deploy completion marker and emits site.changed on finalize.
internal/handler/deploy_test.go Tests marker creation on finalize.
internal/handler/deploy_delete.go Adds deploy delete endpoint (tombstone move with alias-protection).
internal/handler/deploy_delete_test.go Tests deploy delete tombstoning, alias conflict, bad id, and authz.
internal/gc/tombstone.go Implements tombstone purge pass for hard-delete past recovery window.
internal/gc/tombstone_test.go Tests tombstone purge behavior including site-level trash layout.
internal/gc/retain.go Implements retention policy logic for keep vs delete sets.
internal/gc/retain_test.go Tests retention policy invariants (keepN/grace/alias pin/cache TTL).
internal/gc/reconcile.go Implements R2↔PG drift reconciliation and safe orphan tombstoning.
internal/gc/reconcile_test.go Tests reconcile behaviors and drift metrics.
internal/gc/plan.go Builds delete plans with blast-cap protection.
internal/gc/plan_test.go Tests plan sizing and blast-cap abort behavior.
internal/gc/metrics.go Prometheus metrics for GC workflows and drift.
internal/gc/metrics_test.go Tests GC metrics increments and nil-safety.
internal/gc/marker.go Defines completion marker object name.
internal/gc/gcsite.go Implements per-site GC execution with TOCTOU alias re-check.
internal/gc/gcsite_test.go Tests GC behavior (alias pin, inflight protection, dry-run, blast-cap).
internal/config/config.go Adds DB/hatchet/cleanup config + GH_API_BASE validation + cleanup parsing.
internal/config/config_test.go Adds tests for GH_API_BASE validation and cleanup config invariants.
internal/backfill/backfill.go Implements R2→PG backfill/indexing for deploys and aliases.
internal/backfill/backfill_test.go Tests backfill indexing, markers, and alias key behavior.
internal/auth/github.go Adds durable team-cache integration and improves body read/parse error handling.
internal/auth/github_test.go Adds tests for malformed membership body and durable team cache behavior.
docker-compose.yml Adds Postgres and adjusts fakegithub networking/GH_API_BASE for local dev.
cmd/artemis/relayloop_test.go Tests the relay loop drains outbox and records metrics.
cmd/artemis/main_test.go Tests DB boot/migrations gating when DATABASE_URL is empty vs set.
cmd/artemis/gcworkflows.go Wires relay loop + GC/reconcile/tombstone workflows with metrics/observability.
cmd/artemis/gcworkflows_test.go Tests workflow definitions and site input parsing.
cmd/artemis/gcwire.go Wires GC layout/policy and validates feature gating (repo queue requires DB).
cmd/artemis/gcwire_test.go Tests layout parsing and GC policy derivation.
.gitignore Ignores the loadgen binary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/worker/debounce.go
Comment thread internal/pg/migrations/0001_init.sql
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ec3010b0-813b-4ef3-b0eb-7d8a66f12352

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 95 out of 97 changed files in this pull request and generated 1 comment.

Comment thread internal/handler/site_register.go
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/handler/site_register.go (1)

193-227: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Partial failure: registry deletion precedes R2 purge.

If h.Registry.Delete succeeds but the subsequent MovePrefix or RecordTombstone fails, the site is removed from the registry while its R2 data remains in place (not moved to trash). This leaves orphaned data that won't be discoverable for cleanup.

Consider reordering to move data to trash first, then delete from registry, or wrapping both in a transaction/saga if available.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/handler/site_register.go` around lines 193 - 227, Registry deletion
happens before the R2 purge so a partial failure can leave orphaned data; change
the flow to perform h.R2.MovePrefix(r.Context(), slug+"/", base+slug+"/") and
then h.Tombstones.RecordTombstone(r.Context(), slug, "", 0) before calling
h.Registry.Delete(r.Context(), slug), propagating errors from
MovePrefix/RecordTombstone with the existing writeUpstreamError handling and
only deleting the registry entry if both data-move and tombstone recording
succeed; after successful delete keep the same logging and response behavior
(use the same symbols: h.R2.MovePrefix, h.Tombstones.RecordTombstone,
h.Registry.Delete, TrashPrefixBase) and ensure you preserve the original
not-found handling for Delete when appropriate.
🧹 Nitpick comments (9)
internal/hatchet/integration_concurrency_test.go (1)

61-74: ⚡ Quick win

Strengthen the distinct-sites test to assert real overlap, not just eventual starts.

TestR3DistinctSitesRunConcurrent currently passes even if execution is serialized (both sites can “eventually” start). Please add an explicit overlap/concurrency assertion so the test actually enforces the contract it names.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/hatchet/integration_concurrency_test.go` around lines 61 - 74,
TestR3DistinctSitesRunConcurrent currently only waits for each site to start;
modify the test to assert real concurrency by recording start timestamps in the
test observer (use newObserver/obs) or by exposing start events from
instrumented handlers, then after firing both workflows (h.fire) and waiting for
starts (h.waitStarts), assert that the start times for siteA and siteB overlap
(e.g., the difference between their start timestamps is less than the handler
duration 1500ms or that the second site started before the first site finished).
Update the observer/instrumented helper to capture per-site start and finish
times and add an assertion in TestR3DistinctSitesRunConcurrent that verifies
overlap rather than only eventual starts.
internal/auth/github.go (1)

368-376: 💤 Low value

Durable-cache write failure discards successfully fetched teams.

When fetchUserTeams succeeds but teamCacheDurable.Set fails, the error is returned and the caller receives no teams. However, fetchUserTeams already stored the teams in the in-memory cache (lines 450-455), so the data is technically available. Consider whether a durable-cache write failure should be non-fatal, logging a warning instead and returning the teams.

♻️ Optional: log warning instead of failing
 	teams, err := c.fetchUserTeams(ctx, cacheKey, token)
 	if err != nil {
 		return nil, err
 	}
 	if err := c.teamCacheDurable.Set(ctx, login, teams); err != nil {
-		return nil, err
+		// Non-fatal: in-memory cache is populated; log and continue.
+		slog.Warn("durable team cache write failed", "login", login, "err", err)
 	}
 	return teams, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/auth/github.go` around lines 368 - 376, The durable-cache write in
the function that calls c.fetchUserTeams currently returns an error when
c.teamCacheDurable.Set fails, discarding the successfully fetched teams; change
this to make the durable write non-fatal: after calling teams, err :=
c.fetchUserTeams(...), keep the existing error return if fetchUserTeams fails,
but when c.teamCacheDurable.Set(ctx, login, teams) returns an error, log a
warning including login and the error (use the existing logger on the receiver,
e.g. c.logger or similar) and then return the teams instead of returning the
error; keep the original behavior only for fetchUserTeams errors and ensure any
in-memory cache (teamCache) behavior remains unchanged.
test/e2e/harness_test.go (1)

78-80: ⚡ Quick win

Explicitly set MinVersion in TLS configuration.

The TLS config lacks an explicit MinVersion, defaulting to TLS 1.2 for clients. For test harnesses connecting to local environments, this may be acceptable, but explicitly setting the minimum version improves security posture and makes the intention clear.

🔒 Proposed fix to set TLS 1.2 or 1.3 explicitly
-	c.Transport = &http.Transport{TLSClientConfig: &tls.Config{RootCAs: pool}}
+	c.Transport = &http.Transport{TLSClientConfig: &tls.Config{
+		RootCAs:    pool,
+		MinVersion: tls.VersionTLS12,
+	}}

If all test environments support TLS 1.3, consider using tls.VersionTLS13 instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/harness_test.go` around lines 78 - 80, The TLS config assigned to
c.Transport (http.Transport -> tls.Config) does not set MinVersion; update the
tls.Config used in c.Transport to include an explicit MinVersion (e.g.,
tls.VersionTLS12 or tls.VersionTLS13) to make the minimum allowed TLS version
explicit and secure while keeping RootCAs: pool intact.
internal/gc/tombstone.go (1)

52-81: ⚡ Quick win

Consider accumulating partial work before returning errors.

If DeletePrefix or ClearTombstone fails midway through the loop, successfully purged tombstones are not reflected in the result or metrics. In GC workflows this is eventually consistent (the next run retries), but tracking partial success would improve observability and allow incremental progress to be reported.

♻️ Proposed enhancement to track partial success
 	for _, t := range expired {
 		label := t.Site + "/" + t.ID
 		if dryRun {
 			res.Purged = append(res.Purged, label)
 			continue
 		}
 		if _, err := p.Deleter.DeletePrefix(ctx, p.trashPrefix(t)); err != nil {
+			// Partial work is reflected in res; caller can decide whether to retry or log progress
 			return res, fmt.Errorf("tombstone-purge: delete %s: %w", label, err)
 		}
 		if err := p.Store.ClearTombstone(ctx, t.Site, t.ID); err != nil {
 			return res, fmt.Errorf("tombstone-purge: clear %s: %w", label, err)
 		}
 		res.Purged = append(res.Purged, label)
 		res.BytesReclaimed += t.Bytes
 	}
 
 	if !dryRun {
 		p.Metrics.reclaimed(res.BytesReclaimed)
 		p.Metrics.run(WorkflowTombstonePurgeLabel, "ok")
 	}

The change ensures res.Purged and res.BytesReclaimed reflect all successfully purged tombstones even when an error occurs partway through.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/gc/tombstone.go` around lines 52 - 81, The loop in
TombstonePurge.Run currently only appends to res.Purged and res.BytesReclaimed
after both DeletePrefix and ClearTombstone succeed, so if ClearTombstone fails
you lose the partial success; modify the loop so that after a successful
p.Deleter.DeletePrefix(ctx, p.trashPrefix(t)) you immediately append label to
res.Purged and add t.Bytes to res.BytesReclaimed (still attempt
p.Store.ClearTombstone afterwards and return the error if it fails), and before
returning an error (when not dryRun) call
p.Metrics.reclaimed(res.BytesReclaimed) and
p.Metrics.run(WorkflowTombstonePurgeLabel, "error") and log the current res
values so partial progress is reflected in the result and metrics; target the
TombstonePurge.Run function and the DeletePrefix / ClearTombstone handling, and
update where metrics and logging are emitted.
internal/hatchet/adapter.go (1)

10-12: ⚡ Quick win

Reduce reliance on Hatchet legacy pkg/client and tighten worker shutdown behavior

  • internal/hatchet/adapter.go configures hsdk.NewClient(opts...) via github.com/hatchet-dev/hatchet/pkg/client (v0Client.WithToken/v0Client.WithHostPort). If avoiding the legacy pkg/client import is the goal, use the Go SDK’s env-based config (HATCHET_CLIENT_TOKEN, HATCHET_CLIENT_HOST_PORT) and initialize with hsdk.NewClient() (no options).
  • Adapter.Stop is a no-op (return nil) and isn’t called from production code; shutdown is driven by StartBlocking(rootCtx) exiting when rootCtx is canceled. If Engine.Stop is meant to actively stop the Hatchet worker independent of context cancellation, implement it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/hatchet/adapter.go` around lines 10 - 12, Replace usage of the
legacy v0Client options when creating the Hatchet SDK client in
internal/hatchet/adapter.go by relying on the Go SDK's environment-based
configuration: call hsdk.NewClient() with no options and remove the import/usage
of v0Client (and related types if unused), ensuring HATCHET_CLIENT_TOKEN and
HATCHET_CLIENT_HOST_PORT are used instead; additionally, implement Adapter.Stop
(and the Engine.Stop method if present) so it actively stops the worker instead
of being a no-op — e.g., expose/hold an internal cancel function or worker
shutdown handle when starting StartBlocking(rootCtx) and have Stop invoke that
shutdown path (safely and idempotently) so the worker can be stopped
independently of rootCtx cancellation.
internal/hatchet/integration_relay_test.go (1)

115-118: ⚡ Quick win

Bound the docker restart command with a timeout.

exec.Command here can hang indefinitely on daemon/compose issues. Use exec.CommandContext with a timeout so this test fails fast and diagnostically.

Suggested fix
-	cmd := exec.Command("docker", "compose", "-f", composeFile, "restart", "hatchet-lite")
+	ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
+	defer cancel()
+	cmd := exec.CommandContext(ctx, "docker", "compose", "-f", composeFile, "restart", "hatchet-lite")
 	out, err := cmd.CombinedOutput()
+	if ctx.Err() == context.DeadlineExceeded {
+		t.Fatalf("restart hatchet-lite timed out: %s", string(out))
+	}
 	require.NoErrorf(t, err, "restart hatchet-lite: %s", string(out))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/hatchet/integration_relay_test.go` around lines 115 - 118, Replace
the blocking exec.Command call with exec.CommandContext using a context with a
deadline (e.g., 30s) so the restart cannot hang indefinitely: create a context
via context.WithTimeout, use exec.CommandContext(ctx, "docker", "compose", "-f",
composeFile, "restart", "hatchet-lite") to build cmd, call cmd.CombinedOutput(),
and ensure you defer cancel() and assert errors appropriately (handle
context.DeadlineExceeded as a diagnostic timeout failure) so the test fails fast
and cleanly.
internal/worker/relay_test.go (1)

64-107: ⚡ Quick win

Add regression tests for nil relay wiring preconditions.

Given RunOnce is called via exported struct wiring, add tests for nil Source/Publisher (and nil receiver if you support it) so panic-prone misconfigurations stay guarded.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/worker/relay_test.go` around lines 64 - 107, Add regression tests in
relay_test.go to ensure RunOnce guards against nil wiring: add cases that
construct Relay with nil Source, Relay with nil Publisher, and a nil *Relay
receiver (var r *Relay; r.RunOnce(...)), calling RunOnce and asserting it
returns a non-nil error (and does not panic) rather than panicking; reference
the Relay type and its RunOnce method and the Source and Publisher fields so the
tests explicitly exercise those nil configurations.
internal/pg/registry.go (1)

122-134: 💤 Low value

imported count includes conflict-skipped rows.

The counter increments for every loop iteration, but ON CONFLICT DO NOTHING may skip rows that already exist (e.g., from a concurrent partial import). This means the returned count reflects "attempts" rather than "newly inserted" rows. If the count is used for metrics or logging, it may be misleading.

To accurately count inserts, check tag.RowsAffected():

Proposed fix
 	for _, site := range sites {
 		teams := append([]string(nil), site.Teams...)
-		_, err := conn.Exec(ctx,
+		tag, err := conn.Exec(ctx,
 			`INSERT INTO sites (slug, teams, created_at, updated_at, created_by)
 			 VALUES ($1, $2, $3, $4, $5)
 			 ON CONFLICT (slug) DO NOTHING`,
 			site.Slug, teams, site.CreatedAt, site.UpdatedAt, site.CreatedBy)
 		if err != nil {
 			return imported, fmt.Errorf("pg registry import %s: %w", site.Slug, err)
 		}
-		imported++
+		imported += int(tag.RowsAffected())
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/pg/registry.go` around lines 122 - 134, The loop currently counts
every attempted insert even when ON CONFLICT DO NOTHING skips rows; change the
Exec call in internal/pg/registry.go to capture the command tag (e.g., tag, err
:= conn.Exec(...)) and only increment imported when tag.RowsAffected() > 0 (so
imported reflects actual inserts). Keep the existing error wrapping (fmt.Errorf)
for non-nil err and leave the rest of the insert logic unchanged.
internal/pg/repoqueue.go (1)

128-140: 💤 Low value

Mismatch error may be misleading for MarkActive and MarkFailed.

Both methods expect StatusApproved but return ErrNotPending on mismatch. If a request is still pending and MarkActive is called, the error message "not pending" is confusing since the actual requirement is "must be approved."

Consider introducing ErrNotApproved or a more generic ErrInvalidTransition for clarity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/pg/repoqueue.go` around lines 128 - 140, The error returned on
status mismatch in RepoQueue.MarkActive and MarkFailed is misleading (they pass
reporequest.ErrNotPending while expecting reporequest.StatusApproved); add a
clearer error constant (e.g., reporequest.ErrNotApproved or
reporequest.ErrInvalidTransition) in the reporequest package and update both
calls to transition(ctx, id, reporequest.StatusApproved,
reporequest.ErrNotApproved, ...) so the mismatch message accurately reflects the
required approved state (leave transition(...) behavior unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/loadgen/main.go`:
- Around line 157-171: The SiteGC instance created in runGCPlan omits the
Metrics field causing a nil pointer panic because gc.SiteGC.Run calls
g.Metrics.run and g.Metrics.tombstoned; fix by supplying a non-nil Metrics
implementation (either a no-op metrics struct implementing the same methods or
the real metrics collector) when constructing gc.SiteGC in runGCPlan so
g.Metrics is always valid (update the runGCPlan initializer to set Metrics to
the no-op or real Metrics implementation).

In `@docs/design/0001-durable-execution-model.md`:
- Around line 21-34: The markdown diagram block starting with the line "commands
(CLI / CI / curl)" is an unlabeled fenced code block causing MD040; add a
neutral language tag (e.g., change opening "```" to "```text") so the fence
reads "```text" and leave the closing "```" unchanged to satisfy the linter
while preserving the ASCII architecture diagram.

In `@docs/design/0002-scalability-capacity.md`:
- Around line 68-70: The markdown file uses unlabeled fenced code blocks which
trigger MD040; update the fences for the specific snippets so they include
language identifiers: mark the DATABASE_URL and loadgen command blocks with
```bash, and mark the Go struct snippet starting with types.Concurrency{ with
```go; ensure the three other snippets containing the loadgen examples and
LOADGEN_DATABASE_URL are labeled as bash as well so all occurrences (the block
with "DATABASE_URL=postgres://...", the block with "types.Concurrency{", and the
blocks with "just loadgen" / "LOADGEN_DATABASE_URL=... just loadgen") use the
appropriate language fences.

In `@internal/backfill/backfill.go`:
- Around line 36-39: Run dereferences potentially nil dependencies (b.Lister and
b.Now) in Backfill.Run causing panics; add nil checks at the start of Run to
validate dependencies and return a clear typed error instead of panicking.
Specifically, in func (b *Backfill) Run(ctx context.Context) check that b !=
nil, b.Lister != nil and b.Now != nil (or provide a default Now function) and
return a descriptive error (e.g., ErrMissingDependency) if any are nil; update
callers/tests if they rely on default behavior and ensure the references around
ListSites, and the b.Now() invocations at the sites processing (the sites
iteration around lines where b.Now is used) use the validated non-nil values.

In `@internal/handler/site.go`:
- Line 135: The call to h.emitSiteChanged currently uses the incoming request
context (r.Context()), so if the client disconnects the emission can be canceled
after PutAlias succeeds; change the emission to use a non-cancelable context
(e.g., context.Background() or a derived context with a long timeout) when
calling h.emitSiteChanged so the post-write consistency event is delivered
regardless of request cancellation; update both places that call
h.emitSiteChanged(r.Context(), site) to use the detached context and ensure any
necessary logging or error handling still runs.

In `@internal/hatchet/adapter.go`:
- Around line 53-71: The Adapter.Start implementation calls w.StartBlocking(ctx)
but Adapter.Stop is a no-op, so shutdown is never signaled; update Adapter.Stop
to honor the engine lifecycle by shutting down the worker and cancelling any
running context: store a cancellable context when starting (e.g., create ctx,
cancel := context.WithCancel(ctx) inside Adapter.Start before calling
w.StartBlocking), save the cancel function and the worker instance (a.worker and
a.cancel), and implement Adapter.Stop to call a.cancel() and invoke the worker
shutdown API (e.g., a.worker.Stop or the appropriate non-blocking stop method
provided by hsdk) and wait for termination if available; ensure error handling
from worker stop/close is returned from Adapter.Stop.

In `@internal/hatchet/integration_poison_test.go`:
- Around line 17-30: poisonSeen and healthySeen are incremented concurrently in
the handler closures (poison, healthy) without synchronization, causing a data
race; change their declarations to an atomic-backed integer (e.g., int32/int64)
or protect them with a mutex, then replace the increments inside the poison and
healthy functions with atomic.Add... (or mutex-protected increments) and use
atomic.Load... (or mutex-protected reads) where the counters are inspected later
so all accesses are race-free.

In `@internal/pg/migrate.go`:
- Around line 29-31: The deferred unlock currently uses the possibly canceled
caller ctx which can cause pg_advisory_unlock (invoked via conn.Exec with
migrateAdvisoryLockKey) to fail and leak the advisory lock; change the defer to
run the unlock with a new non-cancelable context (e.g., context.Background()
with a short timeout) so the unlock runs even if the original ctx is canceled,
and remove any direct conn.Conn().Close() calls—return the connection to the
pool by calling conn.Release() instead; update the defer to create the timeout
context, call conn.Exec with that context and migrateAdvisoryLockKey, cancel the
timeout context, and ensure only conn.Release() is used to release the pooled
connection.

In `@internal/teamcache/teamcache.go`:
- Around line 55-68: GetOrFetch currently allows parallel cache misses for the
same login to call fetch concurrently; change it to coalesce concurrent misses
using a single-flight mechanism: add a singleflight.Group (or an equivalent
mutex+map) as a field on Cache and, after detecting a miss via Cache.Get, call
group.Do(login, func() (interface{}, error) { teams, err := fetch(ctx); if err
!= nil { return nil, err }; _ = c.Set(ctx, login, teams); return teams, nil })
so only one fetch/Set runs per login and the returned value/error is shared to
all waiters; ensure the returned interface is cast back to []string and errors
propagate unchanged.

In `@internal/worker/relay.go`:
- Around line 28-35: RunOnce currently assumes Relay and its dependencies exist
and will panic if miswired; add explicit nil guards at the start of
Relay.RunOnce to return descriptive errors instead of panicking: check if r ==
nil, then verify r.Source != nil before calling r.Source.FetchUnpublished and
r.Source.MarkPublished, and verify r.Publisher != nil before calling
r.Publisher.Publish; return wrapped errors like "relay: missing Source" /
"relay: missing Publisher" so callers get a clear error path rather than a
runtime panic.

In `@internal/worker/runtime.go`:
- Around line 40-42: NewRuntime currently allows construction with a nil Engine
which leads to panics in runtime lifecycle methods; change NewRuntime to
validate the engine and return (*Runtime, error) (e.g., return nil,
fmt.Errorf("nil engine")) when engine is nil, update callers to handle the
error, and keep the Runtime struct and methods Register, Start, and Stop
assuming a non-nil engine; also add defensive nil checks in Register/Start/Stop
to return errors if somehow called on a Runtime with nil engine to avoid panics.

In `@scripts/e2e-local.sh`:
- Line 45: The private key is being made world-readable; tighten its permissions
by setting "$CERTS_DIR/private.key" to 0600 (owner read/write only) while
keeping "$CERTS_DIR/public.crt" at 0644; update the chmod invocation that
currently targets both files to apply 0600 to the private key and 0644 to the
public cert (referencing the existing "$CERTS_DIR/private.key" and
"$CERTS_DIR/public.crt" symbols).

In `@scripts/loadgen.sh`:
- Around line 29-34: The readiness loop using seq 1 30 and docker exec
"$CONTAINER" pg_isready should fail fast if Postgres never becomes ready: after
the for loop that breaks on success, check whether pg_isready actually succeeded
and if not emit a clear error (including $CONTAINER and that pg_isready timed
out) and exit non-zero (e.g., exit 1) so the script stops rather than continuing
to run the load; update the block containing the for loop, CONTAINER, and
pg_isready checks to perform this post-loop assertion and error log/exit.

---

Outside diff comments:
In `@internal/handler/site_register.go`:
- Around line 193-227: Registry deletion happens before the R2 purge so a
partial failure can leave orphaned data; change the flow to perform
h.R2.MovePrefix(r.Context(), slug+"/", base+slug+"/") and then
h.Tombstones.RecordTombstone(r.Context(), slug, "", 0) before calling
h.Registry.Delete(r.Context(), slug), propagating errors from
MovePrefix/RecordTombstone with the existing writeUpstreamError handling and
only deleting the registry entry if both data-move and tombstone recording
succeed; after successful delete keep the same logging and response behavior
(use the same symbols: h.R2.MovePrefix, h.Tombstones.RecordTombstone,
h.Registry.Delete, TrashPrefixBase) and ensure you preserve the original
not-found handling for Delete when appropriate.

---

Nitpick comments:
In `@internal/auth/github.go`:
- Around line 368-376: The durable-cache write in the function that calls
c.fetchUserTeams currently returns an error when c.teamCacheDurable.Set fails,
discarding the successfully fetched teams; change this to make the durable write
non-fatal: after calling teams, err := c.fetchUserTeams(...), keep the existing
error return if fetchUserTeams fails, but when c.teamCacheDurable.Set(ctx,
login, teams) returns an error, log a warning including login and the error (use
the existing logger on the receiver, e.g. c.logger or similar) and then return
the teams instead of returning the error; keep the original behavior only for
fetchUserTeams errors and ensure any in-memory cache (teamCache) behavior
remains unchanged.

In `@internal/gc/tombstone.go`:
- Around line 52-81: The loop in TombstonePurge.Run currently only appends to
res.Purged and res.BytesReclaimed after both DeletePrefix and ClearTombstone
succeed, so if ClearTombstone fails you lose the partial success; modify the
loop so that after a successful p.Deleter.DeletePrefix(ctx, p.trashPrefix(t))
you immediately append label to res.Purged and add t.Bytes to res.BytesReclaimed
(still attempt p.Store.ClearTombstone afterwards and return the error if it
fails), and before returning an error (when not dryRun) call
p.Metrics.reclaimed(res.BytesReclaimed) and
p.Metrics.run(WorkflowTombstonePurgeLabel, "error") and log the current res
values so partial progress is reflected in the result and metrics; target the
TombstonePurge.Run function and the DeletePrefix / ClearTombstone handling, and
update where metrics and logging are emitted.

In `@internal/hatchet/adapter.go`:
- Around line 10-12: Replace usage of the legacy v0Client options when creating
the Hatchet SDK client in internal/hatchet/adapter.go by relying on the Go SDK's
environment-based configuration: call hsdk.NewClient() with no options and
remove the import/usage of v0Client (and related types if unused), ensuring
HATCHET_CLIENT_TOKEN and HATCHET_CLIENT_HOST_PORT are used instead;
additionally, implement Adapter.Stop (and the Engine.Stop method if present) so
it actively stops the worker instead of being a no-op — e.g., expose/hold an
internal cancel function or worker shutdown handle when starting
StartBlocking(rootCtx) and have Stop invoke that shutdown path (safely and
idempotently) so the worker can be stopped independently of rootCtx
cancellation.

In `@internal/hatchet/integration_concurrency_test.go`:
- Around line 61-74: TestR3DistinctSitesRunConcurrent currently only waits for
each site to start; modify the test to assert real concurrency by recording
start timestamps in the test observer (use newObserver/obs) or by exposing start
events from instrumented handlers, then after firing both workflows (h.fire) and
waiting for starts (h.waitStarts), assert that the start times for siteA and
siteB overlap (e.g., the difference between their start timestamps is less than
the handler duration 1500ms or that the second site started before the first
site finished). Update the observer/instrumented helper to capture per-site
start and finish times and add an assertion in TestR3DistinctSitesRunConcurrent
that verifies overlap rather than only eventual starts.

In `@internal/hatchet/integration_relay_test.go`:
- Around line 115-118: Replace the blocking exec.Command call with
exec.CommandContext using a context with a deadline (e.g., 30s) so the restart
cannot hang indefinitely: create a context via context.WithTimeout, use
exec.CommandContext(ctx, "docker", "compose", "-f", composeFile, "restart",
"hatchet-lite") to build cmd, call cmd.CombinedOutput(), and ensure you defer
cancel() and assert errors appropriately (handle context.DeadlineExceeded as a
diagnostic timeout failure) so the test fails fast and cleanly.

In `@internal/pg/registry.go`:
- Around line 122-134: The loop currently counts every attempted insert even
when ON CONFLICT DO NOTHING skips rows; change the Exec call in
internal/pg/registry.go to capture the command tag (e.g., tag, err :=
conn.Exec(...)) and only increment imported when tag.RowsAffected() > 0 (so
imported reflects actual inserts). Keep the existing error wrapping (fmt.Errorf)
for non-nil err and leave the rest of the insert logic unchanged.

In `@internal/pg/repoqueue.go`:
- Around line 128-140: The error returned on status mismatch in
RepoQueue.MarkActive and MarkFailed is misleading (they pass
reporequest.ErrNotPending while expecting reporequest.StatusApproved); add a
clearer error constant (e.g., reporequest.ErrNotApproved or
reporequest.ErrInvalidTransition) in the reporequest package and update both
calls to transition(ctx, id, reporequest.StatusApproved,
reporequest.ErrNotApproved, ...) so the mismatch message accurately reflects the
required approved state (leave transition(...) behavior unchanged).

In `@internal/worker/relay_test.go`:
- Around line 64-107: Add regression tests in relay_test.go to ensure RunOnce
guards against nil wiring: add cases that construct Relay with nil Source, Relay
with nil Publisher, and a nil *Relay receiver (var r *Relay; r.RunOnce(...)),
calling RunOnce and asserting it returns a non-nil error (and does not panic)
rather than panicking; reference the Relay type and its RunOnce method and the
Source and Publisher fields so the tests explicitly exercise those nil
configurations.

In `@test/e2e/harness_test.go`:
- Around line 78-80: The TLS config assigned to c.Transport (http.Transport ->
tls.Config) does not set MinVersion; update the tls.Config used in c.Transport
to include an explicit MinVersion (e.g., tls.VersionTLS12 or tls.VersionTLS13)
to make the minimum allowed TLS version explicit and secure while keeping
RootCAs: pool intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f2de9c42-4c18-448c-b8f3-fe6ed2a9a201

📥 Commits

Reviewing files that changed from the base of the PR and between e7a9e31 and 26a3225.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (96)
  • .gitignore
  • cmd/artemis/gcwire.go
  • cmd/artemis/gcwire_test.go
  • cmd/artemis/gcworkflows.go
  • cmd/artemis/gcworkflows_test.go
  • cmd/artemis/main.go
  • cmd/artemis/main_test.go
  • cmd/artemis/relayloop_test.go
  • cmd/loadgen/main.go
  • docker-compose.yml
  • docs/design/0001-durable-execution-model.md
  • docs/design/0002-scalability-capacity.md
  • go.mod
  • internal/auth/github.go
  • internal/auth/github_test.go
  • internal/backfill/backfill.go
  • internal/backfill/backfill_test.go
  • internal/config/config.go
  • internal/config/config_test.go
  • internal/gc/gcsite.go
  • internal/gc/gcsite_test.go
  • internal/gc/marker.go
  • internal/gc/metrics.go
  • internal/gc/metrics_test.go
  • internal/gc/plan.go
  • internal/gc/plan_test.go
  • internal/gc/reconcile.go
  • internal/gc/reconcile_test.go
  • internal/gc/retain.go
  • internal/gc/retain_test.go
  • internal/gc/tombstone.go
  • internal/gc/tombstone_test.go
  • internal/handler/deploy.go
  • internal/handler/deploy_delete.go
  • internal/handler/deploy_delete_test.go
  • internal/handler/deploy_test.go
  • internal/handler/handler.go
  • internal/handler/outbox_emit_test.go
  • internal/handler/readyz.go
  • internal/handler/readyz_test.go
  • internal/handler/site.go
  • internal/handler/site_purge_test.go
  • internal/handler/site_register.go
  • internal/handler/test_helpers_test.go
  • internal/hatchet/adapter.go
  • internal/hatchet/adapter_test.go
  • internal/hatchet/integration_concurrency_test.go
  • internal/hatchet/integration_harness_test.go
  • internal/hatchet/integration_poison_test.go
  • internal/hatchet/integration_relay_test.go
  • internal/pg/alias.go
  • internal/pg/alias_test.go
  • internal/pg/migrate.go
  • internal/pg/migrate_test.go
  • internal/pg/migrations/0001_init.sql
  • internal/pg/migrations/0002_registry.sql
  • internal/pg/migrations/0003_repo_requests.sql
  • internal/pg/migrations/0004_outbox_id_index.sql
  • internal/pg/outbox.go
  • internal/pg/outbox_test.go
  • internal/pg/pg.go
  • internal/pg/registry.go
  • internal/pg/registry_import_test.go
  • internal/pg/registry_test.go
  • internal/pg/repo.go
  • internal/pg/repo_test.go
  • internal/pg/repoqueue.go
  • internal/pg/repoqueue_test.go
  • internal/pg/saga.go
  • internal/pg/saga_test.go
  • internal/r2/r2.go
  • internal/r2/r2_test.go
  • internal/registry/valkey/cutover_test.go
  • internal/registry/valkey/reader.go
  • internal/registry/valkey/store.go
  • internal/server/server.go
  • internal/teamcache/teamcache.go
  • internal/teamcache/teamcache_test.go
  • internal/worker/debounce.go
  • internal/worker/debounce_test.go
  • internal/worker/deployflows.go
  • internal/worker/deployflows_test.go
  • internal/worker/metrics.go
  • internal/worker/metrics_test.go
  • internal/worker/relay.go
  • internal/worker/relay_test.go
  • internal/worker/runtime.go
  • internal/worker/runtime_test.go
  • justfile
  • scripts/e2e-local.sh
  • scripts/loadgen.sh
  • test/e2e/catalog_test.go
  • test/e2e/compose.e2e.yaml
  • test/e2e/harness_test.go
  • test/e2e/helpers_test.go
  • test/integration/hatchet/compose.hatchet.yaml

Comment thread cmd/loadgen/main.go
Comment thread docs/design/0001-durable-execution-model.md Outdated
Comment thread docs/design/0002-scalability-capacity.md Outdated
Comment thread internal/backfill/backfill.go
Comment thread internal/gc/reconcile.go
Comment thread internal/teamcache/teamcache.go
Comment thread internal/worker/relay.go
Comment thread internal/worker/runtime.go
Comment thread scripts/e2e-local.sh
Comment thread scripts/loadgen.sh
@raisedadead
Copy link
Copy Markdown
Member Author

Dispositions for the review-body items (posted outside inline threads):

Outside-diff — site_register.go purge ordering: fixed in b94f581 together with Copilot's inline comment — purge work (MovePrefix + tombstone) now precedes registry deletion; regression test TestSitePurge_FailedMoveKeepsSiteRetryable added.

Nitpicks:

  • integration_concurrency_test.go overlap assertion: fixed in c06a5c9 — observer tracks cross-site peak concurrency; the distinct-sites test now asserts real overlap.
  • auth/github.go durable-cache write failure: fixed in 531f491Set failure logs a warning and returns the fetched teams (regression test added).
  • e2e harness TLS MinVersion: fixed in 2e8556f.
  • integration_relay_test.go docker restart: fixed in e963f80CommandContext with 60s timeout.
  • pg/registry.go import count: fixed in a5db15b — counts RowsAffected() so conflict-skips are excluded. The skip path is unreachable today (advisory lock + empty-table gate), but this counter feeds the cutover boot-log gate, so corrected for accuracy.
  • gc/tombstone.go partial accounting: declining — purge is eventually consistent by design (the next run retries); partial-progress accounting adds churn without changing reclaim behavior.
  • worker/relay_test.go nil-wiring tests: declining — deps are wired fail-loud at boot and never nil; testing panic paths would enshrine a pattern we deliberately avoid.
  • hatchet/adapter.go legacy pkg/client: declininghsdk.NewClient's official options surface is pkg/client's ClientOpt; explicit Config{Token, Addr} beats env-based wiring for testability (the integration suite injects per-container addresses).
  • pg/repoqueue.go ErrNotPending naming: declining — the sentinel is shared across the transition contract (pg + valkey implementations + handler mapping); renaming churns three implementations for no behavior change.

@raisedadead raisedadead merged commit 4433050 into main Jun 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants