feat: durable execution model#11
Conversation
…claim, CAS transitions)
There was a problem hiding this comment.
Pull request overview
This PR introduces a more durable execution model by adding Postgres-backed state (deploy/alias metadata, outbox, tombstones, repo-request queue) and wiring it into a worker/engine workflow system (Hatchet), plus adding GC/reconcile workflows and local/e2e/integration tooling to exercise the full stack.
Changes:
- Add Postgres schema + repositories for deploy metadata, aliases (CAS), tombstones, outbox events, registry storage/import, and repo-request queue.
- Introduce a worker runtime abstraction with metrics + an outbox relay, plus Hatchet adapter + integration tests that validate concurrency/poison/at-least-once semantics.
- Add GC/reconcile/backfill logic, new delete/purge endpoints, durable GitHub team caching, and local + e2e harness scripts/compose files.
Reviewed changes
Copilot reviewed 94 out of 96 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/integration/hatchet/compose.hatchet.yaml | Compose stack for live Hatchet-lite integration tests. |
| test/e2e/helpers_test.go | E2E helper utilities for site registration/deploy minting/PG outbox waits. |
| test/e2e/harness_test.go | E2E harness for HTTP/R2/PG connectivity and request helpers. |
| test/e2e/compose.e2e.yaml | Full-stack compose for e2e (pg/valkey/minio/hatchet/fakegithub/artemis). |
| scripts/loadgen.sh | Local load harness runner with ephemeral Postgres. |
| scripts/e2e-local.sh | Local end-to-end runner that boots stack, mints keys/tokens, runs e2e tests. |
| justfile | Adds hatchet-integration, e2e-local, and loadgen convenience targets. |
| internal/worker/runtime.go | Worker runtime abstraction around an engine with workflow registration. |
| internal/worker/runtime_test.go | Tests runtime registration/dup detection/engine delegation. |
| internal/worker/relay.go | Outbox relay that publishes events and marks them published. |
| internal/worker/relay_test.go | Tests relay ordering, partial progress, and at-least-once behavior. |
| internal/worker/metrics.go | Prometheus metrics for worker runs/DLQ/relay. |
| internal/worker/metrics_test.go | Tests worker metrics behavior and nil-safety. |
| internal/worker/deployflows.go | Helper to register deploy workflows with per-site concurrency. |
| internal/worker/deployflows_test.go | Tests deploy workflows register with the site concurrency key. |
| internal/worker/debounce.go | Debouncer for coalescing site-change signals per site. |
| internal/worker/debounce_test.go | Tests debouncer coalescing/retrigger/stop behavior. |
| internal/teamcache/teamcache.go | Valkey-backed durable cache for GitHub team membership lists. |
| internal/teamcache/teamcache_test.go | Tests teamcache hit/miss/TTL/error behavior. |
| internal/server/server.go | Adds deploy delete route to the HTTP API. |
| internal/registry/valkey/store.go | Adds pub-sub publish helpers for registry change notifications. |
| internal/registry/valkey/reader.go | Refactors reader to support separate source-of-truth vs pubsub transport. |
| internal/registry/valkey/cutover_test.go | Tests PG-source + Valkey-pubsub cutover behavior. |
| internal/r2/r2.go | Adds delete/move/list helpers for prefixes and site listing. |
| internal/pg/saga.go | Adds atomic finalize flow that writes deploy + alias + outbox in one tx. |
| internal/pg/saga_test.go | Tests finalize atomicity/idempotency/outbox emission. |
| internal/pg/repoqueue.go | Postgres implementation of repo-request queue with CAS transitions. |
| internal/pg/repoqueue_test.go | Tests repo-request queue behavior and name-claim semantics. |
| internal/pg/repo.go | Postgres deploy/alias/tombstone repository operations. |
| internal/pg/repo_test.go | Tests deploy/alias roundtrip and tombstone lifecycle with testcontainers. |
| internal/pg/registry.go | Postgres registry store + import-on-boot support and onChange hook. |
| internal/pg/registry_test.go | Tests registry CRUD and onChange hook firing. |
| internal/pg/registry_import_test.go | Tests registry import behavior and timestamp preservation. |
| internal/pg/pg.go | Postgres DB wrapper with connect/ping/close. |
| internal/pg/outbox.go | Outbox enqueue/fetch/mark-published implementation. |
| internal/pg/outbox_test.go | Tests outbox atomicity with tx rollback and publish marking. |
| internal/pg/migrations/0003_repo_requests.sql | Adds repo_requests table + name-claim and ordering indexes. |
| internal/pg/migrations/0002_registry.sql | Adds sites table for Postgres-backed registry. |
| internal/pg/migrations/0001_init.sql | Adds deploys/aliases/tombstones/outbox tables + indexes. |
| internal/pg/migrate.go | Embedded SQL migrations runner with advisory lock + ledger. |
| internal/pg/migrate_test.go | Tests migrations are applied and idempotent. |
| internal/pg/alias.go | Adds alias CAS update with row-level locking and outbox emission. |
| internal/pg/alias_test.go | Tests alias CAS no-lost-update and drift rejection. |
| internal/hatchet/integration_relay_test.go | Integration test: relay drains outbox across engine restart. |
| internal/hatchet/integration_poison_test.go | Integration test: poison workflow deadletters without blocking key. |
| internal/hatchet/integration_harness_test.go | Integration test harness for Hatchet workflows and observability. |
| internal/hatchet/integration_concurrency_test.go | Integration tests for per-site concurrency and parallelism across sites. |
| internal/hatchet/adapter.go | Hatchet adapter implementing worker.Engine and Publisher. |
| internal/hatchet/adapter_test.go | Tests adapter registration, publish-before-start, and bad addr handling. |
| internal/handler/test_helpers_test.go | Extends fake R2 store with MovePrefix for purge/tombstone tests. |
| internal/handler/site.go | Emits site.changed on promote/rollback. |
| internal/handler/site_register.go | Adds site purge option that moves all site bytes to trash and records tombstone. |
| internal/handler/site_purge_test.go | Tests site purge moves bytes to trash and records a site-level tombstone. |
| internal/handler/readyz.go | Adds PG health probing and “degraded” readiness response. |
| internal/handler/readyz_test.go | Adds tests for PG degraded readiness behavior. |
| internal/handler/outbox_emit_test.go | Tests site.changed emission on finalize/promote. |
| internal/handler/handler.go | Adds TombstoneStore/Outbox/PGHealth hooks and site.changed emitter helper. |
| internal/handler/deploy.go | Writes a deploy completion marker and emits site.changed on finalize. |
| internal/handler/deploy_test.go | Tests marker creation on finalize. |
| internal/handler/deploy_delete.go | Adds deploy delete endpoint (tombstone move with alias-protection). |
| internal/handler/deploy_delete_test.go | Tests deploy delete tombstoning, alias conflict, bad id, and authz. |
| internal/gc/tombstone.go | Implements tombstone purge pass for hard-delete past recovery window. |
| internal/gc/tombstone_test.go | Tests tombstone purge behavior including site-level trash layout. |
| internal/gc/retain.go | Implements retention policy logic for keep vs delete sets. |
| internal/gc/retain_test.go | Tests retention policy invariants (keepN/grace/alias pin/cache TTL). |
| internal/gc/reconcile.go | Implements R2↔PG drift reconciliation and safe orphan tombstoning. |
| internal/gc/reconcile_test.go | Tests reconcile behaviors and drift metrics. |
| internal/gc/plan.go | Builds delete plans with blast-cap protection. |
| internal/gc/plan_test.go | Tests plan sizing and blast-cap abort behavior. |
| internal/gc/metrics.go | Prometheus metrics for GC workflows and drift. |
| internal/gc/metrics_test.go | Tests GC metrics increments and nil-safety. |
| internal/gc/marker.go | Defines completion marker object name. |
| internal/gc/gcsite.go | Implements per-site GC execution with TOCTOU alias re-check. |
| internal/gc/gcsite_test.go | Tests GC behavior (alias pin, inflight protection, dry-run, blast-cap). |
| internal/config/config.go | Adds DB/hatchet/cleanup config + GH_API_BASE validation + cleanup parsing. |
| internal/config/config_test.go | Adds tests for GH_API_BASE validation and cleanup config invariants. |
| internal/backfill/backfill.go | Implements R2→PG backfill/indexing for deploys and aliases. |
| internal/backfill/backfill_test.go | Tests backfill indexing, markers, and alias key behavior. |
| internal/auth/github.go | Adds durable team-cache integration and improves body read/parse error handling. |
| internal/auth/github_test.go | Adds tests for malformed membership body and durable team cache behavior. |
| docker-compose.yml | Adds Postgres and adjusts fakegithub networking/GH_API_BASE for local dev. |
| cmd/artemis/relayloop_test.go | Tests the relay loop drains outbox and records metrics. |
| cmd/artemis/main_test.go | Tests DB boot/migrations gating when DATABASE_URL is empty vs set. |
| cmd/artemis/gcworkflows.go | Wires relay loop + GC/reconcile/tombstone workflows with metrics/observability. |
| cmd/artemis/gcworkflows_test.go | Tests workflow definitions and site input parsing. |
| cmd/artemis/gcwire.go | Wires GC layout/policy and validates feature gating (repo queue requires DB). |
| cmd/artemis/gcwire_test.go | Tests layout parsing and GC policy derivation. |
| .gitignore | Ignores the loadgen binary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/handler/site_register.go (1)
193-227:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPartial failure: registry deletion precedes R2 purge.
If
h.Registry.Deletesucceeds but the subsequentMovePrefixorRecordTombstonefails, the site is removed from the registry while its R2 data remains in place (not moved to trash). This leaves orphaned data that won't be discoverable for cleanup.Consider reordering to move data to trash first, then delete from registry, or wrapping both in a transaction/saga if available.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/handler/site_register.go` around lines 193 - 227, Registry deletion happens before the R2 purge so a partial failure can leave orphaned data; change the flow to perform h.R2.MovePrefix(r.Context(), slug+"/", base+slug+"/") and then h.Tombstones.RecordTombstone(r.Context(), slug, "", 0) before calling h.Registry.Delete(r.Context(), slug), propagating errors from MovePrefix/RecordTombstone with the existing writeUpstreamError handling and only deleting the registry entry if both data-move and tombstone recording succeed; after successful delete keep the same logging and response behavior (use the same symbols: h.R2.MovePrefix, h.Tombstones.RecordTombstone, h.Registry.Delete, TrashPrefixBase) and ensure you preserve the original not-found handling for Delete when appropriate.
🧹 Nitpick comments (9)
internal/hatchet/integration_concurrency_test.go (1)
61-74: ⚡ Quick winStrengthen the distinct-sites test to assert real overlap, not just eventual starts.
TestR3DistinctSitesRunConcurrentcurrently passes even if execution is serialized (both sites can “eventually” start). Please add an explicit overlap/concurrency assertion so the test actually enforces the contract it names.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/hatchet/integration_concurrency_test.go` around lines 61 - 74, TestR3DistinctSitesRunConcurrent currently only waits for each site to start; modify the test to assert real concurrency by recording start timestamps in the test observer (use newObserver/obs) or by exposing start events from instrumented handlers, then after firing both workflows (h.fire) and waiting for starts (h.waitStarts), assert that the start times for siteA and siteB overlap (e.g., the difference between their start timestamps is less than the handler duration 1500ms or that the second site started before the first site finished). Update the observer/instrumented helper to capture per-site start and finish times and add an assertion in TestR3DistinctSitesRunConcurrent that verifies overlap rather than only eventual starts.internal/auth/github.go (1)
368-376: 💤 Low valueDurable-cache write failure discards successfully fetched teams.
When
fetchUserTeamssucceeds butteamCacheDurable.Setfails, the error is returned and the caller receives no teams. However,fetchUserTeamsalready stored the teams in the in-memory cache (lines 450-455), so the data is technically available. Consider whether a durable-cache write failure should be non-fatal, logging a warning instead and returning the teams.♻️ Optional: log warning instead of failing
teams, err := c.fetchUserTeams(ctx, cacheKey, token) if err != nil { return nil, err } if err := c.teamCacheDurable.Set(ctx, login, teams); err != nil { - return nil, err + // Non-fatal: in-memory cache is populated; log and continue. + slog.Warn("durable team cache write failed", "login", login, "err", err) } return teams, nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/auth/github.go` around lines 368 - 376, The durable-cache write in the function that calls c.fetchUserTeams currently returns an error when c.teamCacheDurable.Set fails, discarding the successfully fetched teams; change this to make the durable write non-fatal: after calling teams, err := c.fetchUserTeams(...), keep the existing error return if fetchUserTeams fails, but when c.teamCacheDurable.Set(ctx, login, teams) returns an error, log a warning including login and the error (use the existing logger on the receiver, e.g. c.logger or similar) and then return the teams instead of returning the error; keep the original behavior only for fetchUserTeams errors and ensure any in-memory cache (teamCache) behavior remains unchanged.test/e2e/harness_test.go (1)
78-80: ⚡ Quick winExplicitly set
MinVersionin TLS configuration.The TLS config lacks an explicit
MinVersion, defaulting to TLS 1.2 for clients. For test harnesses connecting to local environments, this may be acceptable, but explicitly setting the minimum version improves security posture and makes the intention clear.🔒 Proposed fix to set TLS 1.2 or 1.3 explicitly
- c.Transport = &http.Transport{TLSClientConfig: &tls.Config{RootCAs: pool}} + c.Transport = &http.Transport{TLSClientConfig: &tls.Config{ + RootCAs: pool, + MinVersion: tls.VersionTLS12, + }}If all test environments support TLS 1.3, consider using
tls.VersionTLS13instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/harness_test.go` around lines 78 - 80, The TLS config assigned to c.Transport (http.Transport -> tls.Config) does not set MinVersion; update the tls.Config used in c.Transport to include an explicit MinVersion (e.g., tls.VersionTLS12 or tls.VersionTLS13) to make the minimum allowed TLS version explicit and secure while keeping RootCAs: pool intact.internal/gc/tombstone.go (1)
52-81: ⚡ Quick winConsider accumulating partial work before returning errors.
If
DeletePrefixorClearTombstonefails midway through the loop, successfully purged tombstones are not reflected in the result or metrics. In GC workflows this is eventually consistent (the next run retries), but tracking partial success would improve observability and allow incremental progress to be reported.♻️ Proposed enhancement to track partial success
for _, t := range expired { label := t.Site + "/" + t.ID if dryRun { res.Purged = append(res.Purged, label) continue } if _, err := p.Deleter.DeletePrefix(ctx, p.trashPrefix(t)); err != nil { + // Partial work is reflected in res; caller can decide whether to retry or log progress return res, fmt.Errorf("tombstone-purge: delete %s: %w", label, err) } if err := p.Store.ClearTombstone(ctx, t.Site, t.ID); err != nil { return res, fmt.Errorf("tombstone-purge: clear %s: %w", label, err) } res.Purged = append(res.Purged, label) res.BytesReclaimed += t.Bytes } if !dryRun { p.Metrics.reclaimed(res.BytesReclaimed) p.Metrics.run(WorkflowTombstonePurgeLabel, "ok") }The change ensures
res.Purgedandres.BytesReclaimedreflect all successfully purged tombstones even when an error occurs partway through.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/gc/tombstone.go` around lines 52 - 81, The loop in TombstonePurge.Run currently only appends to res.Purged and res.BytesReclaimed after both DeletePrefix and ClearTombstone succeed, so if ClearTombstone fails you lose the partial success; modify the loop so that after a successful p.Deleter.DeletePrefix(ctx, p.trashPrefix(t)) you immediately append label to res.Purged and add t.Bytes to res.BytesReclaimed (still attempt p.Store.ClearTombstone afterwards and return the error if it fails), and before returning an error (when not dryRun) call p.Metrics.reclaimed(res.BytesReclaimed) and p.Metrics.run(WorkflowTombstonePurgeLabel, "error") and log the current res values so partial progress is reflected in the result and metrics; target the TombstonePurge.Run function and the DeletePrefix / ClearTombstone handling, and update where metrics and logging are emitted.internal/hatchet/adapter.go (1)
10-12: ⚡ Quick winReduce reliance on Hatchet legacy
pkg/clientand tighten worker shutdown behavior
internal/hatchet/adapter.goconfigureshsdk.NewClient(opts...)viagithub.com/hatchet-dev/hatchet/pkg/client(v0Client.WithToken/v0Client.WithHostPort). If avoiding the legacypkg/clientimport is the goal, use the Go SDK’s env-based config (HATCHET_CLIENT_TOKEN,HATCHET_CLIENT_HOST_PORT) and initialize withhsdk.NewClient()(no options).Adapter.Stopis a no-op (return nil) and isn’t called from production code; shutdown is driven byStartBlocking(rootCtx)exiting whenrootCtxis canceled. IfEngine.Stopis meant to actively stop the Hatchet worker independent of context cancellation, implement it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/hatchet/adapter.go` around lines 10 - 12, Replace usage of the legacy v0Client options when creating the Hatchet SDK client in internal/hatchet/adapter.go by relying on the Go SDK's environment-based configuration: call hsdk.NewClient() with no options and remove the import/usage of v0Client (and related types if unused), ensuring HATCHET_CLIENT_TOKEN and HATCHET_CLIENT_HOST_PORT are used instead; additionally, implement Adapter.Stop (and the Engine.Stop method if present) so it actively stops the worker instead of being a no-op — e.g., expose/hold an internal cancel function or worker shutdown handle when starting StartBlocking(rootCtx) and have Stop invoke that shutdown path (safely and idempotently) so the worker can be stopped independently of rootCtx cancellation.internal/hatchet/integration_relay_test.go (1)
115-118: ⚡ Quick winBound the docker restart command with a timeout.
exec.Commandhere can hang indefinitely on daemon/compose issues. Useexec.CommandContextwith a timeout so this test fails fast and diagnostically.Suggested fix
- cmd := exec.Command("docker", "compose", "-f", composeFile, "restart", "hatchet-lite") + ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second) + defer cancel() + cmd := exec.CommandContext(ctx, "docker", "compose", "-f", composeFile, "restart", "hatchet-lite") out, err := cmd.CombinedOutput() + if ctx.Err() == context.DeadlineExceeded { + t.Fatalf("restart hatchet-lite timed out: %s", string(out)) + } require.NoErrorf(t, err, "restart hatchet-lite: %s", string(out))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/hatchet/integration_relay_test.go` around lines 115 - 118, Replace the blocking exec.Command call with exec.CommandContext using a context with a deadline (e.g., 30s) so the restart cannot hang indefinitely: create a context via context.WithTimeout, use exec.CommandContext(ctx, "docker", "compose", "-f", composeFile, "restart", "hatchet-lite") to build cmd, call cmd.CombinedOutput(), and ensure you defer cancel() and assert errors appropriately (handle context.DeadlineExceeded as a diagnostic timeout failure) so the test fails fast and cleanly.internal/worker/relay_test.go (1)
64-107: ⚡ Quick winAdd regression tests for nil relay wiring preconditions.
Given
RunOnceis called via exported struct wiring, add tests for nilSource/Publisher(and nil receiver if you support it) so panic-prone misconfigurations stay guarded.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/worker/relay_test.go` around lines 64 - 107, Add regression tests in relay_test.go to ensure RunOnce guards against nil wiring: add cases that construct Relay with nil Source, Relay with nil Publisher, and a nil *Relay receiver (var r *Relay; r.RunOnce(...)), calling RunOnce and asserting it returns a non-nil error (and does not panic) rather than panicking; reference the Relay type and its RunOnce method and the Source and Publisher fields so the tests explicitly exercise those nil configurations.internal/pg/registry.go (1)
122-134: 💤 Low value
importedcount includes conflict-skipped rows.The counter increments for every loop iteration, but
ON CONFLICT DO NOTHINGmay skip rows that already exist (e.g., from a concurrent partial import). This means the returned count reflects "attempts" rather than "newly inserted" rows. If the count is used for metrics or logging, it may be misleading.To accurately count inserts, check
tag.RowsAffected():Proposed fix
for _, site := range sites { teams := append([]string(nil), site.Teams...) - _, err := conn.Exec(ctx, + tag, err := conn.Exec(ctx, `INSERT INTO sites (slug, teams, created_at, updated_at, created_by) VALUES ($1, $2, $3, $4, $5) ON CONFLICT (slug) DO NOTHING`, site.Slug, teams, site.CreatedAt, site.UpdatedAt, site.CreatedBy) if err != nil { return imported, fmt.Errorf("pg registry import %s: %w", site.Slug, err) } - imported++ + imported += int(tag.RowsAffected()) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/pg/registry.go` around lines 122 - 134, The loop currently counts every attempted insert even when ON CONFLICT DO NOTHING skips rows; change the Exec call in internal/pg/registry.go to capture the command tag (e.g., tag, err := conn.Exec(...)) and only increment imported when tag.RowsAffected() > 0 (so imported reflects actual inserts). Keep the existing error wrapping (fmt.Errorf) for non-nil err and leave the rest of the insert logic unchanged.internal/pg/repoqueue.go (1)
128-140: 💤 Low valueMismatch error may be misleading for
MarkActiveandMarkFailed.Both methods expect
StatusApprovedbut returnErrNotPendingon mismatch. If a request is stillpendingandMarkActiveis called, the error message "not pending" is confusing since the actual requirement is "must be approved."Consider introducing
ErrNotApprovedor a more genericErrInvalidTransitionfor clarity.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/pg/repoqueue.go` around lines 128 - 140, The error returned on status mismatch in RepoQueue.MarkActive and MarkFailed is misleading (they pass reporequest.ErrNotPending while expecting reporequest.StatusApproved); add a clearer error constant (e.g., reporequest.ErrNotApproved or reporequest.ErrInvalidTransition) in the reporequest package and update both calls to transition(ctx, id, reporequest.StatusApproved, reporequest.ErrNotApproved, ...) so the mismatch message accurately reflects the required approved state (leave transition(...) behavior unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/loadgen/main.go`:
- Around line 157-171: The SiteGC instance created in runGCPlan omits the
Metrics field causing a nil pointer panic because gc.SiteGC.Run calls
g.Metrics.run and g.Metrics.tombstoned; fix by supplying a non-nil Metrics
implementation (either a no-op metrics struct implementing the same methods or
the real metrics collector) when constructing gc.SiteGC in runGCPlan so
g.Metrics is always valid (update the runGCPlan initializer to set Metrics to
the no-op or real Metrics implementation).
In `@docs/design/0001-durable-execution-model.md`:
- Around line 21-34: The markdown diagram block starting with the line "commands
(CLI / CI / curl)" is an unlabeled fenced code block causing MD040; add a
neutral language tag (e.g., change opening "```" to "```text") so the fence
reads "```text" and leave the closing "```" unchanged to satisfy the linter
while preserving the ASCII architecture diagram.
In `@docs/design/0002-scalability-capacity.md`:
- Around line 68-70: The markdown file uses unlabeled fenced code blocks which
trigger MD040; update the fences for the specific snippets so they include
language identifiers: mark the DATABASE_URL and loadgen command blocks with
```bash, and mark the Go struct snippet starting with types.Concurrency{ with
```go; ensure the three other snippets containing the loadgen examples and
LOADGEN_DATABASE_URL are labeled as bash as well so all occurrences (the block
with "DATABASE_URL=postgres://...", the block with "types.Concurrency{", and the
blocks with "just loadgen" / "LOADGEN_DATABASE_URL=... just loadgen") use the
appropriate language fences.
In `@internal/backfill/backfill.go`:
- Around line 36-39: Run dereferences potentially nil dependencies (b.Lister and
b.Now) in Backfill.Run causing panics; add nil checks at the start of Run to
validate dependencies and return a clear typed error instead of panicking.
Specifically, in func (b *Backfill) Run(ctx context.Context) check that b !=
nil, b.Lister != nil and b.Now != nil (or provide a default Now function) and
return a descriptive error (e.g., ErrMissingDependency) if any are nil; update
callers/tests if they rely on default behavior and ensure the references around
ListSites, and the b.Now() invocations at the sites processing (the sites
iteration around lines where b.Now is used) use the validated non-nil values.
In `@internal/handler/site.go`:
- Line 135: The call to h.emitSiteChanged currently uses the incoming request
context (r.Context()), so if the client disconnects the emission can be canceled
after PutAlias succeeds; change the emission to use a non-cancelable context
(e.g., context.Background() or a derived context with a long timeout) when
calling h.emitSiteChanged so the post-write consistency event is delivered
regardless of request cancellation; update both places that call
h.emitSiteChanged(r.Context(), site) to use the detached context and ensure any
necessary logging or error handling still runs.
In `@internal/hatchet/adapter.go`:
- Around line 53-71: The Adapter.Start implementation calls w.StartBlocking(ctx)
but Adapter.Stop is a no-op, so shutdown is never signaled; update Adapter.Stop
to honor the engine lifecycle by shutting down the worker and cancelling any
running context: store a cancellable context when starting (e.g., create ctx,
cancel := context.WithCancel(ctx) inside Adapter.Start before calling
w.StartBlocking), save the cancel function and the worker instance (a.worker and
a.cancel), and implement Adapter.Stop to call a.cancel() and invoke the worker
shutdown API (e.g., a.worker.Stop or the appropriate non-blocking stop method
provided by hsdk) and wait for termination if available; ensure error handling
from worker stop/close is returned from Adapter.Stop.
In `@internal/hatchet/integration_poison_test.go`:
- Around line 17-30: poisonSeen and healthySeen are incremented concurrently in
the handler closures (poison, healthy) without synchronization, causing a data
race; change their declarations to an atomic-backed integer (e.g., int32/int64)
or protect them with a mutex, then replace the increments inside the poison and
healthy functions with atomic.Add... (or mutex-protected increments) and use
atomic.Load... (or mutex-protected reads) where the counters are inspected later
so all accesses are race-free.
In `@internal/pg/migrate.go`:
- Around line 29-31: The deferred unlock currently uses the possibly canceled
caller ctx which can cause pg_advisory_unlock (invoked via conn.Exec with
migrateAdvisoryLockKey) to fail and leak the advisory lock; change the defer to
run the unlock with a new non-cancelable context (e.g., context.Background()
with a short timeout) so the unlock runs even if the original ctx is canceled,
and remove any direct conn.Conn().Close() calls—return the connection to the
pool by calling conn.Release() instead; update the defer to create the timeout
context, call conn.Exec with that context and migrateAdvisoryLockKey, cancel the
timeout context, and ensure only conn.Release() is used to release the pooled
connection.
In `@internal/teamcache/teamcache.go`:
- Around line 55-68: GetOrFetch currently allows parallel cache misses for the
same login to call fetch concurrently; change it to coalesce concurrent misses
using a single-flight mechanism: add a singleflight.Group (or an equivalent
mutex+map) as a field on Cache and, after detecting a miss via Cache.Get, call
group.Do(login, func() (interface{}, error) { teams, err := fetch(ctx); if err
!= nil { return nil, err }; _ = c.Set(ctx, login, teams); return teams, nil })
so only one fetch/Set runs per login and the returned value/error is shared to
all waiters; ensure the returned interface is cast back to []string and errors
propagate unchanged.
In `@internal/worker/relay.go`:
- Around line 28-35: RunOnce currently assumes Relay and its dependencies exist
and will panic if miswired; add explicit nil guards at the start of
Relay.RunOnce to return descriptive errors instead of panicking: check if r ==
nil, then verify r.Source != nil before calling r.Source.FetchUnpublished and
r.Source.MarkPublished, and verify r.Publisher != nil before calling
r.Publisher.Publish; return wrapped errors like "relay: missing Source" /
"relay: missing Publisher" so callers get a clear error path rather than a
runtime panic.
In `@internal/worker/runtime.go`:
- Around line 40-42: NewRuntime currently allows construction with a nil Engine
which leads to panics in runtime lifecycle methods; change NewRuntime to
validate the engine and return (*Runtime, error) (e.g., return nil,
fmt.Errorf("nil engine")) when engine is nil, update callers to handle the
error, and keep the Runtime struct and methods Register, Start, and Stop
assuming a non-nil engine; also add defensive nil checks in Register/Start/Stop
to return errors if somehow called on a Runtime with nil engine to avoid panics.
In `@scripts/e2e-local.sh`:
- Line 45: The private key is being made world-readable; tighten its permissions
by setting "$CERTS_DIR/private.key" to 0600 (owner read/write only) while
keeping "$CERTS_DIR/public.crt" at 0644; update the chmod invocation that
currently targets both files to apply 0600 to the private key and 0644 to the
public cert (referencing the existing "$CERTS_DIR/private.key" and
"$CERTS_DIR/public.crt" symbols).
In `@scripts/loadgen.sh`:
- Around line 29-34: The readiness loop using seq 1 30 and docker exec
"$CONTAINER" pg_isready should fail fast if Postgres never becomes ready: after
the for loop that breaks on success, check whether pg_isready actually succeeded
and if not emit a clear error (including $CONTAINER and that pg_isready timed
out) and exit non-zero (e.g., exit 1) so the script stops rather than continuing
to run the load; update the block containing the for loop, CONTAINER, and
pg_isready checks to perform this post-loop assertion and error log/exit.
---
Outside diff comments:
In `@internal/handler/site_register.go`:
- Around line 193-227: Registry deletion happens before the R2 purge so a
partial failure can leave orphaned data; change the flow to perform
h.R2.MovePrefix(r.Context(), slug+"/", base+slug+"/") and then
h.Tombstones.RecordTombstone(r.Context(), slug, "", 0) before calling
h.Registry.Delete(r.Context(), slug), propagating errors from
MovePrefix/RecordTombstone with the existing writeUpstreamError handling and
only deleting the registry entry if both data-move and tombstone recording
succeed; after successful delete keep the same logging and response behavior
(use the same symbols: h.R2.MovePrefix, h.Tombstones.RecordTombstone,
h.Registry.Delete, TrashPrefixBase) and ensure you preserve the original
not-found handling for Delete when appropriate.
---
Nitpick comments:
In `@internal/auth/github.go`:
- Around line 368-376: The durable-cache write in the function that calls
c.fetchUserTeams currently returns an error when c.teamCacheDurable.Set fails,
discarding the successfully fetched teams; change this to make the durable write
non-fatal: after calling teams, err := c.fetchUserTeams(...), keep the existing
error return if fetchUserTeams fails, but when c.teamCacheDurable.Set(ctx,
login, teams) returns an error, log a warning including login and the error (use
the existing logger on the receiver, e.g. c.logger or similar) and then return
the teams instead of returning the error; keep the original behavior only for
fetchUserTeams errors and ensure any in-memory cache (teamCache) behavior
remains unchanged.
In `@internal/gc/tombstone.go`:
- Around line 52-81: The loop in TombstonePurge.Run currently only appends to
res.Purged and res.BytesReclaimed after both DeletePrefix and ClearTombstone
succeed, so if ClearTombstone fails you lose the partial success; modify the
loop so that after a successful p.Deleter.DeletePrefix(ctx, p.trashPrefix(t))
you immediately append label to res.Purged and add t.Bytes to res.BytesReclaimed
(still attempt p.Store.ClearTombstone afterwards and return the error if it
fails), and before returning an error (when not dryRun) call
p.Metrics.reclaimed(res.BytesReclaimed) and
p.Metrics.run(WorkflowTombstonePurgeLabel, "error") and log the current res
values so partial progress is reflected in the result and metrics; target the
TombstonePurge.Run function and the DeletePrefix / ClearTombstone handling, and
update where metrics and logging are emitted.
In `@internal/hatchet/adapter.go`:
- Around line 10-12: Replace usage of the legacy v0Client options when creating
the Hatchet SDK client in internal/hatchet/adapter.go by relying on the Go SDK's
environment-based configuration: call hsdk.NewClient() with no options and
remove the import/usage of v0Client (and related types if unused), ensuring
HATCHET_CLIENT_TOKEN and HATCHET_CLIENT_HOST_PORT are used instead;
additionally, implement Adapter.Stop (and the Engine.Stop method if present) so
it actively stops the worker instead of being a no-op — e.g., expose/hold an
internal cancel function or worker shutdown handle when starting
StartBlocking(rootCtx) and have Stop invoke that shutdown path (safely and
idempotently) so the worker can be stopped independently of rootCtx
cancellation.
In `@internal/hatchet/integration_concurrency_test.go`:
- Around line 61-74: TestR3DistinctSitesRunConcurrent currently only waits for
each site to start; modify the test to assert real concurrency by recording
start timestamps in the test observer (use newObserver/obs) or by exposing start
events from instrumented handlers, then after firing both workflows (h.fire) and
waiting for starts (h.waitStarts), assert that the start times for siteA and
siteB overlap (e.g., the difference between their start timestamps is less than
the handler duration 1500ms or that the second site started before the first
site finished). Update the observer/instrumented helper to capture per-site
start and finish times and add an assertion in TestR3DistinctSitesRunConcurrent
that verifies overlap rather than only eventual starts.
In `@internal/hatchet/integration_relay_test.go`:
- Around line 115-118: Replace the blocking exec.Command call with
exec.CommandContext using a context with a deadline (e.g., 30s) so the restart
cannot hang indefinitely: create a context via context.WithTimeout, use
exec.CommandContext(ctx, "docker", "compose", "-f", composeFile, "restart",
"hatchet-lite") to build cmd, call cmd.CombinedOutput(), and ensure you defer
cancel() and assert errors appropriately (handle context.DeadlineExceeded as a
diagnostic timeout failure) so the test fails fast and cleanly.
In `@internal/pg/registry.go`:
- Around line 122-134: The loop currently counts every attempted insert even
when ON CONFLICT DO NOTHING skips rows; change the Exec call in
internal/pg/registry.go to capture the command tag (e.g., tag, err :=
conn.Exec(...)) and only increment imported when tag.RowsAffected() > 0 (so
imported reflects actual inserts). Keep the existing error wrapping (fmt.Errorf)
for non-nil err and leave the rest of the insert logic unchanged.
In `@internal/pg/repoqueue.go`:
- Around line 128-140: The error returned on status mismatch in
RepoQueue.MarkActive and MarkFailed is misleading (they pass
reporequest.ErrNotPending while expecting reporequest.StatusApproved); add a
clearer error constant (e.g., reporequest.ErrNotApproved or
reporequest.ErrInvalidTransition) in the reporequest package and update both
calls to transition(ctx, id, reporequest.StatusApproved,
reporequest.ErrNotApproved, ...) so the mismatch message accurately reflects the
required approved state (leave transition(...) behavior unchanged).
In `@internal/worker/relay_test.go`:
- Around line 64-107: Add regression tests in relay_test.go to ensure RunOnce
guards against nil wiring: add cases that construct Relay with nil Source, Relay
with nil Publisher, and a nil *Relay receiver (var r *Relay; r.RunOnce(...)),
calling RunOnce and asserting it returns a non-nil error (and does not panic)
rather than panicking; reference the Relay type and its RunOnce method and the
Source and Publisher fields so the tests explicitly exercise those nil
configurations.
In `@test/e2e/harness_test.go`:
- Around line 78-80: The TLS config assigned to c.Transport (http.Transport ->
tls.Config) does not set MinVersion; update the tls.Config used in c.Transport
to include an explicit MinVersion (e.g., tls.VersionTLS12 or tls.VersionTLS13)
to make the minimum allowed TLS version explicit and secure while keeping
RootCAs: pool intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: f2de9c42-4c18-448c-b8f3-fe6ed2a9a201
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (96)
.gitignorecmd/artemis/gcwire.gocmd/artemis/gcwire_test.gocmd/artemis/gcworkflows.gocmd/artemis/gcworkflows_test.gocmd/artemis/main.gocmd/artemis/main_test.gocmd/artemis/relayloop_test.gocmd/loadgen/main.godocker-compose.ymldocs/design/0001-durable-execution-model.mddocs/design/0002-scalability-capacity.mdgo.modinternal/auth/github.gointernal/auth/github_test.gointernal/backfill/backfill.gointernal/backfill/backfill_test.gointernal/config/config.gointernal/config/config_test.gointernal/gc/gcsite.gointernal/gc/gcsite_test.gointernal/gc/marker.gointernal/gc/metrics.gointernal/gc/metrics_test.gointernal/gc/plan.gointernal/gc/plan_test.gointernal/gc/reconcile.gointernal/gc/reconcile_test.gointernal/gc/retain.gointernal/gc/retain_test.gointernal/gc/tombstone.gointernal/gc/tombstone_test.gointernal/handler/deploy.gointernal/handler/deploy_delete.gointernal/handler/deploy_delete_test.gointernal/handler/deploy_test.gointernal/handler/handler.gointernal/handler/outbox_emit_test.gointernal/handler/readyz.gointernal/handler/readyz_test.gointernal/handler/site.gointernal/handler/site_purge_test.gointernal/handler/site_register.gointernal/handler/test_helpers_test.gointernal/hatchet/adapter.gointernal/hatchet/adapter_test.gointernal/hatchet/integration_concurrency_test.gointernal/hatchet/integration_harness_test.gointernal/hatchet/integration_poison_test.gointernal/hatchet/integration_relay_test.gointernal/pg/alias.gointernal/pg/alias_test.gointernal/pg/migrate.gointernal/pg/migrate_test.gointernal/pg/migrations/0001_init.sqlinternal/pg/migrations/0002_registry.sqlinternal/pg/migrations/0003_repo_requests.sqlinternal/pg/migrations/0004_outbox_id_index.sqlinternal/pg/outbox.gointernal/pg/outbox_test.gointernal/pg/pg.gointernal/pg/registry.gointernal/pg/registry_import_test.gointernal/pg/registry_test.gointernal/pg/repo.gointernal/pg/repo_test.gointernal/pg/repoqueue.gointernal/pg/repoqueue_test.gointernal/pg/saga.gointernal/pg/saga_test.gointernal/r2/r2.gointernal/r2/r2_test.gointernal/registry/valkey/cutover_test.gointernal/registry/valkey/reader.gointernal/registry/valkey/store.gointernal/server/server.gointernal/teamcache/teamcache.gointernal/teamcache/teamcache_test.gointernal/worker/debounce.gointernal/worker/debounce_test.gointernal/worker/deployflows.gointernal/worker/deployflows_test.gointernal/worker/metrics.gointernal/worker/metrics_test.gointernal/worker/relay.gointernal/worker/relay_test.gointernal/worker/runtime.gointernal/worker/runtime_test.gojustfilescripts/e2e-local.shscripts/loadgen.shtest/e2e/catalog_test.gotest/e2e/compose.e2e.yamltest/e2e/harness_test.gotest/e2e/helpers_test.gotest/integration/hatchet/compose.hatchet.yaml
|
Dispositions for the review-body items (posted outside inline threads): Outside-diff — Nitpicks:
|
No description provided.