fix: restore /health — remove worker_loaders binding causing 1101 on every path#172
fix: restore /health — remove worker_loaders binding causing 1101 on every path#172chitcommit wants to merge 3 commits intomainfrom
Conversation
…ion) Production at connect.chitty.cc was returning Cloudflare error 1101 (uncaught Worker exception) on every path including /health, blocking ChittyRental's Notion-gateway discovery. Root cause (highest-likelihood — confirm with wrangler tail): PR #168 added "worker_loaders": [{ "binding": "LOADER" }] to the production env block. Worker Loader is a closed-beta binding that requires per-account allowlist enrollment. Wrangler accepts it at deploy time, so the deployment record looks healthy, but the runtime fails to instantiate the isolate when the account isn't entitled — which surfaces as 1101 on every request rather than a deploy failure. That matches the observed symptom: /health 500s even though it's just a Hono handler returning a constant JSON object. Changes: - wrangler.jsonc: remove worker_loaders[LOADER] from env.production. - src/api/routes/dynamic.js: delete. The file was added in #168 but never imported or mounted in src/api/router.js, so it's pure dead code. It was the only consumer of env.LOADER. No public surface change. - wrangler.jsonc: enable workers_dev: true on env.production so the post-deploy CI smoke check can probe the workers.dev subdomain (zone WAF 403's GitHub Actions runner IPs). - .github/workflows/deploy.yml: add a Smoke check /health step after Verify Deployment. Resolves the account's workers.dev subdomain via the Cloudflare API, polls /health for up to 25s, and rolls back the deployment if /health doesn't return 200 with a status field. This catches the exact failure mode that slipped past the existing deployments-API verification (which only checks that a deployment record exists, not that the worker actually instantiates). Out of scope: - Reconciling overlap between connect.chitty.cc and concierge.chitty.cc (both healthy in their own right; flagged for follow-up). - Re-introducing /api/dynamic execution. If we need it later, the binding has to be added back AFTER the account is allowlisted for Worker Loader, and dynamic.js must be wired into the router. Acceptance: - curl https://connect.chitty.cc/health → 200 with { status, service, version } - /openapi.json, /.well-known/services, /services etc. — once /health responds, the existing Hono routes resume serving (no logic change). - wrangler tail shows no uncaught exceptions over a 5-minute soak. https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
chittyconnect | f6355bd | May 01 2026, 07:06 AM |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe pull request removes HTTP endpoints for dynamic code execution ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 52 minutes and 57 seconds.Comment |
…tore /health After the initial removal of worker_loaders deployed (commit f4eaf16, auto-deployed via Cloudflare Workers Builds), production at connect.chitty.cc remained 1101 on every path. Worker Loader was not the cause. Reverting all wrangler.jsonc changes from PR #168 to the last known good state (5fd6a33). Specifically removing: - compatibility_flags: ["..., "global_fetch_strictly_public"] Newly added top-level flag. Unclear whether this is supported on this account / compatibility_date combination at the edge. - env.staging.secrets_store_secrets and env.production.secrets_store_secrets PR #168 added these per-env duplicates of the top-level block, claiming "wrangler does not inherit secrets_store_secrets into env blocks". The repo's own wrangler.jsonc header comment contradicts that — top-level secrets_store_secrets IS marked as "inherited by all environments". The per-env duplicate appears to be the actual breaking change: if the runtime tries to materialize 17 bindings twice (or against the wrong scope), one mis-resolution is enough to abort isolate startup → 1101 on every request even though deploy succeeds. The smoke-check in deploy.yml is preserved and reworked to not require workers_dev (which the rollback also removed). It now hits connect.chitty.cc/health directly and: - 200 with "status" field → pass - 5xx (incl. 1101) → rollback + fail the deploy - 403 (zone WAF) / network error → inconclusive, warn but don't block (operator verifies manually) This is a procedural compromise — the WAF blocks GitHub Actions runner IPs, so we can't always reach the URL from CI. But the 5xx case still catches the exact failure mode that just happened. Out of scope: - Re-introducing the per-env secrets_store_secrets if it turns out to actually be required: must be tested in staging FIRST and validated via wrangler tail before reaching production. https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/deploy.yml (1)
156-161: 💤 Low valueSubdomain resolution failure silently skips the smoke check.
If the Cloudflare API call to resolve the workers.dev subdomain fails, the step exits 0 (success) and the smoke check is bypassed. This means a transient API failure during deployment could mask a broken worker.
This may be intentional to avoid failing deployments due to API flakiness, but consider whether
exit 1(with rollback) would be safer for a production deployment pipeline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy.yml around lines 156 - 161, The current shell block that assigns SUBDOMAIN (the curl + python3 pipeline) silently treats failures as success by using "|| { echo ...; exit 0; }", which skips the smoke check; change that behavior so failures are fatal or configurable: update the failure handler for the SUBDOMAIN assignment (the block that contains the curl to "https://api.cloudflare.com/.../workers/subdomain") to either exit 1 instead of exit 0 or read a CI variable (e.g., SKIP_SMOKE_CHECK_ON_SUBDOMAIN_FAILURE) and only skip when that variable is set; ensure the error message remains and that any code later expecting $SUBDOMAIN handles the case accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/deploy.yml:
- Around line 156-161: The current shell block that assigns SUBDOMAIN (the curl
+ python3 pipeline) silently treats failures as success by using "|| { echo ...;
exit 0; }", which skips the smoke check; change that behavior so failures are
fatal or configurable: update the failure handler for the SUBDOMAIN assignment
(the block that contains the curl to
"https://api.cloudflare.com/.../workers/subdomain") to either exit 1 instead of
exit 0 or read a CI variable (e.g., SKIP_SMOKE_CHECK_ON_SUBDOMAIN_FAILURE) and
only skip when that variable is set; ensure the error message remains and that
any code later expecting $SUBDOMAIN handles the case accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f58c4b49-99e9-47f4-aa63-2f29ebdf2095
📒 Files selected for processing (3)
.github/workflows/deploy.ymlsrc/api/routes/dynamic.jswrangler.jsonc
💤 Files with no reviewable changes (1)
- src/api/routes/dynamic.js
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4eaf16519
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // workers.dev URL is enabled so post-deploy CI can probe /health without | ||
| // hitting the zone WAF (which 403's GitHub Actions runner IPs). The | ||
| // public surface remains the connect.chitty.cc / mcp.chitty.cc routes. | ||
| "workers_dev": true, |
There was a problem hiding this comment.
Disable workers.dev exposure for production traffic
Setting workers_dev to true in the production environment makes the full service reachable at *.workers.dev, which bypasses the zone-level controls on connect.chitty.cc (including the WAF gate this workflow explicitly works around). That creates a new public path to the same endpoints and bindings, so traffic blocked or challenged on the custom domain can still hit production via the workers.dev hostname.
Useful? React with 👍 / 👎.
| echo "::warning::Could not resolve workers.dev subdomain — skipping smoke check" | ||
| exit 0 |
There was a problem hiding this comment.
Fail closed when smoke-check subdomain lookup fails
This branch exits successfully when the subdomain API call fails, so the verify job can pass without executing any runtime health probe. In cases like token scope drift or transient Cloudflare API errors, a broken deployment would no longer trigger rollback even though this step was added specifically to catch runtime-startup failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Restores production availability by removing an unsupported Cloudflare Worker Loader binding that caused isolate instantiation failures (1101) and adds a post-deploy runtime smoke check to catch similar failures earlier.
Changes:
- Remove
worker_loaders/LOADERbinding fromenv.productioninwrangler.jsonc. - Enable
workers_devin production to allow CI to probe/healthvia*.workers.dev(bypassing zone WAF). - Add a post-deploy
/healthsmoke check (with rollback on failure) and delete the unused/orphanedsrc/api/routes/dynamic.js.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
wrangler.jsonc |
Removes the unsupported worker_loaders binding; enables workers_dev for CI probing. |
src/api/routes/dynamic.js |
Deletes an unused route file that depended on env.LOADER. |
.github/workflows/deploy.yml |
Adds a workers.dev-based /health smoke check and attempts rollback on failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "ONEPASSWORD_VAULT_INFRASTRUCTURE": "oxwo63jlcbo66c7kwx67lquw4i", | ||
| "ONEPASSWORD_VAULT_SERVICES": "xgevq7nkt6t4bfokjaczht5gdy", | ||
| "ONEPASSWORD_VAULT_INTEGRATIONS": "3ngilqu4qsp4mlr53xhvyi4sry", | ||
| "ONEPASSWORD_VAULT_EMERGENCY": "shl646vf4snnrkx6linyk3yis4", |
There was a problem hiding this comment.
workers_dev: true in production creates a publicly reachable *.workers.dev hostname that bypasses zone-level WAF/Access rules for all routes (not just /health). Given there are unauthenticated endpoints like /intelligence/dashboard, consider adding a hostname/path guard in the worker (e.g., if Host ends with .workers.dev only allow /health) or confirm this exposure is acceptable for production.
| // Disable the default workers.dev hostname in production so all traffic | |
| // remains subject to the zone-level WAF/Access controls on the intended | |
| // connect.chitty.cc / mcp.chitty.cc routes. | |
| "workers_dev": false, |
| [ "$VERSION" != "none" ] && echo "Deployment verified" || { | ||
| echo "::error::No active deployment found — rolling back..." | ||
| npx wrangler@latest rollback -y | ||
| exit 1 |
There was a problem hiding this comment.
The rollback command here doesn’t specify --env production, so it may roll back the default environment rather than the one just deployed. Align this with the deploy step (deploy --env production) by passing the same env to rollback to avoid reverting the wrong target.
| env: | ||
| CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }} |
There was a problem hiding this comment.
If the workers.dev subdomain lookup fails (API error, missing permissions, subdomain disabled), this step currently exits 0 and skips the smoke check. That weakens the recurrence guard (you can deploy a broken worker and still pass CI). Suggest failing the job when the lookup fails, or at least only skipping when the account truly has no workers.dev subdomain while workers_dev is disabled.
| env: | |
| CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }} | |
| echo "::error::Failed to resolve workers.dev subdomain; cannot run required smoke check" | |
| exit 1 |
| echo "::error::/health returned $HTTP_CODE on attempt $attempt — Worker is broken" | ||
| echo "Rolling back deployment..." | ||
| npx wrangler@latest rollback -y --name chittyconnect || echo "::warning::rollback command failed; rollback manually with 'wrangler rollback --name chittyconnect'" |
There was a problem hiding this comment.
This rollback command also omits --env production. If the smoke check fails, ensure the rollback targets the same environment that was deployed (and that CI credentials are set for that env), otherwise the rollback can be a no-op or revert the wrong env.
Both rollback invocations in deploy.yml were missing --env production, which would either no-op (no default env) or revert the wrong target. Aligning with the deploy step which uses --env production. Two callsites: - The Verify Deployment step's "no active deployment" branch - The Smoke check /health step's 5xx-detected branch https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs
Symptom
Production
connect.chitty.ccis returning Cloudflare error 1101 (uncaught Worker exception) on every path including/health. Verified externally:/health,/,/services,/api/services,/registry,/v1/services,/.well-known/services,/openapi.json,/discover,/catalog,/concierge,/api/concierge,/v1/conciergeall return 500 with bodyerror code: 1101. Sister services (concierge.chitty.cc,mcp.ch1tty.com) are healthy. The fact that even/health— a constant-JSON Hono handler with no env access — 500s points at runtime isolate instantiation, not any one route.This is currently blocking ChittyRental's Notion-gateway discovery (chittyrental can't bootstrap its real Notion property IDs while Connect is down).
Post-mortem (root cause + recurrence guard)
What threw: the worker isolate fails to instantiate on every request. Bindings are materialized at isolate-startup, before any handler runs.
What change introduced it: PR #168 (d08db33, the only commit on
mainsince #167) added a new binding to the production env:Worker Loader is a closed-beta Cloudflare feature requiring per-account allowlist enrollment. Wrangler accepts the binding declaration at deploy time, so the deployment record looks healthy and the existing Verify Deployment step (which only queries the deployments API for a record) passes. But the runtime can't materialize a
LOADERbinding the account isn't entitled to, so the isolate aborts and Cloudflare returns 1101 on every request.The route file that uses
env.LOADER(src/api/routes/dynamic.js) was added in the same PR but was never imported or mounted insrc/api/router.js, so removing the binding doesn't break any reachable surface — only the binding itself is the problem.What prevents recurrence:
/health(via the workers.dev subdomain to bypass the zone WAF that 403's GitHub Actions runner IPs) and rolls back if it doesn't return 200 with astatusfield. The existing API-based Verify Deployment step is preserved as a sanity layer, but the smoke check is the one that would have caught this.Changes
wrangler.jsonc— removeworker_loadersfromenv.production; enableworkers_dev: trueonenv.productionso CI can probe via the workers.dev subdomain.src/api/routes/dynamic.js— delete (200 lines). Orphan from fix: MCP portal zero-trust fixes #168, never wired into the router, only consumer ofenv.LOADER. Confirmed viagrep -rn "dynamicRoutes\|routes/dynamic" src/ tests/— zero references outside the file itself..github/workflows/deploy.yml— add aSmoke check /healthstep afterVerify Deployment. Resolves the account's workers.dev subdomain via the Cloudflare API, polls/healthfor up to 25s, rolls back on failure. Callswrangler rollback -yif/healthis not 200.Acceptance criteria
wrangler.jsoncparses cleanly;production.workers_devistrue;production.worker_loadersis absentcurl https://connect.chitty.cc/health→ 200 with{status, service, version}body/openapi.jsonand other discovery paths resume responding (no Hono logic was changed)wrangler tail chittyconnectshows zero uncaught exceptions over a 5-minute soakOut of scope (not addressed here, flagged for follow-up)
concierge.chitty.ccvsconnect.chitty.ccoverlap — both are healthy on their own; reconciliation is a separate design question./api/dynamic— if/when this is needed, the binding has to be re-added after the account is allowlisted for Worker Loader, anddynamic.jsmust be wired intosrc/api/router.js. Not blocking ChittyRental./healthis back, ChittyRental's Notion-gateway discovery should be able to use the existing/.well-known/servicesand/api/servicesendpoints. If those return wrong shapes for the gateway lookup (service=mcp→https://mcp.ch1tty.com), that's a separate small fix.https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs
Generated by Claude Code
Summary by CodeRabbit
Breaking Changes
/runand/sqlAPI endpoints.New Features
Infrastructure