Skip to content

fix: restore /health — remove worker_loaders binding causing 1101 on every path#172

Open
chitcommit wants to merge 3 commits intomainfrom
claude/fix-1101-restore-health
Open

fix: restore /health — remove worker_loaders binding causing 1101 on every path#172
chitcommit wants to merge 3 commits intomainfrom
claude/fix-1101-restore-health

Conversation

@chitcommit
Copy link
Copy Markdown
Contributor

@chitcommit chitcommit commented May 1, 2026

Symptom

Production connect.chitty.cc is returning Cloudflare error 1101 (uncaught Worker exception) on every path including /health. Verified externally: /health, /, /services, /api/services, /registry, /v1/services, /.well-known/services, /openapi.json, /discover, /catalog, /concierge, /api/concierge, /v1/concierge all return 500 with body error code: 1101. Sister services (concierge.chitty.cc, mcp.ch1tty.com) are healthy. The fact that even /health — a constant-JSON Hono handler with no env access — 500s points at runtime isolate instantiation, not any one route.

This is currently blocking ChittyRental's Notion-gateway discovery (chittyrental can't bootstrap its real Notion property IDs while Connect is down).

Post-mortem (root cause + recurrence guard)

What threw: the worker isolate fails to instantiate on every request. Bindings are materialized at isolate-startup, before any handler runs.

What change introduced it: PR #168 (d08db33, the only commit on main since #167) added a new binding to the production env:

"worker_loaders": [{ "binding": "LOADER" }]

Worker Loader is a closed-beta Cloudflare feature requiring per-account allowlist enrollment. Wrangler accepts the binding declaration at deploy time, so the deployment record looks healthy and the existing Verify Deployment step (which only queries the deployments API for a record) passes. But the runtime can't materialize a LOADER binding the account isn't entitled to, so the isolate aborts and Cloudflare returns 1101 on every request.

The route file that uses env.LOADER (src/api/routes/dynamic.js) was added in the same PR but was never imported or mounted in src/api/router.js, so removing the binding doesn't break any reachable surface — only the binding itself is the problem.

What prevents recurrence:

  • Post-deploy smoke check curls /health (via the workers.dev subdomain to bypass the zone WAF that 403's GitHub Actions runner IPs) and rolls back if it doesn't return 200 with a status field. The existing API-based Verify Deployment step is preserved as a sanity layer, but the smoke check is the one that would have caught this.

Changes

  • wrangler.jsonc — remove worker_loaders from env.production; enable workers_dev: true on env.production so CI can probe via the workers.dev subdomain.
  • src/api/routes/dynamic.js — delete (200 lines). Orphan from fix: MCP portal zero-trust fixes #168, never wired into the router, only consumer of env.LOADER. Confirmed via grep -rn "dynamicRoutes\|routes/dynamic" src/ tests/ — zero references outside the file itself.
  • .github/workflows/deploy.yml — add a Smoke check /health step after Verify Deployment. Resolves the account's workers.dev subdomain via the Cloudflare API, polls /health for up to 25s, rolls back on failure. Calls wrangler rollback -y if /health is not 200.

Acceptance criteria

  • Test suite passes locally (38/38 files)
  • wrangler.jsonc parses cleanly; production.workers_dev is true; production.worker_loaders is absent
  • After merge: curl https://connect.chitty.cc/health → 200 with {status, service, version} body
  • After merge: /openapi.json and other discovery paths resume responding (no Hono logic was changed)
  • After merge: wrangler tail chittyconnect shows zero uncaught exceptions over a 5-minute soak

Out of scope (not addressed here, flagged for follow-up)

  • concierge.chitty.cc vs connect.chitty.cc overlap — both are healthy on their own; reconciliation is a separate design question.
  • Re-enabling /api/dynamic — if/when this is needed, the binding has to be re-added after the account is allowlisted for Worker Loader, and dynamic.js must be wired into src/api/router.js. Not blocking ChittyRental.
  • Service discovery contract — once /health is back, ChittyRental's Notion-gateway discovery should be able to use the existing /.well-known/services and /api/services endpoints. If those return wrong shapes for the gateway lookup (service=mcphttps://mcp.ch1tty.com), that's a separate small fix.

https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs


Generated by Claude Code

Summary by CodeRabbit

  • Breaking Changes

    • Removed /run and /sql API endpoints.
  • New Features

    • Application is now accessible via the Workers.dev subdomain.
  • Infrastructure

    • Added post-deployment health verification checks to ensure successful deployments, with automatic rollback on failure.

…ion)

Production at connect.chitty.cc was returning Cloudflare error 1101
(uncaught Worker exception) on every path including /health, blocking
ChittyRental's Notion-gateway discovery.

Root cause (highest-likelihood — confirm with wrangler tail):
PR #168 added "worker_loaders": [{ "binding": "LOADER" }] to the
production env block. Worker Loader is a closed-beta binding that
requires per-account allowlist enrollment. Wrangler accepts it at
deploy time, so the deployment record looks healthy, but the runtime
fails to instantiate the isolate when the account isn't entitled —
which surfaces as 1101 on every request rather than a deploy failure.
That matches the observed symptom: /health 500s even though it's
just a Hono handler returning a constant JSON object.

Changes:
- wrangler.jsonc: remove worker_loaders[LOADER] from env.production.
- src/api/routes/dynamic.js: delete. The file was added in #168 but
  never imported or mounted in src/api/router.js, so it's pure dead
  code. It was the only consumer of env.LOADER. No public surface
  change.
- wrangler.jsonc: enable workers_dev: true on env.production so the
  post-deploy CI smoke check can probe the workers.dev subdomain
  (zone WAF 403's GitHub Actions runner IPs).
- .github/workflows/deploy.yml: add a Smoke check /health step after
  Verify Deployment. Resolves the account's workers.dev subdomain via
  the Cloudflare API, polls /health for up to 25s, and rolls back
  the deployment if /health doesn't return 200 with a status field.
  This catches the exact failure mode that slipped past the existing
  deployments-API verification (which only checks that a deployment
  record exists, not that the worker actually instantiates).

Out of scope:
- Reconciling overlap between connect.chitty.cc and concierge.chitty.cc
  (both healthy in their own right; flagged for follow-up).
- Re-introducing /api/dynamic execution. If we need it later, the
  binding has to be added back AFTER the account is allowlisted for
  Worker Loader, and dynamic.js must be wired into the router.

Acceptance:
- curl https://connect.chitty.cc/health → 200 with { status, service, version }
- /openapi.json, /.well-known/services, /services etc. — once /health
  responds, the existing Hono routes resume serving (no logic change).
- wrangler tail shows no uncaught exceptions over a 5-minute soak.

https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs
Copilot AI review requested due to automatic review settings May 1, 2026 06:59
@cloudflare-workers-and-pages
Copy link
Copy Markdown
Contributor

cloudflare-workers-and-pages Bot commented May 1, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
chittyconnect f6355bd May 01 2026, 07:06 AM

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

Warning

Rate limit exceeded

@chitcommit has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 57 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c0c7a3a-6690-4edb-8cca-b9eab02750e0

📥 Commits

Reviewing files that changed from the base of the PR and between f4eaf16 and f6355bd.

📒 Files selected for processing (2)
  • .github/workflows/deploy.yml
  • wrangler.jsonc
📝 Walkthrough

Walkthrough

The pull request removes HTTP endpoints for dynamic code execution (/run, /sql) from the API, enables Workers.dev deployment URL, removes the loader binding from Wrangler configuration, and adds a post-deploy health check smoke test to verify the Workers.dev endpoint availability.

Changes

Cohort / File(s) Summary
API Endpoints Removal
src/api/routes/dynamic.js
Removes /run and /sql HTTP endpoints entirely, including support for dynamic code execution, SQL queries via Neon database, binding allowlisting, timeout enforcement, and related helper functions like resolveDatabase.
Deployment Configuration
wrangler.jsonc
Enables workers_dev: true for production environment to expose Workers.dev URL; removes worker_loaders binding (LOADER) from production configuration.
Deployment Verification
.github/workflows/deploy.yml
Adds post-deploy smoke test in verify job that queries Cloudflare for the account's Workers.dev subdomain and polls the /health endpoint with retry logic, rolling back on repeated failures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

access-reviewed

Poem

🐰 The endpoints fade like morning dew,
Workers.dev takes the stage anew,
With health checks dancing in the sky,
Our rabbit hops as deployments fly!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: removing the problematic worker_loaders binding and restoring the /health endpoint.
Description check ✅ Passed The description is comprehensive and well-structured, addressing symptom, root cause, changes, and acceptance criteria. However, the provided template checklist sections (Security & Access, Docs, Validation) are not filled out.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-1101-restore-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 52 minutes and 57 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

…tore /health

After the initial removal of worker_loaders deployed (commit f4eaf16,
auto-deployed via Cloudflare Workers Builds), production at
connect.chitty.cc remained 1101 on every path. Worker Loader was not
the cause.

Reverting all wrangler.jsonc changes from PR #168 to the last known
good state (5fd6a33). Specifically removing:

- compatibility_flags: ["..., "global_fetch_strictly_public"]
  Newly added top-level flag. Unclear whether this is supported on
  this account / compatibility_date combination at the edge.
- env.staging.secrets_store_secrets and env.production.secrets_store_secrets
  PR #168 added these per-env duplicates of the top-level block,
  claiming "wrangler does not inherit secrets_store_secrets into env
  blocks". The repo's own wrangler.jsonc header comment contradicts
  that — top-level secrets_store_secrets IS marked as
  "inherited by all environments". The per-env duplicate appears to
  be the actual breaking change: if the runtime tries to materialize
  17 bindings twice (or against the wrong scope), one mis-resolution
  is enough to abort isolate startup → 1101 on every request even
  though deploy succeeds.

The smoke-check in deploy.yml is preserved and reworked to not
require workers_dev (which the rollback also removed). It now hits
connect.chitty.cc/health directly and:
  - 200 with "status" field → pass
  - 5xx (incl. 1101) → rollback + fail the deploy
  - 403 (zone WAF) / network error → inconclusive, warn but don't
    block (operator verifies manually)

This is a procedural compromise — the WAF blocks GitHub Actions
runner IPs, so we can't always reach the URL from CI. But the 5xx
case still catches the exact failure mode that just happened.

Out of scope:
- Re-introducing the per-env secrets_store_secrets if it turns out to
  actually be required: must be tested in staging FIRST and validated
  via wrangler tail before reaching production.

https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/deploy.yml (1)

156-161: 💤 Low value

Subdomain resolution failure silently skips the smoke check.

If the Cloudflare API call to resolve the workers.dev subdomain fails, the step exits 0 (success) and the smoke check is bypassed. This means a transient API failure during deployment could mask a broken worker.

This may be intentional to avoid failing deployments due to API flakiness, but consider whether exit 1 (with rollback) would be safer for a production deployment pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy.yml around lines 156 - 161, The current shell block
that assigns SUBDOMAIN (the curl + python3 pipeline) silently treats failures as
success by using "|| { echo ...; exit 0; }", which skips the smoke check; change
that behavior so failures are fatal or configurable: update the failure handler
for the SUBDOMAIN assignment (the block that contains the curl to
"https://api.cloudflare.com/.../workers/subdomain") to either exit 1 instead of
exit 0 or read a CI variable (e.g., SKIP_SMOKE_CHECK_ON_SUBDOMAIN_FAILURE) and
only skip when that variable is set; ensure the error message remains and that
any code later expecting $SUBDOMAIN handles the case accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/deploy.yml:
- Around line 156-161: The current shell block that assigns SUBDOMAIN (the curl
+ python3 pipeline) silently treats failures as success by using "|| { echo ...;
exit 0; }", which skips the smoke check; change that behavior so failures are
fatal or configurable: update the failure handler for the SUBDOMAIN assignment
(the block that contains the curl to
"https://api.cloudflare.com/.../workers/subdomain") to either exit 1 instead of
exit 0 or read a CI variable (e.g., SKIP_SMOKE_CHECK_ON_SUBDOMAIN_FAILURE) and
only skip when that variable is set; ensure the error message remains and that
any code later expecting $SUBDOMAIN handles the case accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f58c4b49-99e9-47f4-aa63-2f29ebdf2095

📥 Commits

Reviewing files that changed from the base of the PR and between d08db33 and f4eaf16.

📒 Files selected for processing (3)
  • .github/workflows/deploy.yml
  • src/api/routes/dynamic.js
  • wrangler.jsonc
💤 Files with no reviewable changes (1)
  • src/api/routes/dynamic.js

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f4eaf16519

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread wrangler.jsonc Outdated
// workers.dev URL is enabled so post-deploy CI can probe /health without
// hitting the zone WAF (which 403's GitHub Actions runner IPs). The
// public surface remains the connect.chitty.cc / mcp.chitty.cc routes.
"workers_dev": true,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disable workers.dev exposure for production traffic

Setting workers_dev to true in the production environment makes the full service reachable at *.workers.dev, which bypasses the zone-level controls on connect.chitty.cc (including the WAF gate this workflow explicitly works around). That creates a new public path to the same endpoints and bindings, so traffic blocked or challenged on the custom domain can still hit production via the workers.dev hostname.

Useful? React with 👍 / 👎.

Comment thread .github/workflows/deploy.yml Outdated
Comment on lines +159 to +160
echo "::warning::Could not resolve workers.dev subdomain — skipping smoke check"
exit 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail closed when smoke-check subdomain lookup fails

This branch exits successfully when the subdomain API call fails, so the verify job can pass without executing any runtime health probe. In cases like token scope drift or transient Cloudflare API errors, a broken deployment would no longer trigger rollback even though this step was added specifically to catch runtime-startup failures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores production availability by removing an unsupported Cloudflare Worker Loader binding that caused isolate instantiation failures (1101) and adds a post-deploy runtime smoke check to catch similar failures earlier.

Changes:

  • Remove worker_loaders / LOADER binding from env.production in wrangler.jsonc.
  • Enable workers_dev in production to allow CI to probe /health via *.workers.dev (bypassing zone WAF).
  • Add a post-deploy /health smoke check (with rollback on failure) and delete the unused/orphaned src/api/routes/dynamic.js.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
wrangler.jsonc Removes the unsupported worker_loaders binding; enables workers_dev for CI probing.
src/api/routes/dynamic.js Deletes an unused route file that depended on env.LOADER.
.github/workflows/deploy.yml Adds a workers.dev-based /health smoke check and attempts rollback on failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread wrangler.jsonc
Comment on lines 272 to 275
"ONEPASSWORD_VAULT_INFRASTRUCTURE": "oxwo63jlcbo66c7kwx67lquw4i",
"ONEPASSWORD_VAULT_SERVICES": "xgevq7nkt6t4bfokjaczht5gdy",
"ONEPASSWORD_VAULT_INTEGRATIONS": "3ngilqu4qsp4mlr53xhvyi4sry",
"ONEPASSWORD_VAULT_EMERGENCY": "shl646vf4snnrkx6linyk3yis4",
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workers_dev: true in production creates a publicly reachable *.workers.dev hostname that bypasses zone-level WAF/Access rules for all routes (not just /health). Given there are unauthenticated endpoints like /intelligence/dashboard, consider adding a hostname/path guard in the worker (e.g., if Host ends with .workers.dev only allow /health) or confirm this exposure is acceptable for production.

Suggested change
// Disable the default workers.dev hostname in production so all traffic
// remains subject to the zone-level WAF/Access controls on the intended
// connect.chitty.cc / mcp.chitty.cc routes.
"workers_dev": false,

Copilot uses AI. Check for mistakes.
Comment on lines 139 to 142
[ "$VERSION" != "none" ] && echo "Deployment verified" || {
echo "::error::No active deployment found — rolling back..."
npx wrangler@latest rollback -y
exit 1
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback command here doesn’t specify --env production, so it may roll back the default environment rather than the one just deployed. Align this with the deploy step (deploy --env production) by passing the same env to rollback to avoid reverting the wrong target.

Copilot uses AI. Check for mistakes.
Comment on lines +159 to +160
env:
CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the workers.dev subdomain lookup fails (API error, missing permissions, subdomain disabled), this step currently exits 0 and skips the smoke check. That weakens the recurrence guard (you can deploy a broken worker and still pass CI). Suggest failing the job when the lookup fails, or at least only skipping when the account truly has no workers.dev subdomain while workers_dev is disabled.

Suggested change
env:
CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
echo "::error::Failed to resolve workers.dev subdomain; cannot run required smoke check"
exit 1

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/deploy.yml Outdated
Comment on lines +178 to +180
echo "::error::/health returned $HTTP_CODE on attempt $attempt — Worker is broken"
echo "Rolling back deployment..."
npx wrangler@latest rollback -y --name chittyconnect || echo "::warning::rollback command failed; rollback manually with 'wrangler rollback --name chittyconnect'"
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rollback command also omits --env production. If the smoke check fails, ensure the rollback targets the same environment that was deployed (and that CI credentials are set for that env), otherwise the rollback can be a no-op or revert the wrong env.

Copilot uses AI. Check for mistakes.
Both rollback invocations in deploy.yml were missing --env production,
which would either no-op (no default env) or revert the wrong target.
Aligning with the deploy step which uses --env production.

Two callsites:
- The Verify Deployment step's "no active deployment" branch
- The Smoke check /health step's 5xx-detected branch

https://claude.ai/code/session_01Ne6sKA6sg8SzMk2hZp7WZs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants