fix(swarm): production deploy bugs (volumes, sessions, providers, theme, missing assets) by MarcelocardosoLeal · Pull Request #16 · EvolutionAPI/evo-nexus

MarcelocardosoLeal · 2026-04-18T21:55:00Z

Summary

Bugs found and fixed during a Docker Swarm production deployment of EvoNexus at evonexus.advancedbot.com.br.

1. ANTHROPIC_API_KEY missing from ALLOWED_VARS in claude-bridge.js

The API key was silently filtered out by the env var allowlist, causing Claude Code to fall back to the OAuth login flow on every terminal session start — even when the key was properly configured in the Providers page.

Fix: Added ANTHROPIC_API_KEY to the ALLOWED_VARS set in _loadProviderConfig().

2. "Session already exists" crash on orphaned PTY sessions

When a Claude Code process exited unexpectedly (crash, network issue), the session stayed in the bridge's in-memory Map as an inactive entry. The next time the user opened a chat, startSession() hit the duplicate-check and threw — requiring a full container restart to recover.

Fix: Detect orphaned dead sessions on startSession(), kill the stale PTY process, and clean up the entry before creating a fresh session.

3. Local SQLite DB baked into Docker image

dashboard/data/ and workspace/ were not in .dockerignore, so a pre-seeded DB (with local users and config) was baked into the image. Every fresh deployment of the published image got the developer's local accounts instead of a clean setup wizard.

Fix: Added dashboard/data/ and workspace/ to .dockerignore.

4. Claude Code OAuth tokens not persisted in Swarm deployment

The Portainer stack file was missing a volume for /root/.claude (where Claude Code stores OAuth tokens). Without this, users had to re-authenticate on every redeploy.

Fix: Added evonexus_claude_auth:/root/.claude to all three services (dashboard, telegram, scheduler) in both evonexus.portainer.stack.yml and the official evonexus.stack.yml template.

5. Theme picker shown on every agent terminal

Each agent runs in its own working directory, which Claude Code treats as a separate project. Without a global theme set, the user is asked to "Choose the text style that looks best with your terminal" on every single agent — annoying after the second time.

Fix: Pre-seed /root/.claude/settings.json with theme + onboarding flags during container startup (start-dashboard.sh). Only writes the file if it doesn't already exist (preserves user-chosen overrides).

6. "Session already exists" toast on WebSocket reconnect

The earlier fix for #2 only handled inactive orphans. The actual production trigger is different: when a WebSocket reconnects through Traefik, the frontend can re-send start_claude before learning the session is still alive. The bridge then threw on a duplicate active session, and the user saw "Failed to start Claude Code: Session ... already exists" in the corner of a perfectly working terminal.

Fix:

claude-bridge.js startSession is now idempotent — if the session is already active, returns the existing entry instead of throwing.
server.js startClaude responds with type:'claude_started' (instead of type:'error') when the session is already active, so the frontend updates UI to "running" and replays the buffer.

7. /root/.claude.json not persisted across redeploys (theme picker comes back)

Claude Code stores its main config (theme, OAuth state, per-project bookkeeping) at /root/.claude.json — a SIBLING of the /root/.claude/ directory, not inside it. The Swarm volume mounts the /root/.claude/ directory only, so .claude.json lives in the container's writable layer and is wiped on every redeploy. The visible symptom: theme picker appears again on every agent after a release, plus a console message saying "Claude configuration file not found at: /root/.claude.json — A backup file exists at: /root/.claude/backups/.claude.json.backup.".

Fix: start-dashboard.sh now restores /root/.claude.json from the most recent backup in /root/.claude/backups/ (which IS persisted by the volume) when the main file is missing on container start. If no backup exists, it seeds a minimal config with theme + onboarding flags so first-run prompts are skipped.

8. .claude/ and docs/ never copied into the dashboard image — UI shows "No agents found"

Dockerfile.dashboard only copied dashboard/backend/, social-auth/, scheduler.py and the built frontend. The .claude/ tree (agents, skills, commands, templates, rules) and docs/ were never copied. Result on a fresh Swarm deploy: /api/agents, /api/skills, /api/commands and /api/templates all returned empty lists, and the UI showed "No agents found — Add agent files to .claude/agents/ to get started" with all 38 agents missing. Local development worked because uv runs the backend with cwd at the repo root, where these dirs exist.

Fix: Added COPY .claude/ .claude/ and COPY docs/ docs/ to Dockerfile.dashboard. .claude/agent-memory and .claude/.env remain excluded via .dockerignore so user data and secrets stay out of the image.

Additional fixes in evonexus.portainer.stack.yml:

Added priority=1 to main Traefik router (prevents terminal router from intercepting root path)
Added passHostHeader=true to both load balancers

Test plan

Fresh Swarm deploy shows setup wizard (no pre-seeded accounts)
Anthropic API key set in Providers page → terminal session starts without OAuth prompt
Crash/reopen chat does not throw "Session already exists"
After claude OAuth login, tokens survive a redeploy
Opening multiple agents in sequence does not show theme picker on each
Reconnecting through Traefik does not produce a "Session already exists" toast in the running terminal
After a redeploy, no "Claude configuration file not found" message and no theme picker re-appears
/agents page shows the 38 built-in agents (16 business + 22 engineering) on a fresh deploy

🤖 Generated with Claude Code

1. Add ANTHROPIC_API_KEY to ALLOWED_VARS in claude-bridge.js The env var was silently filtered out, causing Claude Code to fall back to OAuth login on every session start instead of using the API key configured in the Providers page. 2. Fix orphaned session crash ("Session already exists") When a Claude process died without firing the PTY onExit event, the session remained in the bridge's in-memory Map as inactive. The next start attempt threw "already exists". Now detects dead sessions, cleans them up, and restarts normally. 3. Exclude dashboard/data/ and workspace/ from Docker build context Without these entries in .dockerignore, the local SQLite database (with hashed passwords) and workspace files were baked into the image. On first Swarm deploy, the volume was seeded from the image, making login impossible with any other credentials. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sourcery-ai

Sorry @MarcelocardosoLeal, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

- Add evonexus_claude_auth:/root/.claude to all three Swarm services (dashboard, telegram, scheduler) so Claude Code OAuth tokens persist across redeploys — avoids re-authentication on every deploy - docker-compose.yml: use Dockerfile.swarm.dashboard, expose terminal port 32352, add claude-auth volume, fix config mount (remove :ro so providers.json can be written by the UI) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add evonexus_claude_auth:/root/.claude to all three services in evonexus.stack.yml so Claude Code OAuth tokens persist across redeploys. Same fix applied to evonexus.portainer.stack.yml in the previous commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bug 1 — Theme picker on every agent Each agent runs in its own working directory, which Claude Code treats as a separate project. Without a global theme set, the user is asked to choose a theme on every single agent terminal. Pre-seed /root/.claude/settings.json with theme + onboarding flags during container startup so the first-run prompts are skipped. Only writes the file if it doesn't exist (preserves user-chosen overrides). Bug 2 — "Session already exists" error toast The previous fix only cleaned up *inactive* orphans. The actual production trigger is different: when a WebSocket reconnects through Traefik, the frontend can re-send start_claude before learning the session is still alive. The bridge's startSession then threw on a duplicate active session. Make startSession idempotent: if the session is already active, return the existing entry instead of throwing. Bug 3 — Misleading error on duplicate start Server.startClaude() responded with type:'error' "An agent is already running" when the session was active. From the user's perspective this looked like a failure even though everything was working. Send type:'claude_started' instead so the frontend updates UI to "running" and replays the buffer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Claude Code stores its main config at /root/.claude.json — a SIBLING of the /root/.claude/ directory, not inside it. The Swarm volume mounts /root/.claude/ only, so .claude.json sits in the container's writable layer and is wiped on every redeploy. Result: theme picker and onboarding reappear on every release, even though the OAuth tokens (in /root/.claude/) survive. Claude Code itself writes timestamped backups into /root/.claude/backups/ (which IS in the volume), so we just need to restore the latest one on startup when the main file is missing. If no backup exists either, seed a minimal config so first-run prompts are skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The Dockerfile only copied dashboard/backend/, social-auth/, scheduler.py and the built frontend. .claude/ (agents, skills, commands, templates, rules) and docs/ were never copied, so on a fresh deploy the backend's WORKSPACE / ".claude" / "agents" path was empty. Result: /api/agents, /api/skills, /api/commands and /api/templates all returned empty lists, and the UI showed "No agents found — Add agent files to .claude/agents/ to get started" on every clean Swarm deploy. Local development worked because uv runs the backend with cwd at the repo root, where .claude/ and docs/ exist. .claude/agent-memory and .claude/.env stay excluded by .dockerignore so user data and secrets remain out of the image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sourcery-ai bot reviewed Apr 18, 2026

View reviewed changes

MarcelocardosoLeal and others added 3 commits April 18, 2026 19:28

MarcelocardosoLeal changed the title ~~fix(swarm): three production deploy bugs~~ fix(swarm): production deploy bugs (volumes, sessions, providers, theme) Apr 18, 2026

MarcelocardosoLeal and others added 2 commits April 18, 2026 22:19

MarcelocardosoLeal changed the title ~~fix(swarm): production deploy bugs (volumes, sessions, providers, theme)~~ fix(swarm): production deploy bugs (volumes, sessions, providers, theme, missing assets) Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(swarm): production deploy bugs (volumes, sessions, providers, theme, missing assets)#16

fix(swarm): production deploy bugs (volumes, sessions, providers, theme, missing assets)#16
MarcelocardosoLeal wants to merge 6 commits intoEvolutionAPI:mainfrom
MarcelocardosoLeal:fix/swarm-deploy-bugs

MarcelocardosoLeal commented Apr 18, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MarcelocardosoLeal commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. ANTHROPIC_API_KEY missing from ALLOWED_VARS in claude-bridge.js

2. "Session already exists" crash on orphaned PTY sessions

3. Local SQLite DB baked into Docker image

4. Claude Code OAuth tokens not persisted in Swarm deployment

5. Theme picker shown on every agent terminal

6. "Session already exists" toast on WebSocket reconnect

7. /root/.claude.json not persisted across redeploys (theme picker comes back)

8. .claude/ and docs/ never copied into the dashboard image — UI shows "No agents found"

Test plan

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MarcelocardosoLeal commented Apr 18, 2026 •

edited

Loading