If you're running the same Cloudflare Tunnel on two or more machines for redundancy, Cloudflare will automatically fail over to another connector when one goes down. But if the machine is up and only your service is broken, Cloudflare keeps routing traffic to it, serving errors to your users.
This watchdog polls a local health endpoint and stops the cloudflared service when the service is unhealthy. Cloudflare detects the lost connector and routes traffic to a healthy machine. When the service recovers, the cloudflared service is started again automatically. Works for any service that exposes a health endpoint or similar.
Cloudflare's Load Balancer with health checks solves this natively, but at $5/month per hostname + other fees, it adds up fast. This watchdog gives you pretty much the same core behaviour, for free, with a single bash script and a cron job.
- Polls the health endpoint every minute.
- On failure, increments a counter. On success, resets it.
- Once failures hit
FAIL_THRESHOLD, stops thecloudflaredservice, Cloudflare detects the lost connector and fails over. - Once the service is stopped, starts counting consecutive successes instead.
- Once successes hit
RECOVER_THRESHOLD, starts thecloudflaredservice again.
- Cron's minimum interval is 1 minute, so worst case failover time is
FAIL_THRESHOLDminutes (3 by default). - If your health endpoint itself is slow or flaky, tune
TIMEOUTandFAIL_THRESHOLDaccordingly to avoid false positives.
sudo install -m 755 cloudflared-watchdog.sh /usr/local/bin/cloudflared-watchdog
echo "* * * * * root /usr/local/bin/cloudflared-watchdog" | sudo tee /etc/cron.d/cloudflared-watchdog
sudo chmod 644 /etc/cron.d/cloudflared-watchdogRuns every minute. Logs via sudo journalctl -t cloudflared-watchdog -f.
Override defaults via env vars in the cron line:
* * * * * root HEALTH_URL=http://localhost:8080/health/ FAIL_THRESHOLD=5 /usr/local/bin/cloudflared-watchdog| Variable | Default | Description |
|---|---|---|
HEALTH_URL |
http://localhost:8080/health |
Health endpoint to poll |
SERVICE |
cloudflared |
systemd service to manage |
FAIL_THRESHOLD |
3 |
Consecutive failures before stopping the service |
RECOVER_THRESHOLD |
2 |
Consecutive successes before restarting the service |
TIMEOUT |
5 |
HTTP request timeout in seconds |
SERVICE is particularly useful if you're running multiple tunnels on the same machine under different service names.