fix: throttle health-check self-heal by EtanHey · Pull Request #521 · EtanHey/brainlayer

EtanHey · 2026-06-19T16:06:23Z

Summary

Gate health-check --heal kickstarts behind consecutive failures for the same issue, persisted in the existing health-check state file.
Keep health alarms and exit-code behavior unchanged while logging every actual heal action to stderr with label, issue, and consecutive-failure count.
Add coverage for first-failure no-op, repeated-failure kickstart, heal logging, canary throttling, and env override.

Test plan

pytest tests/test_stability_health_check.py::test_backlog_batch_zero_alarms_but_waits_until_repeated_failure_to_kickstart_hotlane -q (failed before implementation, then passed)
pytest tests/test_stability_health_check.py tests/test_launchd_hygiene.py -q
ruff check src/brainlayer/health_check.py tests/test_stability_health_check.py
ruff format --check src/brainlayer/health_check.py tests/test_stability_health_check.py
coderabbit review --agent (local, findings: 0)
BRAINLAYER_PREPUSH=1 bash scripts/run_tests.sh (passed after rerunning one transient CLI p95 performance failure; final run passed: 3011 passed, 9 skipped, 61 deselected, 1 xfailed; MCP registration, isolated eval/hook routing, bun, and regression shell passed)

Note

Medium Risk
Changes when automated launchd restarts fire for live BrainBar/hotlane services; mis-tuned thresholds could delay recovery, though defaults only add one check cycle (~5 minutes).

Overview
Health-check --heal no longer kickstarts launchd on the first sighting of an issue. Each issue code gets a consecutive-failure counter in the existing health-check state file (heal_failures); kickstart runs only when that count reaches a threshold (default 2, overridable via BRAINLAYER_HEAL_MIN_CONSECUTIVE_FAILURES or HealthCheckConfig.heal_min_consecutive_failures).

Alarms, exit behavior, and which labels map to which issues are unchanged. When a heal actually runs, stderr logs heal action with label, issue code, and consecutive-failure count.

Tests were updated for two-run throttling (hotlane backlog zero, brainbar canary failure), heal logging, env override, and immediate heal when threshold is set to 1.

^{Reviewed by Cursor Bugbot for commit e76d16a. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Throttle health-check self-heal by requiring consecutive failures before kickstart

Healing kickstarts are now deferred until a configurable number of consecutive failures (heal_min_consecutive_failures, default 2) is reached per issue code, preventing spurious restarts on transient issues.
Consecutive failure counts are persisted in state under heal_failures and incremented each check cycle; counts reset implicitly when the issue clears.
A new BRAINLAYER_HEAL_MIN_CONSECUTIVE_FAILURES env var overrides the default threshold via _env_int in health_check.py.
When a heal fires, a diagnostic line is printed to stderr with the label, issue code, and consecutive failure count.
Behavioral Change: heal kickstarts that previously fired on the first detected failure now require at least 2 consecutive failures by default.

^{Macroscope summarized 83e094b.}

coderabbitai · 2026-06-19T16:06:31Z

Warning

Review limit reached

@EtanHey, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 12 minutes and 48 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 99ded951-c4c0-4a2a-aeae-ad58a5df7e01

📥 Commits

Reviewing files that changed from the base of the PR and between ecedfa0 and 83e094b.

📒 Files selected for processing (2)

src/brainlayer/health_check.py
tests/test_stability_health_check.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/health-check-heal-throttle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

EtanHey · 2026-06-19T16:06:33Z

@codex review

EtanHey · 2026-06-19T16:06:33Z

@coderabbitai review

EtanHey · 2026-06-19T16:06:33Z

@cursor @BugBot review

cursor · 2026-06-19T16:06:36Z

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

coderabbitai · 2026-06-19T16:06:39Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e76d16a. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e76d16af84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-19T16:08:48Z

+        config=config,
+        command_runner=command_runner,
+    )
+    if result.missing_vectors is not None or heal_failures:


Preserve missing-vector state on count failures

When SQLite counting fails (for example during a transient DB lock), result.missing_vectors is None but _apply_heals still returns a nonempty map for the missing_embeddings_count_failed issue. This new or heal_failures branch then rewrites the state file without missing_vectors/stalled_ticks, so the next successful tick has no previous baseline and cannot detect climbing or stalled embeddings. Please skip this write or merge the existing progress fields when the count is unavailable.

Useful? React with 👍 / 👎.

Fixed in commit 83e094b: the state write now merges from the loaded state first, then refreshes heal_failures/ts and only overwrites missing_vectors/stalled_ticks when a fresh count is available. Added regression coverage for count-failure plus heal counter persistence.

EtanHey · 2026-06-19T16:16:16Z

@cursor @BugBot re-review

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

cursor · 2026-06-19T16:16:19Z

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

fix: throttle health-check self-heal

e76d16a

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread src/brainlayer/health_check.py

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

fix: preserve health-check state history

83e094b

macroscopeapp Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread src/brainlayer/health_check.py

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

EtanHey merged commit 1374c7d into main Jun 19, 2026
7 checks passed

Conversation

EtanHey commented Jun 19, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Throttle health-check self-heal by requiring consecutive failures before kickstart

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

EtanHey commented Jun 19, 2026

Uh oh!

EtanHey commented Jun 19, 2026

Uh oh!

EtanHey commented Jun 19, 2026

Uh oh!

cursor Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

EtanHey Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EtanHey commented Jun 19, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Jun 19, 2026 •

edited by macroscopeapp Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading