Skip to content

fix: throttle health-check self-heal#521

Merged
EtanHey merged 2 commits into
mainfrom
fix/health-check-heal-throttle
Jun 19, 2026
Merged

fix: throttle health-check self-heal#521
EtanHey merged 2 commits into
mainfrom
fix/health-check-heal-throttle

Conversation

@EtanHey

@EtanHey EtanHey commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Summary

  • Gate health-check --heal kickstarts behind consecutive failures for the same issue, persisted in the existing health-check state file.
  • Keep health alarms and exit-code behavior unchanged while logging every actual heal action to stderr with label, issue, and consecutive-failure count.
  • Add coverage for first-failure no-op, repeated-failure kickstart, heal logging, canary throttling, and env override.

Test plan

  • pytest tests/test_stability_health_check.py::test_backlog_batch_zero_alarms_but_waits_until_repeated_failure_to_kickstart_hotlane -q (failed before implementation, then passed)
  • pytest tests/test_stability_health_check.py tests/test_launchd_hygiene.py -q
  • ruff check src/brainlayer/health_check.py tests/test_stability_health_check.py
  • ruff format --check src/brainlayer/health_check.py tests/test_stability_health_check.py
  • coderabbit review --agent (local, findings: 0)
  • BRAINLAYER_PREPUSH=1 bash scripts/run_tests.sh (passed after rerunning one transient CLI p95 performance failure; final run passed: 3011 passed, 9 skipped, 61 deselected, 1 xfailed; MCP registration, isolated eval/hook routing, bun, and regression shell passed)

Note

Medium Risk
Changes when automated launchd restarts fire for live BrainBar/hotlane services; mis-tuned thresholds could delay recovery, though defaults only add one check cycle (~5 minutes).

Overview
Health-check --heal no longer kickstarts launchd on the first sighting of an issue. Each issue code gets a consecutive-failure counter in the existing health-check state file (heal_failures); kickstart runs only when that count reaches a threshold (default 2, overridable via BRAINLAYER_HEAL_MIN_CONSECUTIVE_FAILURES or HealthCheckConfig.heal_min_consecutive_failures).

Alarms, exit behavior, and which labels map to which issues are unchanged. When a heal actually runs, stderr logs heal action with label, issue code, and consecutive-failure count.

Tests were updated for two-run throttling (hotlane backlog zero, brainbar canary failure), heal logging, env override, and immediate heal when threshold is set to 1.

Reviewed by Cursor Bugbot for commit e76d16a. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Throttle health-check self-heal by requiring consecutive failures before kickstart

  • Healing kickstarts are now deferred until a configurable number of consecutive failures (heal_min_consecutive_failures, default 2) is reached per issue code, preventing spurious restarts on transient issues.
  • Consecutive failure counts are persisted in state under heal_failures and incremented each check cycle; counts reset implicitly when the issue clears.
  • A new BRAINLAYER_HEAL_MIN_CONSECUTIVE_FAILURES env var overrides the default threshold via _env_int in health_check.py.
  • When a heal fires, a diagnostic line is printed to stderr with the label, issue code, and consecutive failure count.
  • Behavioral Change: heal kickstarts that previously fired on the first detected failure now require at least 2 consecutive failures by default.

Macroscope summarized 83e094b.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@EtanHey, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 12 minutes and 48 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 99ded951-c4c0-4a2a-aeae-ad58a5df7e01

📥 Commits

Reviewing files that changed from the base of the PR and between ecedfa0 and 83e094b.

📒 Files selected for processing (2)
  • src/brainlayer/health_check.py
  • tests/test_stability_health_check.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/health-check-heal-throttle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@EtanHey

EtanHey commented Jun 19, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@EtanHey

EtanHey commented Jun 19, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@EtanHey

EtanHey commented Jun 19, 2026

Copy link
Copy Markdown
Owner Author

@cursor @BugBot review

@cursor

cursor Bot commented Jun 19, 2026

Copy link
Copy Markdown

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e76d16a. Configure here.

Comment thread src/brainlayer/health_check.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e76d16af84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

config=config,
command_runner=command_runner,
)
if result.missing_vectors is not None or heal_failures:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve missing-vector state on count failures

When SQLite counting fails (for example during a transient DB lock), result.missing_vectors is None but _apply_heals still returns a nonempty map for the missing_embeddings_count_failed issue. This new or heal_failures branch then rewrites the state file without missing_vectors/stalled_ticks, so the next successful tick has no previous baseline and cannot detect climbing or stalled embeddings. Please skip this write or merge the existing progress fields when the count is unavailable.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 83e094b: the state write now merges from the loaded state first, then refreshes heal_failures/ts and only overwrites missing_vectors/stalled_ticks when a fresh count is available. Added regression coverage for count-failure plus heal counter persistence.

Comment thread src/brainlayer/health_check.py
@EtanHey

EtanHey commented Jun 19, 2026

Copy link
Copy Markdown
Owner Author

@cursor @BugBot re-review

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@cursor

cursor Bot commented Jun 19, 2026

Copy link
Copy Markdown

You need to increase your spend limit or enable usage-based billing to run background agents. Go to Cursor

@EtanHey EtanHey merged commit 1374c7d into main Jun 19, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant