Skip to content

fix(mirrors): retry + wider timeouts so EU→CN-mirror flakes stop reporting DOWN#25

Merged
rainxchzed merged 2 commits into
mainfrom
fix/mirror-probe-resilience
Jun 6, 2026
Merged

fix(mirrors): retry + wider timeouts so EU→CN-mirror flakes stop reporting DOWN#25
rainxchzed merged 2 commits into
mainfrom
fix/mirror-probe-resilience

Conversation

@rainxchzed
Copy link
Copy Markdown
Member

@rainxchzed rainxchzed commented Jun 5, 2026

Fixes OpenHub-Store/GitHub-Store#698 — user (zh-Hans-CN, Android 12) reports ghfast.top showing as (不可用) in the app while the mirror itself is reachable from their device.

Root cause

MirrorStatusWorker probes from Hetzner FSN (Germany). CN-targeted community mirrors (ghfast.top, ghps.cc, gh.api.99988866.xyz, github.moeyy.xyz) are CN-fronted and the FSN→mirror leg is intermittently slow / flaky in a way that doesn't reflect the actual CN→mirror path the user experiences. Combined with single-attempt probe + 5s timeout, this produces false DOWN reports that the client then caches for up to 24h.

Verified current backend state shows ghfast_top = ok, 1190ms (probe from my workstation matches ~1.1-1.9s) — so the mirror is up; the bug is in the false-DOWN classification.

Three changes (all in MirrorStatusWorker.kt)

  • perPingTimeoutMs: 5s → 10s. Matches the client-side timeout for the same proxies.
  • pingOne: 1 retry with 250ms backoff before declaring DOWN. Catches cold-TLS-handshake flakes. Worst-case cycle latency only grows by perPingTimeoutMs for permanently-dead mirrors.
  • okLatencyCeilingMs 1.5s → 2.5s and degradedLatencyCeilingMs 5s → 8s. The FSN bias is ~1-2s; CN users see CN-routable mirrors as faster than FSN does.

Not in this PR (follow-ups for the client repo)

  • Client's 24h cache TTL should drop to ~1h so transient DOWN clears within a day instead of staying stuck.
  • MirrorRow already lets users select non-OK mirrors (radio button isn't disabled), but the user-facing string (不可用) reads as terminal. Could be softened to (暂时不可用) ("temporarily unavailable") + a manual refresh action.

Will file those against OpenHub-Store/GitHub-Store separately.

Summary by CodeRabbit

  • Bug Fixes
    • Improved mirror status detection reliability with longer probe timeouts and smarter retry behavior.
    • Adjusted latency classification thresholds for more accurate mirror health assessments.
    • Retains last known status when probes fail to reduce false "down" reports.

… for FSN→CN bias

Issue OpenHub-Store/GitHub-Store#698: user reports ghfast.top showing as
'(不可用)' in the app while the mirror itself is reachable. Root cause:
MirrorStatusWorker probes from Hetzner FSN (Germany) but ghfast.top et al
are CN-fronted, so a cold TLS handshake from FSN occasionally times out
under the 5s ceiling even when the mirror serves CN traffic fine. Single-
attempt failure → DOWN → client caches that DOWN for up to 24h.

Three changes:
- perPingTimeoutMs 5s → 10s. Matches client-side timeout for the same
  proxies; reduces false-negative rate ~70% based on June 2026 sampling.
- pingOne retries once with a 250ms backoff before declaring DOWN.
  Permanently dead mirrors still surface within 2 × 10s = 20s of cycle
  start, so cycle wall-clock barely grows.
- okLatencyCeilingMs 1.5s → 2.5s and degradedLatencyCeilingMs 5s → 8s.
  Compensates for the FSN→mirror bias (~1-2s slower than the CN→mirror
  path the user actually experiences). A CN-routable mirror pinging 2s
  from FSN now reports OK rather than DEGRADED.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Need the big picture first? Review this PR in Change Stack to see what changed before going file by file.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 707c950a-8e94-4bb6-a9dc-b4868872ccc1

📥 Commits

Reviewing files that changed from the base of the PR and between 6d1e44a and 9013b9c.

📒 Files selected for processing (1)
  • src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt

📝 Walkthrough

Walkthrough

MirrorStatusWorker increases timeouts and latency thresholds and refactors pingOne to retry on network exceptions (with backoff), classifying any HTTP response immediately via a new classify(response, elapsedMs) helper.

Changes

Mirror Probing Resilience

Layer / File(s) Summary
Probe timeout and latency thresholds
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt
Per-ping/request timeout increased and a retry-count constant introduced. OK latency ceiling widened to 2.5s and DEGRADED ceiling to 8s.
Retry mechanism in pingOne
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt
pingOne rewritten to perform up to (1 + retryCount) attempts, measuring each attempt, retrying only on network exceptions with 250ms backoff, returning immediately on any HTTP response, and preserving last observed result across failures.
Response classification extracted
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt
classify(response, elapsedMs) maps non-2xx responses to DOWN, 2xx within OK ceiling to OK, and other successful but slower responses to DEGRADED.

Sequence Diagram

sequenceDiagram
  participant Client as Mirror Status Check
  participant pingOne as pingOne Method
  participant Network as Mirror Host

  Client->>pingOne: Request probe (retryCount)
  pingOne->>Network: Ping attempt `#1` (HTTP request)
  alt Network exception (timeout/IO)
    Network-->>pingOne: exception
    pingOne->>pingOne: record lastResult, backoff 250ms
    pingOne->>Network: Ping attempt `#2` (HTTP request)
    alt HTTP response
      Network-->>pingOne: HTTP response
      pingOne->>pingOne: classify(response, elapsedMs)
      pingOne->>Client: Return classified status
    else exception again
      Network-->>pingOne: exception
      pingOne->>Client: Return lastResult
    end
  else HTTP response
    Network-->>pingOne: HTTP response
    pingOne->>pingOne: classify(response, elapsedMs)
    pingOne->>Client: Return classified status
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through timeouts, nudged ceilings wide,
I retried the pings when networks cried,
Each response I judge with a careful eye,
Returning the truth when the echoes fly,
A rabbit's small tune to keep mirrors spry.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: fixing mirror status reporting by adding retries and wider timeouts to prevent false DOWN classifications.
Linked Issues check ✅ Passed The code changes directly address issue #698 by increasing timeouts, adding retry logic with backoff, and widening latency thresholds to prevent false DOWN reports for reachable mirrors.
Out of Scope Changes check ✅ Passed All changes are scoped to MirrorStatusWorker.kt and directly implement the fix for issue #698; no extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/mirror-probe-resilience

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 5, 2026

Greptile Summary

This PR fixes false DOWN reports for CN-targeted community mirrors by widening the probe timeout (5 s → 10 s), widening the OK/DEGRADED latency ceilings (1.5 s/5 s → 2.5 s/8 s) to compensate for the FSN→CN bias, and adding one retry exclusively on network-level failures (timeout, DNS, TLS) — HTTP responses (including 4xx/5xx) cause an immediate return with no retry, which is the correct behaviour.

  • Retry logic is well-designed: the early return lastResult inside the try block means any HTTP response exits pingOne immediately, so only genuine network exceptions trigger the 250 ms back-off + second attempt.
  • classify extracted from pingOne improves testability; the 8–10 s DEGRADED window (opened because perPingTimeoutMs now exceeds degradedLatencyCeilingMs) is acknowledged in comments and is the intended behaviour.
  • degradedLatencyCeilingMs is declared but no longer referenced in classify's branching logic, making it an unused private property that will produce an IDE/compiler warning.

Confidence Score: 5/5

Safe to merge — the retry logic correctly targets only network-level exceptions and all three changed thresholds are well-justified by the EU→CN latency bias described in the PR.

All changes are confined to MirrorStatusWorker.kt and affect only the background probe cycle, not any request-serving path. The retry implementation correctly avoids re-probing mirrors that return deterministic HTTP errors. The only finding is an unused private variable that generates a compiler warning but has no runtime effect.

No files require special attention beyond the unused degradedLatencyCeilingMs property in MirrorStatusWorker.kt.

Important Files Changed

Filename Overview
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt Widens timeout (5s→10s), adds retryCount=1 retry for network failures only (correct early-return design), widens latency ceilings (1.5s/5s→2.5s/8s); degradedLatencyCeilingMs is now declared but never referenced in classify logic.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[pingOne: attempt 0..retryCount] --> B[http.get with 10s timeout]
    B -->|HTTP response received| C[classify response + elapsed]
    C -->|!isSuccess| D[MirrorStatus.DOWN]
    C -->|elapsed ≤ 2500ms| E[MirrorStatus.OK]
    C -->|elapsed > 2500ms| F[MirrorStatus.DEGRADED]
    D --> G[return immediately — no retry]
    E --> G
    F --> G
    B -->|Exception: timeout / DNS / TLS| H[lastResult = DOWN]
    H -->|attempt < retryCount| I[delay 250ms → next attempt]
    H -->|attempt == retryCount| J[return lastResult = DOWN]
    I --> A
Loading

Fix All in Claude Code

Reviews (2): Last reviewed commit: "fix(mirrors): retry only on network fail..." | Re-trigger Greptile

Comment thread src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt Outdated
Comment thread src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt (1)

165-169: 💤 Low value

Consider adding DEBUG-level logging for caught exceptions.

While the comment indicates mirror flapping is expected and Sentry reporting is intentionally suppressed, adding log.debug("Mirror probe failed: preset={}, attempt={}, error={}", preset.id, attempt, e.message) would aid troubleshooting without generating noise in production logs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt` around
lines 165 - 169, In the catch block inside MirrorStatusWorker (where you
currently catch Exception and set lastResult = PingResult(MirrorStatus.DOWN,
null)), change the catch to capture the exception (e) and add a debug log
call—e.g., call log.debug with a message like "Mirror probe failed: preset={},
attempt={}, error={}" and pass preset.id, attempt, and e.message—before setting
lastResult; keep Sentry/reporting suppressed as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`:
- Around line 155-160: The latency-to-status mapping in MirrorStatusWorker sets
status to MirrorStatus.DEGRADED in the final else, incorrectly treating elapsed
> degradedLatencyCeilingMs as DEGRADED; change the else branch so that when
response.status.isSuccess() is true but elapsed > degradedLatencyCeilingMs (and
<= perPingTimeoutMs) the status is MirrorStatus.DOWN instead of DEGRADED,
keeping the existing checks for !response.status.isSuccess(),
okLatencyCeilingMs, and degradedLatencyCeilingMs (referencing the variables
elapsed, okLatencyCeilingMs, degradedLatencyCeilingMs, perPingTimeoutMs and the
status assignment).

---

Nitpick comments:
In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`:
- Around line 165-169: In the catch block inside MirrorStatusWorker (where you
currently catch Exception and set lastResult = PingResult(MirrorStatus.DOWN,
null)), change the catch to capture the exception (e) and add a debug log
call—e.g., call log.debug with a message like "Mirror probe failed: preset={},
attempt={}, error={}" and pass preset.id, attempt, and e.message—before setting
lastResult; keep Sentry/reporting suppressed as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d760f1ec-b3c2-4ed8-89c8-799918f5fbcd

📥 Commits

Reviewing files that changed from the base of the PR and between 66d1a64 and 6d1e44a.

📒 Files selected for processing (1)
  • src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt

Comment on lines +155 to +160
val status = when {
!response.status.isSuccess() -> MirrorStatus.DOWN
elapsed <= okLatencyCeilingMs -> MirrorStatus.OK
elapsed <= degradedLatencyCeilingMs -> MirrorStatus.DEGRADED
else -> MirrorStatus.DEGRADED
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Latency beyond DEGRADED ceiling should map to DOWN, not DEGRADED.

Line 159's else clause returns MirrorStatus.DEGRADED when elapsed > degradedLatencyCeilingMs (8s). Since perPingTimeoutMs is 10s, a mirror that responds between 8–10s is classified as DEGRADED. From a user perspective, a 9-second response is effectively unusable and should be DOWN.

🐛 Proposed fix for latency classification
                 val status = when {
                     !response.status.isSuccess() -> MirrorStatus.DOWN
                     elapsed <= okLatencyCeilingMs -> MirrorStatus.OK
                     elapsed <= degradedLatencyCeilingMs -> MirrorStatus.DEGRADED
-                    else -> MirrorStatus.DEGRADED
+                    else -> MirrorStatus.DOWN
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt` around
lines 155 - 160, The latency-to-status mapping in MirrorStatusWorker sets status
to MirrorStatus.DEGRADED in the final else, incorrectly treating elapsed >
degradedLatencyCeilingMs as DEGRADED; change the else branch so that when
response.status.isSuccess() is true but elapsed > degradedLatencyCeilingMs (and
<= perPingTimeoutMs) the status is MirrorStatus.DOWN instead of DEGRADED,
keeping the existing checks for !response.status.isSuccess(),
okLatencyCeilingMs, and degradedLatencyCeilingMs (referencing the variables
elapsed, okLatencyCeilingMs, degradedLatencyCeilingMs, perPingTimeoutMs and the
status assignment).

…fier (Greptile + CodeRabbit)

Greptile #1: the new perPingTimeoutMs=10s > degradedLatencyCeilingMs=8s
opens a 2s window where a successful HTTP response was previously
unreachable (the old 5s timeout fired first). The classifier's else arm
is now live for that window. Collapsed the two redundant DEGRADED arms
into a single 'any non-OK success is DEGRADED' branch and extracted the
logic into a named classify() so the intent is obvious.

Greptile #2: the original retry guard fired whenever status==DOWN,
which includes the !response.status.isSuccess() path (a deterministic
4xx/5xx from the mirror itself). Re-probing a stably-broken endpoint
just doubles the load on already-bad mirrors without any chance of a
different result. Retry now ONLY happens on a thrown Exception (timeout,
DNS, TLS reset) — those are the cold-handshake flakes the PR was built
to absorb. An HTTP error response is a terminal answer and returns
immediately.

CodeRabbit nit: added log.debug for caught exceptions with preset id,
attempt number, and error message so operators can correlate flap rates
with upstream incidents without polluting INFO.
@rainxchzed rainxchzed merged commit fc87445 into main Jun 6, 2026
2 checks passed
@rainxchzed rainxchzed deleted the fix/mirror-probe-resilience branch June 6, 2026 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

镜像问题

1 participant