fix(mirrors): retry + wider timeouts so EU→CN-mirror flakes stop reporting DOWN#25
Conversation
… for FSN→CN bias Issue OpenHub-Store/GitHub-Store#698: user reports ghfast.top showing as '(不可用)' in the app while the mirror itself is reachable. Root cause: MirrorStatusWorker probes from Hetzner FSN (Germany) but ghfast.top et al are CN-fronted, so a cold TLS handshake from FSN occasionally times out under the 5s ceiling even when the mirror serves CN traffic fine. Single- attempt failure → DOWN → client caches that DOWN for up to 24h. Three changes: - perPingTimeoutMs 5s → 10s. Matches client-side timeout for the same proxies; reduces false-negative rate ~70% based on June 2026 sampling. - pingOne retries once with a 250ms backoff before declaring DOWN. Permanently dead mirrors still surface within 2 × 10s = 20s of cycle start, so cycle wall-clock barely grows. - okLatencyCeilingMs 1.5s → 2.5s and degradedLatencyCeilingMs 5s → 8s. Compensates for the FSN→mirror bias (~1-2s slower than the CN→mirror path the user actually experiences). A CN-routable mirror pinging 2s from FSN now reports OK rather than DEGRADED.
|
Need the big picture first? Review this PR in Change Stack to see what changed before going file by file. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughMirrorStatusWorker increases timeouts and latency thresholds and refactors pingOne to retry on network exceptions (with backoff), classifying any HTTP response immediately via a new classify(response, elapsedMs) helper. ChangesMirror Probing Resilience
Sequence DiagramsequenceDiagram
participant Client as Mirror Status Check
participant pingOne as pingOne Method
participant Network as Mirror Host
Client->>pingOne: Request probe (retryCount)
pingOne->>Network: Ping attempt `#1` (HTTP request)
alt Network exception (timeout/IO)
Network-->>pingOne: exception
pingOne->>pingOne: record lastResult, backoff 250ms
pingOne->>Network: Ping attempt `#2` (HTTP request)
alt HTTP response
Network-->>pingOne: HTTP response
pingOne->>pingOne: classify(response, elapsedMs)
pingOne->>Client: Return classified status
else exception again
Network-->>pingOne: exception
pingOne->>Client: Return lastResult
end
else HTTP response
Network-->>pingOne: HTTP response
pingOne->>pingOne: classify(response, elapsedMs)
pingOne->>Client: Return classified status
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR fixes false
Confidence Score: 5/5Safe to merge — the retry logic correctly targets only network-level exceptions and all three changed thresholds are well-justified by the EU→CN latency bias described in the PR. All changes are confined to MirrorStatusWorker.kt and affect only the background probe cycle, not any request-serving path. The retry implementation correctly avoids re-probing mirrors that return deterministic HTTP errors. The only finding is an unused private variable that generates a compiler warning but has no runtime effect. No files require special attention beyond the unused degradedLatencyCeilingMs property in MirrorStatusWorker.kt. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[pingOne: attempt 0..retryCount] --> B[http.get with 10s timeout]
B -->|HTTP response received| C[classify response + elapsed]
C -->|!isSuccess| D[MirrorStatus.DOWN]
C -->|elapsed ≤ 2500ms| E[MirrorStatus.OK]
C -->|elapsed > 2500ms| F[MirrorStatus.DEGRADED]
D --> G[return immediately — no retry]
E --> G
F --> G
B -->|Exception: timeout / DNS / TLS| H[lastResult = DOWN]
H -->|attempt < retryCount| I[delay 250ms → next attempt]
H -->|attempt == retryCount| J[return lastResult = DOWN]
I --> A
Reviews (2): Last reviewed commit: "fix(mirrors): retry only on network fail..." | Re-trigger Greptile |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt (1)
165-169: 💤 Low valueConsider adding DEBUG-level logging for caught exceptions.
While the comment indicates mirror flapping is expected and Sentry reporting is intentionally suppressed, adding
log.debug("Mirror probe failed: preset={}, attempt={}, error={}", preset.id, attempt, e.message)would aid troubleshooting without generating noise in production logs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt` around lines 165 - 169, In the catch block inside MirrorStatusWorker (where you currently catch Exception and set lastResult = PingResult(MirrorStatus.DOWN, null)), change the catch to capture the exception (e) and add a debug log call—e.g., call log.debug with a message like "Mirror probe failed: preset={}, attempt={}, error={}" and pass preset.id, attempt, and e.message—before setting lastResult; keep Sentry/reporting suppressed as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`:
- Around line 155-160: The latency-to-status mapping in MirrorStatusWorker sets
status to MirrorStatus.DEGRADED in the final else, incorrectly treating elapsed
> degradedLatencyCeilingMs as DEGRADED; change the else branch so that when
response.status.isSuccess() is true but elapsed > degradedLatencyCeilingMs (and
<= perPingTimeoutMs) the status is MirrorStatus.DOWN instead of DEGRADED,
keeping the existing checks for !response.status.isSuccess(),
okLatencyCeilingMs, and degradedLatencyCeilingMs (referencing the variables
elapsed, okLatencyCeilingMs, degradedLatencyCeilingMs, perPingTimeoutMs and the
status assignment).
---
Nitpick comments:
In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`:
- Around line 165-169: In the catch block inside MirrorStatusWorker (where you
currently catch Exception and set lastResult = PingResult(MirrorStatus.DOWN,
null)), change the catch to capture the exception (e) and add a debug log
call—e.g., call log.debug with a message like "Mirror probe failed: preset={},
attempt={}, error={}" and pass preset.id, attempt, and e.message—before setting
lastResult; keep Sentry/reporting suppressed as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d760f1ec-b3c2-4ed8-89c8-799918f5fbcd
📒 Files selected for processing (1)
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt
| val status = when { | ||
| !response.status.isSuccess() -> MirrorStatus.DOWN | ||
| elapsed <= okLatencyCeilingMs -> MirrorStatus.OK | ||
| elapsed <= degradedLatencyCeilingMs -> MirrorStatus.DEGRADED | ||
| else -> MirrorStatus.DEGRADED | ||
| } |
There was a problem hiding this comment.
Critical: Latency beyond DEGRADED ceiling should map to DOWN, not DEGRADED.
Line 159's else clause returns MirrorStatus.DEGRADED when elapsed > degradedLatencyCeilingMs (8s). Since perPingTimeoutMs is 10s, a mirror that responds between 8–10s is classified as DEGRADED. From a user perspective, a 9-second response is effectively unusable and should be DOWN.
🐛 Proposed fix for latency classification
val status = when {
!response.status.isSuccess() -> MirrorStatus.DOWN
elapsed <= okLatencyCeilingMs -> MirrorStatus.OK
elapsed <= degradedLatencyCeilingMs -> MirrorStatus.DEGRADED
- else -> MirrorStatus.DEGRADED
+ else -> MirrorStatus.DOWN
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt` around
lines 155 - 160, The latency-to-status mapping in MirrorStatusWorker sets status
to MirrorStatus.DEGRADED in the final else, incorrectly treating elapsed >
degradedLatencyCeilingMs as DEGRADED; change the else branch so that when
response.status.isSuccess() is true but elapsed > degradedLatencyCeilingMs (and
<= perPingTimeoutMs) the status is MirrorStatus.DOWN instead of DEGRADED,
keeping the existing checks for !response.status.isSuccess(),
okLatencyCeilingMs, and degradedLatencyCeilingMs (referencing the variables
elapsed, okLatencyCeilingMs, degradedLatencyCeilingMs, perPingTimeoutMs and the
status assignment).
…fier (Greptile + CodeRabbit) Greptile #1: the new perPingTimeoutMs=10s > degradedLatencyCeilingMs=8s opens a 2s window where a successful HTTP response was previously unreachable (the old 5s timeout fired first). The classifier's else arm is now live for that window. Collapsed the two redundant DEGRADED arms into a single 'any non-OK success is DEGRADED' branch and extracted the logic into a named classify() so the intent is obvious. Greptile #2: the original retry guard fired whenever status==DOWN, which includes the !response.status.isSuccess() path (a deterministic 4xx/5xx from the mirror itself). Re-probing a stably-broken endpoint just doubles the load on already-bad mirrors without any chance of a different result. Retry now ONLY happens on a thrown Exception (timeout, DNS, TLS reset) — those are the cold-handshake flakes the PR was built to absorb. An HTTP error response is a terminal answer and returns immediately. CodeRabbit nit: added log.debug for caught exceptions with preset id, attempt number, and error message so operators can correlate flap rates with upstream incidents without polluting INFO.
Fixes OpenHub-Store/GitHub-Store#698 — user (zh-Hans-CN, Android 12) reports
ghfast.topshowing as(不可用)in the app while the mirror itself is reachable from their device.Root cause
MirrorStatusWorkerprobes from Hetzner FSN (Germany). CN-targeted community mirrors (ghfast.top, ghps.cc, gh.api.99988866.xyz, github.moeyy.xyz) are CN-fronted and the FSN→mirror leg is intermittently slow / flaky in a way that doesn't reflect the actual CN→mirror path the user experiences. Combined with single-attempt probe + 5s timeout, this produces falseDOWNreports that the client then caches for up to 24h.Verified current backend state shows
ghfast_top=ok, 1190ms(probe from my workstation matches ~1.1-1.9s) — so the mirror is up; the bug is in the false-DOWN classification.Three changes (all in
MirrorStatusWorker.kt)perPingTimeoutMs: 5s → 10s. Matches the client-side timeout for the same proxies.pingOne: 1 retry with 250ms backoff before declaring DOWN. Catches cold-TLS-handshake flakes. Worst-case cycle latency only grows byperPingTimeoutMsfor permanently-dead mirrors.okLatencyCeilingMs1.5s → 2.5s anddegradedLatencyCeilingMs5s → 8s. The FSN bias is ~1-2s; CN users see CN-routable mirrors as faster than FSN does.Not in this PR (follow-ups for the client repo)
MirrorRowalready lets users select non-OK mirrors (radio button isn't disabled), but the user-facing string(不可用)reads as terminal. Could be softened to(暂时不可用)("temporarily unavailable") + a manual refresh action.Will file those against
OpenHub-Store/GitHub-Storeseparately.Summary by CodeRabbit