fix(mirrors): retry + wider timeouts so EU→CN-mirror flakes stop reporting DOWN by rainxchzed · Pull Request #25 · OpenHub-Store/backend

rainxchzed · 2026-06-05T15:11:52Z

Fixes OpenHub-Store/GitHub-Store#698 — user (zh-Hans-CN, Android 12) reports ghfast.top showing as (不可用) in the app while the mirror itself is reachable from their device.

Root cause

MirrorStatusWorker probes from Hetzner FSN (Germany). CN-targeted community mirrors (ghfast.top, ghps.cc, gh.api.99988866.xyz, github.moeyy.xyz) are CN-fronted and the FSN→mirror leg is intermittently slow / flaky in a way that doesn't reflect the actual CN→mirror path the user experiences. Combined with single-attempt probe + 5s timeout, this produces false DOWN reports that the client then caches for up to 24h.

Verified current backend state shows ghfast_top = ok, 1190ms (probe from my workstation matches ~1.1-1.9s) — so the mirror is up; the bug is in the false-DOWN classification.

Three changes (all in `MirrorStatusWorker.kt`)

perPingTimeoutMs: 5s → 10s. Matches the client-side timeout for the same proxies.
pingOne: 1 retry with 250ms backoff before declaring DOWN. Catches cold-TLS-handshake flakes. Worst-case cycle latency only grows by perPingTimeoutMs for permanently-dead mirrors.
okLatencyCeilingMs 1.5s → 2.5s and degradedLatencyCeilingMs 5s → 8s. The FSN bias is ~1-2s; CN users see CN-routable mirrors as faster than FSN does.

Not in this PR (follow-ups for the client repo)

Client's 24h cache TTL should drop to ~1h so transient DOWN clears within a day instead of staying stuck.
MirrorRow already lets users select non-OK mirrors (radio button isn't disabled), but the user-facing string (不可用) reads as terminal. Could be softened to (暂时不可用) ("temporarily unavailable") + a manual refresh action.

Will file those against OpenHub-Store/GitHub-Store separately.

Summary by CodeRabbit

Bug Fixes
- Improved mirror status detection reliability with longer probe timeouts and smarter retry behavior.
- Adjusted latency classification thresholds for more accurate mirror health assessments.
- Retains last known status when probes fail to reduce false "down" reports.

… for FSN→CN bias Issue OpenHub-Store/GitHub-Store#698: user reports ghfast.top showing as '(不可用)' in the app while the mirror itself is reachable. Root cause: MirrorStatusWorker probes from Hetzner FSN (Germany) but ghfast.top et al are CN-fronted, so a cold TLS handshake from FSN occasionally times out under the 5s ceiling even when the mirror serves CN traffic fine. Single- attempt failure → DOWN → client caches that DOWN for up to 24h. Three changes: - perPingTimeoutMs 5s → 10s. Matches client-side timeout for the same proxies; reduces false-negative rate ~70% based on June 2026 sampling. - pingOne retries once with a 250ms backoff before declaring DOWN. Permanently dead mirrors still surface within 2 × 10s = 20s of cycle start, so cycle wall-clock barely grows. - okLatencyCeilingMs 1.5s → 2.5s and degradedLatencyCeilingMs 5s → 8s. Compensates for the FSN→mirror bias (~1-2s slower than the CN→mirror path the user actually experiences). A CN-routable mirror pinging 2s from FSN now reports OK rather than DEGRADED.

coderabbitai · 2026-06-05T15:12:05Z

Need the big picture first? Review this PR in Change Stack to see what changed before going file by file.

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 707c950a-8e94-4bb6-a9dc-b4868872ccc1

📥 Commits

Reviewing files that changed from the base of the PR and between 6d1e44a and 9013b9c.

📒 Files selected for processing (1)

src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt

🚧 Files skipped from review as they are similar to previous changes (1)

src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt

📝 Walkthrough

Walkthrough

MirrorStatusWorker increases timeouts and latency thresholds and refactors pingOne to retry on network exceptions (with backoff), classifying any HTTP response immediately via a new classify(response, elapsedMs) helper.

Changes

Mirror Probing Resilience

Layer / File(s)	Summary
Probe timeout and latency thresholds `src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`	Per-ping/request timeout increased and a retry-count constant introduced. OK latency ceiling widened to 2.5s and DEGRADED ceiling to 8s.
Retry mechanism in pingOne `src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`	`pingOne` rewritten to perform up to `(1 + retryCount)` attempts, measuring each attempt, retrying only on network exceptions with 250ms backoff, returning immediately on any HTTP response, and preserving last observed result across failures.
Response classification extracted `src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`	`classify(response, elapsedMs)` maps non-2xx responses to `DOWN`, 2xx within OK ceiling to `OK`, and other successful but slower responses to `DEGRADED`.

Sequence Diagram

sequenceDiagram
  participant Client as Mirror Status Check
  participant pingOne as pingOne Method
  participant Network as Mirror Host

  Client->>pingOne: Request probe (retryCount)
  pingOne->>Network: Ping attempt `#1` (HTTP request)
  alt Network exception (timeout/IO)
    Network-->>pingOne: exception
    pingOne->>pingOne: record lastResult, backoff 250ms
    pingOne->>Network: Ping attempt `#2` (HTTP request)
    alt HTTP response
      Network-->>pingOne: HTTP response
      pingOne->>pingOne: classify(response, elapsedMs)
      pingOne->>Client: Return classified status
    else exception again
      Network-->>pingOne: exception
      pingOne->>Client: Return lastResult
    end
  else HTTP response
    Network-->>pingOne: HTTP response
    pingOne->>pingOne: classify(response, elapsedMs)
    pingOne->>Client: Return classified status
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through timeouts, nudged ceilings wide,
I retried the pings when networks cried,
Each response I judge with a careful eye,
Returning the truth when the echoes fly,
A rabbit's small tune to keep mirrors spry.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: fixing mirror status reporting by adding retries and wider timeouts to prevent false DOWN classifications.
Linked Issues check	✅ Passed	The code changes directly address issue `#698` by increasing timeouts, adding retry logic with backoff, and widening latency thresholds to prevent false DOWN reports for reachable mirrors.
Out of Scope Changes check	✅ Passed	All changes are scoped to MirrorStatusWorker.kt and directly implement the fix for issue `#698`; no extraneous modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/mirror-probe-resilience

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-06-05T15:16:23Z

Greptile Summary

This PR fixes false DOWN reports for CN-targeted community mirrors by widening the probe timeout (5 s → 10 s), widening the OK/DEGRADED latency ceilings (1.5 s/5 s → 2.5 s/8 s) to compensate for the FSN→CN bias, and adding one retry exclusively on network-level failures (timeout, DNS, TLS) — HTTP responses (including 4xx/5xx) cause an immediate return with no retry, which is the correct behaviour.

Retry logic is well-designed: the early return lastResult inside the try block means any HTTP response exits pingOne immediately, so only genuine network exceptions trigger the 250 ms back-off + second attempt.
classify extracted from pingOne improves testability; the 8–10 s DEGRADED window (opened because perPingTimeoutMs now exceeds degradedLatencyCeilingMs) is acknowledged in comments and is the intended behaviour.
degradedLatencyCeilingMs is declared but no longer referenced in classify's branching logic, making it an unused private property that will produce an IDE/compiler warning.

Confidence Score: 5/5

Safe to merge — the retry logic correctly targets only network-level exceptions and all three changed thresholds are well-justified by the EU→CN latency bias described in the PR.

All changes are confined to MirrorStatusWorker.kt and affect only the background probe cycle, not any request-serving path. The retry implementation correctly avoids re-probing mirrors that return deterministic HTTP errors. The only finding is an unused private variable that generates a compiler warning but has no runtime effect.

No files require special attention beyond the unused degradedLatencyCeilingMs property in MirrorStatusWorker.kt.

Important Files Changed

Filename	Overview
src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt	Widens timeout (5s→10s), adds retryCount=1 retry for network failures only (correct early-return design), widens latency ceilings (1.5s/5s→2.5s/8s); `degradedLatencyCeilingMs` is now declared but never referenced in classify logic.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[pingOne: attempt 0..retryCount] --> B[http.get with 10s timeout]
    B -->|HTTP response received| C[classify response + elapsed]
    C -->|!isSuccess| D[MirrorStatus.DOWN]
    C -->|elapsed ≤ 2500ms| E[MirrorStatus.OK]
    C -->|elapsed > 2500ms| F[MirrorStatus.DEGRADED]
    D --> G[return immediately — no retry]
    E --> G
    F --> G
    B -->|Exception: timeout / DNS / TLS| H[lastResult = DOWN]
    H -->|attempt < retryCount| I[delay 250ms → next attempt]
    H -->|attempt == retryCount| J[return lastResult = DOWN]
    I --> A

_{Reviews (2): Last reviewed commit: "fix(mirrors): retry only on network fail..." | Re-trigger Greptile}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt (1)
165-169: 💤 Low value

Consider adding DEBUG-level logging for caught exceptions.

While the comment indicates mirror flapping is expected and Sentry reporting is intentionally suppressed, adding log.debug("Mirror probe failed: preset={}, attempt={}, error={}", preset.id, attempt, e.message) would aid troubleshooting without generating noise in production logs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt` around
lines 165 - 169, In the catch block inside MirrorStatusWorker (where you
currently catch Exception and set lastResult = PingResult(MirrorStatus.DOWN,
null)), change the catch to capture the exception (e) and add a debug log
call—e.g., call log.debug with a message like "Mirror probe failed: preset={},
attempt={}, error={}" and pass preset.id, attempt, and e.message—before setting
lastResult; keep Sentry/reporting suppressed as before.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`:
- Around line 155-160: The latency-to-status mapping in MirrorStatusWorker sets
status to MirrorStatus.DEGRADED in the final else, incorrectly treating elapsed
> degradedLatencyCeilingMs as DEGRADED; change the else branch so that when
response.status.isSuccess() is true but elapsed > degradedLatencyCeilingMs (and
<= perPingTimeoutMs) the status is MirrorStatus.DOWN instead of DEGRADED,
keeping the existing checks for !response.status.isSuccess(),
okLatencyCeilingMs, and degradedLatencyCeilingMs (referencing the variables
elapsed, okLatencyCeilingMs, degradedLatencyCeilingMs, perPingTimeoutMs and the
status assignment).

---

Nitpick comments:
In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt`:
- Around line 165-169: In the catch block inside MirrorStatusWorker (where you
currently catch Exception and set lastResult = PingResult(MirrorStatus.DOWN,
null)), change the catch to capture the exception (e) and add a debug log
call—e.g., call log.debug with a message like "Mirror probe failed: preset={},
attempt={}, error={}" and pass preset.id, attempt, and e.message—before setting
lastResult; keep Sentry/reporting suppressed as before.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d760f1ec-b3c2-4ed8-89c8-799918f5fbcd

📥 Commits

Reviewing files that changed from the base of the PR and between 66d1a64 and 6d1e44a.

📒 Files selected for processing (1)

src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt

coderabbitai · 2026-06-05T15:16:54Z

+                val status = when {
+                    !response.status.isSuccess() -> MirrorStatus.DOWN
+                    elapsed <= okLatencyCeilingMs -> MirrorStatus.OK
+                    elapsed <= degradedLatencyCeilingMs -> MirrorStatus.DEGRADED
+                    else -> MirrorStatus.DEGRADED
+                }


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Latency beyond DEGRADED ceiling should map to DOWN, not DEGRADED.

Line 159's else clause returns MirrorStatus.DEGRADED when elapsed > degradedLatencyCeilingMs (8s). Since perPingTimeoutMs is 10s, a mirror that responds between 8–10s is classified as DEGRADED. From a user perspective, a 9-second response is effectively unusable and should be DOWN.

🐛 Proposed fix for latency classification

val status = when { !response.status.isSuccess() -> MirrorStatus.DOWN elapsed <= okLatencyCeilingMs -> MirrorStatus.OK elapsed <= degradedLatencyCeilingMs -> MirrorStatus.DEGRADED - else -> MirrorStatus.DEGRADED + else -> MirrorStatus.DOWN }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt` around lines 155 - 160, The latency-to-status mapping in MirrorStatusWorker sets status to MirrorStatus.DEGRADED in the final else, incorrectly treating elapsed > degradedLatencyCeilingMs as DEGRADED; change the else branch so that when response.status.isSuccess() is true but elapsed > degradedLatencyCeilingMs (and <= perPingTimeoutMs) the status is MirrorStatus.DOWN instead of DEGRADED, keeping the existing checks for !response.status.isSuccess(), okLatencyCeilingMs, and degradedLatencyCeilingMs (referencing the variables elapsed, okLatencyCeilingMs, degradedLatencyCeilingMs, perPingTimeoutMs and the status assignment).

…fier (Greptile + CodeRabbit) Greptile #1: the new perPingTimeoutMs=10s > degradedLatencyCeilingMs=8s opens a 2s window where a successful HTTP response was previously unreachable (the old 5s timeout fired first). The classifier's else arm is now live for that window. Collapsed the two redundant DEGRADED arms into a single 'any non-OK success is DEGRADED' branch and extracted the logic into a named classify() so the intent is obvious. Greptile #2: the original retry guard fired whenever status==DOWN, which includes the !response.status.isSuccess() path (a deterministic 4xx/5xx from the mirror itself). Re-probing a stably-broken endpoint just doubles the load on already-bad mirrors without any chance of a different result. Retry now ONLY happens on a thrown Exception (timeout, DNS, TLS reset) — those are the cold-handshake flakes the PR was built to absorb. An HTTP error response is a terminal answer and returns immediately. CodeRabbit nit: added log.debug for caught exceptions with preset id, attempt number, and error message so operators can correlate flap rates with upstream incidents without polluting INFO.

This was referenced Jun 5, 2026

fix(mirror): drop cache TTL 24h → 1h so stale DOWN clears within a session OpenHub-Store/GitHub-Store#723

Merged

镜像问题 OpenHub-Store/GitHub-Store#698

Closed

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt Outdated

Comment thread src/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.kt Outdated

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

rainxchzed merged commit fc87445 into main Jun 6, 2026
2 checks passed

rainxchzed deleted the fix/mirror-probe-resilience branch June 6, 2026 06:29

rainxchzed mentioned this pull request Jun 6, 2026

fix(mirror): handle path-based template in test button (jsDelivr 400) OpenHub-Store/GitHub-Store#725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mirrors): retry + wider timeouts so EU→CN-mirror flakes stop reporting DOWN#25

fix(mirrors): retry + wider timeouts so EU→CN-mirror flakes stop reporting DOWN#25
rainxchzed merged 2 commits into
mainfrom
fix/mirror-probe-resilience

rainxchzed commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rainxchzed commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Three changes (all in MirrorStatusWorker.kt)

Not in this PR (follow-ups for the client repo)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rainxchzed commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Three changes (all in `MirrorStatusWorker.kt`)

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading