Skip to content

Fix flaky TestNetworkEvents (checkpoint before triggering events)#292

Merged
rgarcia merged 1 commit into
mainfrom
hypeship/fix-testnetworkevents-flake
Jun 22, 2026
Merged

Fix flaky TestNetworkEvents (checkpoint before triggering events)#292
rgarcia merged 1 commit into
mainfrom
hypeship/fix-testnetworkevents-flake

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

TestNetworkEvents intermittently failed in CI on the loading_failed and binary_resource_skips_body subtests, each timing out after a suspiciously round ~2s / ~3s — the waitForNew deadlines.

Root cause

eventCollector.waitForNew snapshotted the current event count (skip := len(c.events)) inside the call, i.e. after the test had already sent the messages that trigger the awaited event. Events are published asynchronously relative to the test goroutine — readLoop dispatches CDP messages on its own goroutine, and handleLoadingFinished publishes network_response from an asyncWg.Go goroutine. So the triggering event could be appended to c.events before waitForNew recorded its checkpoint. The just-published event then fell below skip, was treated as "old", and the wait blocked until its deadline.

Only subtests where a same-typed event had already been emitted earlier in the shared monitor were affected (request_and_response uses waitFor, which scans all events, so it was immune). On fast unloaded machines the publish reliably lost the race against the checkpoint, so it passed locally; under CI load it didn't.

This was confirmed by injecting a small delay before the checkpoint, which reproduces the exact two failing subtests at ~2.04s / ~3.04s; with the fix the same delay passes.

Fix

Add eventCollector.checkpoint(), taken before the triggering messages are sent, and have waitForNew wait for events at or after that index. Updated all four call sites. No production code changed.

Test plan

  • go test ./lib/cdpmonitor/ -run TestNetworkEvents -count=100 -race passes
  • go test ./lib/cdpmonitor/ -run TestNetworkEvents -count=100 -race -cpu=1 passes
  • Full package go test ./lib/cdpmonitor/ -race -count=10 passes
  • go vet ./lib/cdpmonitor/ clean
  • Verified the fix closes the race: a delay injected before the checkpoint reproduces the original failures, and passes once the checkpoint precedes the trigger

🤖 Generated with Claude Code


Note

Low Risk
Test harness changes only; no runtime CDP monitor behavior is modified.

Overview
Fixes intermittent CI timeouts in CDP monitor handler tests by changing how eventCollector.waitForNew decides which events count as "new."

checkpoint() snapshots the current event slice length; callers take it before sending CDP messages that should produce the awaited event. waitForNew now takes that index and only matches events at or after it, instead of recording the skip count inside waitForNew after triggers may already have published asynchronously (read loop / async response-body work).

handlers_test.go updates four sites: console non-string args, network loading failed / binary image response, and per-session network_idle attribution (checkpoint after sess-b's idle, then finish sess-a's request).

No production cdpmonitor code changes.

Reviewed by Cursor Bugbot for commit 6640ed2. Bugbot is set up for automated code reviews on this repo. Configure here.

waitForNew snapshotted the event count after the messages that trigger
the awaited event were already sent. Events are published asynchronously
(readLoop dispatch and the async response-body goroutine), so the event
could be appended before the snapshot, get treated as old, and the wait
would block until its deadline. This caused intermittent ~2s/~3s timeouts
in the loading_failed and binary_resource_skips_body subtests.

Replace the internal snapshot with an explicit checkpoint() taken before
the trigger, and have waitForNew wait for events at or after it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia rgarcia marked this pull request as ready for review June 21, 2026 12:59
@firetiger-agent

Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR is in the kernel repo but affects only the cdpmonitor test package with no changes to production code, so it's unclear if deploy monitoring is needed for a test-only fix.

To monitor this PR anyway, reply with @firetiger monitor this.

@rgarcia rgarcia requested a review from archandatta June 21, 2026 12:59
@rgarcia rgarcia merged commit ef737d3 into main Jun 22, 2026
10 checks passed
@rgarcia rgarcia deleted the hypeship/fix-testnetworkevents-flake branch June 22, 2026 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants