Fix flaky TestNetworkEvents (checkpoint before triggering events) by rgarcia · Pull Request #292 · kernel/kernel-images

rgarcia · 2026-06-21T12:40:04Z

Summary

TestNetworkEvents intermittently failed in CI on the loading_failed and binary_resource_skips_body subtests, each timing out after a suspiciously round ~2s / ~3s — the waitForNew deadlines.

Root cause

eventCollector.waitForNew snapshotted the current event count (skip := len(c.events)) inside the call, i.e. after the test had already sent the messages that trigger the awaited event. Events are published asynchronously relative to the test goroutine — readLoop dispatches CDP messages on its own goroutine, and handleLoadingFinished publishes network_response from an asyncWg.Go goroutine. So the triggering event could be appended to c.events before waitForNew recorded its checkpoint. The just-published event then fell below skip, was treated as "old", and the wait blocked until its deadline.

Only subtests where a same-typed event had already been emitted earlier in the shared monitor were affected (request_and_response uses waitFor, which scans all events, so it was immune). On fast unloaded machines the publish reliably lost the race against the checkpoint, so it passed locally; under CI load it didn't.

This was confirmed by injecting a small delay before the checkpoint, which reproduces the exact two failing subtests at ~2.04s / ~3.04s; with the fix the same delay passes.

Fix

Add eventCollector.checkpoint(), taken before the triggering messages are sent, and have waitForNew wait for events at or after that index. Updated all four call sites. No production code changed.

Test plan

go test ./lib/cdpmonitor/ -run TestNetworkEvents -count=100 -race passes
go test ./lib/cdpmonitor/ -run TestNetworkEvents -count=100 -race -cpu=1 passes
Full package go test ./lib/cdpmonitor/ -race -count=10 passes
go vet ./lib/cdpmonitor/ clean
Verified the fix closes the race: a delay injected before the checkpoint reproduces the original failures, and passes once the checkpoint precedes the trigger

🤖 Generated with Claude Code

Note

Low Risk
Test harness changes only; no runtime CDP monitor behavior is modified.

Overview
Fixes intermittent CI timeouts in CDP monitor handler tests by changing how eventCollector.waitForNew decides which events count as "new."

checkpoint() snapshots the current event slice length; callers take it before sending CDP messages that should produce the awaited event. waitForNew now takes that index and only matches events at or after it, instead of recording the skip count inside waitForNew after triggers may already have published asynchronously (read loop / async response-body work).

handlers_test.go updates four sites: console non-string args, network loading failed / binary image response, and per-session network_idle attribution (checkpoint after sess-b's idle, then finish sess-a's request).

No production cdpmonitor code changes.

^{Reviewed by Cursor Bugbot for commit 6640ed2. Bugbot is set up for automated code reviews on this repo. Configure here.}

waitForNew snapshotted the event count after the messages that trigger the awaited event were already sent. Events are published asynchronously (readLoop dispatch and the async response-body goroutine), so the event could be appended before the snapshot, get treated as old, and the wait would block until its deadline. This caused intermittent ~2s/~3s timeouts in the loading_failed and binary_resource_skips_body subtests. Replace the internal snapshot with an explicit checkpoint() taken before the trigger, and have waitForNew wait for events at or after it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

firetiger-agent · 2026-06-21T12:59:18Z

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR is in the kernel repo but affects only the cdpmonitor test package with no changes to production code, so it's unclear if deploy monitoring is needed for a test-only fix.

To monitor this PR anyway, reply with @firetiger monitor this.

rgarcia marked this pull request as ready for review June 21, 2026 12:59

rgarcia requested a review from archandatta June 21, 2026 12:59

archandatta approved these changes Jun 22, 2026

View reviewed changes

rgarcia merged commit ef737d3 into main Jun 22, 2026
10 checks passed

rgarcia deleted the hypeship/fix-testnetworkevents-flake branch June 22, 2026 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky TestNetworkEvents (checkpoint before triggering events)#292

Fix flaky TestNetworkEvents (checkpoint before triggering events)#292
rgarcia merged 1 commit into
mainfrom
hypeship/fix-testnetworkevents-flake

rgarcia commented Jun 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rgarcia commented Jun 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Uh oh!

firetiger-agent Bot commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rgarcia commented Jun 21, 2026 •

edited by cursor Bot

Loading