Skip to content

fix(storage): resolve flaky integration tests and Surefire crashes in CI#13349

Open
nidhiii-27 wants to merge 6 commits into
mainfrom
fix-flaky-storage-tests
Open

fix(storage): resolve flaky integration tests and Surefire crashes in CI#13349
nidhiii-27 wants to merge 6 commits into
mainfrom
fix-flaky-storage-tests

Conversation

@nidhiii-27
Copy link
Copy Markdown
Contributor

This pull request resolves the severe, intermittent CI flakiness and Maven Surefire JVM crashes in google-cloud-storage under parallel Kokoro environments.

Modifications:

  1. Strategy 1 (Port Allocation & Isolation): Dynamically allocate HTTP/gRPC ports and unique container names (fork-<port>) for each parallel JVM fork in Registry.java and TestBench.java.
  2. Strategy 2 (Console Stream Pollution): Removed raw print stack traces from test files/mock servers, and redirected container logs to target files.
  3. Strategy 3 (Resource & Thread Leaks): Wrapped channel instantiations in try-with-resources across all test files and removed obsolete compiler warnings suppression annotations.
  4. Graceful Stream and Thread Teardown: Updated channel close() to be idempotent and cleanly close outbound observers and cancel backoff executor retry futures.

Verified locally with a clean compilation and all tests passing (BUILD SUCCESS).

@nidhiii-27 nidhiii-27 requested review from a team as code owners June 3, 2026 16:46
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces robustness improvements to Google Cloud Storage writable byte channels and refactors associated tests to use try-with-resources blocks for safer resource management. Key changes include canceling pending reconciliations on stream close, handling stream errors during channel flushes, and preventing double-close operations. Additionally, test logging was modernized, and dynamic port allocation was introduced for the test bench. Feedback on these changes highlights a potential TOCTOU race condition in the new port allocation helper, warns that wrapping caught exceptions in a cancelled status obscures original gRPC error codes, and suggests using unique identifiers for copied test bench logs to prevent parallel test runs from overwriting each other.

Comment on lines +337 to +344
private static int findFreePort() {
try (java.net.ServerSocket socket = new java.net.ServerSocket(0)) {
socket.setReuseAddress(true);
return socket.getLocalPort();
} catch (java.io.IOException e) {
throw new RuntimeException("Failed to allocate a free port", e);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The findFreePort() method has a classic Time-of-Check to Time-of-Use (TOCTOU) race condition. Because the socket is closed immediately after retrieving the port, another thread, process, or parallel JVM fork can bind to the same port before the TestBench container starts and binds to it. This can lead to intermittent port collision failures under heavy parallel CI runs.

If TestBench is managed via Testcontainers, it is highly recommended to let Testcontainers dynamically allocate and bind the ports (which avoids the race entirely) and then query the mapped ports using container.getMappedPort(...). If you are manually starting the container via Docker CLI, you can map to port 0 (e.g., -p 0:8080) and inspect the container to find the allocated host port.

Comment on lines +254 to +259
if (stream != null) {
try {
stream.onError(io.grpc.Status.CANCELLED.withCause(t).asRuntimeException());
} catch (Exception ignored) {
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wrapping the caught Throwable t in io.grpc.Status.CANCELLED.withCause(t).asRuntimeException() obscures the original gRPC status code (such as UNAVAILABLE, PERMISSION_DENIED, etc.) that might be present in t. This can break retry policies or error handling downstream that rely on the specific status code of the failure.

Consider passing the original exception t directly to stream.onError(t) to preserve the original error type and status code.

Suggested change
if (stream != null) {
try {
stream.onError(io.grpc.Status.CANCELLED.withCause(t).asRuntimeException());
} catch (Exception ignored) {
}
}
if (stream != null) {
try {
stream.onError(t);
} catch (Exception ignored) {
}
}

Comment on lines +389 to +393
Path targetLogsDir = java.nio.file.Paths.get("target/testbench-logs");
java.nio.file.Files.createDirectories(targetLogsDir);
java.nio.file.Files.copy(outFile, targetLogsDir.resolve(outFile.getFileName()), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
java.nio.file.Files.copy(errFile, targetLogsDir.resolve(errFile.getFileName()), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
LOGGER.warn("Copied Testbench logs to target directory: {}", targetLogsDir.toAbsolutePath());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When copying the log files to target/testbench-logs, using outFile.getFileName() (which is likely a generic name like stdout.log or stderr.log) will cause parallel test forks to overwrite each other's log files in the shared target/testbench-logs directory.

To prevent log overwrites and preserve logs from all parallel forks, consider incorporating a unique identifier (such as the container name, port, or parent directory name) into the destination file name.

      Path targetLogsDir = java.nio.file.Paths.get("target/testbench-logs");
      java.nio.file.Files.createDirectories(targetLogsDir);
      String uniqueId = outFile.getParent() != null ? outFile.getParent().getFileName().toString() : java.util.UUID.randomUUID().toString();
      java.nio.file.Files.copy(outFile, targetLogsDir.resolve(uniqueId + "-" + outFile.getFileName().toString()), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
      java.nio.file.Files.copy(errFile, targetLogsDir.resolve(uniqueId + "-" + errFile.getFileName().toString()), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
      LOGGER.warn("Copied Testbench logs to target directory: {}", targetLogsDir.toAbsolutePath());

@nidhiii-27 nidhiii-27 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 4, 2026
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants