fix(storage): resolve flaky integration tests and Surefire crashes in CI#13349
fix(storage): resolve flaky integration tests and Surefire crashes in CI#13349nidhiii-27 wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces robustness improvements to Google Cloud Storage writable byte channels and refactors associated tests to use try-with-resources blocks for safer resource management. Key changes include canceling pending reconciliations on stream close, handling stream errors during channel flushes, and preventing double-close operations. Additionally, test logging was modernized, and dynamic port allocation was introduced for the test bench. Feedback on these changes highlights a potential TOCTOU race condition in the new port allocation helper, warns that wrapping caught exceptions in a cancelled status obscures original gRPC error codes, and suggests using unique identifiers for copied test bench logs to prevent parallel test runs from overwriting each other.
| private static int findFreePort() { | ||
| try (java.net.ServerSocket socket = new java.net.ServerSocket(0)) { | ||
| socket.setReuseAddress(true); | ||
| return socket.getLocalPort(); | ||
| } catch (java.io.IOException e) { | ||
| throw new RuntimeException("Failed to allocate a free port", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
The findFreePort() method has a classic Time-of-Check to Time-of-Use (TOCTOU) race condition. Because the socket is closed immediately after retrieving the port, another thread, process, or parallel JVM fork can bind to the same port before the TestBench container starts and binds to it. This can lead to intermittent port collision failures under heavy parallel CI runs.
If TestBench is managed via Testcontainers, it is highly recommended to let Testcontainers dynamically allocate and bind the ports (which avoids the race entirely) and then query the mapped ports using container.getMappedPort(...). If you are manually starting the container via Docker CLI, you can map to port 0 (e.g., -p 0:8080) and inspect the container to find the allocated host port.
| if (stream != null) { | ||
| try { | ||
| stream.onError(io.grpc.Status.CANCELLED.withCause(t).asRuntimeException()); | ||
| } catch (Exception ignored) { | ||
| } | ||
| } |
There was a problem hiding this comment.
Wrapping the caught Throwable t in io.grpc.Status.CANCELLED.withCause(t).asRuntimeException() obscures the original gRPC status code (such as UNAVAILABLE, PERMISSION_DENIED, etc.) that might be present in t. This can break retry policies or error handling downstream that rely on the specific status code of the failure.
Consider passing the original exception t directly to stream.onError(t) to preserve the original error type and status code.
| if (stream != null) { | |
| try { | |
| stream.onError(io.grpc.Status.CANCELLED.withCause(t).asRuntimeException()); | |
| } catch (Exception ignored) { | |
| } | |
| } | |
| if (stream != null) { | |
| try { | |
| stream.onError(t); | |
| } catch (Exception ignored) { | |
| } | |
| } |
| Path targetLogsDir = java.nio.file.Paths.get("target/testbench-logs"); | ||
| java.nio.file.Files.createDirectories(targetLogsDir); | ||
| java.nio.file.Files.copy(outFile, targetLogsDir.resolve(outFile.getFileName()), java.nio.file.StandardCopyOption.REPLACE_EXISTING); | ||
| java.nio.file.Files.copy(errFile, targetLogsDir.resolve(errFile.getFileName()), java.nio.file.StandardCopyOption.REPLACE_EXISTING); | ||
| LOGGER.warn("Copied Testbench logs to target directory: {}", targetLogsDir.toAbsolutePath()); |
There was a problem hiding this comment.
When copying the log files to target/testbench-logs, using outFile.getFileName() (which is likely a generic name like stdout.log or stderr.log) will cause parallel test forks to overwrite each other's log files in the shared target/testbench-logs directory.
To prevent log overwrites and preserve logs from all parallel forks, consider incorporating a unique identifier (such as the container name, port, or parent directory name) into the destination file name.
Path targetLogsDir = java.nio.file.Paths.get("target/testbench-logs");
java.nio.file.Files.createDirectories(targetLogsDir);
String uniqueId = outFile.getParent() != null ? outFile.getParent().getFileName().toString() : java.util.UUID.randomUUID().toString();
java.nio.file.Files.copy(outFile, targetLogsDir.resolve(uniqueId + "-" + outFile.getFileName().toString()), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
java.nio.file.Files.copy(errFile, targetLogsDir.resolve(uniqueId + "-" + errFile.getFileName().toString()), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
LOGGER.warn("Copied Testbench logs to target directory: {}", targetLogsDir.toAbsolutePath());…WritableByteChannelTest
This pull request resolves the severe, intermittent CI flakiness and Maven Surefire JVM crashes in google-cloud-storage under parallel Kokoro environments.
Modifications:
fork-<port>) for each parallel JVM fork inRegistry.javaandTestBench.java.close()to be idempotent and cleanly close outbound observers and cancel backoff executor retry futures.Verified locally with a clean compilation and all tests passing (BUILD SUCCESS).