Reservoir coupling: add a slave DATA file parsing error test case#6951
Reservoir coupling: add a slave DATA file parsing error test case#6951hakonhagland wants to merge 4 commits intoOPM:masterfrom
Conversation
|
jenkins build this serial please |
| WORKING_DIRECTORY | ||
| ${CMAKE_CURRENT_BINARY_DIR}/tests/rescoup/slave_parse_error | ||
| ) | ||
| set_tests_properties(rc_slave_parsing_err PROPERTIES TIMEOUT 30) |
There was a problem hiding this comment.
i don't think we want to run a test that is supposed to take 30 secs on every ctest invocation on the CI. I appreciate that the test is useful, however. I suggest you add it to an Integration ctest configuration. Then you have to do
ctest -C Integration -V -R rc_slave_parsing_err
to have it executed. We can hook this up to the jenkins trigger so you can selectively run it from there when necessary. I also suggest using the flow binary, since flow_blackoil isn't necessarily built.
There was a problem hiding this comment.
I suggest you add it to an Integration ctest configuration.
@akva2 Good idea, I'll move it to CONFIGURATIONS Integration and switch from flow_blackoil to flow.
Regarding the Jenkins build failure, the CI's OpenMPI detected the slave exit and terminated with exit code 1, which differs from my local OpenMPI (v5.0.8) where the master just hangs. Do you happen to know the OpenMPI version on the CI, and whether mpiexec is invoked with
any specific options (e.g. --mca orte_abort_on_non_zero_status)? Knowing this would help me set the right pass conditions for the test across environments.
There was a problem hiding this comment.
it's the default openmpi on ubuntu 24.04, ie, 4.1.6-7ubuntu2. there is no fancy smancy parameters,just a few settings, https://github.com/OPM/opm-utilities/blob/master/opm-ci-builder/scripts/setup_system_options.sh#L5
There was a problem hiding this comment.
I suggest you add it to an Integration ctest configuration.
@akva2 Added a commit to address this. Let me know if I should squash the two commits into one before merging.
6d8ca87 to
f89c9ed
Compare
|
jenkins build this serial please |
1 similar comment
|
jenkins build this serial please |
|
Please add a link to the OpenMPI bug report. |
Adds a CTest integration test (rc_slave_parsing_err) that documents and detects the current hang behavior when a reservoir coupling slave process fails to parse its DATA file. The test consists of a minimal 1x1x1 master deck that spawns one slave whose DATA file contains a deliberate parse error (missing INCLUDE file). The slave exits immediately after the parse failure, but the master blocks on MPI_Recv waiting for the slave's initial data exchange, which never arrives. Expected behavior (pre-fix) depends on the MPI implementation: - Custom MPICH (v4.3.2): hydra kills master immediately (exit code 9) - System MPICH (v4.2.1): master hangs until 10s timeout (exit code 124) - OpenMPI (v5.0.8): master hangs until 10s timeout (exit code 124) The test passes for both exit codes (9 and 124), failing only if the master completes successfully (exit 0), which would indicate the slave error was silently swallowed. New files: - tests/rescoup/slave_parse_error/RC_MASTER.DATA: minimal master deck - tests/rescoup/slave_parse_error/slave/RC_SLAVE.DATA: broken slave deck - tests/rescoup/slave_parse_error/run_test.sh: manual test runner with human-readable output including slave log - tests/rescoup/slave_parse_error/run_ctest.sh: automated wrapper that exits 0/1 for use by ctest - rescoupTests.cmake: CMake registration; also handles the case where MPIEXEC_EXECUTABLE points to a different MPI implementation than the one the binaries were compiled against Run with: ninja test_rc_slave_parsing_err # copies data to build tree ctest -V -R rc_slave_parsing_err
Move rc_slave_parsing_err to CONFIGURATIONS Integration so it only runs with `ctest -C Integration`, not on every CI ctest invocation. Use `flow` instead of `flow_blackoil` since it is not necessarily built. Simplify the pass condition in run_ctest.sh: any non-zero exit code is now treated as PASS. The CI's OpenMPI 4.1.6 exits with code 1 (detects child process death), which differs from local OpenMPI 5.0.8 (hangs) and custom MPICH 4.3.2 (exit code 9). The only meaningful invariant is that exit 0 means the master completed despite the slave parse error, which is a failure.
OpenMPI 5.x registers only the IPv6 address for dual-stack interfaces, causing MPI_Comm_spawn TCP connections to fail when the child connects back via IPv4. Document the root cause (btl_tcp_component.c first-match address selection) and workaround (btl_tcp_disable_family=6) in run_ctest.sh. Deduplicate the MPI behavior comments between run_test.sh and run_ctest.sh by having run_test.sh reference run_ctest.sh.
On dual-stack interfaces, OpenMPI 5.x BTL TCP registers only the IPv6 address when built with --enable-ipv6. This can cause MPI_Comm_spawn TCP connections to fail when the child connects from an IPv4 address the parent doesn't recognize. Observed with the Ubuntu 25.10 system package (5.0.8-8ubuntu1) but not reproducible with custom source builds of the same version, suggesting additional configure flags (--with-verbs, --with-libfabric) affect connection routing at runtime. Add source code link to the acknowledged limitation in mca_btl_tcp_create() and document the workaround (btl_tcp_disable_family=6). Deduplicate MPI behavior comments between run_test.sh and run_ctest.sh.
766d067 to
cbf212c
Compare
@blattms I was not able to find an official bug report. I searched the OpenMPI GitHub issues and the closest are #585 (closed, about IPv6 link-local addresses) and #6960 (open, general IPv4/IPv6 coexistence), but neither describes this specific dual-stack address selection problem with I built a debug version of OpenMPI 5.0.8 (the same version as the Ubuntu system package) to verify the exact issue, but the debug build failed to reproduce the connection failure — despite registering the same IPv6 address for the WiFi interface. The bug only manifests with the Ubuntu system package ( The OpenMPI source code does acknowledge the underlying limitation in |
|
jenkins build this serial please |
Summary
rc_slave_parsing_err) that reproduces and detects the hang that occurs when a reservoir coupling slave fails to parse its DATA fileINCLUDEfile); the slave exits immediately but the master blocks indefinitely onMPI_Recvwaiting for the slave's initial data exchangeninja test_rc_slave_parsing_err && ctest -R rc_slave_parsing_errBackground
When
MPI_Comm_spawnis used to launch a slave process and that slave exits before completing its initial handshake, the master blocks forever onMPI_Recv. The test documents this as the current known behavior and provides a reproducible regression baseline.Observed behavior varies by MPI implementation:
The test passes for both outcomes (exit 9 and exit 124), and fails only if the master exits cleanly (exit 0), which would mean the slave parse error was silently ignored.
Future work
A follow-up PR will fix the hang by having the slave send a "parse failed" signal as its first MPI message after being spawned, before reaching
sendAndReceiveInitialData(). The master will check for this signal and callsendTerminateAndDisconnect()before exiting with a clear error. This fix will be applied for both OpenMPI and MPICH so that all implementations produce a clean exit with a meaningful error message. Once the fix lands, this test will be updated to expect exit code 1 on all implementations.New files
tests/rescoup/slave_parse_error/RC_MASTER.DATA— minimal 1×1×1 master deck with a single slave referencetests/rescoup/slave_parse_error/slave/RC_SLAVE.DATA— slave deck with deliberate parse errortests/rescoup/slave_parse_error/run_test.sh— manual runner with human-readable output and slave log displaytests/rescoup/slave_parse_error/run_ctest.sh— automated wrapper (exits 0/1) used by CTestrescoupTests.cmake— CTest registration; includes logic to select the correct MPI launcher whenMPIEXEC_EXECUTABLEandMPI_C_COMPILERcome from different MPI installations