Skip to content

Reservoir coupling: add a slave DATA file parsing error test case#6951

Open
hakonhagland wants to merge 4 commits intoOPM:masterfrom
hakonhagland:rc_testcase
Open

Reservoir coupling: add a slave DATA file parsing error test case#6951
hakonhagland wants to merge 4 commits intoOPM:masterfrom
hakonhagland:rc_testcase

Conversation

@hakonhagland
Copy link
Copy Markdown
Contributor

Summary

  • Adds a minimal test case (rc_slave_parsing_err) that reproduces and detects the hang that occurs when a reservoir coupling slave fails to parse its DATA file
  • The master spawns a slave whose deck contains a deliberate parse error (missing INCLUDE file); the slave exits immediately but the master blocks indefinitely on MPI_Recv waiting for the slave's initial data exchange
  • Registers the test as a proper CTest so it runs in CI and can be executed with ninja test_rc_slave_parsing_err && ctest -R rc_slave_parsing_err

Background

When MPI_Comm_spawn is used to launch a slave process and that slave exits before completing its initial handshake, the master blocks forever on MPI_Recv. The test documents this as the current known behavior and provides a reproducible regression baseline.

Observed behavior varies by MPI implementation:

MPI Master outcome Exit code
Custom MPICH v4.3.2 Killed by hydra immediately 9
System MPICH v4.2.1 Hangs until timeout 124
OpenMPI v5.0.8 Hangs until timeout 124

The test passes for both outcomes (exit 9 and exit 124), and fails only if the master exits cleanly (exit 0), which would mean the slave parse error was silently ignored.

Future work

A follow-up PR will fix the hang by having the slave send a "parse failed" signal as its first MPI message after being spawned, before reaching sendAndReceiveInitialData(). The master will check for this signal and call sendTerminateAndDisconnect() before exiting with a clear error. This fix will be applied for both OpenMPI and MPICH so that all implementations produce a clean exit with a meaningful error message. Once the fix lands, this test will be updated to expect exit code 1 on all implementations.

New files

  • tests/rescoup/slave_parse_error/RC_MASTER.DATA — minimal 1×1×1 master deck with a single slave reference
  • tests/rescoup/slave_parse_error/slave/RC_SLAVE.DATA — slave deck with deliberate parse error
  • tests/rescoup/slave_parse_error/run_test.sh — manual runner with human-readable output and slave log display
  • tests/rescoup/slave_parse_error/run_ctest.sh — automated wrapper (exits 0/1) used by CTest
  • rescoupTests.cmake — CTest registration; includes logic to select the correct MPI launcher when MPIEXEC_EXECUTABLE and MPI_C_COMPILER come from different MPI installations

@hakonhagland hakonhagland added the manual:irrelevant This PR is a minor fix and should not appear in the manual label Mar 19, 2026
@hakonhagland
Copy link
Copy Markdown
Contributor Author

jenkins build this serial please

Comment thread rescoupTests.cmake
WORKING_DIRECTORY
${CMAKE_CURRENT_BINARY_DIR}/tests/rescoup/slave_parse_error
)
set_tests_properties(rc_slave_parsing_err PROPERTIES TIMEOUT 30)
Copy link
Copy Markdown
Member

@akva2 akva2 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we want to run a test that is supposed to take 30 secs on every ctest invocation on the CI. I appreciate that the test is useful, however. I suggest you add it to an Integration ctest configuration. Then you have to do

 ctest -C Integration -V -R rc_slave_parsing_err

to have it executed. We can hook this up to the jenkins trigger so you can selectively run it from there when necessary. I also suggest using the flow binary, since flow_blackoil isn't necessarily built.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you add it to an Integration ctest configuration.

@akva2 Good idea, I'll move it to CONFIGURATIONS Integration and switch from flow_blackoil to flow.

Regarding the Jenkins build failure, the CI's OpenMPI detected the slave exit and terminated with exit code 1, which differs from my local OpenMPI (v5.0.8) where the master just hangs. Do you happen to know the OpenMPI version on the CI, and whether mpiexec is invoked with
any specific options (e.g. --mca orte_abort_on_non_zero_status)? Knowing this would help me set the right pass conditions for the test across environments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the default openmpi on ubuntu 24.04, ie, 4.1.6-7ubuntu2. there is no fancy smancy parameters,just a few settings, https://github.com/OPM/opm-utilities/blob/master/opm-ci-builder/scripts/setup_system_options.sh#L5

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you add it to an Integration ctest configuration.

@akva2 Added a commit to address this. Let me know if I should squash the two commits into one before merging.

@hakonhagland
Copy link
Copy Markdown
Contributor Author

jenkins build this serial please

1 similar comment
@hakonhagland
Copy link
Copy Markdown
Contributor Author

jenkins build this serial please

@blattms
Copy link
Copy Markdown
Member

blattms commented Mar 24, 2026

Please add a link to the OpenMPI bug report.

Adds a CTest integration test (rc_slave_parsing_err) that documents
and detects the current hang behavior when a reservoir coupling slave
process fails to parse its DATA file.

The test consists of a minimal 1x1x1 master deck that spawns one slave
whose DATA file contains a deliberate parse error (missing INCLUDE
file). The slave exits immediately after the parse failure, but the
master blocks on MPI_Recv waiting for the slave's initial data
exchange, which never arrives.

Expected behavior (pre-fix) depends on the MPI implementation:
- Custom MPICH (v4.3.2): hydra kills master immediately (exit code 9)
- System MPICH (v4.2.1): master hangs until 10s timeout (exit code 124)
- OpenMPI (v5.0.8): master hangs until 10s timeout (exit code 124)

The test passes for both exit codes (9 and 124), failing only if the
master completes successfully (exit 0), which would indicate the slave
error was silently swallowed.

New files:
- tests/rescoup/slave_parse_error/RC_MASTER.DATA: minimal master deck
- tests/rescoup/slave_parse_error/slave/RC_SLAVE.DATA: broken slave deck
- tests/rescoup/slave_parse_error/run_test.sh: manual test runner with
  human-readable output including slave log
- tests/rescoup/slave_parse_error/run_ctest.sh: automated wrapper that
  exits 0/1 for use by ctest
- rescoupTests.cmake: CMake registration; also handles the case where
  MPIEXEC_EXECUTABLE points to a different MPI implementation than the
  one the binaries were compiled against

Run with:
  ninja test_rc_slave_parsing_err  # copies data to build tree
  ctest -V -R rc_slave_parsing_err
Move rc_slave_parsing_err to CONFIGURATIONS Integration so it
only runs with `ctest -C Integration`, not on every CI ctest
invocation. Use `flow` instead of `flow_blackoil` since it is
not necessarily built.

Simplify the pass condition in run_ctest.sh: any non-zero exit
code is now treated as PASS. The CI's OpenMPI 4.1.6 exits with
code 1 (detects child process death), which differs from local
OpenMPI 5.0.8 (hangs) and custom MPICH 4.3.2 (exit code 9).
The only meaningful invariant is that exit 0 means the master
completed despite the slave parse error, which is a failure.
OpenMPI 5.x registers only the IPv6 address for dual-stack
interfaces, causing MPI_Comm_spawn TCP connections to fail
when the child connects back via IPv4. Document the root
cause (btl_tcp_component.c first-match address selection)
and workaround (btl_tcp_disable_family=6) in run_ctest.sh.

Deduplicate the MPI behavior comments between run_test.sh
and run_ctest.sh by having run_test.sh reference run_ctest.sh.
On dual-stack interfaces, OpenMPI 5.x BTL TCP registers only
the IPv6 address when built with --enable-ipv6. This can cause
MPI_Comm_spawn TCP connections to fail when the child connects
from an IPv4 address the parent doesn't recognize.

Observed with the Ubuntu 25.10 system package (5.0.8-8ubuntu1)
but not reproducible with custom source builds of the same
version, suggesting additional configure flags (--with-verbs,
--with-libfabric) affect connection routing at runtime.

Add source code link to the acknowledged limitation in
mca_btl_tcp_create() and document the workaround
(btl_tcp_disable_family=6). Deduplicate MPI behavior comments
between run_test.sh and run_ctest.sh.
@hakonhagland
Copy link
Copy Markdown
Contributor Author

Please add a link to the OpenMPI bug report.

@blattms I was not able to find an official bug report. I searched the OpenMPI GitHub issues and the closest are #585 (closed, about IPv6 link-local addresses) and #6960 (open, general IPv4/IPv6 coexistence), but neither describes this specific dual-stack address selection problem with MPI_Comm_spawn.

I built a debug version of OpenMPI 5.0.8 (the same version as the Ubuntu system package) to verify the exact issue, but the debug build failed to reproduce the connection failure — despite registering the same IPv6 address for the WiFi interface. The bug only manifests with the Ubuntu system package (5.0.8-8ubuntu1), which is built with --enable-ipv6 --with-verbs --with-libfabric --with-psm --with-psm2 --with-ucx. Custom source builds (with and without --enable-ipv6, with and without --enable-debug) all connect successfully, suggesting additional runtime factors from the transport providers affect connection routing.

The OpenMPI source code does acknowledge the underlying limitation in mca_btl_tcp_create() — the address selection loop takes the first address per interface regardless of which address family matched the if_include filter. I've added this source link to the comment in run_ctest.sh.

@hakonhagland
Copy link
Copy Markdown
Contributor Author

jenkins build this serial please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

manual:irrelevant This PR is a minor fix and should not appear in the manual

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants