Throw from readDeck() instead of std::exit()#6935
Conversation
|
jenkins build this please |
|
I still maintain that the unit test justification is just wrong. You should never add a unit test that has parse errors. |
Yes, but in my opinion during development it will be helpful to have this to catch deck errors quickly. |
And you're making everything more difficult for the slight convenience of something that happens very infrequently. You should use other tools when developing. |
|
The implementation here is better than in #6933 as it is less intrusive, but the MPI cleanup is worse: We do want cooperative cleanup and MPI_Finalize() when we can, instead of MPI_Abort(). So the code inside readDeck.cpp doing this is exactly as is should be, unlike the code in Main.hpp that just aborts! I agree that we SHOULD move the finalize and exit out of readDeck.cpp, that function may instead return a status flag or similar. But in the end all processes must communicate to cooperatively finalize and end, as would happen in the current master. The implementation in this PR would cooperate to establish that there is an error, all would throw, then whichever process reaches MPI_Abort() first gets everyone killed. |
|
One more on this: The destructor of Main performs If you look at the no-MPI version, that is exactly what happens! I believe that is actually correct and that we should remove the |
Replace MPI_Finalize + std::exit(EXIT_FAILURE) with throw std::runtime_error in readDeck() on parse failure. This lets callers handle the error: Main::initialize_() catches it and calls MPI_Abort(), while Boost.Test catches it and reports a proper test failure instead of a silent exit. Widen the catch clause in Main::initialize_() from std::invalid_argument to std::exception to cover both existing and new exception types.
Remove MPI_Abort() from Main::initialize_() catch block. All exceptions from readDeck() are synchronized across ranks via comm.min(), so returning false is safe and lets Main::~Main() call MPI_Finalize() for cooperative shutdown.
9a8e11b to
98dafe2
Compare
@atgeirr Good suggestion. I've added a commit that removes Let me know if I should squash the two commits into one before merging. |
|
jenkins build this please |
|
This is better, although I think we should try one more improvement: the catch in Main.hpp does not have an associated communication to ensure all ranks stop if only one throws. The mechanism in place in readDeck.cpp ensure that for those failures, all ranks throw and will enter the catch block. However, other exceptions may still only be thrown on only one or some ranks (I do not have a particular candidate, but it would include anything the catch on (old) line 322 was designed to handle). Therefore that code must also have a comm.min() etc. to ensure cooperative shutdown, just as the code in readDeck.cpp has. Maybe that also allows us to remove that handling from readDeck.cpp, in total essentially moving it to Main.hpp? I have not thought it through completely, but it would be nice to handle the communication of failures as far out as possible. |
Note that the The natural synchronization point is inside The rank-0-only exceptions during parsing (e.g., schedule consistency checks, satellite group validation) are caught by an inner try-catch inside |
| // exceptions are caught inside readDeck() and converted to a | ||
| // synchronized throw via comm.min()). Returning false lets | ||
| // Main::~Main() call MPI_Finalize() for cooperative shutdown. | ||
| catch (const std::exception& e) |
There was a problem hiding this comment.
I would prefer two catch clauses with different specific error messages. std::exception can really be anything and does not have to be related to EclipseState.
| #if HAVE_MPI | ||
| MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); | ||
| #endif |
There was a problem hiding this comment.
I think this is problematic (in combination with std::exception) as long as we do not know what state the system is in. With our current exception handling we do not know that as we only catch vanilla exceptions from the std library. E.g. memory might be exhausted, a fatal unrecoverable mpi problem has happened, a network problem etc.
If you only catch std::exception then actually MPI_Abort makes sense. If you you have no information about what happened, then you need to take the last available weapon that might somehow kill the application. If you want to get partly rid of this then you need to carefully refactor to use custom exceptions and act upon them accordingly. This would be a rather huge task and you would still use MPI_Abort on occasions.
Even if we argue that on an ideal system nothing will happen and the current code does makes sure exceptions are propagated to all ranks, the latter might easily change and boom we have a deadlock.
I have read all the arguments of the two different PRs. To me the other one seems less risky as it just adds some special casing. I have spent quite some time fixing deadlocks and problems in this area and would rather avoid us from spending more time.
There was a problem hiding this comment.
I would prefer two catch clauses with different specific error messages. std::exception can really be anything and does not have to be related to EclipseState.
@blattms Good point. I've added a commit that splits the catch into two clauses:
-
catch (const std::runtime_error& e): for synchronizedreadDeck()parse failures. These are safe for cooperative shutdown viareturn falsesince all ranks throw together (synchronized viacomm.min()insidereadDeck()). -
catch (const std::exception& e): safety net for unexpected exceptions (e.g.,std::bad_alloc, network errors) that may not be synchronized across ranks. This retainsMPI_Abort()to prevent deadlocks.
Each clause also has its own error message: "Failed to create valid EclipseState object" for parse failures vs. "Unexpected error during initialization" for other exceptions.
Use std::runtime_error catch for synchronized readDeck() parse failures with cooperative shutdown via return false. Keep std::exception catch with MPI_Abort() as safety net for unexpected unsynchronized exceptions (e.g. bad_alloc) that could cause deadlocks if not forcefully terminated.
|
jenkins build this please |
|
Seems like everything has been addressed here. Still some reservations but no real objections. |
|
jenkins build this serial please |
Alternative to #6933.
Problem
When
readDeck()encounters unrecoverable parse errors, it callsMPI_Finalize()+std::exit(EXIT_FAILURE). This is problematic because:std::exit()bypasses Boost.Test, producing exit code 1 with no test failure reportMPI_Finalize()requires cooperative shutdown from all ranks; on an error pathMPI_Abort()is more appropriateSolution
readDeck.cpp: ReplaceMPI_Finalize()+std::exit(EXIT_FAILURE)withthrow std::runtime_error(failureMessage)Main.hpp: Widen catch inMain::initialize_()fromstd::invalid_argumenttostd::exceptionso it catches both existing and new exception types. The existingMPI_Abort()call in the catch block handles MPI cleanup correctly.Impact
MPI_Finalize+std::exit(1)Main::initialize_(),MPI_Abort, return falsestd::runtime_error, reports failure with messageCloses #6933