Skip to content

Conversation

@janhoy
Copy link
Contributor

@janhoy janhoy commented Dec 18, 2025

Analysis:

Root Cause

LeaderTragicEventTest fails during class-level shutdown when Jetty's Server.doStop() exceeds its internal timeout and throws ExecutionException(TimeoutException). After tragic events corrupt cores, shutdown naturally takes longer and can timeout - this is expected behavior, not a test failure. See develocity logs here.

Skjermbilde 2025-12-18 kl  13 05 34

Fix

Added shutdownTimeoutIsError configuration to MiniSolrCloudCluster:

  • Default: true - normal tests fail on unexpected timeouts
  • LeaderTragicEventTest: false - accepts timeouts as expected outcome

Implementation:

  • Added 60s shutdown timeout to the shutdown process (2x Jetty's internal timeout)
  • checkForExceptions() treats ExecutionException(TimeoutException) as warning when shutdownTimeoutIsError=false

https://issues.apache.org/jira/browse/SOLR-18025

This comment was marked as outdated.

@github-actions github-actions bot added dependencies Dependency upgrades tool:build labels Dec 18, 2025
@github-actions github-actions bot added the tests label Dec 18, 2025
@github-actions github-actions bot removed dependencies Dependency upgrades tool:build labels Dec 18, 2025
@janhoy janhoy requested a review from dsmiley December 18, 2025 12:08
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was an analysis done as to the state of the various threads when the test timed out (I'm assuming test timeout was the ultimate symptom)? Hopefully it would show a clue as to a thread busy or waiting that is preventing the node it lives on from shutting down.

A few weeks ago, I noticed another test (ugh, I forget which) reliably taking a long time to shut down (I forget if it led to a failure or not) and partially root caused it in this way. I have a shelved change to ZkContainer.close() to call shutdownNowAndAwaitTermination (with the "Now" in there, which wasn't there before). I noticed a test trying to shut down had cores that were stuck registering in ZK for some reason. I suppose that's unrelated to the failure here but without seeing the threads -- who knows.

@janhoy
Copy link
Contributor Author

janhoy commented Dec 18, 2025

Was an analysis done as to the state of the various threads when the test timed out

No, I have not dived into the cause of hung nodes. I appreciate that all these failures may be a symptom of a real bug that prevents Solr from gracefully shutting down and giving up control / releasing zk.

I'll mark this as draft, and give some more time to fix the root instead of the symptom then...

@janhoy janhoy marked this pull request as draft December 18, 2025 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants