fix(scheduler): catch StaleDataError in verify_integrity to prevent scheduler crash#64503
Open
kalluripradeep wants to merge 1 commit intoapache:mainfrom
Open
Conversation
2 tasks
d8275d1 to
20bea6c
Compare
…cheduler crash When LocalExecutor runs with high parallelism, a task instance can be completed/deleted between the time _check_for_removed_or_restored_tasks loads TIs into the session and the time session.flush() is called in _create_task_instances. This causes a StaleDataError (ORM optimistic locking violation) that was previously uncaught, crashing the scheduler. StaleDataError is not a subclass of DBAPIError, so it bypassed both the except IntegrityError guard in _create_task_instances and the tenacity retry wrapper in run_with_db_retries. Fix by: 1. Catching StaleDataError alongside IntegrityError in _create_task_instances and rolling back the session safely 2. Adding StaleDataError to the tenacity retry list in run_with_db_retries so the scheduling loop retries the transient race condition Tests added: - test_verify_integrity_handles_stale_data_error - test_retry_db_transaction_with_stale_data_error Fixes apache#63926
20bea6c to
e106b50
Compare
Contributor
Author
Member
|
Triggered to re-run the failed CI job |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When LocalExecutor runs with high parallelism, a race condition can occur:
a task instance is completed/deleted between the time
_check_for_removed_or_restored_tasksloads TIs into the session andthe time
session.flush()is called inside_create_task_instances.This raises a
StaleDataError(SQLAlchemy ORM optimistic lockingviolation) which was previously uncaught — crashing the scheduler
entirely instead of recovering gracefully.
The key reason it slipped through:
StaleDataErroris not asubclass of
DBAPIError, so it bypassed both theexcept IntegrityErrorguard in_create_task_instancesand thetenacity retry wrapper in
run_with_db_retries.Changes:
StaleDataErroralongsideIntegrityErrorin_create_task_instancesand roll back the session safelyStaleDataErrorto the tenacity retry list inrun_with_db_retriesso the scheduling loop retries the transientrace condition
Tests added:
test_verify_integrity_handles_stale_data_error— verifiesStaleDataErrorduringsession.flush()is caught andsession.rollback()is calledtest_retry_db_transaction_with_stale_data_error— verifiesStaleDataErroris retried 3 times byrun_with_db_retriesFixes #63926