Skip to content

fix(scheduler): catch StaleDataError in verify_integrity to prevent scheduler crash#64503

Open
kalluripradeep wants to merge 1 commit intoapache:mainfrom
kalluripradeep:fix/63926-stale-data-error-scheduler-crash
Open

fix(scheduler): catch StaleDataError in verify_integrity to prevent scheduler crash#64503
kalluripradeep wants to merge 1 commit intoapache:mainfrom
kalluripradeep:fix/63926-stale-data-error-scheduler-crash

Conversation

@kalluripradeep
Copy link
Copy Markdown
Contributor

When LocalExecutor runs with high parallelism, a race condition can occur:
a task instance is completed/deleted between the time
_check_for_removed_or_restored_tasks loads TIs into the session and
the time session.flush() is called inside _create_task_instances.

This raises a StaleDataError (SQLAlchemy ORM optimistic locking
violation) which was previously uncaught — crashing the scheduler
entirely instead of recovering gracefully.

The key reason it slipped through: StaleDataError is not a
subclass of DBAPIError, so it bypassed both the
except IntegrityError guard in _create_task_instances and the
tenacity retry wrapper in run_with_db_retries.

Changes:

  • Catch StaleDataError alongside IntegrityError in
    _create_task_instances and roll back the session safely
  • Add StaleDataError to the tenacity retry list in
    run_with_db_retries so the scheduling loop retries the transient
    race condition

Tests added:

  • test_verify_integrity_handles_stale_data_error — verifies
    StaleDataError during session.flush() is caught and
    session.rollback() is called
  • test_retry_db_transaction_with_stale_data_error — verifies
    StaleDataError is retried 3 times by run_with_db_retries

Fixes #63926

…cheduler crash

When LocalExecutor runs with high parallelism, a task instance can be
completed/deleted between the time _check_for_removed_or_restored_tasks
loads TIs into the session and the time session.flush() is called in
_create_task_instances. This causes a StaleDataError (ORM optimistic
locking violation) that was previously uncaught, crashing the scheduler.

StaleDataError is not a subclass of DBAPIError, so it bypassed both the
except IntegrityError guard in _create_task_instances and the tenacity
retry wrapper in run_with_db_retries.

Fix by:
1. Catching StaleDataError alongside IntegrityError in _create_task_instances
   and rolling back the session safely
2. Adding StaleDataError to the tenacity retry list in run_with_db_retries
   so the scheduling loop retries the transient race condition

Tests added:
- test_verify_integrity_handles_stale_data_error
- test_retry_db_transaction_with_stale_data_error

Fixes apache#63926
@kalluripradeep kalluripradeep force-pushed the fix/63926-stale-data-error-scheduler-crash branch from 20bea6c to e106b50 Compare March 30, 2026 20:15
@kalluripradeep
Copy link
Copy Markdown
Contributor Author

Hey @ashb / @XD-DENG — one Helm test job failed due to a GitHub infra fluke (TCP connection reset while downloading the CI image artifact, not related to code changes). Could you re-run that failed job when you get a chance? Thanks!

@XD-DENG
Copy link
Copy Markdown
Member

XD-DENG commented Mar 30, 2026

Triggered to re-run the failed CI job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Airflow 3.x] Scheduler Crash: StaleDataError in verify_integrity during high-frequency DAG runs (~5k runs/8hrs)

2 participants