fix(scheduler): catch StaleDataError in verify_integrity to prevent scheduler crash by kalluripradeep · Pull Request #64503 · apache/airflow

kalluripradeep · 2026-03-30T19:09:09Z

When LocalExecutor runs with high parallelism, a race condition can occur:
a task instance is completed/deleted between the time
_check_for_removed_or_restored_tasks loads TIs into the session and
the time session.flush() is called inside _create_task_instances.

This raises a StaleDataError (SQLAlchemy ORM optimistic locking
violation) which was previously uncaught — crashing the scheduler
entirely instead of recovering gracefully.

The key reason it slipped through: StaleDataError is not a
subclass of DBAPIError, so it bypassed both the
except IntegrityError guard in _create_task_instances and the
tenacity retry wrapper in run_with_db_retries.

Changes:

Catch StaleDataError alongside IntegrityError in
_create_task_instances and roll back the session safely
Add StaleDataError to the tenacity retry list in
run_with_db_retries so the scheduling loop retries the transient
race condition

Tests added:

test_verify_integrity_handles_stale_data_error — verifies
StaleDataError during session.flush() is caught and
session.rollback() is called
test_retry_db_transaction_with_stale_data_error — verifies
StaleDataError is retried 3 times by run_with_db_retries

Fixes #63926

…cheduler crash When LocalExecutor runs with high parallelism, a task instance can be completed/deleted between the time _check_for_removed_or_restored_tasks loads TIs into the session and the time session.flush() is called in _create_task_instances. This causes a StaleDataError (ORM optimistic locking violation) that was previously uncaught, crashing the scheduler. StaleDataError is not a subclass of DBAPIError, so it bypassed both the except IntegrityError guard in _create_task_instances and the tenacity retry wrapper in run_with_db_retries. Fix by: 1. Catching StaleDataError alongside IntegrityError in _create_task_instances and rolling back the session safely 2. Adding StaleDataError to the tenacity retry list in run_with_db_retries so the scheduling loop retries the transient race condition Tests added: - test_verify_integrity_handles_stale_data_error - test_retry_db_transaction_with_stale_data_error Fixes apache#63926

kalluripradeep · 2026-03-30T21:31:44Z

Hey @ashb / @XD-DENG — one Helm test job failed due to a GitHub infra fluke (TCP connection reset while downloading the CI image artifact, not related to code changes). Could you re-run that failed job when you get a chance? Thanks!

XD-DENG · 2026-03-30T21:33:10Z

Triggered to re-run the failed CI job

kalluripradeep requested review from XD-DENG and ashb as code owners March 30, 2026 19:09

kalluripradeep mentioned this pull request Mar 30, 2026

[Airflow 3.x] Scheduler Crash: StaleDataError in verify_integrity during high-frequency DAG runs (~5k runs/8hrs) #63926

Open

2 tasks

kalluripradeep force-pushed the fix/63926-stale-data-error-scheduler-crash branch from d8275d1 to 20bea6c Compare March 30, 2026 20:13

kalluripradeep force-pushed the fix/63926-stale-data-error-scheduler-crash branch from 20bea6c to e106b50 Compare March 30, 2026 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): catch StaleDataError in verify_integrity to prevent scheduler crash#64503

fix(scheduler): catch StaleDataError in verify_integrity to prevent scheduler crash#64503
kalluripradeep wants to merge 1 commit intoapache:mainfrom
kalluripradeep:fix/63926-stale-data-error-scheduler-crash

kalluripradeep commented Mar 30, 2026

Uh oh!

kalluripradeep commented Mar 30, 2026

Uh oh!

XD-DENG commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kalluripradeep commented Mar 30, 2026

Uh oh!

kalluripradeep commented Mar 30, 2026

Uh oh!

XD-DENG commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants