-
Notifications
You must be signed in to change notification settings - Fork 325
Open
Description
Race Condition in Virtual Layer Update
Summary
Non-deterministic failure during sqlmesh plan execution due to race condition when updating virtual layer. The plan fails on first execution but succeeds on second execution without any changes to code or data.
Environment
- SQLMesh Version: 0.228.1
- Python Version: 3.13.1
- Database: PostgreSQL
- Execution Mode: ThreadPoolExecutor (concurrent)
Expected Behavior
Running sqlmesh plan should be idempotent and deterministic. The same command with the same state should produce the same result every time.
Actual Behavior
First execution fails with UndefinedTable error during virtual layer update. Second execution (without any changes) succeeds completely.
Steps to Reproduce
- Add multiple new external models to
external_models.yaml(~28 new tables) - Run
sqlmesh plan - Answer "y" to "Apply - Virtual Update"
- Observe failure at ~31% completion during "Updating virtual layer"
- Run
sqlmesh planagain without making any changes - Observe successful completion
Error Details
First Execution (Failed)
Differences from the `dev` environment:
Models:
├── Added:
│ ├── schema1__dev.table_001
│ ├── .... 26 more ....
│ └── schema1__dev.table_028
└── Removed:
└── schema0__dev.external_table_old
Apply - Virtual Update [y/n]: y
Updating virtual layer ━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31.2% • 5/16 • 0:00:00
Error: Execution failed for node SnapshotId<"database"."schema2"."model_name": 1477141055>
Log Analysis
Error from logs:
psycopg2.errors.UndefinedTable: relation "sqlmesh__schema2.schema2__model_name__3109983560" does not exist
LINE 1: ..."schema2__dev"."model_name" AS SELECT * FROM "database"."sqlmesh__schema2"."schema2__model_name__3109983560"
^
Sequence of events:
- Virtual layer update phase begins
- SQLMesh attempts to create dev view:
CREATE VIEW "schema2__dev"."model_name" AS SELECT * FROM "database"."sqlmesh__schema2"."schema2__model_name__3109983560"
- Error: The referenced view
schema2__model_name__3109983560does not exist yet - This view is supposed to be created later in the Backfill phase
From first run log (09:16:35):
2026-01-03 09:16:35,430 - ThreadPoolExecutor-4_2 - sqlmesh.core.snapshot.evaluator - INFO - Updating view 'database.schema2__dev.model_name' to point at table 'database.sqlmesh__schema2.schema2__model_name__3109983560' (evaluator.py:1988)
2026-01-03 09:16:35,430 - ThreadPoolExecutor-4_2 - sqlmesh.core.engine_adapter.base - INFO - Executing SQL: DROP VIEW IF EXISTS "schema2__dev"."model_name" CASCADE (base.py:2626)
2026-01-03 09:16:35,431 - ThreadPoolExecutor-4_2 - sqlmesh.core.engine_adapter.base - INFO - Executing SQL: CREATE VIEW "schema2__dev"."model_name" AS SELECT * FROM "database"."sqlmesh__schema2"."schema2__model_name__3109983560" (base.py:2626)
[FAILURE - Table does not exist]
From second run log (09:20:48) - Successful:
2026-01-03 09:20:48,604 - ThreadPoolExecutor-3_2 - sqlmesh.core.snapshot.evaluator - INFO - Evaluating snapshot SnapshotId<"database"."schema2"."model_name": 1477141055> (evaluator.py:711)
2026-01-03 09:20:48,612 - ThreadPoolExecutor-3_2 - sqlmesh.core.snapshot.evaluator - INFO - Inserting data for snapshot (evaluator.py:952)
2026-01-03 09:20:48,628 - ThreadPoolExecutor-3_2 - sqlmesh.core.snapshot.evaluator - INFO - Replacing view 'database.sqlmesh__schema2.schema2__model_name__3109983560' (evaluator.py:2676)
2026-01-03 09:20:48,632 - ThreadPoolExecutor-3_2 - sqlmesh.core.engine_adapter.base - INFO - Executing SQL: CREATE VIEW "sqlmesh__schema2"."schema2__model_name__3109983560" (...) (base.py:2626)
[SUCCESS - View created during backfill phase]
Second Execution (Succeeded)
Differences from the `dev` environment:
Models needing backfill:
├── schema3__dev.snapshot_001: [2025-12-27 - 2025-12-30, 2026-01-01 - 2026-01-01] (preview)
├── schema3__dev.snapshot_002: [2025-12-27 - 2025-12-30, 2026-01-01 - 2026-01-01] (preview)
├── schema2__dev.model_001: [recreate view]
├── schema2__dev.model_002: [recreate view]
└── schema2__dev.model_name: [recreate view]
Apply - Backfill Tables [y/n]: y
[SUCCESS]
Root Cause Analysis
Race Condition in Execution Order:
- SQLMesh uses ThreadPoolExecutor for concurrent execution
- During virtual layer update, it attempts to create dev views pointing to underlying physical views
- The underlying physical views are created in the backfill phase, which should execute before the virtual layer update
- Due to concurrent execution and possibly incorrect dependency ordering, the virtual layer update executed before the view was created
- First run fails, but saves partial progress to state tables
- Second run starts with different state and executes in correct order
Additional Context
- The failure occurs specifically when adding many new external models (~28 tables)
- The error happens at exactly 31.2% progress (5 out of 16 models processed)
- The failure is not caused by missing data or database connection issues
- The underlying external table exists and has data
- Manual queries against the external table work correctly
- The issue appears to be purely a SQLMesh execution ordering problem
Impact
- Severity: Medium
- Frequency: Reproducible on first run, resolved on retry
- Workaround: Re-run
sqlmesh plan(succeeds on second attempt) - Impact on Production: Low (can work around by retrying), but concerning for CI/CD pipelines that expect idempotent operations
Expected Fix
SQLMesh should:
- Ensure proper dependency ordering between plan stages
- Create physical views/tables before creating dev views that reference them
- Maintain idempotency - same command should produce same result
- Provide better error messages when dependency ordering issues occur
Reproducibility
- Reproducible consistently on first run
- Resolves on second run without changes
- Error logs available
- State persists between runs causing non-deterministic behavior
Questions for Maintainers
- Is this a known issue with concurrent execution in virtual layer updates?
- Should the backfill phase complete before virtual layer update begins?
- Is there a configuration option to disable concurrent execution to avoid this race condition?
- Should SQLMesh validate dependency ordering before execution?
Attachments:
- Full error logs available upon request (redacted for privacy)
- SQLMesh state table dumps available if helpful for debugging
Metadata
Metadata
Assignees
Labels
No labels