Skip to content

Race Condition in Virtual Layer Update #5643

@rm-minus-r-star

Description

@rm-minus-r-star

Race Condition in Virtual Layer Update

Summary

Non-deterministic failure during sqlmesh plan execution due to race condition when updating virtual layer. The plan fails on first execution but succeeds on second execution without any changes to code or data.

Environment

  • SQLMesh Version: 0.228.1
  • Python Version: 3.13.1
  • Database: PostgreSQL
  • Execution Mode: ThreadPoolExecutor (concurrent)

Expected Behavior

Running sqlmesh plan should be idempotent and deterministic. The same command with the same state should produce the same result every time.

Actual Behavior

First execution fails with UndefinedTable error during virtual layer update. Second execution (without any changes) succeeds completely.

Steps to Reproduce

  1. Add multiple new external models to external_models.yaml (~28 new tables)
  2. Run sqlmesh plan
  3. Answer "y" to "Apply - Virtual Update"
  4. Observe failure at ~31% completion during "Updating virtual layer"
  5. Run sqlmesh plan again without making any changes
  6. Observe successful completion

Error Details

First Execution (Failed)

Differences from the `dev` environment:

Models:
├── Added:
│   ├── schema1__dev.table_001
│   ├── .... 26 more ....
│   └── schema1__dev.table_028
└── Removed:
    └── schema0__dev.external_table_old

Apply - Virtual Update [y/n]: y
Updating virtual layer  ━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31.2% •  5/16 • 0:00:00
Error: Execution failed for node SnapshotId<"database"."schema2"."model_name": 1477141055>

Log Analysis

Error from logs:

psycopg2.errors.UndefinedTable: relation "sqlmesh__schema2.schema2__model_name__3109983560" does not exist
LINE 1: ..."schema2__dev"."model_name" AS SELECT * FROM "database"."sqlmesh__schema2"."schema2__model_name__3109983560"
                                                             ^

Sequence of events:

  1. Virtual layer update phase begins
  2. SQLMesh attempts to create dev view:
    CREATE VIEW "schema2__dev"."model_name"
    AS SELECT * FROM "database"."sqlmesh__schema2"."schema2__model_name__3109983560"
  3. Error: The referenced view schema2__model_name__3109983560 does not exist yet
  4. This view is supposed to be created later in the Backfill phase

From first run log (09:16:35):

2026-01-03 09:16:35,430 - ThreadPoolExecutor-4_2 - sqlmesh.core.snapshot.evaluator - INFO - Updating view 'database.schema2__dev.model_name' to point at table 'database.sqlmesh__schema2.schema2__model_name__3109983560' (evaluator.py:1988)
2026-01-03 09:16:35,430 - ThreadPoolExecutor-4_2 - sqlmesh.core.engine_adapter.base - INFO - Executing SQL: DROP VIEW IF EXISTS "schema2__dev"."model_name" CASCADE (base.py:2626)
2026-01-03 09:16:35,431 - ThreadPoolExecutor-4_2 - sqlmesh.core.engine_adapter.base - INFO - Executing SQL: CREATE VIEW "schema2__dev"."model_name" AS SELECT * FROM "database"."sqlmesh__schema2"."schema2__model_name__3109983560" (base.py:2626)
[FAILURE - Table does not exist]

From second run log (09:20:48) - Successful:

2026-01-03 09:20:48,604 - ThreadPoolExecutor-3_2 - sqlmesh.core.snapshot.evaluator - INFO - Evaluating snapshot SnapshotId<"database"."schema2"."model_name": 1477141055> (evaluator.py:711)
2026-01-03 09:20:48,612 - ThreadPoolExecutor-3_2 - sqlmesh.core.snapshot.evaluator - INFO - Inserting data for snapshot (evaluator.py:952)
2026-01-03 09:20:48,628 - ThreadPoolExecutor-3_2 - sqlmesh.core.snapshot.evaluator - INFO - Replacing view 'database.sqlmesh__schema2.schema2__model_name__3109983560' (evaluator.py:2676)
2026-01-03 09:20:48,632 - ThreadPoolExecutor-3_2 - sqlmesh.core.engine_adapter.base - INFO - Executing SQL: CREATE VIEW "sqlmesh__schema2"."schema2__model_name__3109983560" (...) (base.py:2626)
[SUCCESS - View created during backfill phase]

Second Execution (Succeeded)

Differences from the `dev` environment:

Models needing backfill:
├── schema3__dev.snapshot_001: [2025-12-27 - 2025-12-30, 2026-01-01 - 2026-01-01] (preview)
├── schema3__dev.snapshot_002: [2025-12-27 - 2025-12-30, 2026-01-01 - 2026-01-01] (preview)
├── schema2__dev.model_001: [recreate view]
├── schema2__dev.model_002: [recreate view]
└── schema2__dev.model_name: [recreate view]

Apply - Backfill Tables [y/n]: y
[SUCCESS]

Root Cause Analysis

Race Condition in Execution Order:

  1. SQLMesh uses ThreadPoolExecutor for concurrent execution
  2. During virtual layer update, it attempts to create dev views pointing to underlying physical views
  3. The underlying physical views are created in the backfill phase, which should execute before the virtual layer update
  4. Due to concurrent execution and possibly incorrect dependency ordering, the virtual layer update executed before the view was created
  5. First run fails, but saves partial progress to state tables
  6. Second run starts with different state and executes in correct order

Additional Context

  • The failure occurs specifically when adding many new external models (~28 tables)
  • The error happens at exactly 31.2% progress (5 out of 16 models processed)
  • The failure is not caused by missing data or database connection issues
  • The underlying external table exists and has data
  • Manual queries against the external table work correctly
  • The issue appears to be purely a SQLMesh execution ordering problem

Impact

  • Severity: Medium
  • Frequency: Reproducible on first run, resolved on retry
  • Workaround: Re-run sqlmesh plan (succeeds on second attempt)
  • Impact on Production: Low (can work around by retrying), but concerning for CI/CD pipelines that expect idempotent operations

Expected Fix

SQLMesh should:

  1. Ensure proper dependency ordering between plan stages
  2. Create physical views/tables before creating dev views that reference them
  3. Maintain idempotency - same command should produce same result
  4. Provide better error messages when dependency ordering issues occur

Reproducibility

  • Reproducible consistently on first run
  • Resolves on second run without changes
  • Error logs available
  • State persists between runs causing non-deterministic behavior

Questions for Maintainers

  1. Is this a known issue with concurrent execution in virtual layer updates?
  2. Should the backfill phase complete before virtual layer update begins?
  3. Is there a configuration option to disable concurrent execution to avoid this race condition?
  4. Should SQLMesh validate dependency ordering before execution?

Attachments:

  • Full error logs available upon request (redacted for privacy)
  • SQLMesh state table dumps available if helpful for debugging

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions