Skip to content

Fix workflow execution state reconciliation#9199

Open
JPPhoto wants to merge 1 commit into
invoke-ai:mainfrom
JPPhoto:workflow-execution-state-reconciliation
Open

Fix workflow execution state reconciliation#9199
JPPhoto wants to merge 1 commit into
invoke-ai:mainfrom
JPPhoto:workflow-execution-state-reconciliation

Conversation

@JPPhoto
Copy link
Copy Markdown
Collaborator

@JPPhoto JPPhoto commented May 18, 2026

Summary

Fixes stale frontend workflow execution state when fast workflows complete before all socket events are reflected in the workflow editor.

This is best done with a formal state model, so this PR adds a small workflow execution state model for queue and invocation events. Socket events now pass through the model before mutating node execution state, and completed workflow queue items are reconciled from the authoritative persisted session.

Area State Event Result
Queue null queue_item_status_changed: pending Apply pending
Queue null queue_item_status_changed: in_progress Apply in_progress
Queue null queue_item_status_changed: completed Apply completed
Queue null queue_item_status_changed: failed Apply failed
Queue null queue_item_status_changed: canceled Apply canceled
Queue pending queue_item_status_changed: in_progress Apply in_progress
Queue pending queue_item_status_changed: completed Apply completed
Queue pending queue_item_status_changed: failed Apply failed
Queue pending queue_item_status_changed: canceled Apply canceled
Queue in_progress queue_item_status_changed: completed Apply completed
Queue in_progress queue_item_status_changed: failed Apply failed
Queue in_progress queue_item_status_changed: canceled Apply canceled
Queue completed queue_item_status_changed: pending Ignore stale event
Queue completed queue_item_status_changed: in_progress Ignore stale event
Queue failed queue_item_status_changed: pending Ignore stale event
Queue failed queue_item_status_changed: in_progress Ignore stale event
Queue canceled queue_item_status_changed: pending Ignore stale event
Queue canceled queue_item_status_changed: in_progress Ignore stale event
Invocation unknown invocation_started Apply in_progress
Invocation unknown invocation_progress Apply in_progress
Invocation unknown invocation_complete Apply completed
Invocation unknown invocation_error Apply failed
Invocation in_progress invocation_complete Apply completed
Invocation in_progress invocation_error Apply failed
Invocation completed invocation_started Ignore stale event
Invocation completed invocation_progress Ignore stale event
Invocation completed invocation_error Ignore stale event
Invocation failed invocation_started Ignore stale event
Invocation failed invocation_progress Ignore stale event
Invocation failed invocation_complete Ignore stale event
Queue completed invocation not terminal invocation_complete Apply completed
Queue completed any invocation invocation_started Ignore stale event
Queue completed any invocation invocation_progress Ignore stale event
Queue completed any invocation invocation_error Ignore stale event
Queue failed invocation not terminal invocation_error Apply failed
Queue failed any invocation invocation_started Ignore stale event
Queue failed any invocation invocation_progress Ignore stale event
Queue failed any invocation invocation_complete Ignore stale event
Queue canceled any invocation any invocation event Ignore stale event
Reconciliation completed queue item completed_session_reconciled Mark queue completed; mark persisted prepared invocation IDs completed; rebuild node outputs from session

Related Issues / Discussions

This attempts to finally resolve issues with execution state partially resolved in #9043 and others.

QA Instructions

Run tests from invokeai/frontend/web:

pnpm exec vitest run src/services/events/workflowExecutionState.test.ts src/services/events/nodeExecutionState.test.ts src/services/events/invocationTracking.test.ts

Merge Plan

Normal merge.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@JPPhoto JPPhoto requested a review from blessedcoolant as a code owner May 18, 2026 16:38
@JPPhoto JPPhoto added the 6.13.5 Library Updates label May 18, 2026
@JPPhoto JPPhoto moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap May 18, 2026
@JPPhoto JPPhoto added the frontend PRs that change frontend files label May 18, 2026
@JPPhoto JPPhoto force-pushed the workflow-execution-state-reconciliation branch 9 times, most recently from 2aadadc to c60ee2a Compare May 19, 2026 00:10
@Pfannkuchensack Pfannkuchensack self-assigned this May 19, 2026
@JPPhoto JPPhoto force-pushed the workflow-execution-state-reconciliation branch from c60ee2a to 61c33bc Compare May 22, 2026 03:37
@Pfannkuchensack
Copy link
Copy Markdown
Collaborator

Findings

  • High: invokeai/frontend/web/src/services/events/setEventListeners.tsx:544-566 introduces an async reconciliation fetch for completed workflow queue items, but its upsertExecutionState side effect is keyed only by nodeId (the source-node id) in the global $nodeExecutionStates store, with no guard that the reconciled item is still the current run. Scenario: queue item 1 (workflow) reaches completed, the reconciliation dispatch(queueApi.endpoints.getQueueItem.initiate(item_id, { forceRefetch: true, subscribe: false })) is dispatched but not yet resolved. Queue item 2 for the same workflow then starts; on queue_item_status_changed with status === 'in_progress', invokeai/frontend/web/src/services/events/setEventListeners.tsx:516-528 resets all node states to PENDING, and subsequent invocation_started events transition source nodes to IN_PROGRESS. The pending fetch for item 1 finally resolves and calls upsertExecutionState for each source node in item 1's session with status COMPLETED and item 1's outputs, blowing away item 2's IN_PROGRESS (or partial progress) state. Evidence chain: setEventListeners.tsx:559-561 -> invokeai/frontend/web/src/features/nodes/hooks/useNodeExecutionState.ts:40-47 (upsertExecutionState does { ...state, ...updates } with no item id check). The previous code never wrote node state from an async reconciliation, so this is a regression introduced by this branch.
    To expose this issue, add a test that drives the listener through (1) queue_item_status_changed(item_id=1, status=completed, origin=workflows) with a deferred getQueueItem response, (2) queue_item_status_changed(item_id=2, status=in_progress) followed by invocation_started for the same source nodes, then (3) resolve item 1's getQueueItem mock with completed results, and assert $nodeExecutionStates still reflects item 2's IN_PROGRESS state rather than item 1's COMPLETED outputs.

  • High: invokeai/frontend/web/src/services/events/workflowExecutionState.ts:76-101 now gates invocation_complete through the state machine, returning shouldApply: false when either the per-invocation status is already terminal or the queue status is failed/canceled. invokeai/frontend/web/src/services/events/setEventListeners.tsx:213-224 short-circuits before calling onInvocationComplete, which is the only place that runs addImagesToGallery, clearCanvasWorkflowIntegrationProcessing, and $lastProgressEvent.set(null) (invokeai/frontend/web/src/services/events/onInvocationComplete.tsx:247-275). Scenario A: a workflow item enters failed (one invocation errored) but other sibling invocations had already produced images; if their invocation_complete events arrive after the queue_item_status_changed(failed) event (which is a documented race the previous LRU cache was sized to handle), the new gate drops them and the generated images are never inserted into the gallery, board totals, or auto-switched. The previous handler explicitly excluded invocation_complete from the finished-item filter (the removed shouldIgnoreFinishedQueueItemInvocationEvent returned false for invocation_error and never even saw invocation_complete), so the regression is direct. Scenario B: when reconciliation marks invocations completed in the state machine before a late invocation_complete arrives for the same prepared id (e.g., a fast getQueueItem resolution), the per-invocation terminal gate at workflowExecutionState.ts:76-79 drops the late event and the image still never reaches the gallery (reconciliation only calls upsertExecutionState, never the gallery path).
    To expose this issue, add a test that fires queue_item_status_changed(item_id=1, status=failed) (or pre-applies completed_session_reconciled) and then invocation_complete for a sibling invocation with an ImageOutput result, and asserts that onInvocationComplete ran (boards/image cache updates, last-progress cleared) instead of being swallowed.

  • Medium: invokeai/frontend/web/src/services/events/setEventListeners.tsx:544 only triggers reconciliation when status === 'completed' && origin === 'workflows'. For partial-success runs that end in failed or canceled, the persisted session.results may still contain completed prepared invocations whose invocation_complete events were dropped by the gate described in the previous finding. Those results are not reconciled at all, so successful sibling outputs disappear from the UI on any race-affected failed run. The branch's stated goal is "workflow execution state reconciliation", yet the failure path it most needs to cover is excluded.
    To expose this issue, add a test that simulates a workflow where some invocations succeed and one fails with the invocation_complete events arriving after queue_item_status_changed(failed), and asserts the surviving node outputs are reconciled (either via the same getQueueItem path or via not gating invocation_complete in failed state).

  • Medium: invokeai/frontend/web/src/services/events/setEventListeners.tsx:435-444 plus invokeai/frontend/web/src/services/events/workflowExecutionState.ts:55-62 only short-circuits queue_item_status_changed when the cached status was terminal AND the new status is non-terminal. Two consecutive terminal events (e.g., a re-delivered completed, or a backend that transitions completed -> failed) both return shouldApply: true. That re-runs the entire handler: another getQueueItem force-refetch, another full tag invalidation set, another reconciliation pass, and on a completed -> failed flip it can stomp queueStatus back to failed then a duplicate completed will flip it again. The previous finishedQueueItemIds.has(...) check rejected any repeat terminal event outright. This is a behavioral change with no test.
    To expose this issue, add a test that fires two queue_item_status_changed events with status: 'completed' for the same item id and asserts the reconciliation dispatch and tag invalidations only happen once.

  • Medium: invokeai/frontend/web/src/services/events/setEventListeners.tsx:79 sets the cache size at max: 100, the same as the previous finished-id cache, but each entry now stores a full WorkflowExecutionState including a Record<string, InvocationStatus> keyed by every prepared invocation id seen. For workflows with many prepared nodes (iterate/batch expansion) this is materially larger memory-per-entry than a boolean, and there is no scheduled cleanup once a queue item reaches a terminal state. invokeai/frontend/web/src/services/events/setEventListeners.tsx:530 only clears completedInvocationKeysByItemId, not workflowExecutionStates. Long-lived sessions with large batched workflows can grow this map until LRU eviction kicks in, and there is no test demonstrating memory bounds for the new structure.

  • Medium: invokeai/frontend/web/src/services/events/setEventListeners.tsx:544-566 calls getQueueItem.initiate(..., { subscribe: false }). With no subscription added, RTK Query may evict the entry before the rest of the app sees it; more importantly, no unsubscribe/abort is wired up, so if the user disconnects/reconnects mid-reconciliation the resolved callback will still call transitionWorkflowEvent and upsertExecutionState against the new socket session's state. Combined with the cross-queue-item race in the first finding, this widens the window where stale reconciliation results win.

  • Low: invokeai/frontend/web/src/services/events/workflowExecutionState.ts:64-74's completed_session_reconciled branch unconditionally forces queueStatus = 'completed' whenever it is applied. The only invocation site is invokeai/frontend/web/src/services/events/setEventListeners.tsx:554-558, immediately after a completed status event, so today this is a no-op redundancy. But the function is exported and named generically; any future caller that fires completed_session_reconciled for a non-terminal state will silently mark the queue completed. The state machine should at least verify state.queueStatus === 'completed' before promoting.

  • Low: invokeai/frontend/web/src/services/events/nodeExecutionState.ts:87-110's getNodeExecutionStatesFromCompletedSession skips nodes that have no result rows, but does not handle the case where a session has stored a result for a NON-source prepared id (e.g., subsequent edits to source_prepared_mapping). It iterates Object.entries(session.source_prepared_mapping), so any persisted result whose prepared id is missing from the mapping is silently dropped from reconciliation. There is no test for that branch.
    To expose this issue, add a test where session.results contains a prepared id not present in source_prepared_mapping and assert it is either reconciled or explicitly ignored as intended.

  • Low: invokeai/frontend/web/src/services/events/setEventListeners.tsx:189-210 now gates invocation_error through the state machine. invokeai/frontend/web/src/services/events/workflowExecutionState.ts:89-95 drops the event when state.queueStatus === 'completed' || 'canceled'. The previous code unconditionally ran the error handler, which dispatches canvasWorkflowIntegrationProcessingCompleted() at setEventListeners.tsx:207-209. After this branch, a late invocation_error for a canvas_workflow_integration origin that arrives after a canceled/completed queue status will leave the canvas modal stuck on its loading spinner.
    To expose this issue, add a test that fires queue_item_status_changed(canceled) and then invocation_error(origin=canvas_workflow_integration) and asserts canvasWorkflowIntegrationProcessingCompleted is still dispatched.

  • Low: invokeai/frontend/web/src/services/events/workflowExecutionState.test.ts has good unit coverage of the pure reducer but no integration test exercises invokeai/frontend/web/src/services/events/setEventListeners.tsx's actual socket-handler wiring. None of the regressions in the High/Medium findings above can be caught by the current vitest suite. Frontend repo policy (invokeai/frontend/web/CLAUDE.md) excludes DOM tests, but these are pure state-machine + listener interactions and could be covered with vitest against a mocked Socket plus dispatch.

Open Questions

  • Is the backend currently emitting queue_item_status_changed events strictly after all invocation_complete events for the same item, or can they be interleaved on the wire? The previous comment at the deleted setEventListeners.tsx:75-78 explicitly asserted that out-of-order delivery does occur, which is why all three High/Medium findings rely on it. If the server now guarantees ordering this branch's regressions narrow significantly; if it does not, every finding above is reachable in production.
  • The reconciliation fetch uses forceRefetch: true, subscribe: false. Was subscribe: true (with explicit unsubscribe) considered, so the cache result is retained for other consumers and the request can be aborted on disconnect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.13.5 Library Updates frontend PRs that change frontend files

Projects

Status: 6.13.5 LIBRARY UPDATES

Development

Successfully merging this pull request may close these issues.

2 participants