fix: drive storage cleanup through reset before discovery (#5555701) by williampnvidia · Pull Request #1748 · NVIDIA/infra-controller-core

williampnvidia · 2026-05-17T21:21:55Z

Summary

Stop Scout's discovery/no-API path from performing hidden NVMe/HDD/SAS cleanup;
discovery now leaves destructive storage cleanup to the API-directed reset path
Add cleanup context to WaitingForCleanup so the state machine can distinguish
initial discovery cleanup from deprovision cleanup and route recovery correctly
Drive initial discovery through WaitingForCleanup { InitialDiscovery } before
returning Scout Action::Discovery
Preserve HostInit cleanup context across repeated NVMECleanFailed retries so a
successful retry returns to HostInit/WaitingForDiscovery, not the deprovision flow
Add regressions covering firmware-upgrade Scout boots, assigned discovery-image
boots, and repeated HostInit cleanup failures

Recovery context note

NVMECleanFailed recovery currently uses FailureSource::StateMachineArea(HostInit)
to remember that the failure happened during initial discovery cleanup. That keeps
this fix scoped and avoids changing serialized FailureCause shape in this PR.
A follow-up should consider storing cleanup context directly on the storage-cleanup
failure cause instead of inferring it from FailureSource.

Dev environment validation

Deployed the local Carbide build into local-dev control-plane
environment and verified the updated pods rolled out
Normal delete path: provisioned a host, deleted it, observed API cleanup
state, Scout RESET, NVMe cleanup, successful cleanup_machine_completed,
and no destructive cleanup during the later DISCOVERY
Force-delete/re-ingest path: force-deleted the host, observed the predicted
host move through WaitingForCleanup { InitialDiscovery }, the actual host
receive RESET, cleanup complete successfully, then DISCOVERY log the
no-API cleanup skip message
Injected an NVMe cleanup failure in Scout for the delete path and verified
the machine moved to NVMECleanFailed
Restored the non-failing Scout artifact, rebooted the host, and verified the
machine recovered from NVMECleanFailed
Repeated the injected NVMe cleanup failure and recovery validation for the
force-delete/re-ingest path
Verified API state transitions and Scout cleanup behavior in Loki logs

Test plan

cargo +1.90.0 fmt --all
git diff --check

Signed-off-by: Josh P <williamp@nvidia.com>

williampnvidia requested a review from a team as a code owner May 17, 2026 21:21

williampnvidia requested review from ajf, krish-nvidia, martinraumann and stoo-davies May 17, 2026 21:29

williampnvidia added 4 commits May 18, 2026 14:16

fix: keep storage cleanup out of discovery

d6f94a0

Signed-off-by: Josh P <williamp@nvidia.com>

fix: drive initial cleanup through reset

5060943

Signed-off-by: Josh P <williamp@nvidia.com>

Add storage cleanup recovery regressions

53af9ae

test: drive initial cleanup in host fixture

f96cbec

williampnvidia force-pushed the fix/scout-discovery-no-storage-cleanup branch from 9f63f6f to f96cbec Compare May 18, 2026 21:19

williampnvidia added 2 commits May 18, 2026 15:08

test: update host readiness metrics for cleanup

cd6d0a4

test: stabilize cleanup fixture tests

6b6bb64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drive storage cleanup through reset before discovery (#5555701)#1748

fix: drive storage cleanup through reset before discovery (#5555701)#1748
williampnvidia wants to merge 6 commits into
NVIDIA:mainfrom
williampnvidia:fix/scout-discovery-no-storage-cleanup

williampnvidia commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

williampnvidia commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Recovery context note

Dev environment validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

williampnvidia commented May 17, 2026 •

edited

Loading