Skip to content

fix: drive storage cleanup through reset before discovery (#5555701)#1748

Open
williampnvidia wants to merge 6 commits into
NVIDIA:mainfrom
williampnvidia:fix/scout-discovery-no-storage-cleanup
Open

fix: drive storage cleanup through reset before discovery (#5555701)#1748
williampnvidia wants to merge 6 commits into
NVIDIA:mainfrom
williampnvidia:fix/scout-discovery-no-storage-cleanup

Conversation

@williampnvidia
Copy link
Copy Markdown
Contributor

@williampnvidia williampnvidia commented May 17, 2026

Summary

  • Stop Scout's discovery/no-API path from performing hidden NVMe/HDD/SAS cleanup;
    discovery now leaves destructive storage cleanup to the API-directed reset path
  • Add cleanup context to WaitingForCleanup so the state machine can distinguish
    initial discovery cleanup from deprovision cleanup and route recovery correctly
  • Drive initial discovery through WaitingForCleanup { InitialDiscovery } before
    returning Scout Action::Discovery
  • Preserve HostInit cleanup context across repeated NVMECleanFailed retries so a
    successful retry returns to HostInit/WaitingForDiscovery, not the deprovision flow
  • Add regressions covering firmware-upgrade Scout boots, assigned discovery-image
    boots, and repeated HostInit cleanup failures

Recovery context note

NVMECleanFailed recovery currently uses FailureSource::StateMachineArea(HostInit)
to remember that the failure happened during initial discovery cleanup. That keeps
this fix scoped and avoids changing serialized FailureCause shape in this PR.
A follow-up should consider storing cleanup context directly on the storage-cleanup
failure cause instead of inferring it from FailureSource.

Dev environment validation

  • Deployed the local Carbide build into local-dev control-plane
    environment and verified the updated pods rolled out
  • Normal delete path: provisioned a host, deleted it, observed API cleanup
    state, Scout RESET, NVMe cleanup, successful cleanup_machine_completed,
    and no destructive cleanup during the later DISCOVERY
  • Force-delete/re-ingest path: force-deleted the host, observed the predicted
    host move through WaitingForCleanup { InitialDiscovery }, the actual host
    receive RESET, cleanup complete successfully, then DISCOVERY log the
    no-API cleanup skip message
  • Injected an NVMe cleanup failure in Scout for the delete path and verified
    the machine moved to NVMECleanFailed
  • Restored the non-failing Scout artifact, rebooted the host, and verified the
    machine recovered from NVMECleanFailed
  • Repeated the injected NVMe cleanup failure and recovery validation for the
    force-delete/re-ingest path
  • Verified API state transitions and Scout cleanup behavior in Loki logs

Test plan

  • cargo +1.90.0 fmt --all
  • git diff --check

@williampnvidia williampnvidia force-pushed the fix/scout-discovery-no-storage-cleanup branch from 9f63f6f to f96cbec Compare May 18, 2026 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant