Skip to content

Conversation

@andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Jan 31, 2026

This commit adds a 3 phase mechanism for sled expungement.

The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for the commit of the configuration with the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before.

The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving!

The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that the last committed trust quorum configuration does not contain the sled that is to be expunged.

The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked.

This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in.

This also builds on #9741 and should merge after that PR.

I tested this out by first trying to abort and watching it fail because
there is no trust quorum configuration. Then I issued an LRTQ upgrade,
which will fail because I didn't restart the sled-agents to pick up the
LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly,
I successfully issued a new LRTQ upgrade after restartng the sled agents
and watched it commit.

Here's the external API calls:

```
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
  "error_code": "Not Found",
  "message": "No trust quorum configuration exists for this rack",
  "request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜  oxide.rs git:(main) ✗
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
  "members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
  "state": "aborted",
  "time_aborted": "2026-01-29T01:54:02.590683Z",
  "time_committed": null,
  "time_created": "2026-01-29T01:37:07.476451Z",
  "unacknowledged_members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "version": 2
}
```

Here's the omdb calls:

```
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade

Caused by:
    Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: Aborted,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: Some(
        2026-01-29T01:54:02.590683Z,
    ),
    abort_reason: Some(
        "Aborted via API request",
    ),
}

root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: Committed,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: Some(
        EncryptedRackSecrets {
            salt: Salt(
                [
                    143,
                    198,
                    3,
                    63,
                    136,
                    48,
                    212,
                    180,
                    101,
                    106,
                    50,
                    2,
                    251,
                    84,
                    234,
                    25,
                    46,
                    39,
                    139,
                    46,
                    29,
                    99,
                    252,
                    166,
                    76,
                    146,
                    78,
                    238,
                    28,
                    146,
                    191,
                    126,
                ],
            ),
            data: [
                167,
                223,
                29,
                18,
                50,
                230,
                103,
                71,
                159,
                77,
                118,
                39,
                173,
                97,
                16,
                92,
                27,
                237,
                125,
                173,
                53,
                51,
                96,
                242,
                203,
                70,
                36,
                188,
                200,
                59,
                251,
                53,
                126,
                48,
                182,
                141,
                216,
                162,
                240,
                5,
                4,
                255,
                145,
                106,
                97,
                62,
                91,
                161,
                51,
                110,
                220,
                16,
                132,
                29,
                147,
                60,
            ],
        },
    ),
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.792674Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
            ),
            time_prepared: Some(
                2026-01-29T02:20:47.236089Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.809779Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:52.248351Z,
            ),
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: Some(
        2026-01-29T02:20:47.597276Z,
    ),
    time_committed: Some(
        2026-01-29T02:21:52.263198Z,
    ),
    time_aborted: None,
    abort_reason: None,
}
```
After chatting with @davepacheco, I changed the authz checks in the
datastore to do lookups with Rack scope. This fixed the test bug, but is
only a shortcut. Trust quorum should have it's own authz object and I"m
going to open an issue for that.

Additionally, for methods that already took an authorized connection, I
removed the unnecessary authz checks and opctx parameter.
This commit adds a 3 phase mechanism for sled expungement.

The first phase is to remove the sled from the latest trust quorum
configuration via omdb. The second phase is to reboot the sled after
polling for commit the trust quorum removal. The third phase is to
issue the existing omdb expunge command, which changes the sled policy
as before.

The first and second phases remove the need to physically remove the
sled before expungement. They act as a software mechanism that gates the
sled-agent from restarting on the sled and doing work when it should be
treated as "absent". We've discussed this numerous times in the update
huddle and it is finally arriving!

The third phase is what informs reconfigurator that the sled is gone
and remains the same except for an extra sanity check that that the last
committed trust quorum configuration does not contain the sled that is
to be expunged.

The removed sled may be added back to this rack or another after being
clean slated. I tested this by deleting the files in the internal
"cluster" and "config" directories and rebooting the removed sled in
a4x2 and it worked.

This PR is marked draft because it changes the current
sled-expunge pathway to depend on real trust quorum. We
cannot safely merge it in until the key-rotation work from
#9737 is merged in.

This also builds on #9741 and should merge after that PR.
@andrewjstone andrewjstone marked this pull request as draft January 31, 2026 00:20
the Reconfigurator will not yet know the sled is expunged and may \
still try to use it.

Therefore, you must treat this action in conjunction with a reboot as \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting is definitely off. We should consider what we want this to look like. My guess is just removing the \ from the formatting altogether so it isn't very long lines with weird indents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe either:

  • three separate println!s so we're not mixing "long strings that need \-continuations" with "I actually want newlines here"
  • \ on every line, with explicit \n\ on lines where you want newlines

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the separate println! version in 2c55d34

@andrewjstone andrewjstone mentioned this pull request Jan 31, 2026
48 tasks
eprintln!(
"WARNING: Are you sure that you have removed this sled from the latest \
trust quorum configuration for rack {}?. Please double check with: \
`omdb nexus trust-quorum get-config <RACK_ID> latest`\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check this here ourselves? I know we couldn't definitively confirm "the sled is off / clean-slated" in the LRTQ case, but in this case it seems like we could?

Copy link
Contributor Author

@andrewjstone andrewjstone Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the check in the app layer and will reject if they ignore this. There's an inherent TOCTTOU here, and I think it's valid to ask them to check their work. More caution the better.

the Reconfigurator will not yet know the sled is expunged and may \
still try to use it.

Therefore, you must treat this action in conjunction with a reboot as \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe either:

  • three separate println!s so we're not mixing "long strings that need \-continuations" with "I actually want newlines here"
  • \ on every line, with explicit \n\ on lines where you want newlines

?

result
}

// `omdb nexus trust-quorum remove-sled`, but borrowing a datastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine but I expected there to be some other caller that already had a datastore - will there be one in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what you are asking here. This is the inner method that does the work, with the wrapper to ensure datastore termination in cmd_nexus_trust_quorum_remove_sled. I did this to emulate the behavior of expunge in this file.

println!(
"About to start the trust quorum reconfiguration to remove the sled.

If this operation with a timeout, please check the latest trust quorum \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this operation with a timeout,

Missing a word here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed in 2c55d34

expungement.

You can poll the trust quorum reconfiguration with \
`omdb nexus trust-quorum get-config <RACK_ID> <EPOCH | latest>`\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last bit looks duplicated by the next println!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I removed the following println!

.context("trust quorum remove sled")?
.into_inner();

println!("Started trust quorum reconfiguration at epoch {epoch}\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could / should we go into a polling loop here so the operator doesn't have to do it manually (in the happy path)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we discussed this and explicitly decided to make this operation not block. I'm not sure how we can tell the happy path from the sad path here besides a timeout, and I'm not sure what timeout to use.

While I would really like to make all these operations one shot and simple to use, they inherently are not that way.

let rack_id = RackUuid::from_untyped_uuid(sled.rack_id);

// If the sled still exists in the latest committed trust quorum
// configuration, it cannot be expunged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do on racks that are still running in LRTQ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error will be returned:

return Err(Error::invalid_request(format!(
       "Missing trust quorum configurations for rack {rack_id}. \
        Upgrade to trust quorum required."
)));

else {
return Err(Error::internal_error(&format!(
"Cannot retrieve newly inserted trust quorum \
configuration for rack {rack_id}, epoch {new_epoch}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rustfmt sigh

Suggested change
configuration for rack {rack_id}, epoch {new_epoch}."
configuration for rack {rack_id}, epoch {new_epoch}."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2c55d34


// Now send the reconfiguration request to the coordinator. We do
// this directly in the API handler because this is a non-idempotent
// operation and we only want to issue it once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we proceed if we've inserted the latest config in the db, but fail before successfully sending the reconfigure operation? (Or if the sled receives the reconfigure then immediately crashes or whatever)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the reason the abort operation exists. The problem there is that it's difficult (if not impossible) to always abort correctly in an automated fashion. So the customer has to be like: "I've been waiting 20 minutes, wtf is going on" and then try to abort. This is easier to tell with full trust quorum status in omdb since we differentiate prepare from commit phases. And you can also get some information by looking at inventory.

Once you abort, you can go ahead and try to expunge again. A new coordinator will be chosen randomly from members of the latest committed trust quorum.

Base automatically changed from tq-abort to main February 3, 2026 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants