-
Notifications
You must be signed in to change notification settings - Fork 67
TQ: Add External API for aborting a membership change #9741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I tested this out by first trying to abort and watching it fail because
there is no trust quorum configuration. Then I issued an LRTQ upgrade,
which will fail because I didn't restart the sled-agents to pick up the
LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly,
I successfully issued a new LRTQ upgrade after restartng the sled agents
and watched it commit.
Here's the external API calls:
```
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
"error_code": "Not Found",
"message": "No trust quorum configuration exists for this rack",
"request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜ oxide.rs git:(main) ✗
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
"members": [
{
"part_number": "913-0000019",
"serial_number": "20000000"
},
{
"part_number": "913-0000019",
"serial_number": "20000001"
},
{
"part_number": "913-0000019",
"serial_number": "20000003"
}
],
"rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
"state": "aborted",
"time_aborted": "2026-01-29T01:54:02.590683Z",
"time_committed": null,
"time_created": "2026-01-29T01:37:07.476451Z",
"unacknowledged_members": [
{
"part_number": "913-0000019",
"serial_number": "20000000"
},
{
"part_number": "913-0000019",
"serial_number": "20000001"
},
{
"part_number": "913-0000019",
"serial_number": "20000003"
}
],
"version": 2
}
```
Here's the omdb calls:
```
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade
Caused by:
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
2,
),
last_committed_epoch: None,
state: PreparingLrtqUpgrade,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T01:37:07.476451Z,
time_committing: None,
time_committed: None,
time_aborted: None,
abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
2,
),
last_committed_epoch: None,
state: Aborted,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T01:37:07.476451Z,
time_committing: None,
time_committed: None,
time_aborted: Some(
2026-01-29T01:54:02.590683Z,
),
abort_reason: Some(
"Aborted via API request",
),
}
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
3,
),
last_committed_epoch: None,
state: PreparingLrtqUpgrade,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T02:20:03.848507Z,
time_committing: None,
time_committed: None,
time_aborted: None,
abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
3,
),
last_committed_epoch: None,
state: Committed,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: Some(
EncryptedRackSecrets {
salt: Salt(
[
143,
198,
3,
63,
136,
48,
212,
180,
101,
106,
50,
2,
251,
84,
234,
25,
46,
39,
139,
46,
29,
99,
252,
166,
76,
146,
78,
238,
28,
146,
191,
126,
],
),
data: [
167,
223,
29,
18,
50,
230,
103,
71,
159,
77,
118,
39,
173,
97,
16,
92,
27,
237,
125,
173,
53,
51,
96,
242,
203,
70,
36,
188,
200,
59,
251,
53,
126,
48,
182,
141,
216,
162,
240,
5,
4,
255,
145,
106,
97,
62,
91,
161,
51,
110,
220,
16,
132,
29,
147,
60,
],
},
),
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
),
time_prepared: Some(
2026-01-29T02:20:46.792674Z,
),
time_committed: Some(
2026-01-29T02:21:49.503179Z,
),
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
),
time_prepared: Some(
2026-01-29T02:20:47.236089Z,
),
time_committed: Some(
2026-01-29T02:21:49.503179Z,
),
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
),
time_prepared: Some(
2026-01-29T02:20:46.809779Z,
),
time_committed: Some(
2026-01-29T02:21:52.248351Z,
),
},
},
time_created: 2026-01-29T02:20:03.848507Z,
time_committing: Some(
2026-01-29T02:20:47.597276Z,
),
time_committed: Some(
2026-01-29T02:21:52.263198Z,
),
time_aborted: None,
abort_reason: None,
}
```
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handler LGTM; will defer to @ahl for the phrasing of the external API endpoint.
| tags = ["experimental"], | ||
| versions = VERSION_TRUST_QUORUM_ABORT_CONFIG.. | ||
| }] | ||
| async fn rack_membership_abort( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this take a specific RackMembershipVersionParam like ..._status does instead of implicitly aborting the latest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can only abort the latest membership, as only one trust quorum reconfiguration is allowed at a time.
I could see how that could be worrisome in the case of dueling administrators. However, it can only occur during the trust quorum prepare phase, which should be very short (I need to activate the bg task immediately rather than waiting on timeout in a future PR), so it's hard to do the wrong thing on human timescales. Additionally, even if it was done by accident, no harm, no foul. An admin can just reissue the last command to add the sleds again and it will work.
After chatting with @davepacheco, I changed the authz checks in the datastore to do lookups with Rack scope. This fixed the test bug, but is only a shortcut. Trust quorum should have it's own authz object and I"m going to open an issue for that. Additionally, for methods that already took an authorized connection, I removed the unnecessary authz checks and opctx parameter.
This commit adds a 3 phase mechanism for sled expungement. The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for commit the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before. The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving! The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that that the last committed trust quorum configuration does not contain the sled that is to be expunged. The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked. This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in. This also builds on #9741 and should merge after that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. It has me thinking about how we explain all this to customers... keeping in mind that only a small handful of operators at each customer are ever going to be exposed to this workflow and it's exactly the kind of uncommon workflow for which a console/UI wizard-style interface makes sense (i.e. the API is 99.99% going to be used by the console).
add a sled <-+
| |
v |
abort <---- wait |
| | |
v v |
wait ---> complete |
| |
+--------+
I don't know that that's particularly illustrative or accurate, but an operator would add some pile of sleds; they might abort that operation if it... gets stuck? or they wait for it to complete. If they try to add more sleds before it completes... I think that API call fails. At any time one can get status which tells us the current "rack membership"--essentially, which sleds are part of the resource pool.
| probe_delete DELETE /experimental/v1/probes/{probe} | ||
| probe_list GET /experimental/v1/probes | ||
| probe_view GET /experimental/v1/probes/{probe} | ||
| rack_membership_abort POST /v1/system/hardware/racks/{rack_id}/membership/abort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what we discussed and I think this makes sense adjacent to add
| req: TypedBody<params::RackMembershipAddSledsRequest>, | ||
| ) -> Result<HttpResponseOk<RackMembershipStatus>, HttpError>; | ||
|
|
||
| /// Abort the latest rack membership change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- worth documenting if this operation in synchronous or asynchronous?
- what are the semantics if the latest operation was already completed (error or success)?
I tested this out by first trying to abort and watching it fail because there is no trust quorum configuration. Then I issued an LRTQ upgrade, which will fail because I didn't restart the sled-agents to pick up the LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly, I successfully issued a new LRTQ upgrade after restartng the sled agents and watched it commit.
Here's the external API calls:
Here's the omdb calls: