-
Notifications
You must be signed in to change notification settings - Fork 67
[inventory] Add unhealthy zpools from each sled #9615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
illumos-utils/src/zpool.rs
Outdated
| #[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)] | ||
| #[serde(rename_all = "snake_case")] | ||
| pub struct UnhealthyZpoolsResult { | ||
| pub zpools: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth between just having a list of unhealthy zpools, or associating each zpool with it's state. In the end I went with listing the zpools only, but I'm not convinced. We'll be including the information of the health checks in the support bundle, and it'd be useful for them to be able to see what state each zpool is in. Thoughts? @davepacheco @jgallagher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Associating each zpool with its state sounds good to me; having an explicit entry for "this zpool was healthy" seems safer than inferring "any zpool that isn't explicitly listed as unhealthy must have been healthy".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I was thinking of only including the "unhealthy" zpools with their associated statuses in this list. Similarly with the svcs_in_maintenance I only added the services in maintenance with their associated zones. If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?
My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?
I get this, but for some reason it doesn't bother me? I think because (a) there are many fewer zpools, and (b) a zpool definitely can "just not be there" (e.g., if the drive is physically gone or not accessible or whatever), whereas an SMF service can't really, so it seems safer to assume "no report about the SMF service" implies "service is healthy".
My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?
Maybe worth discussing live. I just remembered we already do report some zpool information in inventory, and this has some nontrivial differences that we might want to address?
- The top-level inventory zpool list our sled config tells us to use; I think this branch will report all zpools that are present. (I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?)
- The top-level inventory zpool list runs
zpool ...commands on demand when an/inventoryrequest comes in; this seems much worse than this branch's "run periodically in the background, and/inventorygets to clone the contents of a watch channel" pattern. - The top-level inventory zpool list gathers each zpool's name and total size, but doesn't include any other status information, even though the
zpool ...command we're invoking returns it.
I'm wondering if it would make sense to reconcile these differences and then remove the top-level inventory zpools entry, and update any consumers of it to consume the health information instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that familiar with what consumers use the current zpool information. But I wonder if it makes sense to add more information to the health checks than necessary for their purpose. In the case of services in maintenance we stripped a whole bunch of information out just to report a list of services with their zones.
That said, this sounds like an interesting Idea.
I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?
What happens with zpools that would be in this state?
The top-level inventory zpool list runs zpool ... commands on demand when an /inventory request comes in; this seems much worse than this branch's "run periodically in the background, and /inventory gets to clone the contents of a watch channel" pattern.
Yeah, it feels messy to be calling zpool in different places for the same inventory collection
But yeah, better to chat about this live. I'll be at the next update watercooler!
|
Thanks for taking a look @jgallagher @davepacheco ! I think I've addressed all of the comments and this is ready for another review. I also incorporated the feedback from #9589 (comment) here. I have updated the PR's description to show the output of |
|
In the end I decided to make the SMF services in maintenance as an |
| if let Some(svcs) = &health_monitor.smf_services_in_maintenance { | ||
| match svcs { | ||
| Ok(svcs) => { | ||
| if !svcs.is_empty() { | ||
| if let Some(time_of_status) = &svcs.time_of_status { | ||
| writeln!( | ||
| indent2, | ||
| "SMF services in maintenance at {}:", | ||
| time_of_status.to_rfc3339_opts( | ||
| SecondsFormat::Millis, | ||
| /* use_z */ true, | ||
| ) | ||
| )?; | ||
| } | ||
| let mut indent3 = | ||
| IndentWriter::new(" ", &mut indent2); | ||
| for svc in &svcs.services { | ||
| writeln!(indent3, "{svc}")?; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| writeln!( | ||
| indent2, | ||
| "failed to retrieve SMF services in maintenance: {e}" | ||
| )?; | ||
| Err(e) => { | ||
| writeln!( | ||
| indent2, | ||
| "failed to retrieve SMF services in maintenance: {e}" | ||
| )?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am changing all of this in https://github.com/oxidecomputer/omicron/pull/9589/files#diff-7aa92821ab1567814797c276e521417c9ecb6ee700d2b76679ea5e3daedc5904R1105-R1183 anyway. These changes are only to not have any compilation errors with the new Option<Result<SvcsInMaintenanceResult, String>> type
|
@davepacheco @jgallagher tiny ping to see if I could get some eyes on this? |
| #[serde(rename_all = "snake_case")] | ||
| #[serde(rename = "OptionResult{T}Or{E}")] | ||
| #[serde(untagged)] | ||
| pub enum SnakeCaseOptionResult<T, E> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, sorry, I didn't realize proposing Option<Result<_, _>> would lead to this.
Would it be cleaner to squish this down into an enum of our own? Something like
enum HealthMonitorResult<T> {
NotYetKnown,
Ok(T),
Err(String),
}? (I don't have strong feelings either way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the code is cleaner yes, but the OpenAPI linter forces me to use tags and the result of this ends up being a somewhat weird looking API 😄
I tried out how the API would look using the enum with services in maintenance and the Option<Result<T>> with the unhealthy zpools, and I think the API is much cleaner when using the Option<Result<T>> :
$ curl -H "api-version: 18.0.0" http://[::1]:55444/inventory | jq .health_monitor
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 21757 100 21757 0 0 8901k 0 --:--:-- --:--:-- --:--:-- 10.3M
{
"smf_services_in_maintenance": {
"result": "ok",
"value": {
"services": [
{
"fmri": "svc:/site/fake-service2:default",
"zone": "global"
},
{
"fmri": "svc:/site/fake-service:default",
"zone": "global"
}
],
"errors": [],
"time_of_status": "2026-02-03T03:56:03.594752466Z"
}
},
"unhealthy_zpools": {
"ok": {
"zpools": [
{
"zpool": "fakepool1",
"state": "degraded"
},
{
"zpool": "fakepool2",
"state": "degraded"
}
],
"errors": [],
"time_of_status": "2026-02-03T03:56:03.560975738Z"
}
}
}
If it's not super important, perhaps we can keep it as is?
illumos-utils/src/zpool.rs
Outdated
| #[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)] | ||
| #[serde(rename_all = "snake_case")] | ||
| pub struct UnhealthyZpoolsResult { | ||
| pub zpools: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?
I get this, but for some reason it doesn't bother me? I think because (a) there are many fewer zpools, and (b) a zpool definitely can "just not be there" (e.g., if the drive is physically gone or not accessible or whatever), whereas an SMF service can't really, so it seems safer to assume "no report about the SMF service" implies "service is healthy".
My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?
Maybe worth discussing live. I just remembered we already do report some zpool information in inventory, and this has some nontrivial differences that we might want to address?
- The top-level inventory zpool list our sled config tells us to use; I think this branch will report all zpools that are present. (I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?)
- The top-level inventory zpool list runs
zpool ...commands on demand when an/inventoryrequest comes in; this seems much worse than this branch's "run periodically in the background, and/inventorygets to clone the contents of a watch channel" pattern. - The top-level inventory zpool list gathers each zpool's name and total size, but doesn't include any other status information, even though the
zpool ...command we're invoking returns it.
I'm wondering if it would make sense to reconcile these differences and then remove the top-level inventory zpools entry, and update any consumers of it to consume the health information instead?
This is the second PR for #9412
This commit adds a list of unhealthy zpools to the sled agent inventory's health monitor. In a follow-up PR this information will be added to the DB
Successful unhealthy zpool and SMF service in maintenance retrieval:
Response contains errors:
All zpools and services are healthy
Health monitor checks aren't running: