Skip to content

Conversation

@karencfv
Copy link
Contributor

@karencfv karencfv commented Jan 9, 2026

This is the second PR for #9412

This commit adds a list of unhealthy zpools to the sled agent inventory's health monitor. In a follow-up PR this information will be added to the DB

Successful unhealthy zpool and SMF service in maintenance retrieval:

$ zpool list -Hpo health,name
DEGRADED        fakepool1
DEGRADED        fakepool2
ONLINE  rpool

$ curl -H "api-version: 16.0.0"  http://[::1]:41364/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [
          {
            "fmri": "svc:/site/fake-service2:default",
            "zone": "global"
          },
          {
            "fmri": "svc:/site/fake-service:default",
            "zone": "global"
          }
        ],
        "errors": [],
        "time_of_status": "2026-01-19T06:03:13.713176608Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [
          {
            "zpool": "fakepool1",
            "state": "degraded"
          },
          {
            "zpool": "fakepool2",
            "state": "degraded"
          }
        ],
        "errors": [],
        "time_of_status": "2026-01-19T06:03:13.690492960Z"
      }
    }
  }
}

Response contains errors:

$ curl -H "api-version: 16.0.0"  http://[::1]:38614/inventory | jq
<...>
"health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [],
        "errors": [
          <...>
          "Found a service with an unknown state: global",
          "Found a service with an unknown state: global"
        ],
        "time_of_status": "2026-01-19T06:08:11.141551661Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [],
        "errors": [
          "Failed to parse output: Unrecognized zpool 'health': fakepool1",
          "Failed to parse output: Unrecognized zpool 'health': fakepool2",
          "Failed to parse output: Unrecognized zpool 'health': rpool"
        ],
        "time_of_status": "2026-01-19T06:08:11.133593564Z"
      }
    }
  }
}

All zpools and services are healthy

$ zpool list -Hpo health,name
ONLINE  rpool

$ curl -H "api-version: 16.0.0"  http://[::1]:39273/inventory | jq
<...>
"health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [],
        "errors": [],
        "time_of_status": "2026-01-19T06:14:24.219297126Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [],
        "errors": [],
        "time_of_status": "2026-01-19T06:14:24.194600484Z"
      }
    }
  }
}

Health monitor checks aren't running:

$ curl -H "api-version: 16.0.0"  http://[::1]:40386/inventory | jq
<...>
"health_monitor": {
  "smf_services_in_maintenance": null,
  "unhealthy_zpools": null
}

@karencfv karencfv marked this pull request as ready for review January 12, 2026 07:29
#[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub struct UnhealthyZpoolsResult {
pub zpools: Vec<String>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth between just having a list of unhealthy zpools, or associating each zpool with it's state. In the end I went with listing the zpools only, but I'm not convinced. We'll be including the information of the health checks in the support bundle, and it'd be useful for them to be able to see what state each zpool is in. Thoughts? @davepacheco @jgallagher

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Associating each zpool with its state sounds good to me; having an explicit entry for "this zpool was healthy" seems safer than inferring "any zpool that isn't explicitly listed as unhealthy must have been healthy".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I was thinking of only including the "unhealthy" zpools with their associated statuses in this list. Similarly with the svcs_in_maintenance I only added the services in maintenance with their associated zones. If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?

My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?

I get this, but for some reason it doesn't bother me? I think because (a) there are many fewer zpools, and (b) a zpool definitely can "just not be there" (e.g., if the drive is physically gone or not accessible or whatever), whereas an SMF service can't really, so it seems safer to assume "no report about the SMF service" implies "service is healthy".

My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?

Maybe worth discussing live. I just remembered we already do report some zpool information in inventory, and this has some nontrivial differences that we might want to address?

  1. The top-level inventory zpool list our sled config tells us to use; I think this branch will report all zpools that are present. (I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?)
  2. The top-level inventory zpool list runs zpool ... commands on demand when an /inventory request comes in; this seems much worse than this branch's "run periodically in the background, and /inventory gets to clone the contents of a watch channel" pattern.
  3. The top-level inventory zpool list gathers each zpool's name and total size, but doesn't include any other status information, even though the zpool ... command we're invoking returns it.

I'm wondering if it would make sense to reconcile these differences and then remove the top-level inventory zpools entry, and update any consumers of it to consume the health information instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with what consumers use the current zpool information. But I wonder if it makes sense to add more information to the health checks than necessary for their purpose. In the case of services in maintenance we stripped a whole bunch of information out just to report a list of services with their zones.

That said, this sounds like an interesting Idea.

I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?

What happens with zpools that would be in this state?

The top-level inventory zpool list runs zpool ... commands on demand when an /inventory request comes in; this seems much worse than this branch's "run periodically in the background, and /inventory gets to clone the contents of a watch channel" pattern.

Yeah, it feels messy to be calling zpool in different places for the same inventory collection

But yeah, better to chat about this live. I'll be at the next update watercooler!

@karencfv
Copy link
Contributor Author

karencfv commented Jan 19, 2026

Thanks for taking a look @jgallagher @davepacheco ! I think I've addressed all of the comments and this is ready for another review. I also incorporated the feedback from #9589 (comment) here.

I have updated the PR's description to show the output of /inventory after the changes

@karencfv karencfv requested a review from jgallagher January 19, 2026 02:38
@karencfv
Copy link
Contributor Author

karencfv commented Jan 19, 2026

In the end I decided to make the SMF services in maintenance as an Option<T> as part of this PR. Otherwise the versioning between the 2 PRs was getting annoying. I edited the PR's description to include testing for them as well

Comment on lines +916 to +942
if let Some(svcs) = &health_monitor.smf_services_in_maintenance {
match svcs {
Ok(svcs) => {
if !svcs.is_empty() {
if let Some(time_of_status) = &svcs.time_of_status {
writeln!(
indent2,
"SMF services in maintenance at {}:",
time_of_status.to_rfc3339_opts(
SecondsFormat::Millis,
/* use_z */ true,
)
)?;
}
let mut indent3 =
IndentWriter::new(" ", &mut indent2);
for svc in &svcs.services {
writeln!(indent3, "{svc}")?;
}
}
}
}
Err(e) => {
writeln!(
indent2,
"failed to retrieve SMF services in maintenance: {e}"
)?;
Err(e) => {
writeln!(
indent2,
"failed to retrieve SMF services in maintenance: {e}"
)?;
}
Copy link
Contributor Author

@karencfv karencfv Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am changing all of this in https://github.com/oxidecomputer/omicron/pull/9589/files#diff-7aa92821ab1567814797c276e521417c9ecb6ee700d2b76679ea5e3daedc5904R1105-R1183 anyway. These changes are only to not have any compilation errors with the new Option<Result<SvcsInMaintenanceResult, String>> type

@karencfv
Copy link
Contributor Author

karencfv commented Feb 2, 2026

@davepacheco @jgallagher tiny ping to see if I could get some eyes on this?

#[serde(rename_all = "snake_case")]
#[serde(rename = "OptionResult{T}Or{E}")]
#[serde(untagged)]
pub enum SnakeCaseOptionResult<T, E> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, sorry, I didn't realize proposing Option<Result<_, _>> would lead to this.

Would it be cleaner to squish this down into an enum of our own? Something like

enum HealthMonitorResult<T> {
    NotYetKnown,
    Ok(T),
    Err(String),
}

? (I don't have strong feelings either way.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the code is cleaner yes, but the OpenAPI linter forces me to use tags and the result of this ends up being a somewhat weird looking API 😄

I tried out how the API would look using the enum with services in maintenance and the Option<Result<T>> with the unhealthy zpools, and I think the API is much cleaner when using the Option<Result<T>> :

$ curl -H "api-version: 18.0.0"  http://[::1]:55444/inventory | jq .health_monitor
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21757  100 21757    0     0  8901k      0 --:--:-- --:--:-- --:--:-- 10.3M
{
  "smf_services_in_maintenance": {
    "result": "ok",
    "value": {
      "services": [
        {
          "fmri": "svc:/site/fake-service2:default",
          "zone": "global"
        },
        {
          "fmri": "svc:/site/fake-service:default",
          "zone": "global"
        }
      ],
      "errors": [],
      "time_of_status": "2026-02-03T03:56:03.594752466Z"
    }
  },
  "unhealthy_zpools": {
    "ok": {
      "zpools": [
        {
          "zpool": "fakepool1",
          "state": "degraded"
        },
        {
          "zpool": "fakepool2",
          "state": "degraded"
        }
      ],
      "errors": [],
      "time_of_status": "2026-02-03T03:56:03.560975738Z"
    }
  }
}

If it's not super important, perhaps we can keep it as is?

#[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub struct UnhealthyZpoolsResult {
pub zpools: Vec<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?

I get this, but for some reason it doesn't bother me? I think because (a) there are many fewer zpools, and (b) a zpool definitely can "just not be there" (e.g., if the drive is physically gone or not accessible or whatever), whereas an SMF service can't really, so it seems safer to assume "no report about the SMF service" implies "service is healthy".

My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?

Maybe worth discussing live. I just remembered we already do report some zpool information in inventory, and this has some nontrivial differences that we might want to address?

  1. The top-level inventory zpool list our sled config tells us to use; I think this branch will report all zpools that are present. (I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?)
  2. The top-level inventory zpool list runs zpool ... commands on demand when an /inventory request comes in; this seems much worse than this branch's "run periodically in the background, and /inventory gets to clone the contents of a watch channel" pattern.
  3. The top-level inventory zpool list gathers each zpool's name and total size, but doesn't include any other status information, even though the zpool ... command we're invoking returns it.

I'm wondering if it would make sense to reconcile these differences and then remove the top-level inventory zpools entry, and update any consumers of it to consume the health information instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants