[inventory] Add unhealthy zpools from each sled #9615

karencfv · 2026-01-09T03:44:37Z

This is the second PR for #9412

This commit adds a list of unhealthy zpools to the sled agent inventory's health monitor. In a follow-up PR this information will be added to the DB

Successful unhealthy zpool and SMF service in maintenance retrieval:

$ zpool list -Hpo health,name
DEGRADED        fakepool1
DEGRADED        fakepool2
ONLINE  rpool

$ curl -H "api-version: 16.0.0"  http://[::1]:41364/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [
          {
            "fmri": "svc:/site/fake-service2:default",
            "zone": "global"
          },
          {
            "fmri": "svc:/site/fake-service:default",
            "zone": "global"
          }
        ],
        "errors": [],
        "time_of_status": "2026-01-19T06:03:13.713176608Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [
          {
            "zpool": "fakepool1",
            "state": "degraded"
          },
          {
            "zpool": "fakepool2",
            "state": "degraded"
          }
        ],
        "errors": [],
        "time_of_status": "2026-01-19T06:03:13.690492960Z"
      }
    }
  }
}

Response contains errors:

$ curl -H "api-version: 16.0.0"  http://[::1]:38614/inventory | jq
<...>
"health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [],
        "errors": [
          <...>
          "Found a service with an unknown state: global",
          "Found a service with an unknown state: global"
        ],
        "time_of_status": "2026-01-19T06:08:11.141551661Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [],
        "errors": [
          "Failed to parse output: Unrecognized zpool 'health': fakepool1",
          "Failed to parse output: Unrecognized zpool 'health': fakepool2",
          "Failed to parse output: Unrecognized zpool 'health': rpool"
        ],
        "time_of_status": "2026-01-19T06:08:11.133593564Z"
      }
    }
  }
}

All zpools and services are healthy

$ zpool list -Hpo health,name
ONLINE  rpool

$ curl -H "api-version: 16.0.0"  http://[::1]:39273/inventory | jq
<...>
"health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [],
        "errors": [],
        "time_of_status": "2026-01-19T06:14:24.219297126Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [],
        "errors": [],
        "time_of_status": "2026-01-19T06:14:24.194600484Z"
      }
    }
  }
}

Health monitor checks aren't running:

$ curl -H "api-version: 16.0.0"  http://[::1]:40386/inventory | jq
<...>
"health_monitor": {
  "smf_services_in_maintenance": null,
  "unhealthy_zpools": null
}

illumos-utils/src/zpool.rs

sled-agent/health-monitor/src/health_checks.rs

karencfv · 2026-01-14T03:09:26Z

illumos-utils/src/zpool.rs

+#[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)]
+#[serde(rename_all = "snake_case")]
+pub struct UnhealthyZpoolsResult {
+    pub zpools: Vec<String>,


I went back and forth between just having a list of unhealthy zpools, or associating each zpool with it's state. In the end I went with listing the zpools only, but I'm not convinced. We'll be including the information of the health checks in the support bundle, and it'd be useful for them to be able to see what state each zpool is in. Thoughts? @davepacheco @jgallagher

Associating each zpool with its state sounds good to me; having an explicit entry for "this zpool was healthy" seems safer than inferring "any zpool that isn't explicitly listed as unhealthy must have been healthy".

Hmmm, I was thinking of only including the "unhealthy" zpools with their associated statuses in this list. Similarly with the svcs_in_maintenance I only added the services in maintenance with their associated zones. If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?

My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?

If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?

I get this, but for some reason it doesn't bother me? I think because (a) there are many fewer zpools, and (b) a zpool definitely can "just not be there" (e.g., if the drive is physically gone or not accessible or whatever), whereas an SMF service can't really, so it seems safer to assume "no report about the SMF service" implies "service is healthy".

My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?

Maybe worth discussing live. I just remembered we already do report some zpool information in inventory, and this has some nontrivial differences that we might want to address?

The top-level inventory zpool list our sled config tells us to use; I think this branch will report all zpools that are present. (I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?)

The top-level inventory zpool list runs zpool ... commands on demand when an /inventory request comes in; this seems much worse than this branch's "run periodically in the background, and /inventory gets to clone the contents of a watch channel" pattern.

The top-level inventory zpool list gathers each zpool's name and total size, but doesn't include any other status information, even though the zpool ... command we're invoking returns it.

I'm wondering if it would make sense to reconcile these differences and then remove the top-level inventory zpools entry, and update any consumers of it to consume the health information instead?

I'm not that familiar with what consumers use the current zpool information. But I wonder if it makes sense to add more information to the health checks than necessary for their purpose. In the case of services in maintenance we stripped a whole bunch of information out just to report a list of services with their zones.

That said, this sounds like an interesting Idea.

I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?

What happens with zpools that would be in this state?

The top-level inventory zpool list runs zpool ... commands on demand when an /inventory request comes in; this seems much worse than this branch's "run periodically in the background, and /inventory gets to clone the contents of a watch channel" pattern.

Yeah, it feels messy to be calling zpool in different places for the same inventory collection

But yeah, better to chat about this live. I'll be at the next update watercooler!

karencfv · 2026-01-19T02:38:22Z

Thanks for taking a look @jgallagher @davepacheco ! I think I've addressed all of the comments and this is ready for another review. I also incorporated the feedback from #9589 (comment) here.

I have updated the PR's description to show the output of /inventory after the changes

karencfv · 2026-01-19T06:30:15Z

In the end I decided to make the SMF services in maintenance as an Option<T> as part of this PR. Otherwise the versioning between the 2 PRs was getting annoying. I edited the PR's description to include testing for them as well

karencfv · 2026-01-19T06:34:35Z

nexus/types/src/inventory/display.rs

+            if let Some(svcs) = &health_monitor.smf_services_in_maintenance {
+                match svcs {
+                    Ok(svcs) => {
+                        if !svcs.is_empty() {
+                            if let Some(time_of_status) = &svcs.time_of_status {
+                                writeln!(
+                                    indent2,
+                                    "SMF services in maintenance at {}:",
+                                    time_of_status.to_rfc3339_opts(
+                                        SecondsFormat::Millis,
+                                        /* use_z */ true,
+                                    )
+                                )?;
+                            }
+                            let mut indent3 =
+                                IndentWriter::new("  ", &mut indent2);
+                            for svc in &svcs.services {
+                                writeln!(indent3, "{svc}")?;
+                            }
                        }
                    }
-                }
-                Err(e) => {
-                    writeln!(
-                        indent2,
-                        "failed to retrieve SMF services in maintenance: {e}"
-                    )?;
+                    Err(e) => {
+                        writeln!(
+                            indent2,
+                            "failed to retrieve SMF services in maintenance: {e}"
+                        )?;
+                    }


I am changing all of this in https://github.com/oxidecomputer/omicron/pull/9589/files#diff-7aa92821ab1567814797c276e521417c9ecb6ee700d2b76679ea5e3daedc5904R1105-R1183 anyway. These changes are only to not have any compilation errors with the new Option<Result<SvcsInMaintenanceResult, String>> type

karencfv · 2026-02-02T03:11:34Z

@davepacheco @jgallagher tiny ping to see if I could get some eyes on this?

jgallagher · 2026-02-02T16:01:17Z

common/src/snake_case_option_result.rs

+#[serde(rename_all = "snake_case")]
+#[serde(rename = "OptionResult{T}Or{E}")]
+#[serde(untagged)]
+pub enum SnakeCaseOptionResult<T, E> {


Oof, sorry, I didn't realize proposing Option<Result<_, _>> would lead to this.

Would it be cleaner to squish this down into an enum of our own? Something like

enum HealthMonitorResult<T> { NotYetKnown, Ok(T), Err(String), }

? (I don't have strong feelings either way.)

Hm, the code is cleaner yes, but the OpenAPI linter forces me to use tags and the result of this ends up being a somewhat weird looking API 😄

I tried out how the API would look using the enum with services in maintenance and the Option<Result<T>> with the unhealthy zpools, and I think the API is much cleaner when using the Option<Result<T>> :

$ curl -H "api-version: 18.0.0" http://[::1]:55444/inventory | jq .health_monitor % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 21757 100 21757 0 0 8901k 0 --:--:-- --:--:-- --:--:-- 10.3M { "smf_services_in_maintenance": { "result": "ok", "value": { "services": [ { "fmri": "svc:/site/fake-service2:default", "zone": "global" }, { "fmri": "svc:/site/fake-service:default", "zone": "global" } ], "errors": [], "time_of_status": "2026-02-03T03:56:03.594752466Z" } }, "unhealthy_zpools": { "ok": { "zpools": [ { "zpool": "fakepool1", "state": "degraded" }, { "zpool": "fakepool2", "state": "degraded" } ], "errors": [], "time_of_status": "2026-02-03T03:56:03.560975738Z" } } }

If it's not super important, perhaps we can keep it as is?

jgallagher · 2026-02-02T16:15:09Z

illumos-utils/src/zpool.rs

+#[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)]
+#[serde(rename_all = "snake_case")]
+pub struct UnhealthyZpoolsResult {
+    pub zpools: Vec<String>,


If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?

I get this, but for some reason it doesn't bother me? I think because (a) there are many fewer zpools, and (b) a zpool definitely can "just not be there" (e.g., if the drive is physically gone or not accessible or whatever), whereas an SMF service can't really, so it seems safer to assume "no report about the SMF service" implies "service is healthy".

My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?

Maybe worth discussing live. I just remembered we already do report some zpool information in inventory, and this has some nontrivial differences that we might want to address?

The top-level inventory zpool list our sled config tells us to use; I think this branch will report all zpools that are present. (I'm vaguely inclined to say we shouldn't report an unhealthy zpool if our sled config instructs us not to use that disk - it being unhealthy seems likely to be the reason we dropped it from the config?)

The top-level inventory zpool list runs zpool ... commands on demand when an /inventory request comes in; this seems much worse than this branch's "run periodically in the background, and /inventory gets to clone the contents of a watch channel" pattern.

The top-level inventory zpool list gathers each zpool's name and total size, but doesn't include any other status information, even though the zpool ... command we're invoking returns it.

I'm wondering if it would make sense to reconcile these differences and then remove the top-level inventory zpools entry, and update any consumers of it to consume the health information instead?

karencfv added 8 commits December 4, 2025 20:55

pseudo-code

ed4d1ee

add some thoughts

b30e767

Merge branch 'main' into zpool-health-check

27204f4

[inventory] Add unhealthy zpools from each sled

b038f6e

add some tests

25db966

Add unhealthy zpool functionality

cc6d397

add zpool health info to API

219918c

fix cmd order

407a068

karencfv marked this pull request as ready for review January 12, 2026 07:29

karencfv requested review from davepacheco and jgallagher January 12, 2026 07:29

jgallagher reviewed Jan 13, 2026

View reviewed changes

illumos-utils/src/zpool.rs Outdated Show resolved Hide resolved

illumos-utils/src/zpool.rs Outdated Show resolved Hide resolved

sled-agent/health-monitor/src/health_checks.rs Outdated Show resolved Hide resolved

jgallagher mentioned this pull request Jan 13, 2026

[inventory] Add svcs in maintenance to DB #9589

Draft

address comments

c817b9c

karencfv commented Jan 14, 2026

View reviewed changes

karencfv added 12 commits January 15, 2026 16:29

get serde/schemars working with Option<Result<T,E>>

d01216d

clean up

b1fecc8

make time_of_status required

8eb5001

add json file and fmt

7cd8f4b

merge main

c5baf76

generate openapi doc

e6944dc

sigh...

0458330

untagged so it works on none

4c7cad3

linter

2771225

report zpool health state as well

b8157ea

add generated openapi spec

a2e3e90

Remove manual testing code

c734654

karencfv requested a review from jgallagher January 19, 2026 02:38

Make smf_services_in_maintenance optional too

030b11f

karencfv commented Jan 19, 2026

View reviewed changes

karencfv mentioned this pull request Jan 26, 2026

[inventory] Add initial unhealthy zpools information to DB #9719

Open

jgallagher reviewed Feb 2, 2026

View reviewed changes

karencfv added 3 commits February 3, 2026 14:59

merge main

bc8eb04

fixes after merge

fe84464

remove manual testing code

3332389

[inventory] Add unhealthy zpools from each sled #9615

Are you sure you want to change the base?

[inventory] Add unhealthy zpools from each sled #9615

Uh oh!

Conversation

karencfv commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karencfv commented Jan 9, 2026 •

edited

Loading

karencfv commented Jan 19, 2026 •

edited

Loading

karencfv commented Jan 19, 2026 •

edited

Loading

karencfv Jan 19, 2026 •

edited

Loading