Skip to content

[fm] Guard alert creation with SitrepGuardedInsert.#10533

Open
mergeconflict wants to merge 1 commit into
mergeconflict/fm-sitrepguardedinsertfrom
mergeconflict/fm-sitrepguardedinsert-alerts
Open

[fm] Guard alert creation with SitrepGuardedInsert.#10533
mergeconflict wants to merge 1 commit into
mergeconflict/fm-sitrepguardedinsertfrom
mergeconflict/fm-sitrepguardedinsert-alerts

Conversation

@mergeconflict

@mergeconflict mergeconflict commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Wire the alert resource through SitrepGuardedInsert:

  • impl SitrepGuardedResource for Alert;
  • schema: alert_generation on fm_sitrep and the rendezvous_alert_created marker table (migration fm-alert-resource-deletion);
  • alert_create's FM path routes through the combinator, surfacing a stale sitrep as Error::Conflict;
  • SitrepBuilder tracks alert_generation, bumping it when the outstanding alert-request set changes; the closed-case carry-forward filter drops fully-satisfied closed cases and keeps those with unsatisfied alert requests;
  • fm_rendezvous reads the expected generation and aborts the alert loop on a stale mismatch; fm_analysis loads existing markers to drive carry-forward; omdb displays the new status fields and generation.

Context: #10248, builds on #10564. Note, garbage collection for rendezvous_alert_created rows is not in this PR, that should be coming later this week if all goes well.

@mergeconflict mergeconflict self-assigned this Jun 2, 2026
@mergeconflict mergeconflict added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Jun 2, 2026
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from 1bc2f60 to 62cc3dd Compare June 2, 2026 19:32
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert branch from 6663e70 to c6e5904 Compare June 2, 2026 19:49
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from 62cc3dd to 3d5af68 Compare June 2, 2026 19:50
@AlejandroME AlejandroME added this to the 21 milestone Jun 4, 2026
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch 4 times, most recently from f3832d4 to cbc506b Compare June 8, 2026 16:41
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert branch from c4925ea to 0aded13 Compare June 8, 2026 16:41
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from cbc506b to 59680b0 Compare June 8, 2026 16:52
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert branch from 22bdf33 to 21c72b0 Compare June 8, 2026 17:57
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from 59680b0 to b6f5bfe Compare June 8, 2026 17:57
Comment thread nexus/src/app/alert.rs
/// to do that, the event remains durably in the database to be dispatched
/// and delivered by someone else.
///
/// Note: as of this writing, this method has no production callers; it

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment useful?? it feels extremely likely to get out-of-date soon, and overlooked

comment: "test sitrep 1".to_string(),
time_created: Utc::now(),
next_inv_min_time_started: Utc::now(),
alert_generation: Generation::from_u32(0),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generation::new starts at 1 - as do your schema changes -- any reason we're starting from 0 here and elsewhere in this PR?

Comment on lines +145 to +146
/// row in `rendezvous_alert_created`, and therefore don't force a closed
/// case to be copied forward.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first half of this comment makes sense, but I'm a little thrown off by the second half.

We're saying: "If a case is closed, but has remaining work, we want to copy it forward", and the lack of presence in this set is one of many possible signals of "remaining work", right?

@@ -0,0 +1,4 @@
CREATE TABLE IF NOT EXISTS omicron.public.rendezvous_alert_created (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem that all pre-existing FM alerts might "exist but be marker-less" here? maybe a question for @hawkw

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, since there aren't any, it isn't actually a problem in practice, even if it would be a problem if it was real...

Comment on lines +249 to +254
let unmarked_alert_requests: BTreeSet<AlertUuid> = case
.alerts_requested
.iter()
.map(|r| r.id)
.filter(|id| !self.existing_alerts.contains(id))
.collect();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case where "alert exists but has no marker" would appear as pending work forever, if we have any alerts that exist without markers right now, correct?

Comment on lines +258 to +261
// Carry the case forward intact: satisfied alert
// requests are harmless to keep around (the marker
// prevents resurrection) and pruning them would force a
// generation bump that buys us only earlier marker GC.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is true if either of the outstanding_work elements are incomplete, right?

(like, we carry forward the case if we have ereport work to do, OR alert work to do -- they all can be GC'd once the case is finally complete)

/// case must be carried forward intact (both alert requests still present)
/// and `alerts_changed` must remain `false`.
///
/// Pre-this-change the absence of unmarked ereports alone would have

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really confusing to see "Pre-this-change" comments embedded into source code which will outlast a review of this PR. Is this a claude artifact?

@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from b6f5bfe to a7d6f65 Compare June 8, 2026 20:08
Wire the alert resource through `SitrepGuardedInsert`:

  - `impl SitrepGuardedResource for Alert`;
  - schema: `alert_generation` on `fm_sitrep` and the
    `rendezvous_alert_created` marker table (migration
    fm-alert-resource-deletion);
  - `alert_create`'s FM path routes through the combinator, surfacing a
    stale sitrep as a monomorphic `Error::Conflict`;
  - `SitrepBuilder` tracks `alert_generation`, bumping it when the
    outstanding alert-request set changes; the closed-case carry-forward
    filter drops fully-satisfied closed cases;
  - fm_rendezvous reads the expected generation and aborts the alert loop
    on a stale mismatch; fm_analysis loads existing markers to drive
    carry-forward; omdb displays the new status fields and generation.
@mergeconflict mergeconflict force-pushed the mergeconflict/fm-sitrepguardedinsert-alerts branch from a7d6f65 to aee00ec Compare June 8, 2026 20:12

@hawkw hawkw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot of code here, and i haven't yet managed to read over all of it, but here are some initial thoughts.

}
}
println!(" {NEXT_INV_MIN_START:>WIDTH$}: {next_inv_min_time_started}");
println!(" {ALERT_GEN:>WIDTH$}: {alert_generation}");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice if all the different generation numbers were displayed under their own heading here, like:

    rendezvous resource generation numbers:
               alert generation: 69
      support bundle generation: 420

or some such

Comment on lines +178 to +182
/// Returns the subset of `candidates` for which a marker row exists in
/// `rendezvous_alert_created`. Used by FM analysis to see which alert
/// requests on closed cases have been satisfied, to determine whether any
/// cases can be dropped (see `nexus_fm::analysis_input::Builder::build`).
pub async fn alert_markers_existing_in(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neither the name of the function nor the commentary feel particularly clear to me. in particular:

  • i would like for the name to make it clear that this is an operation on the FM rendezvous_alert_created and is FM-specific, which is not obvious from the name
  • i don't really get the in.
  • essentially you are saying "which of these alert UUIDs exist in the marker table", can we make the combination of the documentation and function name just like, say that?

Comment on lines 52 to 123
@@ -48,6 +117,7 @@ impl DataStore {
"alert_class" => %alert.class,
"alert_case_id" => ?alert.case_id,
"time_created" => ?alert.identity.time_created,
"fm" => matches!(provenance, AlertProvenance::Fm { .. }),
);

Ok(alert)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels complex enough that i kinda wonder if it's better to just have separate alert_create and fm_rendezvous_alert_create functions, or something, which correctly populate the right provenance and do the other stuff. this feels like a lot of runtime dynamism just to be able to share the code

       let insert = diesel::insert_into(alert_dsl::alert)
            .values(alert)
            .on_conflict_do_nothing()
            .returning(Alert::as_returning());

between both paths...which could also just be stuffed into a separate function. personally, i would have written something more like this:

Suggested change
pub async fn alert_create(
&self,
opctx: &OpContext,
alert: Alert,
) -> CreateResult<Alert> {
let conn = self.pool_connection_authorized(&opctx).await?;
let alert = Self::alert_create_query(alert)
.get_result_async(&*conn)
.await
.map_err(|e| match e {
DieselError::DatabaseError(
DatabaseErrorKind::UniqueViolation,
_,
) => Err(Error::ObjectAlreadyExists {
type_name: ResourceType::Alert,
object_name: alert_id.to_string(),
}),
e => public_error_from_diesel(e, ErrorHandler::Server),
})?;
slog::debug!(
&opctx.log,
"published alert";
"alert_id" => ?alert.id(),
"alert_class" => %alert.class,
"alert_case_id" => ?alert.case_id,
"time_created" => ?alert.identity.time_created,
);
Ok(alert)
}
pub async fn fm_rendezvous_alert_create(
&self,
opctx: &OpContext,
alert: Alert,
alert_generation: Generation,
) -> CreateResult<Alert> {
let conn = self.pool_connection_authorized(&opctx).await?;
let guarded = SitrepGuardedInsert::<Alert, _>::new(
alert_id.into_untyped_uuid(),
expected_alert_generation.into(),
Self::alert_create_query(alert),
);
match guarded.execute_async(&conn).await.map_err(|e| {
public_error_from_diesel(e, ErrorHandler::Server)
})? {
SitrepGuardedInsertOutcome::Created(alert) => {
slog::debug!(
&opctx.log,
"published alert for FM alert request";
"alert_id" => ?alert.id(),
"alert_class" => %alert.class,
"alert_case_id" => ?alert.case_id,
"time_created" => ?alert.identity.time_created,
);
Ok(alert)
}
SitrepGuardedInsertOutcome::AlreadyExists => {
Err(Error::ObjectAlreadyExists {
type_name: ResourceType::Alert,
object_name: alert_id.to_string(),
})
}
SitrepGuardedInsertOutcome::StaleSitrep => {
// We signal stale sitrep to the caller as a `Conflict`
// error. This is unambiguous; no other path produces
// `Conflict`.
Err(Error::conflict(
"cannot create alert for stale sitrep",
))
}
};
}
fn alert_create_query(alert: Alert) -> impl RunnableQuery<Alert> {
diesel::insert_into(alert_dsl::alert)
.values(alert)
.returning(Alert::as_returning())
}

which feels a lot less convoluted, at least to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also obviates the need for the AlertProvenance enum, which makes the diff a bit smaller by not having to change any of the existing alert tests and so on (and lets us delete a bunch of comments from Claude calling me out about how we don't currently actually publish any alerts, which would sure make me feel better... :P)

}

#[tokio::test]
async fn test_fm_sitrep_insert_persists_generations() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this...well, i'm not going to tell you that you should delete a test, but it sure seems like something we did not need to have a test for. this basically appears to be testing "does diesel work?" --- if we were to declare a model struct with the fields on a database row, and roundtrip that struct through the database, and find that one of those rows did not survive the roundtrip through the database, i am pretty sure that basically all our tests would break.

i assume claude wrote this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to declare a model struct for this table? As far as I can tell, we are never actually using this struct, since there's one query that reads from the table and it just selects the alert IDs, and the insert is generated by the SitrepGuardedInsert query so it also doesn't actually use the model struct, right? AFAICT this is entirely dead code

Comment on lines +223 to +226
// - The case has been closed, but still has outstanding work: either an
// ereport which has not yet been marked as "seen" in the database, or
// an alert request whose `rendezvous_alert_created` marker is not yet
// present.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally i would kinda have kept the previous bullet point and added a new one for rendezvous_alert_created but 🤷‍♀️

Comment thread nexus/fm/src/builder.rs
Comment on lines +202 to +208
/// Finalize the builder with throwaway values for `creator_id` and
/// `time_created` (which these tests don't exercise).
fn build_sitrep(builder: SitrepBuilder<'_>) -> fm::Sitrep {
let (sitrep, _) =
builder.build(OmicronZoneUuid::new_v4(), chrono::Utc::now());
sitrep
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this...really doesn't feel like it's saving us much over just calling it with a new_v4() and now()...why do we need a helper for that?

Comment thread nexus/fm/src/builder.rs
Comment on lines +191 to +194
/// A logger that discards all output.
fn test_log() -> Logger {
slog::Logger::root(slog::Discard, slog::o!())
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should be discarding all log output in tests. i would much prefer we use the omicron_test_utils::dev::test_setup_log function like most of the other tests in Omicron do. that way, we still have the logs from the test if it fails, so you can go look at them to debug what happened.

@@ -0,0 +1,4 @@
CREATE TABLE IF NOT EXISTS omicron.public.rendezvous_alert_created (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, since there aren't any, it isn't actually a problem in practice, even if it would be a problem if it was real...

Comment on lines +23 to +26
ereports not yet marked seen:
* ereport dddddddd-dddd-dddd-dddd-dddddddddddd:2
alert requests not yet satisfied:
* alert 66666666-6666-6666-6666-666666666666

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would really like it if this still noted explicitly that the case has been copied forwards, like the old formatter did. maybe we could do something like this:

Suggested change
ereports not yet marked seen:
* ereport dddddddd-dddd-dddd-dddd-dddddddddddd:2
alert requests not yet satisfied:
* alert 66666666-6666-6666-6666-666666666666
copied forwards due to:
ereports not yet marked seen:
* ereport dddddddd-dddd-dddd-dddd-dddddddddddd:2
alert requests not yet satisfied:
* alert 66666666-6666-6666-6666-666666666666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants