[fm] Guard alert creation with SitrepGuardedInsert.#10533
Conversation
1bc2f60 to
62cc3dd
Compare
6663e70 to
c6e5904
Compare
62cc3dd to
3d5af68
Compare
f3832d4 to
cbc506b
Compare
c4925ea to
0aded13
Compare
cbc506b to
59680b0
Compare
22bdf33 to
21c72b0
Compare
59680b0 to
b6f5bfe
Compare
| /// to do that, the event remains durably in the database to be dispatched | ||
| /// and delivered by someone else. | ||
| /// | ||
| /// Note: as of this writing, this method has no production callers; it |
There was a problem hiding this comment.
Is this comment useful?? it feels extremely likely to get out-of-date soon, and overlooked
| comment: "test sitrep 1".to_string(), | ||
| time_created: Utc::now(), | ||
| next_inv_min_time_started: Utc::now(), | ||
| alert_generation: Generation::from_u32(0), |
There was a problem hiding this comment.
Generation::new starts at 1 - as do your schema changes -- any reason we're starting from 0 here and elsewhere in this PR?
| /// row in `rendezvous_alert_created`, and therefore don't force a closed | ||
| /// case to be copied forward. |
There was a problem hiding this comment.
The first half of this comment makes sense, but I'm a little thrown off by the second half.
We're saying: "If a case is closed, but has remaining work, we want to copy it forward", and the lack of presence in this set is one of many possible signals of "remaining work", right?
| @@ -0,0 +1,4 @@ | |||
| CREATE TABLE IF NOT EXISTS omicron.public.rendezvous_alert_created ( | |||
There was a problem hiding this comment.
Is it a problem that all pre-existing FM alerts might "exist but be marker-less" here? maybe a question for @hawkw
There was a problem hiding this comment.
well, since there aren't any, it isn't actually a problem in practice, even if it would be a problem if it was real...
| let unmarked_alert_requests: BTreeSet<AlertUuid> = case | ||
| .alerts_requested | ||
| .iter() | ||
| .map(|r| r.id) | ||
| .filter(|id| !self.existing_alerts.contains(id)) | ||
| .collect(); |
There was a problem hiding this comment.
This is a case where "alert exists but has no marker" would appear as pending work forever, if we have any alerts that exist without markers right now, correct?
| // Carry the case forward intact: satisfied alert | ||
| // requests are harmless to keep around (the marker | ||
| // prevents resurrection) and pruning them would force a | ||
| // generation bump that buys us only earlier marker GC. |
There was a problem hiding this comment.
This comment is true if either of the outstanding_work elements are incomplete, right?
(like, we carry forward the case if we have ereport work to do, OR alert work to do -- they all can be GC'd once the case is finally complete)
| /// case must be carried forward intact (both alert requests still present) | ||
| /// and `alerts_changed` must remain `false`. | ||
| /// | ||
| /// Pre-this-change the absence of unmarked ereports alone would have |
There was a problem hiding this comment.
Really confusing to see "Pre-this-change" comments embedded into source code which will outlast a review of this PR. Is this a claude artifact?
b6f5bfe to
a7d6f65
Compare
Wire the alert resource through `SitrepGuardedInsert`:
- `impl SitrepGuardedResource for Alert`;
- schema: `alert_generation` on `fm_sitrep` and the
`rendezvous_alert_created` marker table (migration
fm-alert-resource-deletion);
- `alert_create`'s FM path routes through the combinator, surfacing a
stale sitrep as a monomorphic `Error::Conflict`;
- `SitrepBuilder` tracks `alert_generation`, bumping it when the
outstanding alert-request set changes; the closed-case carry-forward
filter drops fully-satisfied closed cases;
- fm_rendezvous reads the expected generation and aborts the alert loop
on a stale mismatch; fm_analysis loads existing markers to drive
carry-forward; omdb displays the new status fields and generation.
a7d6f65 to
aee00ec
Compare
hawkw
left a comment
There was a problem hiding this comment.
there's a lot of code here, and i haven't yet managed to read over all of it, but here are some initial thoughts.
| } | ||
| } | ||
| println!(" {NEXT_INV_MIN_START:>WIDTH$}: {next_inv_min_time_started}"); | ||
| println!(" {ALERT_GEN:>WIDTH$}: {alert_generation}"); |
There was a problem hiding this comment.
it might be nice if all the different generation numbers were displayed under their own heading here, like:
rendezvous resource generation numbers:
alert generation: 69
support bundle generation: 420
or some such
| /// Returns the subset of `candidates` for which a marker row exists in | ||
| /// `rendezvous_alert_created`. Used by FM analysis to see which alert | ||
| /// requests on closed cases have been satisfied, to determine whether any | ||
| /// cases can be dropped (see `nexus_fm::analysis_input::Builder::build`). | ||
| pub async fn alert_markers_existing_in( |
There was a problem hiding this comment.
neither the name of the function nor the commentary feel particularly clear to me. in particular:
- i would like for the name to make it clear that this is an operation on the FM
rendezvous_alert_createdand is FM-specific, which is not obvious from the name - i don't really get the
in. - essentially you are saying "which of these alert UUIDs exist in the marker table", can we make the combination of the documentation and function name just like, say that?
| @@ -48,6 +117,7 @@ impl DataStore { | |||
| "alert_class" => %alert.class, | |||
| "alert_case_id" => ?alert.case_id, | |||
| "time_created" => ?alert.identity.time_created, | |||
| "fm" => matches!(provenance, AlertProvenance::Fm { .. }), | |||
| ); | |||
|
|
|||
| Ok(alert) | |||
There was a problem hiding this comment.
this feels complex enough that i kinda wonder if it's better to just have separate alert_create and fm_rendezvous_alert_create functions, or something, which correctly populate the right provenance and do the other stuff. this feels like a lot of runtime dynamism just to be able to share the code
let insert = diesel::insert_into(alert_dsl::alert)
.values(alert)
.on_conflict_do_nothing()
.returning(Alert::as_returning());between both paths...which could also just be stuffed into a separate function. personally, i would have written something more like this:
| pub async fn alert_create( | |
| &self, | |
| opctx: &OpContext, | |
| alert: Alert, | |
| ) -> CreateResult<Alert> { | |
| let conn = self.pool_connection_authorized(&opctx).await?; | |
| let alert = Self::alert_create_query(alert) | |
| .get_result_async(&*conn) | |
| .await | |
| .map_err(|e| match e { | |
| DieselError::DatabaseError( | |
| DatabaseErrorKind::UniqueViolation, | |
| _, | |
| ) => Err(Error::ObjectAlreadyExists { | |
| type_name: ResourceType::Alert, | |
| object_name: alert_id.to_string(), | |
| }), | |
| e => public_error_from_diesel(e, ErrorHandler::Server), | |
| })?; | |
| slog::debug!( | |
| &opctx.log, | |
| "published alert"; | |
| "alert_id" => ?alert.id(), | |
| "alert_class" => %alert.class, | |
| "alert_case_id" => ?alert.case_id, | |
| "time_created" => ?alert.identity.time_created, | |
| ); | |
| Ok(alert) | |
| } | |
| pub async fn fm_rendezvous_alert_create( | |
| &self, | |
| opctx: &OpContext, | |
| alert: Alert, | |
| alert_generation: Generation, | |
| ) -> CreateResult<Alert> { | |
| let conn = self.pool_connection_authorized(&opctx).await?; | |
| let guarded = SitrepGuardedInsert::<Alert, _>::new( | |
| alert_id.into_untyped_uuid(), | |
| expected_alert_generation.into(), | |
| Self::alert_create_query(alert), | |
| ); | |
| match guarded.execute_async(&conn).await.map_err(|e| { | |
| public_error_from_diesel(e, ErrorHandler::Server) | |
| })? { | |
| SitrepGuardedInsertOutcome::Created(alert) => { | |
| slog::debug!( | |
| &opctx.log, | |
| "published alert for FM alert request"; | |
| "alert_id" => ?alert.id(), | |
| "alert_class" => %alert.class, | |
| "alert_case_id" => ?alert.case_id, | |
| "time_created" => ?alert.identity.time_created, | |
| ); | |
| Ok(alert) | |
| } | |
| SitrepGuardedInsertOutcome::AlreadyExists => { | |
| Err(Error::ObjectAlreadyExists { | |
| type_name: ResourceType::Alert, | |
| object_name: alert_id.to_string(), | |
| }) | |
| } | |
| SitrepGuardedInsertOutcome::StaleSitrep => { | |
| // We signal stale sitrep to the caller as a `Conflict` | |
| // error. This is unambiguous; no other path produces | |
| // `Conflict`. | |
| Err(Error::conflict( | |
| "cannot create alert for stale sitrep", | |
| )) | |
| } | |
| }; | |
| } | |
| fn alert_create_query(alert: Alert) -> impl RunnableQuery<Alert> { | |
| diesel::insert_into(alert_dsl::alert) | |
| .values(alert) | |
| .returning(Alert::as_returning()) | |
| } |
which feels a lot less convoluted, at least to me.
There was a problem hiding this comment.
this also obviates the need for the AlertProvenance enum, which makes the diff a bit smaller by not having to change any of the existing alert tests and so on (and lets us delete a bunch of comments from Claude calling me out about how we don't currently actually publish any alerts, which would sure make me feel better... :P)
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_fm_sitrep_insert_persists_generations() { |
There was a problem hiding this comment.
this...well, i'm not going to tell you that you should delete a test, but it sure seems like something we did not need to have a test for. this basically appears to be testing "does diesel work?" --- if we were to declare a model struct with the fields on a database row, and roundtrip that struct through the database, and find that one of those rows did not survive the roundtrip through the database, i am pretty sure that basically all our tests would break.
i assume claude wrote this.
There was a problem hiding this comment.
do we actually need to declare a model struct for this table? As far as I can tell, we are never actually using this struct, since there's one query that reads from the table and it just selects the alert IDs, and the insert is generated by the SitrepGuardedInsert query so it also doesn't actually use the model struct, right? AFAICT this is entirely dead code
| // - The case has been closed, but still has outstanding work: either an | ||
| // ereport which has not yet been marked as "seen" in the database, or | ||
| // an alert request whose `rendezvous_alert_created` marker is not yet | ||
| // present. |
There was a problem hiding this comment.
personally i would kinda have kept the previous bullet point and added a new one for rendezvous_alert_created but 🤷♀️
| /// Finalize the builder with throwaway values for `creator_id` and | ||
| /// `time_created` (which these tests don't exercise). | ||
| fn build_sitrep(builder: SitrepBuilder<'_>) -> fm::Sitrep { | ||
| let (sitrep, _) = | ||
| builder.build(OmicronZoneUuid::new_v4(), chrono::Utc::now()); | ||
| sitrep | ||
| } |
There was a problem hiding this comment.
this...really doesn't feel like it's saving us much over just calling it with a new_v4() and now()...why do we need a helper for that?
| /// A logger that discards all output. | ||
| fn test_log() -> Logger { | ||
| slog::Logger::root(slog::Discard, slog::o!()) | ||
| } |
There was a problem hiding this comment.
i don't think we should be discarding all log output in tests. i would much prefer we use the omicron_test_utils::dev::test_setup_log function like most of the other tests in Omicron do. that way, we still have the logs from the test if it fails, so you can go look at them to debug what happened.
| @@ -0,0 +1,4 @@ | |||
| CREATE TABLE IF NOT EXISTS omicron.public.rendezvous_alert_created ( | |||
There was a problem hiding this comment.
well, since there aren't any, it isn't actually a problem in practice, even if it would be a problem if it was real...
| ereports not yet marked seen: | ||
| * ereport dddddddd-dddd-dddd-dddd-dddddddddddd:2 | ||
| alert requests not yet satisfied: | ||
| * alert 66666666-6666-6666-6666-666666666666 |
There was a problem hiding this comment.
i would really like it if this still noted explicitly that the case has been copied forwards, like the old formatter did. maybe we could do something like this:
| ereports not yet marked seen: | |
| * ereport dddddddd-dddd-dddd-dddd-dddddddddddd:2 | |
| alert requests not yet satisfied: | |
| * alert 66666666-6666-6666-6666-666666666666 | |
| copied forwards due to: | |
| ereports not yet marked seen: | |
| * ereport dddddddd-dddd-dddd-dddd-dddddddddddd:2 | |
| alert requests not yet satisfied: | |
| * alert 66666666-6666-6666-6666-666666666666 |
Wire the alert resource through
SitrepGuardedInsert:impl SitrepGuardedResource for Alert;alert_generationonfm_sitrepand therendezvous_alert_createdmarker table (migration fm-alert-resource-deletion);alert_create's FM path routes through the combinator, surfacing a stale sitrep asError::Conflict;SitrepBuildertracksalert_generation, bumping it when the outstanding alert-request set changes; the closed-case carry-forward filter drops fully-satisfied closed cases and keeps those with unsatisfied alert requests;Context: #10248, builds on #10564. Note, garbage collection for
rendezvous_alert_createdrows is not in this PR, that should be coming later this week if all goes well.