orchestrator: refuse to persist a snapshot when findings are missing#75
Closed
orchestrator: refuse to persist a snapshot when findings are missing#75
Conversation
Background: on 2026-05-04 a scheduled scan in cash-utility-stage silently produced an empty snapshot.json in S3. The pod was OOMKilled mid-scan (exit 137 at 512Mi); after restart, Temporal retried the `CreateSnapshot` activity on a fresh worker whose in-memory finding store was empty. The activity happily reported `count 0` for every resource type and persisted an all-null snapshot, overwriting the previous good `latest.json`. Fix: `CreateSnapshot` now compares the per-resource-type findings count it reads from the in-memory store against the count each detection workflow reported. If they disagree the activity errors out *before* calling SaveSnapshot, so nothing is written to S3. Temporal retries the activity (per the existing retry policy); when a fresh worker can't recover the lost findings, the workflow fails and the next scheduled run starts cleanly. The orchestrator workflow already collects each detection workflow's `FindingsCount` from its WorkflowOutput, so the expected counts are populated for free without any new RPCs. Tests: - TestActivities_CreateSnapshot_FindingsCountMismatch_FailsWithoutSaving asserts the new failure path doesn't call SaveSnapshot. - The existing happy-path / empty-findings tests are updated to pass matching expected counts. Amp-Thread-ID: https://ampcode.com/threads/T-019df3d1-962c-7051-a156-8cf596d7dc1d Co-authored-by: Amp <amp@ampcode.com>
Collaborator
Author
|
Closing — the count-equality check doesn't actually defend against the failure modes that matter (e.g. detection returning a spurious 0 still overwrites a good snapshot). Will revisit with a proper fix (durable findings store and/or a sanity check against the previous latest.json). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
On 2026-05-04 a scheduled scan in cash-utility-stage silently produced an empty
snapshots/latest.jsonin S3. Investigation:06:00:01orchestrator starts (10 detection workflows)06:00:07pod is OOMKilled (exitCode: 137) at the 512Mi limit while parsing Wiz inventory (~28k findings observed locally)06:00:08new container starts; the in-memory finding store is empty06:05:08Temporal retriesCreateSnapshot(Attempt 2) on the fresh worker; the activity reads 0 findings for every resource type and persists an all-null snapshot, overwriting the previous goodlatest.json.The memory limit is being raised in block-platform-core-infrastructure#15044 so the OOM stops happening, but the underlying design flaw is that
CreateSnapshotcannot tell a 'genuinely empty' inventory apart from 'a worker died and lost the findings between detection and snapshot'.Change
CreateSnapshotnow compares the per-resource-type findings count it reads from the in-memory store against the count each detection workflow reported. If they disagree the activity errors out beforeSaveSnapshotis called, so nothing is written to S3.The orchestrator workflow already collects each detection workflow's
FindingsCountfrom itsWorkflowOutput, so the expected counts come along for free — no new RPCs.After this:
latest.jsonis preserved.Tests
TestActivities_CreateSnapshot_FindingsCountMismatch_FailsWithoutSavingasserts the new failure path doesn't callSaveSnapshot.make testandgo vet ./...clean.Out of scope
A real self-healing fix would persist findings durably (S3, DB) so a worker restart can recover them. That's a bigger restructure — this PR just stops the silent corruption.