orchestrator: refuse to persist a snapshot when findings are missing by bakayolo · Pull Request #75 · block/version-guard

bakayolo · 2026-05-04T17:40:23Z

Background

On 2026-05-04 a scheduled scan in cash-utility-stage silently produced an empty snapshots/latest.json in S3. Investigation:

06:00:01 orchestrator starts (10 detection workflows)
06:00:07 pod is OOMKilled (exitCode: 137) at the 512Mi limit while parsing Wiz inventory (~28k findings observed locally)
06:00:08 new container starts; the in-memory finding store is empty
06:05:08 Temporal retries CreateSnapshot (Attempt 2) on the fresh worker; the activity reads 0 findings for every resource type and persists an all-null snapshot, overwriting the previous good latest.json.

The memory limit is being raised in block-platform-core-infrastructure#15044 so the OOM stops happening, but the underlying design flaw is that CreateSnapshot cannot tell a 'genuinely empty' inventory apart from 'a worker died and lost the findings between detection and snapshot'.

Change

CreateSnapshot now compares the per-resource-type findings count it reads from the in-memory store against the count each detection workflow reported. If they disagree the activity errors out before SaveSnapshot is called, so nothing is written to S3.

The orchestrator workflow already collects each detection workflow's FindingsCount from its WorkflowOutput, so the expected counts come along for free — no new RPCs.

if expected, ok := input.ExpectedFindingsCounts[resourceType]; ok && len(findings) != expected {
    return nil, fmt.Errorf("findings count mismatch for %s: expected %d ..., found %d ...", ...)
}

After this:

Activity retries (per the existing retry policy) — on a worker without the data, the activity keeps failing instead of writing garbage.
When retries are exhausted the orchestrator workflow fails. The next scheduled scan starts fresh.
The previous good latest.json is preserved.

Tests

New TestActivities_CreateSnapshot_FindingsCountMismatch_FailsWithoutSaving asserts the new failure path doesn't call SaveSnapshot.
Existing happy-path / empty-findings tests pass matching expected counts.
make test and go vet ./... clean.

Out of scope

A real self-healing fix would persist findings durably (S3, DB) so a worker restart can recover them. That's a bigger restructure — this PR just stops the silent corruption.

Background: on 2026-05-04 a scheduled scan in cash-utility-stage silently produced an empty snapshot.json in S3. The pod was OOMKilled mid-scan (exit 137 at 512Mi); after restart, Temporal retried the `CreateSnapshot` activity on a fresh worker whose in-memory finding store was empty. The activity happily reported `count 0` for every resource type and persisted an all-null snapshot, overwriting the previous good `latest.json`. Fix: `CreateSnapshot` now compares the per-resource-type findings count it reads from the in-memory store against the count each detection workflow reported. If they disagree the activity errors out *before* calling SaveSnapshot, so nothing is written to S3. Temporal retries the activity (per the existing retry policy); when a fresh worker can't recover the lost findings, the workflow fails and the next scheduled run starts cleanly. The orchestrator workflow already collects each detection workflow's `FindingsCount` from its WorkflowOutput, so the expected counts are populated for free without any new RPCs. Tests: - TestActivities_CreateSnapshot_FindingsCountMismatch_FailsWithoutSaving asserts the new failure path doesn't call SaveSnapshot. - The existing happy-path / empty-findings tests are updated to pass matching expected counts. Amp-Thread-ID: https://ampcode.com/threads/T-019df3d1-962c-7051-a156-8cf596d7dc1d Co-authored-by: Amp <amp@ampcode.com>

bakayolo · 2026-05-04T17:59:36Z

Closing — the count-equality check doesn't actually defend against the failure modes that matter (e.g. detection returning a spurious 0 still overwrites a good snapshot). Will revisit with a proper fix (durable findings store and/or a sanity check against the previous latest.json).

bakayolo requested a review from a team as a code owner May 4, 2026 17:40

bakayolo closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

orchestrator: refuse to persist a snapshot when findings are missing#75

orchestrator: refuse to persist a snapshot when findings are missing#75
bakayolo wants to merge 1 commit intomainfrom
bena/snapshot-validate-counts

bakayolo commented May 4, 2026

Uh oh!

bakayolo commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bakayolo commented May 4, 2026

Background

Change

Tests

Out of scope

Uh oh!

bakayolo commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant