Skip to content

orchestrator: refuse to persist a snapshot when findings are missing#75

Closed
bakayolo wants to merge 1 commit intomainfrom
bena/snapshot-validate-counts
Closed

orchestrator: refuse to persist a snapshot when findings are missing#75
bakayolo wants to merge 1 commit intomainfrom
bena/snapshot-validate-counts

Conversation

@bakayolo
Copy link
Copy Markdown
Collaborator

@bakayolo bakayolo commented May 4, 2026

Background

On 2026-05-04 a scheduled scan in cash-utility-stage silently produced an empty snapshots/latest.json in S3. Investigation:

  • 06:00:01 orchestrator starts (10 detection workflows)
  • 06:00:07 pod is OOMKilled (exitCode: 137) at the 512Mi limit while parsing Wiz inventory (~28k findings observed locally)
  • 06:00:08 new container starts; the in-memory finding store is empty
  • 06:05:08 Temporal retries CreateSnapshot (Attempt 2) on the fresh worker; the activity reads 0 findings for every resource type and persists an all-null snapshot, overwriting the previous good latest.json.

The memory limit is being raised in block-platform-core-infrastructure#15044 so the OOM stops happening, but the underlying design flaw is that CreateSnapshot cannot tell a 'genuinely empty' inventory apart from 'a worker died and lost the findings between detection and snapshot'.

Change

CreateSnapshot now compares the per-resource-type findings count it reads from the in-memory store against the count each detection workflow reported. If they disagree the activity errors out before SaveSnapshot is called, so nothing is written to S3.

The orchestrator workflow already collects each detection workflow's FindingsCount from its WorkflowOutput, so the expected counts come along for free — no new RPCs.

if expected, ok := input.ExpectedFindingsCounts[resourceType]; ok && len(findings) != expected {
    return nil, fmt.Errorf("findings count mismatch for %s: expected %d ..., found %d ...", ...)
}

After this:

  • Activity retries (per the existing retry policy) — on a worker without the data, the activity keeps failing instead of writing garbage.
  • When retries are exhausted the orchestrator workflow fails. The next scheduled scan starts fresh.
  • The previous good latest.json is preserved.

Tests

  • New TestActivities_CreateSnapshot_FindingsCountMismatch_FailsWithoutSaving asserts the new failure path doesn't call SaveSnapshot.
  • Existing happy-path / empty-findings tests pass matching expected counts.
  • make test and go vet ./... clean.

Out of scope

A real self-healing fix would persist findings durably (S3, DB) so a worker restart can recover them. That's a bigger restructure — this PR just stops the silent corruption.

Background: on 2026-05-04 a scheduled scan in cash-utility-stage
silently produced an empty snapshot.json in S3. The pod was OOMKilled
mid-scan (exit 137 at 512Mi); after restart, Temporal retried the
`CreateSnapshot` activity on a fresh worker whose in-memory finding
store was empty. The activity happily reported `count 0` for every
resource type and persisted an all-null snapshot, overwriting the
previous good `latest.json`.

Fix: `CreateSnapshot` now compares the per-resource-type findings
count it reads from the in-memory store against the count each
detection workflow reported. If they disagree the activity errors
out *before* calling SaveSnapshot, so nothing is written to S3.
Temporal retries the activity (per the existing retry policy);
when a fresh worker can't recover the lost findings, the workflow
fails and the next scheduled run starts cleanly.

The orchestrator workflow already collects each detection
workflow's `FindingsCount` from its WorkflowOutput, so the
expected counts are populated for free without any new RPCs.

Tests:
- TestActivities_CreateSnapshot_FindingsCountMismatch_FailsWithoutSaving
  asserts the new failure path doesn't call SaveSnapshot.
- The existing happy-path / empty-findings tests are updated to
  pass matching expected counts.

Amp-Thread-ID: https://ampcode.com/threads/T-019df3d1-962c-7051-a156-8cf596d7dc1d
Co-authored-by: Amp <amp@ampcode.com>
@bakayolo bakayolo requested a review from a team as a code owner May 4, 2026 17:40
@bakayolo
Copy link
Copy Markdown
Collaborator Author

bakayolo commented May 4, 2026

Closing — the count-equality check doesn't actually defend against the failure modes that matter (e.g. detection returning a spurious 0 still overwrites a good snapshot). Will revisit with a proper fix (durable findings store and/or a sanity check against the previous latest.json).

@bakayolo bakayolo closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant