fix(cloudformation): persist provisioned resources across restart by WarpRat · Pull Request #1767 · faiscadev/fakecloud

WarpRat · 2026-06-18T19:59:34Z

Hey @vieiralucas - another fakecloud fix from our SAM-on-Kubernetes setup. Heads up that this one
is bigger than my usual: the persistence gap turned out to be provisioner-general rather than
lambda-specific, so the fix touches every snapshot-backed service. I tried to keep that surface
mechanical and uniform (one small snapshot_hook() per service, reusing each service's existing
persist path), with the real logic concentrated in CloudFormation and S3. Happy to adjust anything
if you'd shape it differently.

If you could cut a new release after this merges it'd help us pick it up cleanly. No worries at
all if it's not convenient right now though - we can pin a head image or build our own in the
meantime. Thanks as always for being so welcoming to these.

Fixes #1766

Summary

With FAKECLOUD_STORAGE_MODE=persistent, resources created by the CloudFormation provisioner
were not written through to the persistent store. The provisioner mutates each service's
in-memory state directly and never triggered that service's snapshot/persist path, so a
CFN/SAM-deployed resource worked normally in the running process but was silently lost on the
next restart - while the CFN stack metadata persisted, leaving a CREATE_COMPLETE stack
whose physical resources no longer existed. The same resources created via their direct
service APIs persisted correctly, so the gap was the CFN provisioning path, not the resource
type.

Whether a CFN-provisioned resource survived a restart was therefore accidental - it depended on
whether its owning service happened to be re-mutated (re-serializing that service's whole state
to disk) before the restart:

Deterministically lost: AWS::Lambda::Function (plus AWS::Lambda::Permission /
AWS::Lambda::Url) and AWS::SecretsManager::Secret.
Survived only incidentally: AWS::SQS::Queue and AWS::StepFunctions::StateMachine.
S3 variant: the provisioner inserted into the in-memory S3 map and bypassed the bespoke
S3Store disk path the real CreateBucket uses, so CFN-created buckets were lost.

Root cause

Persistence is write-through inside each service's handle() (under if mutates && success) via
a private save_snapshot() that re-serializes the whole service state; S3/IAM use bespoke stores
(S3Store per-object writes; IAM a whole-state snapshot fn). There is no periodic or on-shutdown
flush. The CFN ResourceProvisioner holds the shared service state directly and mutates it in
place (e.g. create_lambda_function -> state.functions.insert(...), create_s3_bucket ->
state.buckets.insert(...)); these provisioner fns are synchronous and never call the async
save_snapshot() (and bypass S3Store for buckets), so the resource lands in live memory but is
never written to disk.

Fix

After the CFN provisioner mutates service state, trigger the same persistence the API handlers
use - for create, update, and delete - for every snapshot-backed service the stack op
touched.

New SnapshotHook type (fakecloud-persistence): a type-erased async persist closure.
Each snapshot-backed service exposes snapshot_hook(), built from its own state + snapshot
store + serializing lock, so the serialization stays in the owning crate (it reuses the
service's existing save_snapshot path - no duplicated snapshot logic). Covers lambda, sqs,
sns, secretsmanager, dynamodb, stepfunctions, eventbridge, ssm, logs, kms, kinesis, ses,
cognito, rds, elasticache, ecr, ecs, cloudwatch, apigateway, apigatewayv2, bedrock, scheduler,
and iam.
Server wiring collects each service's hook into a service-name -> hook map and hands it
(plus the already-built S3Store) to the CloudFormation service.
CloudFormation handler: after a stack create/update/delete, it maps the touched resource
types to their owning services (service_key_for_type, e.g. AWS::Events::* -> eventbridge)
and invokes each touched service's hook exactly once - writing that whole state through to
disk. The await happens after every RwLockWriteGuard is released so the handler future stays
Send.
S3 is routed through the same S3Store path the real CreateBucket/DeleteBucket use
(bucket create/delete, and bucket-policy create/update/delete write through as subresources),
so a CFN-provisioned bucket lands on disk and a CFN-deleted one does not reappear. In memory
mode the store is a MemoryS3Store, so these writes are no-ops.

StepFunctions (which previously only persisted incidentally on the next execution) is now
deterministic too; the fix is provisioner-general, not lambda-specific.

Tests

Unit (fakecloud-cloudformation): provisioning via CFN fires the persist hook once per
touched service (create/update/delete), skips services without a registered hook, writes the
bucket (and bucket policy) through the S3Store, and removes them on delete; plus a
service_key_for_type table test (direct services, the Events/Logs aliases, S3 -> none,
and malformed/non-AWS types).
Per-service: each snapshot-backed crate gets a snapshot_hook() test (None in memory mode;
fires and persists when a store is set).
E2E (fakecloud-e2e/tests/cfn_provisioner_persistence.rs, persistent mode + real restart):
creates a Lambda function, a Secret, an SQS queue and an S3 bucket both via CloudFormation
and via the direct service APIs, restarts the server against the same data dir, and asserts
all eight survive (the CFN lambda + secret are the deterministic regression guards; the
API-created set is the control). A second test asserts a CFN-deleted lambda + bucket stay gone
after a restart.

Validated with cargo test --workspace (incl. fakecloud-e2e), cargo clippy --workspace --all-targets -- -D warnings, and cargo fmt --all --check.

Scope note

This persists on the successful create/update/delete paths. On a failed stack op the behavior
follows fakecloud's existing no-rollback model: snapshot-backed services are persisted only on
success, while S3 buckets write through eagerly (same as the real CreateBucket handler), so a
bucket created before a sibling resource fails can persist while the stack is CREATE_FAILED. The
AWS-correct answer here is rollback (deleting the partials), which is a larger change I left out of
scope. Happy to align the failure-path behavior whichever way you prefer.

Fixes the reported failure mode: a Kubernetes spot-instance reschedule of the fakecloud pod
wiped the SAM-deployed workflow Lambda (so every lambda:invoke Step Functions task then failed
with Lambda.ResourceNotFoundException) and the sam deploy --resolve-s3 managed artifact bucket
vanished while its aws-sam-cli-managed-default stack stayed CREATE_COMPLETE.

Summary by cubic

Persist CloudFormation-provisioned resources across restarts in persistent mode by writing through to disk after stack create/update/delete and routing S3 bucket ops via the on-disk store. This keeps CFN stacks and their physical resources in sync and matches persistence of direct service API calls.

Bug Fixes
- Added SnapshotHook in fakecloud-persistence; each snapshot-backed service exposes snapshot_hook() that reuses its own snapshot path.
- Server collects hooks and passes them to CloudFormation; after a successful stack op, CFN maps touched resource types to owning services and invokes each hook once to persist.
- S3: CFN bucket create/delete and bucket-policy create/update/delete now use S3Store, so buckets persist correctly and deletes don’t reappear.
- Applies across all snapshot-backed services (e.g., Lambda, Secrets Manager, SQS, Step Functions, Logs, IAM, API Gateway v1/v2, DynamoDB, EventBridge, KMS, Kinesis, SSM, SNS, SES, ECS, ECR, RDS, ElastiCache, Bedrock, CloudWatch).
Dependencies
- Added tempfile as a dev-dependency in fakecloud-cloudformation.

^{Written for commit 818e961. Summary will update on new commits.}

The CloudFormation resource provisioner mutated each service's in-memory state directly and never triggered that service's snapshot/persist path, so CFN/SAM-provisioned resources were lost on restart in persistent mode while the stack itself stayed CREATE_COMPLETE. Whether a resource survived was accidental: it depended on whether its owning service happened to be re-mutated before the restart (so AWS::Lambda::Function and AWS::SecretsManager::Secret were lost deterministically; SQS/StepFunctions survived only incidentally; S3 buckets bypassed the S3Store disk path entirely). Add a type-erased SnapshotHook that each snapshot-backed service builds from its own state, snapshot store, and serializing lock (reusing its existing save_snapshot path). The CloudFormation handler maps the resource types a stack op touched to their owning services and invokes each hook once, on create, update, and delete. Route S3 bucket create/delete and bucket-policy create/update/delete through the same S3Store write-through path the real CreateBucket/DeleteBucket handlers use. Together these make every provisioned resource land on disk, matching the persistence of the same resource created via its direct service API, and ensure a CFN-deleted resource does not reappear.

codecov · 2026-06-18T22:23:04Z

Codecov Report

❌ Patch coverage is 86.98678% with 187 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/fakecloud-server/src/main.rs	0.00%	64 Missing ⚠️
crates/fakecloud-apigateway/src/service.rs	84.74%	9 Missing ⚠️
crates/fakecloud-cognito/src/service/mod.rs	78.57%	9 Missing ⚠️
crates/fakecloud-dynamodb/src/service/mod.rs	78.57%	9 Missing ⚠️
crates/fakecloud-eventbridge/src/service.rs	78.57%	9 Missing ⚠️
crates/fakecloud-kinesis/src/service.rs	78.57%	9 Missing ⚠️
crates/fakecloud-kms/src/service.rs	78.57%	9 Missing ⚠️
crates/fakecloud-logs/src/service/mod.rs	83.63%	9 Missing ⚠️
crates/fakecloud-sns/src/service.rs	78.57%	9 Missing ⚠️
crates/fakecloud-sqs/src/service/mod.rs	78.57%	9 Missing ⚠️
... and 13 more

📢 Thoughts on this report? Let us know!

vieiralucas merged commit 5cb3061 into faiscadev:main Jun 18, 2026
67 of 69 checks passed

vieiralucas mentioned this pull request Jun 18, 2026

chore(release): bump workspace to v0.20.0 #1769

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cloudformation): persist provisioned resources across restart#1767

fix(cloudformation): persist provisioned resources across restart#1767
vieiralucas merged 1 commit into
faiscadev:mainfrom
WarpRat:fix/cfn-provisioner-persistence

WarpRat commented Jun 18, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

codecov Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WarpRat commented Jun 18, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Tests

Scope note

Summary by cubic

Uh oh!

codecov Bot commented Jun 18, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WarpRat commented Jun 18, 2026 •

edited by cubic-dev-ai Bot

Loading