fix(cloudformation): persist provisioned resources across restart#1767
Merged
vieiralucas merged 1 commit intoJun 18, 2026
Merged
Conversation
The CloudFormation resource provisioner mutated each service's in-memory state directly and never triggered that service's snapshot/persist path, so CFN/SAM-provisioned resources were lost on restart in persistent mode while the stack itself stayed CREATE_COMPLETE. Whether a resource survived was accidental: it depended on whether its owning service happened to be re-mutated before the restart (so AWS::Lambda::Function and AWS::SecretsManager::Secret were lost deterministically; SQS/StepFunctions survived only incidentally; S3 buckets bypassed the S3Store disk path entirely). Add a type-erased SnapshotHook that each snapshot-backed service builds from its own state, snapshot store, and serializing lock (reusing its existing save_snapshot path). The CloudFormation handler maps the resource types a stack op touched to their owning services and invokes each hook once, on create, update, and delete. Route S3 bucket create/delete and bucket-policy create/update/delete through the same S3Store write-through path the real CreateBucket/DeleteBucket handlers use. Together these make every provisioned resource land on disk, matching the persistence of the same resource created via its direct service API, and ensure a CFN-deleted resource does not reappear.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey @vieiralucas - another fakecloud fix from our SAM-on-Kubernetes setup. Heads up that this one
is bigger than my usual: the persistence gap turned out to be provisioner-general rather than
lambda-specific, so the fix touches every snapshot-backed service. I tried to keep that surface
mechanical and uniform (one small
snapshot_hook()per service, reusing each service's existingpersist path), with the real logic concentrated in CloudFormation and S3. Happy to adjust anything
if you'd shape it differently.
If you could cut a new release after this merges it'd help us pick it up cleanly. No worries at
all if it's not convenient right now though - we can pin a head image or build our own in the
meantime. Thanks as always for being so welcoming to these.
Fixes #1766
Summary
With
FAKECLOUD_STORAGE_MODE=persistent, resources created by the CloudFormation provisionerwere not written through to the persistent store. The provisioner mutates each service's
in-memory state directly and never triggered that service's snapshot/persist path, so a
CFN/SAM-deployed resource worked normally in the running process but was silently lost on the
next restart - while the CFN stack metadata persisted, leaving a
CREATE_COMPLETEstackwhose physical resources no longer existed. The same resources created via their direct
service APIs persisted correctly, so the gap was the CFN provisioning path, not the resource
type.
Whether a CFN-provisioned resource survived a restart was therefore accidental - it depended on
whether its owning service happened to be re-mutated (re-serializing that service's whole state
to disk) before the restart:
AWS::Lambda::Function(plusAWS::Lambda::Permission/AWS::Lambda::Url) andAWS::SecretsManager::Secret.AWS::SQS::QueueandAWS::StepFunctions::StateMachine.S3Storedisk path the realCreateBucketuses, so CFN-created buckets were lost.Root cause
Persistence is write-through inside each service's
handle()(underif mutates && success) viaa private
save_snapshot()that re-serializes the whole service state; S3/IAM use bespoke stores(
S3Storeper-object writes; IAM a whole-state snapshot fn). There is no periodic or on-shutdownflush. The CFN
ResourceProvisionerholds the shared service state directly and mutates it inplace (e.g.
create_lambda_function->state.functions.insert(...),create_s3_bucket->state.buckets.insert(...)); these provisioner fns are synchronous and never call the asyncsave_snapshot()(and bypassS3Storefor buckets), so the resource lands in live memory but isnever written to disk.
Fix
After the CFN provisioner mutates service state, trigger the same persistence the API handlers
use - for create, update, and delete - for every snapshot-backed service the stack op
touched.
SnapshotHooktype (fakecloud-persistence): a type-erased async persist closure.snapshot_hook(), built from its own state + snapshotstore + serializing lock, so the serialization stays in the owning crate (it reuses the
service's existing
save_snapshotpath - no duplicated snapshot logic). Covers lambda, sqs,sns, secretsmanager, dynamodb, stepfunctions, eventbridge, ssm, logs, kms, kinesis, ses,
cognito, rds, elasticache, ecr, ecs, cloudwatch, apigateway, apigatewayv2, bedrock, scheduler,
and iam.
service-name -> hookmap and hands it(plus the already-built
S3Store) to the CloudFormation service.types to their owning services (
service_key_for_type, e.g.AWS::Events::*->eventbridge)and invokes each touched service's hook exactly once - writing that whole state through to
disk. The await happens after every
RwLockWriteGuardis released so the handler future staysSend.S3Storepath the realCreateBucket/DeleteBucketuse(bucket create/delete, and bucket-policy create/update/delete write through as subresources),
so a CFN-provisioned bucket lands on disk and a CFN-deleted one does not reappear. In memory
mode the store is a
MemoryS3Store, so these writes are no-ops.StepFunctions(which previously only persisted incidentally on the next execution) is nowdeterministic too; the fix is provisioner-general, not lambda-specific.
Tests
fakecloud-cloudformation): provisioning via CFN fires the persist hook once pertouched service (create/update/delete), skips services without a registered hook, writes the
bucket (and bucket policy) through the
S3Store, and removes them on delete; plus aservice_key_for_typetable test (direct services, theEvents/Logsaliases,S3-> none,and malformed/non-AWS types).
snapshot_hook()test (None in memory mode;fires and persists when a store is set).
fakecloud-e2e/tests/cfn_provisioner_persistence.rs, persistent mode + real restart):creates a Lambda function, a Secret, an SQS queue and an S3 bucket both via CloudFormation
and via the direct service APIs, restarts the server against the same data dir, and asserts
all eight survive (the CFN lambda + secret are the deterministic regression guards; the
API-created set is the control). A second test asserts a CFN-deleted lambda + bucket stay gone
after a restart.
Validated with
cargo test --workspace(incl.fakecloud-e2e),cargo clippy --workspace --all-targets -- -D warnings, andcargo fmt --all --check.Scope note
This persists on the successful create/update/delete paths. On a failed stack op the behavior
follows fakecloud's existing no-rollback model: snapshot-backed services are persisted only on
success, while S3 buckets write through eagerly (same as the real
CreateBuckethandler), so abucket created before a sibling resource fails can persist while the stack is
CREATE_FAILED. TheAWS-correct answer here is rollback (deleting the partials), which is a larger change I left out of
scope. Happy to align the failure-path behavior whichever way you prefer.
Fixes the reported failure mode: a Kubernetes spot-instance reschedule of the fakecloud pod
wiped the SAM-deployed workflow Lambda (so every
lambda:invokeStep Functions task then failedwith
Lambda.ResourceNotFoundException) and thesam deploy --resolve-s3managed artifact bucketvanished while its
aws-sam-cli-managed-defaultstack stayedCREATE_COMPLETE.Summary by cubic
Persist CloudFormation-provisioned resources across restarts in persistent mode by writing through to disk after stack create/update/delete and routing S3 bucket ops via the on-disk store. This keeps CFN stacks and their physical resources in sync and matches persistence of direct service API calls.
Bug Fixes
SnapshotHookinfakecloud-persistence; each snapshot-backed service exposessnapshot_hook()that reuses its own snapshot path.S3Store, so buckets persist correctly and deletes don’t reappear.Dependencies
tempfileas a dev-dependency infakecloud-cloudformation.Written for commit 818e961. Summary will update on new commits.