Skip to content

fix(cloudformation): persist provisioned resources across restart#1767

Merged
vieiralucas merged 1 commit into
faiscadev:mainfrom
WarpRat:fix/cfn-provisioner-persistence
Jun 18, 2026
Merged

fix(cloudformation): persist provisioned resources across restart#1767
vieiralucas merged 1 commit into
faiscadev:mainfrom
WarpRat:fix/cfn-provisioner-persistence

Conversation

@WarpRat

@WarpRat WarpRat commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hey @vieiralucas - another fakecloud fix from our SAM-on-Kubernetes setup. Heads up that this one
is bigger than my usual: the persistence gap turned out to be provisioner-general rather than
lambda-specific, so the fix touches every snapshot-backed service. I tried to keep that surface
mechanical and uniform (one small snapshot_hook() per service, reusing each service's existing
persist path), with the real logic concentrated in CloudFormation and S3. Happy to adjust anything
if you'd shape it differently.

If you could cut a new release after this merges it'd help us pick it up cleanly. No worries at
all if it's not convenient right now though - we can pin a head image or build our own in the
meantime. Thanks as always for being so welcoming to these.

Fixes #1766

Summary

With FAKECLOUD_STORAGE_MODE=persistent, resources created by the CloudFormation provisioner
were not written through to the persistent store. The provisioner mutates each service's
in-memory state directly and never triggered that service's snapshot/persist path, so a
CFN/SAM-deployed resource worked normally in the running process but was silently lost on the
next restart
- while the CFN stack metadata persisted, leaving a CREATE_COMPLETE stack
whose physical resources no longer existed. The same resources created via their direct
service APIs
persisted correctly, so the gap was the CFN provisioning path, not the resource
type.

Whether a CFN-provisioned resource survived a restart was therefore accidental - it depended on
whether its owning service happened to be re-mutated (re-serializing that service's whole state
to disk) before the restart:

  • Deterministically lost: AWS::Lambda::Function (plus AWS::Lambda::Permission /
    AWS::Lambda::Url) and AWS::SecretsManager::Secret.
  • Survived only incidentally: AWS::SQS::Queue and AWS::StepFunctions::StateMachine.
  • S3 variant: the provisioner inserted into the in-memory S3 map and bypassed the bespoke
    S3Store disk path the real CreateBucket uses, so CFN-created buckets were lost.

Root cause

Persistence is write-through inside each service's handle() (under if mutates && success) via
a private save_snapshot() that re-serializes the whole service state; S3/IAM use bespoke stores
(S3Store per-object writes; IAM a whole-state snapshot fn). There is no periodic or on-shutdown
flush. The CFN ResourceProvisioner holds the shared service state directly and mutates it in
place (e.g. create_lambda_function -> state.functions.insert(...), create_s3_bucket ->
state.buckets.insert(...)); these provisioner fns are synchronous and never call the async
save_snapshot() (and bypass S3Store for buckets), so the resource lands in live memory but is
never written to disk.

Fix

After the CFN provisioner mutates service state, trigger the same persistence the API handlers
use - for create, update, and delete - for every snapshot-backed service the stack op
touched.

  • New SnapshotHook type (fakecloud-persistence): a type-erased async persist closure.
  • Each snapshot-backed service exposes snapshot_hook(), built from its own state + snapshot
    store + serializing lock, so the serialization stays in the owning crate (it reuses the
    service's existing save_snapshot path - no duplicated snapshot logic). Covers lambda, sqs,
    sns, secretsmanager, dynamodb, stepfunctions, eventbridge, ssm, logs, kms, kinesis, ses,
    cognito, rds, elasticache, ecr, ecs, cloudwatch, apigateway, apigatewayv2, bedrock, scheduler,
    and iam.
  • Server wiring collects each service's hook into a service-name -> hook map and hands it
    (plus the already-built S3Store) to the CloudFormation service.
  • CloudFormation handler: after a stack create/update/delete, it maps the touched resource
    types to their owning services (service_key_for_type, e.g. AWS::Events::* -> eventbridge)
    and invokes each touched service's hook exactly once - writing that whole state through to
    disk. The await happens after every RwLockWriteGuard is released so the handler future stays
    Send.
  • S3 is routed through the same S3Store path the real CreateBucket/DeleteBucket use
    (bucket create/delete, and bucket-policy create/update/delete write through as subresources),
    so a CFN-provisioned bucket lands on disk and a CFN-deleted one does not reappear. In memory
    mode the store is a MemoryS3Store, so these writes are no-ops.

StepFunctions (which previously only persisted incidentally on the next execution) is now
deterministic too; the fix is provisioner-general, not lambda-specific.

Tests

  • Unit (fakecloud-cloudformation): provisioning via CFN fires the persist hook once per
    touched service (create/update/delete), skips services without a registered hook, writes the
    bucket (and bucket policy) through the S3Store, and removes them on delete; plus a
    service_key_for_type table test (direct services, the Events/Logs aliases, S3 -> none,
    and malformed/non-AWS types).
  • Per-service: each snapshot-backed crate gets a snapshot_hook() test (None in memory mode;
    fires and persists when a store is set).
  • E2E (fakecloud-e2e/tests/cfn_provisioner_persistence.rs, persistent mode + real restart):
    creates a Lambda function, a Secret, an SQS queue and an S3 bucket both via CloudFormation
    and via the direct service APIs, restarts the server against the same data dir, and asserts
    all eight survive (the CFN lambda + secret are the deterministic regression guards; the
    API-created set is the control). A second test asserts a CFN-deleted lambda + bucket stay gone
    after a restart.

Validated with cargo test --workspace (incl. fakecloud-e2e), cargo clippy --workspace --all-targets -- -D warnings, and cargo fmt --all --check.

Scope note

This persists on the successful create/update/delete paths. On a failed stack op the behavior
follows fakecloud's existing no-rollback model: snapshot-backed services are persisted only on
success, while S3 buckets write through eagerly (same as the real CreateBucket handler), so a
bucket created before a sibling resource fails can persist while the stack is CREATE_FAILED. The
AWS-correct answer here is rollback (deleting the partials), which is a larger change I left out of
scope. Happy to align the failure-path behavior whichever way you prefer.

Fixes the reported failure mode: a Kubernetes spot-instance reschedule of the fakecloud pod
wiped the SAM-deployed workflow Lambda (so every lambda:invoke Step Functions task then failed
with Lambda.ResourceNotFoundException) and the sam deploy --resolve-s3 managed artifact bucket
vanished while its aws-sam-cli-managed-default stack stayed CREATE_COMPLETE.


Summary by cubic

Persist CloudFormation-provisioned resources across restarts in persistent mode by writing through to disk after stack create/update/delete and routing S3 bucket ops via the on-disk store. This keeps CFN stacks and their physical resources in sync and matches persistence of direct service API calls.

  • Bug Fixes

    • Added SnapshotHook in fakecloud-persistence; each snapshot-backed service exposes snapshot_hook() that reuses its own snapshot path.
    • Server collects hooks and passes them to CloudFormation; after a successful stack op, CFN maps touched resource types to owning services and invokes each hook once to persist.
    • S3: CFN bucket create/delete and bucket-policy create/update/delete now use S3Store, so buckets persist correctly and deletes don’t reappear.
    • Applies across all snapshot-backed services (e.g., Lambda, Secrets Manager, SQS, Step Functions, Logs, IAM, API Gateway v1/v2, DynamoDB, EventBridge, KMS, Kinesis, SSM, SNS, SES, ECS, ECR, RDS, ElastiCache, Bedrock, CloudWatch).
  • Dependencies

    • Added tempfile as a dev-dependency in fakecloud-cloudformation.

Written for commit 818e961. Summary will update on new commits.

Review in cubic

The CloudFormation resource provisioner mutated each service's in-memory state
directly and never triggered that service's snapshot/persist path, so
CFN/SAM-provisioned resources were lost on restart in persistent mode while the
stack itself stayed CREATE_COMPLETE. Whether a resource survived was accidental:
it depended on whether its owning service happened to be re-mutated before the
restart (so AWS::Lambda::Function and AWS::SecretsManager::Secret were lost
deterministically; SQS/StepFunctions survived only incidentally; S3 buckets
bypassed the S3Store disk path entirely).

Add a type-erased SnapshotHook that each snapshot-backed service builds from its
own state, snapshot store, and serializing lock (reusing its existing
save_snapshot path). The CloudFormation handler maps the resource types a stack
op touched to their owning services and invokes each hook once, on create,
update, and delete. Route S3 bucket create/delete and bucket-policy
create/update/delete through the same S3Store write-through path the real
CreateBucket/DeleteBucket handlers use. Together these make every provisioned
resource land on disk, matching the persistence of the same resource created via
its direct service API, and ensure a CFN-deleted resource does not reappear.
@vieiralucas vieiralucas merged commit 5cb3061 into faiscadev:main Jun 18, 2026
67 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CloudFormation-provisioned resources are not persisted and vanish on restart (persistent mode)

2 participants