Skip to content

Fix issue with reconciliation of resources using UseAsync=true#562

Open
fabiencastarede wants to merge 4 commits intocrossplane:mainfrom
fabiencastarede:fix-async-res-reconcile
Open

Fix issue with reconciliation of resources using UseAsync=true#562
fabiencastarede wants to merge 4 commits intocrossplane:mainfrom
fabiencastarede:fix-async-res-reconcile

Conversation

@fabiencastarede
Copy link
Copy Markdown

Description

This PR fixes a critical issue #561 where resources configured with UseAsync=true create duplicate cloud resources after provider pod restarts or Kubernetes cluster backup/restore operations (e.g., Velero).

Problem

Resources with UseAsync = true store Terraform workspace state in ephemeral pod storage (/tmp/<workspace-id>/). When the provider pod restarts:

  • Workspace state is lost
  • Provider treats the resource as not existing in Terraform state
  • Creates a duplicate resource in the cloud provider
  • Updates external-name with new resource ID, orphaning the original resource

Reproduction

# 1. Create an async resource
apiVersion: kube.ovh.example.io/v1alpha1
kind: Cluster
metadata:
  name: test-cluster
spec:
  forProvider:
    serviceName: "my-project"
    region: "EU-WEST-PAR"
# 2. Wait for creation
kubectl wait --for=condition=Ready cluster/test-cluster

# 3. Note the external-name (resource ID)
kubectl get cluster test-cluster -o jsonpath='{.metadata.annotations.crossplane\.io/external-name}'
# Output: abc123-original-id

# 4. Restart provider pod
kubectl delete pod -n crossplane-system -l pkg.crossplane.io/provider=provider-ovh

# 5. Bug: external-name changes, duplicate created
kubectl get cluster test-cluster -o jsonpath='{.metadata.annotations.crossplane\.io/external-name}'
# Output: xyz789-new-duplicate-id
# Result: Two clusters now exist in cloud provider!

Solution

When an async resource has the external-create-succeeded annotation (indicating prior successful creation) but workspace state is missing, use Import instead of Refresh to reconstruct state directly from the cloud provider API.

Code Changes

File: pkg/controller/external.go

Added logic in the Observe() function before calling Refresh():

// For async resources that were previously created, use Import instead
// of Refresh if the resource has been successfully created before.
// This prevents duplicate resource creation after provider pod restarts
// when the ephemeral workspace state in /tmp is lost.
// The external-create-succeeded annotation persists in Kubernetes and
// indicates the resource was successfully created or imported previously.
if e.config.UseAsync && meta.GetExternalName(tr) != "" {
    annotations := tr.GetAnnotations()
    if _, hasCreateSucceeded := annotations["crossplane.io/external-create-succeeded"]; hasCreateSucceeded {
        e.logger.Debug("Using Import instead of Refresh for async resource with external-create-succeeded annotation",
            "external-name", meta.GetExternalName(tr))
        return e.Import(ctx, tr)
    }
    e.logger.Debug("Async resource missing external-create-succeeded annotation, using Refresh",
        "external-name", meta.GetExternalName(tr),
        "annotations", annotations)
} else {
    e.logger.Debug("Not using Import fallback",
        "useAsync", e.config.UseAsync,
        "externalName", meta.GetExternalName(tr))
}

Required import added:

"github.com/crossplane/crossplane-runtime/v2/pkg/meta"

How It Works

  1. Detection: Checks if resource is async AND has external-create-succeeded annotation
  2. Import: Uses Import() to reconstruct Terraform state from cloud provider API using the external-name as resource ID
  3. State Reconstruction: Import queries cloud provider API and rebuilds the tfstate file
  4. Normal Flow: After Import succeeds, reconciliation continues normally with proper state

Why This Works

  • external-name annotation persists in Kubernetes (not lost on pod restart)
  • external-create-succeeded annotation persists in Kubernetes (backed up by Velero)
  • ✅ Import reconstructs state directly from cloud provider API
  • ✅ No duplicate resources created
  • ✅ No manual intervention required
  • ✅ Works with Velero backup/restore
  • ✅ Minimal code change, low risk
  • ✅ Only affects async resources with confirmed prior creation

Testing

Tested successfully with provider-ovh managing OVH Managed Kubernetes Clusters (a resource with UseAsync = true).

Test Scenarios

Scenario Steps Result
Provider pod restart 1. Create async resource
2. Verify creation succeeded
3. Delete provider pod
4. Wait for pod restart
5. Check for duplicates
✅ No duplicate created
✅ external-name unchanged
✅ Resource synced
Velero backup/restore 1. Create async resource
2. Backup with Velero
3. Reset Kubernetes cluster
4. Restore with Velero
5. Check for duplicates
✅ No duplicate created
✅ external-name preserved
✅ Resource synced with existing cloud resource
Node failure 1. Create async resource on node A
2. Drain/cordon node A
3. Pod reschedules to node B
4. Check for duplicates
✅ No duplicate created
✅ Resource remains synced

Debug Logs Confirming Fix

After applying the fix, provider logs show:

Using Import instead of Refresh for async resource with external-create-succeeded annotation external-name=abc123-original-id
Successfully imported resource into Terraform state
Resource observation completed, upToDate=true

Impact

  • Severity: Critical bug fix for async resources
  • Scope: Only affects resources with UseAsync = true that have been previously created
  • Risk: Low - only changes behavior when workspace state is missing AND resource has external-create-succeeded annotation
  • Breaking Changes: None
  • Performance: Minimal - Import is only used when state is missing (pod restart scenario)

Alternative Solutions Considered

1. PersistentVolume for Terraform Workspaces

Rejected: Adds infrastructure complexity, requires storage provisioning, doesn't scale well with multiple provider instances

2. Store tfstate in Kubernetes Secrets

Rejected: Large state files could exceed secret size limits (1MB), performance concerns with frequent updates

3. Disable UseAsync

Rejected: Removes async operation tracking capability, breaks long-running operations

4. Velero Filesystem Backup

Rejected: Only solves Velero restore case, doesn't help with pod restarts or node failures

5. Pre-Create Existence Check

Partially Rejected: Doesn't handle all edge cases, Import is more robust and already well-tested

Additional Notes

Provider Implementation Considerations

When using this fix, ensure your provider's GetIDFn configurations handle empty externalName values correctly during initial resource creation:

GetIDFn: func(ctx context.Context, externalName string, parameters map[string]any, providerConfig map[string]any) (string, error) {
    // Return empty string if external-name is not set yet (resource being created)
    if externalName == "" {
        return "", nil
    }
    // ... construct composite ID for Import
    return fmt.Sprintf("%s/%s", serviceName, externalName), nil
}

This prevents incomplete IDs (e.g., service_name/ instead of service_name/resource_id) from being set in tfstate before resource creation completes.

Related Issues

This fix addresses duplicate resource creation issues that have been reported by multiple users of upjet-based providers, particularly for resources requiring long-running operations such as:

  • Managed Kubernetes clusters
  • RDS database instances
  • Virtual machines with custom images
  • Any resource with complex provisioning workflows

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Comments added to explain complex logic
  • Changes tested with real provider (provider-ovh)
  • All test scenarios passing (pod restart, Velero restore, node failure)
  • No breaking changes introduced
  • Logging added for debugging

References

When a provider pod restarts, the ephemeral Terraform workspace state
stored in /tmp/<uid>/ is lost. For resources with UseAsync=true, this
causes the Refresh operation to fail to detect existing resources,
triggering duplicate resource creation.

This fix detects async resources that were previously created by
checking for the crossplane.io/external-create-succeeded annotation
and uses Import instead of Refresh. Import reconstructs state directly
from the cloud provider API, avoiding the duplicate creation issue.

Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>
Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>
Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>
The Import function was always returning ResourceUpToDate: true
without checking if the desired state matches the actual state.
This caused updates to resources to not be detected after they
were successfully created and the import fallback was used.

Now Import calls workspace.Plan() to properly detect drift,
just like the normal Refresh flow does.

Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>
@fabiencastarede fabiencastarede force-pushed the fix-async-res-reconcile branch from 51a58fc to d900f98 Compare March 31, 2026 07:55
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

The controller's Observe method now conditionally routes async resources with the crossplane.io/external-create-succeeded annotation to an Import path instead of the standard refresh flow. The Import method additionally runs drift detection via workspace.Plan after resource import and updates the up-to-date condition accordingly.

Changes

Cohort / File(s) Summary
Async Resource Handling & Drift Detection
pkg/controller/external.go
Modified Observe to conditionally route async resources with external-create-succeeded annotation to Import fallback path. Enhanced Import to run workspace.Plan for drift detection and set up-to-date condition via resource.SetUpToDateCondition instead of always returning true.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: fixing a critical issue with UseAsync=true resource reconciliation after pod restarts.
Description check ✅ Passed The description comprehensively relates to the changeset, detailing the problem, solution, code changes, testing, and impact of the UseAsync reconciliation fix.
Configuration Api Breaking Changes ✅ Passed No files in pkg/config/** were modified; changes isolated to pkg/controller/external.go using existing API without altering public types or function signatures.
Generated Code Manual Edits ✅ Passed The pull request modifies only pkg/controller/external.go, which does not match the zz_*.go pattern for generated code files. The changes implement logic for handling async resources through the Observe and Import methods, which is appropriate manual editing of controller source code.
Template Breaking Changes ✅ Passed external.go is an implementation file, not a template file; changes are narrowly scoped to async resources with specific annotation and provide backward-compatible safeguards.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/controller/external.go (1)

219-234: Template change affecting all generated providers — thanks for this solid fix!

This is a valuable solution for preventing duplicate resources after pod restarts. A few points to consider:

  1. Hardcoded annotation key: The code uses the literal string "crossplane.io/external-create-succeeded" directly. Does crossplane-runtime define a constant for this? Using a constant would help maintain consistency across the ecosystem and prevent typos.

  2. Debug log verbosity: Line 231 logs the entire annotations map, which could be very verbose for resources with many annotations. Consider logging just the annotation keys or a count instead to keep logs manageable.

  3. Multi-provider testing: Since this is a template file under pkg/controller/external*.go that affects all generated providers, could you confirm this has been tested with multiple provider implementations (not just provider-ovh)? Template changes should ideally be validated across different provider types before merging.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fb832136-83f0-41e6-9ed6-d23da6f7cdb9

📥 Commits

Reviewing files that changed from the base of the PR and between 52796ea and d900f98.

📒 Files selected for processing (1)
  • pkg/controller/external.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant