Skip to content

Flaky E2E: DPA reconciliation race condition causes NoDefaultBackupLocation test failure #2236

@kaovilai

Description

@kaovilai

Summary

The E2E test "DPA CR with NoDefaultBackupLocation and with BackupImages false" fails intermittently due to optimistic locking conflicts during DPA spec changes. The Consistently(IsReconciledTrue(), 1min, 15s) check catches transient Reconciled=False states caused by concurrent reconciliation loops conflicting on the same DPA resource.

Evidence

From PR #2234 CI run:

  • JUnit: Failed after 0.063s. Expected <bool>: false to be true at dpa_deployment_suite_test.go:166
  • Controller logs: Multiple "Operation cannot be fulfilled on dataprotectionapplications.oadp.openshift.io: the object has been modified; please apply your changes to the latest version and try again" errors
  • Timeline: Conflicts occur when test switches from DPA with BackupLocations to NoDefaultBackupLocation=true, BackupLocations=[]

Root Cause

When the DPA spec changes to NoDefaultBackupLocation=true, the reconciler tries to:

  1. Delete BackupStorageLocation(s) that no longer match the spec
  2. Delete registry secrets associated with those BSLs
  3. Delete the node-agent daemonset
  4. Update DPA status

Multiple reconciliation loops fire simultaneously, hitting optimistic locking conflicts. The status update at dataprotectionapplication_controller.go:152-155:

statusErr := r.Client.Status().Update(ctx, r.dpa)
if err == nil { // Don't mask previous error
    err = statusErr
}

When this fails due to a conflict, controller-runtime requeues with exponential backoff, but the Reconciled condition may be transiently False or stale.

The test at dpa_deployment_suite_test.go:166:

gomega.Consistently(dpaCR.IsReconciledTrue(), time.Minute*1, time.Second*15).Should(gomega.BeTrue())

This catches the transient False during one of its 15-second polls.

Possible Fixes

Operator side (preferred):

  • Re-fetch the DPA object before Status().Update() to reduce optimistic lock conflicts
  • Use client.MergeFrom patch instead of full status update to avoid conflicts
  • Add retry logic around the status update

Test side (workaround):

  • Replace Consistently with a pattern that tolerates brief transient states during spec transitions
  • Add a short Eventually(IsReconciledTrue()) after CreateOrUpdate before the Consistently check (already done at line 164, but the 2min timeout may not be enough for conflict resolution)

Note

Responses generated with Claude

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions