Skip to content

fix(): vpn key rotation race condition#458

Open
gourishkb wants to merge 4 commits intomasterfrom
hotfix-vpn-key-rotation
Open

fix(): vpn key rotation race condition#458
gourishkb wants to merge 4 commits intomasterfrom
hotfix-vpn-key-rotation

Conversation

@gourishkb
Copy link
Contributor

Description

Fixes #

How Has This Been Tested?

Checklist:

  • The title of the PR states what changed and the related issues number (used for the release note).
  • Does this PR requires documentation updates?
  • I've updated documentation as required by this PR.
  • I have ran go fmt
  • I have updated the helm chart as required by this PR.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have tested it for all user roles.
  • I have added all the required unit test cases.
  • I have verified the E2E test cases with new code changes.
  • I have added all the required E2E test cases.

Does this PR introduce a breaking change?


rajendra-avesha and others added 3 commits January 22, 2026 11:11
…tion (#453)

* feat(): add custom ns labels/annotations to appns

user defined CR labels/annotations config is fetched from cluster CR and
configmap "namespace-config-labels" is created
this configmap is used to apply labels/annotations to any slice appns

When a slice appns undergoes unbinding, the custom labels/annotations
are also removed

vendor/ changes will be updated with apis repo tag when changes are
approved

Signed-off-by: gourishkb <104021126+gourishkb@users.noreply.github.com>

* Revert "feat(): add custom ns labels/annotations to appns"

This reverts commit 254111a.

* fix(): Add validation and retry logic for VPN key rotation race condition

This fix addresses a race condition during concurrent VPN key rotation and
gateway certificate recycling operations that caused ~129 errors per rotation.

Root Cause:
- VPN key rotation triggered FSM before gateway pods finished reloading certificates
- GetPeerGwPodName() was called when pod status was incomplete
- Result: gRPC marshaling errors, tunnel failures, and connection context issues

Changes:
1. Enhanced GetPeerGwPodName() with detailed error messages
2. Added ValidateGatewayPodReadiness() to check pod readiness before FSM trigger
   - Validates pod exists in gateway status
   - Ensures tunnel is UP and TUN interface is configured
   - Verifies peer pod information is available
3. Modified reconciler to validate all pods before triggering FSM
4. Added retry logic with appropriate delays for transient errors

Fixes all 5 error types:
- gRPC Marshal nil errors (~35 per rotation)
- TUN interface not found errors (~50 per rotation)
- Tunnel not up errors (~20 per rotation)
- Connection context failures (~15 per rotation)
- RouteAdd file exists errors (~9 per rotation)

Testing:
- Added 7 unit tests for ValidateGatewayPodReadiness()
- All tests passing

---------

Signed-off-by: gourishkb <104021126+gourishkb@users.noreply.github.com>
Co-authored-by: gourishkb <104021126+gourishkb@users.noreply.github.com>
Co-authored-by: Rajendra <rajendra@Rajendras-MacBook-Pro.local>
Signed-off-by: gourishkb <gourish@aveshasystems.com>
latency/txrate status changes should not trigger reconciliation

Signed-off-by: gourishkb <gourish@aveshasystems.com>
@gourishkb gourishkb enabled auto-merge (squash) February 17, 2026 07:46
Signed-off-by: gourishkb <gourish@aveshasystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants