Skip to content

fix(security_group): treat omitted GroupID in self-referencing rules as equal to self-id#325

Open
kbujanecki-dt wants to merge 2 commits into
aws-controllers-k8s:mainfrom
kbujanecki-dt:fix/security-group-self-ref-diff
Open

fix(security_group): treat omitted GroupID in self-referencing rules as equal to self-id#325
kbujanecki-dt wants to merge 2 commits into
aws-controllers-k8s:mainfrom
kbujanecki-dt:fix/security-group-self-ref-diff

Conversation

@kbujanecki-dt
Copy link
Copy Markdown

Issue

Fixes aws-controllers-k8s/community#2822

Description

SecurityGroup rules whose userIDGroupPairs[*].groupID is omitted — the canonical way to express a self-reference, since the SG's own ID is unknown until AWS assigns it — are recreated on every reconciliation cycle.

Root cause is in compareIPPermission (pkg/resource/security_group/sdk.go), which compares UserIDGroupPairs with equality.Semantic.Equalities.DeepEqual. On DescribeSecurityGroups read-back AWS auto-fills GroupId, UserId, and (optionally) GroupName for self-referencing pairs, while the desired spec retains them as nil. The two sides therefore never compare equal → permanent diff → RevokeSecurityGroupIngress + AuthorizeSecurityGroupIngress on every reconcile.

This is the bug reported in #2822, with the additional operational severity that a controller crash between the Revoke and Authorize leaves the rule missing — degrading any workload that depends on the SG (CoreDNS node-to-node traffic in the reporter's reproducer).

Fix

Add a small helper normalizeSelfRefRules(r *resource) *resource in pkg/resource/security_group/hooks.go and call it on both desired and latest at the top of syncSGRules.

Behaviour:

  • Returns a deep copy. The caller's desired / latest are never mutated, so downstream code (status patching, late-initialisation, requeue) cannot accidentally lose user-set GroupID/UserID values.
  • Identifies a pair as a self-reference when GroupID is nil or equals r.ko.Status.ID.
  • Clears the server-fillable fields (GroupID, UserID, GroupName) for self-ref pairs only. Description, ports, protocol, IP/IPv6 ranges, prefix lists, cross-SG pairs, and IP-CIDR-only rules are untouched.

This makes the two sides semantically equivalent for the diff check while leaving the actual AWS API calls untouched: the existing outbound path (createSecurityGroupRules further down hooks.go) already canonicalises nil → self-id before AuthorizeSecurityGroup{Ingress,Egress}. The change is symmetric — the inbound side now matches the outbound convention.

The fix lives in the manually-maintained hooks.go rather than in the generated compareIPPermission in sdk.go, so it survives code-generator regenerations.

Scope and follow-up

  • Only self-references are addressed. Cross-SG / cross-account pairs that omit server-filled fields (e.g. UserID, VPCID) may still exhibit perpetual diffs; out of scope here.
  • The runtime-level Delta computed above customUpdateSecurityGroup still observes the raw (unfilled vs filled) Spec.IngressRules each cycle and dispatches into syncSGRules. That dispatch is now harmless: the normalised diff produces empty toAdd/toDelete sets and no AWS API call is made. Eliminating that cosmetic log noise requires a normalisation hook above the generated update path — left for a follow-up.

Verification

Unit tests

pkg/resource/security_group/hooks_test.go:

TestNormalizeSelfRefRules (10 subtests):

  • nil resource returns nil
  • nil status ID returns a copy untouched
  • does not mutate the input resource (deep-copy guarantee)
  • strips server-fillable fields on self-ref ingress pair
  • strips server-fillable fields on self-ref egress pair
  • preserves cross-SG pair untouched
  • nil GroupID counts as self-ref and clears UserID/GroupName
  • mixed pair: only self stripped, cross-SG kept
  • handles nil rule and nil pair entries
  • preserves description on self-ref pair

TestContainsRule_SelfRef_AfterNormalisation — locks the fix in at the comparison layer:

  1. Asserts the pre-condition (containsRule returns false for canonical desired vs AWS-filled latest without normalisation — the bug exists),
  2. Asserts the fix (containsRule returns true after normalizeSelfRefRules on both sides),
  3. Asserts symmetry (both desired→latest and latest→desired directions match, so neither branch of syncSGRules schedules churn).
$ go test -v -run 'TestNormalizeSelfRefRules|TestContainsRule_SelfRef_AfterNormalisation' \
    ./pkg/resource/security_group/
PASS
ok  github.com/aws-controllers-k8s/ec2-controller/pkg/resource/security_group  0.589s

Full package test suite (go test ./pkg/resource/security_group/...) and module-wide go build ./... both pass.

End-to-end against real AWS

Built the controller from this branch into a local image, deployed to k3d v1.34, applied the exact SecurityGroup CR pattern from #2822 (self-referencing ingress on TCP/53, groupID omitted, only description set) against a sandbox AWS account in eu-central-1. Observed across multiple reconcile cycles (requeue period 30s):

Check Pre-fix Post-fix (this branch)
Ready=True time <1s <1s
securityGroupRuleID in .status.rules changes every cycle stable
AuthorizeSecurityGroupIngress API calls (post-create) every cycle 0
RevokeSecurityGroupIngress API calls every cycle 0

Pre-fix log excerpt (rule ID changing between cycles 30s apart):

11:40:47  securityGroupRuleID: sgr-0261c49254bedc422
11:43:21  securityGroupRuleID: sgr-020e2ebec1bc13397

Post-fix log excerpt (same workload, multi-cycle window):

13:23:??  sgr-0d71004b23dcd7309
13:25:??  sgr-0d71004b23dcd7309   ← unchanged
$ kubectl logs ... | grep -E 'AuthorizeSecurityGroup|RevokeSecurityGroup'
(no output)

Backwards compatibility

None. The helper operates on a deep copy used only for diff comparison — it does not alter what is sent to AWS. Non-self-referencing rules (cross-SG, IP/IPv6 CIDR, prefix-list) are untouched. Behaviour for existing CRs that already specified groupID explicitly to a different SG is unchanged.

Files changed

  • pkg/resource/security_group/hooks.go — adds normalizeSelfRefRules and rebinds desired/latest to its result at the top of syncSGRules.
  • pkg/resource/security_group/hooks_test.go — new file with TestNormalizeSelfRefRules (10 subtests) and TestContainsRule_SelfRef_AfterNormalisation.

…efore diff

AWS auto-fills GroupId (and UserId, optionally GroupName) on
DescribeSecurityGroups read-back when a rule references the SG itself.
Specs typically omit these for self-references — the SG's own ID is
unknown at apply time — so compareIPPermission's DeepEqual on
UserIDGroupPairs sees a perpetual diff and syncSGRules schedules a
Revoke + Authorize on every reconcile. A crash between the two leaves
the rule missing (CoreDNS outage in the reporter's case).

Mirror the outbound-path canonicalisation symmetrically: in syncSGRules,
deep-copy desired and latest, and for any UserIDGroupPair identified as
a self-reference (GroupID nil or equal to self-id), clear the
server-fillable fields (GroupID, UserID, GroupName). Description is
preserved. Originals are never mutated.

Scope: cross-SG/cross-account pairs that omit server-filled fields are
not addressed. The runtime-level Delta above customUpdateSecurityGroup
still logs a Spec.IngressRules diff each cycle, but syncSGRules now
produces empty toAdd/toDelete sets so no AWS API call is made — pushing
canonicalisation up to the Delta layer is left for a follow-up.

Verified on a live k3d cluster: rule ID stable across reconciles and
zero Revoke/Authorize API calls observed over multiple cycles.

Tests:
 - TestNormalizeSelfRefRules: ingress/egress, deep-copy guarantee,
   nil-status no-op, cross-SG preservation, nil GroupID as self-ref,
   mixed pairs, nil rule/pair entries, description preservation.
 - TestContainsRule_SelfRef_AfterNormalisation: locks in the fix at
   the comparison layer, symmetric.

Fixes aws-controllers-k8s/community#2822
@ack-prow ack-prow Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 27, 2026
@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented Apr 27, 2026

Hi @kbujanecki-dt. Thanks for your PR.

I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ack-prow ack-prow Bot requested a review from gustavodiaz7722 April 27, 2026 08:53
@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented Apr 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kbujanecki-dt
Once this PR has been reviewed and has the lgtm label, please assign knottnt for approval by writing /assign @knottnt in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow Bot requested a review from sapphirew April 27, 2026 08:53
Copy link
Copy Markdown
Member

@michaelhtm michaelhtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kbujanecki-dt
I just have one comment
/test all

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also fix how we compare rules? We could have a custom compare, to avoid getting diffs

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelhtm Good call - done in the latest push.

@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented May 12, 2026

@kbujanecki-dt: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ec2-kind-e2e 8be9533 link true /test ec2-kind-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented May 14, 2026

@kbujanecki-dt: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ack-prow
Copy link
Copy Markdown

ack-prow Bot commented May 14, 2026

@kbujanecki-dt: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test ec2-kind-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…are hook

Addresses review on aws-controllers-k8s#325: normalise self-referencing UserIDGroupPairs at
the runtime Delta layer instead of only inside syncSGRules, so
newResourceDelta no longer reports a spurious Spec.IngressRules /
Spec.EgressRules diff on every reconcile.

 - generator.yaml: register  -> customPreCompare.
 - delta.go: regenerated; newResourceDelta now calls customPreCompare.
 - hooks.go: add customPreCompare; convert normalizeSelfRefRules to
   in-place mutation (matches RouteTable/NetworkAcl/VPC convention;
   safe — Spec is never patched back, createSecurityGroupRules already
   auto-fills nil GroupID, desired is reread each cycle); drop the
   redundant normalisation in syncSGRules.
 - hooks_test.go: update subtests for in-place signature; add
   TestCustomPreCompare_{SelfRef_SuppressesDelta,RealDiff_StillFires,
   CrossSGRef_NotSuppressed,Mutates_a_and_b}.
@kbujanecki-dt kbujanecki-dt requested a review from michaelhtm May 19, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ec2-controller] SecurityGroup rules with self-reference recreated with each reconciliation cycle

2 participants