fix(security_group): treat omitted GroupID in self-referencing rules as equal to self-id#325
Conversation
…efore diff AWS auto-fills GroupId (and UserId, optionally GroupName) on DescribeSecurityGroups read-back when a rule references the SG itself. Specs typically omit these for self-references — the SG's own ID is unknown at apply time — so compareIPPermission's DeepEqual on UserIDGroupPairs sees a perpetual diff and syncSGRules schedules a Revoke + Authorize on every reconcile. A crash between the two leaves the rule missing (CoreDNS outage in the reporter's case). Mirror the outbound-path canonicalisation symmetrically: in syncSGRules, deep-copy desired and latest, and for any UserIDGroupPair identified as a self-reference (GroupID nil or equal to self-id), clear the server-fillable fields (GroupID, UserID, GroupName). Description is preserved. Originals are never mutated. Scope: cross-SG/cross-account pairs that omit server-filled fields are not addressed. The runtime-level Delta above customUpdateSecurityGroup still logs a Spec.IngressRules diff each cycle, but syncSGRules now produces empty toAdd/toDelete sets so no AWS API call is made — pushing canonicalisation up to the Delta layer is left for a follow-up. Verified on a live k3d cluster: rule ID stable across reconciles and zero Revoke/Authorize API calls observed over multiple cycles. Tests: - TestNormalizeSelfRefRules: ingress/egress, deep-copy guarantee, nil-status no-op, cross-SG preservation, nil GroupID as self-ref, mixed pairs, nil rule/pair entries, description preservation. - TestContainsRule_SelfRef_AfterNormalisation: locks in the fix at the comparison layer, symmetric. Fixes aws-controllers-k8s/community#2822
|
Hi @kbujanecki-dt. Thanks for your PR. I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kbujanecki-dt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
michaelhtm
left a comment
There was a problem hiding this comment.
Thanks @kbujanecki-dt
I just have one comment
/test all
There was a problem hiding this comment.
Can we also fix how we compare rules? We could have a custom compare, to avoid getting diffs
|
@kbujanecki-dt: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@kbujanecki-dt: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@kbujanecki-dt: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…are hook Addresses review on aws-controllers-k8s#325: normalise self-referencing UserIDGroupPairs at the runtime Delta layer instead of only inside syncSGRules, so newResourceDelta no longer reports a spurious Spec.IngressRules / Spec.EgressRules diff on every reconcile. - generator.yaml: register -> customPreCompare. - delta.go: regenerated; newResourceDelta now calls customPreCompare. - hooks.go: add customPreCompare; convert normalizeSelfRefRules to in-place mutation (matches RouteTable/NetworkAcl/VPC convention; safe — Spec is never patched back, createSecurityGroupRules already auto-fills nil GroupID, desired is reread each cycle); drop the redundant normalisation in syncSGRules. - hooks_test.go: update subtests for in-place signature; add TestCustomPreCompare_{SelfRef_SuppressesDelta,RealDiff_StillFires, CrossSGRef_NotSuppressed,Mutates_a_and_b}.
Issue
Fixes aws-controllers-k8s/community#2822
Description
SecurityGrouprules whoseuserIDGroupPairs[*].groupIDis omitted — the canonical way to express a self-reference, since the SG's own ID is unknown until AWS assigns it — are recreated on every reconciliation cycle.Root cause is in
compareIPPermission(pkg/resource/security_group/sdk.go), which comparesUserIDGroupPairswithequality.Semantic.Equalities.DeepEqual. OnDescribeSecurityGroupsread-back AWS auto-fillsGroupId,UserId, and (optionally)GroupNamefor self-referencing pairs, while the desired spec retains them asnil. The two sides therefore never compare equal → permanent diff →RevokeSecurityGroupIngress+AuthorizeSecurityGroupIngresson every reconcile.This is the bug reported in #2822, with the additional operational severity that a controller crash between the Revoke and Authorize leaves the rule missing — degrading any workload that depends on the SG (CoreDNS node-to-node traffic in the reporter's reproducer).
Fix
Add a small helper
normalizeSelfRefRules(r *resource) *resourceinpkg/resource/security_group/hooks.goand call it on bothdesiredandlatestat the top ofsyncSGRules.Behaviour:
desired/latestare never mutated, so downstream code (status patching, late-initialisation, requeue) cannot accidentally lose user-setGroupID/UserIDvalues.GroupIDisnilor equalsr.ko.Status.ID.GroupID,UserID,GroupName) for self-ref pairs only.Description, ports, protocol, IP/IPv6 ranges, prefix lists, cross-SG pairs, and IP-CIDR-only rules are untouched.This makes the two sides semantically equivalent for the diff check while leaving the actual AWS API calls untouched: the existing outbound path (
createSecurityGroupRulesfurther downhooks.go) already canonicalisesnil→ self-id beforeAuthorizeSecurityGroup{Ingress,Egress}. The change is symmetric — the inbound side now matches the outbound convention.The fix lives in the manually-maintained
hooks.gorather than in the generatedcompareIPPermissioninsdk.go, so it survives code-generator regenerations.Scope and follow-up
UserID,VPCID) may still exhibit perpetual diffs; out of scope here.Deltacomputed abovecustomUpdateSecurityGroupstill observes the raw (unfilled vs filled)Spec.IngressRuleseach cycle and dispatches intosyncSGRules. That dispatch is now harmless: the normalised diff produces emptytoAdd/toDeletesets and no AWS API call is made. Eliminating that cosmetic log noise requires a normalisation hook above the generated update path — left for a follow-up.Verification
Unit tests
pkg/resource/security_group/hooks_test.go:TestNormalizeSelfRefRules(10 subtests):nil resource returns nilnil status ID returns a copy untoucheddoes not mutate the input resource (deep-copy guarantee)strips server-fillable fields on self-ref ingress pairstrips server-fillable fields on self-ref egress pairpreserves cross-SG pair untouchednil GroupID counts as self-ref and clears UserID/GroupNamemixed pair: only self stripped, cross-SG kepthandles nil rule and nil pair entriespreserves description on self-ref pairTestContainsRule_SelfRef_AfterNormalisation— locks the fix in at the comparison layer:containsRulereturnsfalsefor canonical desired vs AWS-filled latest without normalisation — the bug exists),containsRulereturnstrueafternormalizeSelfRefRuleson both sides),desired→latestandlatest→desireddirections match, so neither branch ofsyncSGRulesschedules churn).Full package test suite (
go test ./pkg/resource/security_group/...) and module-widego build ./...both pass.End-to-end against real AWS
Built the controller from this branch into a local image, deployed to k3d v1.34, applied the exact
SecurityGroupCR pattern from #2822 (self-referencing ingress on TCP/53,groupIDomitted, onlydescriptionset) against a sandbox AWS account ineu-central-1. Observed across multiple reconcile cycles (requeue period 30s):Ready=TruetimesecurityGroupRuleIDin.status.rulesAuthorizeSecurityGroupIngressAPI calls (post-create)RevokeSecurityGroupIngressAPI callsPre-fix log excerpt (rule ID changing between cycles 30s apart):
Post-fix log excerpt (same workload, multi-cycle window):
Backwards compatibility
None. The helper operates on a deep copy used only for diff comparison — it does not alter what is sent to AWS. Non-self-referencing rules (cross-SG, IP/IPv6 CIDR, prefix-list) are untouched. Behaviour for existing CRs that already specified
groupIDexplicitly to a different SG is unchanged.Files changed
pkg/resource/security_group/hooks.go— addsnormalizeSelfRefRulesand rebindsdesired/latestto its result at the top ofsyncSGRules.pkg/resource/security_group/hooks_test.go— new file withTestNormalizeSelfRefRules(10 subtests) andTestContainsRule_SelfRef_AfterNormalisation.