Skip to content

feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469

Open
IngmarVG-IB wants to merge 6 commits into
NVIDIA:mainfrom
IngmarVG-IB:feat/dns-aid-discovery
Open

feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement#469
IngmarVG-IB wants to merge 6 commits into
NVIDIA:mainfrom
IngmarVG-IB:feat/dns-aid-discovery

Conversation

@IngmarVG-IB
Copy link
Copy Markdown

@IngmarVG-IB IngmarVG-IB commented Mar 27, 2026

Summary

  • Adds DNS-AID agent discovery via SVCB records (RFC 9460) with private-use SvcParamKeys per draft-mozleywilliams-dnsop-dnsaid-01
  • Adds OpenShell caller-side policy enforcement (Layer 1) consuming dns-aid-core's PolicyDocument schema
  • Integrates both into the /v1/agents endpoint with feature gating

Story: "OpenShell secures the agent boundary; DNS-AID resolves what's inside it."

Motivation / Context

AICR agents in Kubernetes need a standard discovery layer to find peers and a security boundary to control which connections are permitted. DNS-AID provides discovery via DNS SVCB lookups; OpenShell evaluates target agents' published policy documents before allowing the calling agent to connect.

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • API server (/v1/agents endpoint)
  • Core libraries (pkg/discovery, pkg/openshell, pkg/errors, pkg/defaults)

Implementation Notes

DNS-AID Discovery (pkg/discovery/)

  • Discoverer: DNS SVCB queries for _{name}._{protocol}._agents.{domain}
  • Publisher: K8s ConfigMap-based agent registration with create-or-update semantics
  • TXT-based index at _agents.{domain} for listing all agents
  • Server lifecycle hooks (OnStart/OnShutdown) for auto-publish/deregister
  • Feature-gated via AICR_DISCOVERY_ENABLED=true

OpenShell Policy Enforcement (pkg/openshell/)

  • All 16 native policy rules matching dns-aid-core's schema
  • Three enforcement modes: strict (deny on violation), permissive (log + allow, default), disabled
  • Fail-open on policy fetch errors (matches dns-aid-core behavior)
  • SSRF-protected fetcher: HTTPS-only, rejects private/loopback IPs
  • TTL-based cache (5m default) with bounded eviction (max 1024 entries) and singleflight coalescing
  • Enforcement layer filtering — Layer 1 only evaluates applicable rules
  • Realm isolation detection with cross-realm access logging
  • CEL custom rules deferred to future phase (logged as unsupported)
  • Controlled via OPENSHELL_MODE env var with startup validation

Integration

  • /v1/agents evaluates OpenShell policy for each discovered agent with a policy URI
  • Guard accepts nil — zero behavioral change when discovery is disabled
  • New error codes: POLICY_DENIED, POLICY_FETCH
  • New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB)

Key Design Decisions

  • Fail-open: If a policy document can't be fetched, the connection is allowed. This matches dns-aid-core and is correct for a discovery system where policy is advisory at Layer 1.
  • No Python dependency: OpenShell is a pure Go reimplementation consuming the same JSON schema as dns-aid-core's Python SDK.
  • Numeric TLS version comparison: Parses major.minor as integers (not lexicographic) to correctly handle versions like 1.12.

Testing

make test  # all tests pass with -race
  • pkg/openshell: 84.2% coverage (30 tests)
    • Evaluator: 15 table-driven tests covering all 16 rules, layer filtering, multiple violations
    • Fetcher: SSRF validation, caching, content-type/size rejection, non-200 handling
    • Guard: all 3 modes, fail-open, compliant caller, mode validation
    • Availability: midnight wrap, timezone handling, malformed input fail-open
    • TLS version: numeric comparison including double-digit minor versions
  • pkg/discovery: Mock DNS server tests for SVCB parsing, index listing, publisher
  • pkg/api: Handler tests with nil guard, domain validation, empty index

Risk Assessment

Medium — New packages with no existing consumers beyond the /v1/agents endpoint. Feature-gated behind AICR_DISCOVERY_ENABLED and OPENSHELL_MODE env vars, so zero impact on existing functionality when disabled. Default mode is permissive (log-only), providing a safe rollout path.

Checklist

  • Tests pass locally with -race
  • Linter passes (go vet clean, gofmt clean)
  • No tests skipped or disabled
  • New tests added for new functionality
  • Changes follow existing patterns (functional options, pkg/errors, slog, pkg/defaults)
  • Commits are cryptographically signed (git commit -S)

@IngmarVG-IB IngmarVG-IB requested a review from a team as a code owner March 27, 2026 23:10
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

Welcome to AICR, @IngmarVG-IB! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@IngmarVG-IB IngmarVG-IB changed the title feat(discovery): add DNS-AID agent discovery and server lifecycle hooks feat(discovery+openshell): DNS-AID agent discovery with OpenShell policy enforcement Mar 28, 2026
@mchmarny mchmarny requested a review from lockwobr April 2, 2026 11:25
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR — clean separation between discovery, policy, and API layers. The code follows project conventions (functional options, pkg/errors, pkg/defaults, slog, table-driven tests, create-or-update K8s semantics). Feature gating via env vars is a good rollout strategy.

Key issues to address:

  1. SSRF bypass via hostname (fetcher.go) — The private IP check only catches raw IP literals in URLs. A hostname resolving to 127.0.0.1 or a link-local address passes validation. Since policy URIs come from DNS records (untrusted input), this needs a DialContext-level check or at minimum a documented limitation.

  2. Silent data loss on context cancellation (resolver.go)DiscoverAll swallows all errors including context cancellation, returning an empty list with nil error. Callers can't distinguish "no agents" from "timed out mid-discovery."

  3. Confusing dual-return on fail-open (guard.go)Check returns (allowed=true, err!=nil) on fetch failure. This contract is easy to misuse.

  4. bapVersions v-prefix handling (types.go) — A v-prefixed version like v2.1.0 produces mcp/v2 instead of mcp/2.

Minor/cleanup:

  • Dead code: g.Wait() error check in DiscoverAll can never trigger
  • Unnecessary loop variable capture (Go 1.22+ scoping)
  • Duplicated startMockDNS test helper across packages
  • PORT env var parsed in two places
  • .claude/agents/k8s-discovery.md may be unintentional

Overall the architecture is solid. The SSRF issue is the most important one to fix before merge.

Comment thread pkg/discovery/resolver.go
Comment thread pkg/discovery/resolver.go
Comment thread pkg/discovery/resolver.go
Comment thread pkg/openshell/guard.go
Comment thread pkg/openshell/guard.go
Comment thread pkg/discovery/publisher.go
Comment thread pkg/discovery/types.go Outdated
Comment thread pkg/openshell/fetcher.go
Comment thread pkg/api/agents_test.go
Comment thread .claude/agents/k8s-discovery.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

@IngmarVG-IB this PR now has merge conflicts with main. Please rebase to resolve them.

@IngmarVG-IB IngmarVG-IB force-pushed the feat/dns-aid-discovery branch from 3b66bf9 to ab2b0f4 Compare April 27, 2026 16:02
IngmarVG-IB added a commit to IngmarVG-IB/aicr that referenced this pull request Apr 27, 2026
Address reviewer (mchmarny) findings on PR NVIDIA#469:

Security:
- openshell/fetcher: close SSRF hostname bypass by adding a safeDialContext
  on the HTTP transport that rejects connections to private/loopback IPs
  even when the URL hostname resolves to one. Adds isBlockedIP helper and
  a regression test that fails closed against an httptest TLS server.

Correctness:
- discovery/resolver: propagate context cancellation in DiscoverAll so
  callers can distinguish "no agents" from "timed out mid-discovery".
  Remove the dead g.Wait() error path now that goroutines genuinely
  return ctx.Err() on cancellation.
- openshell/guard: simplify the Check contract — fetch failures fail
  open with (Allowed=true, nil error) rather than the dual-return that
  was easy to misuse. Return an error only on programmer errors (nil
  record). Update agents.go to match.
- openshell/guard: remove panic from NewGuard; document that callers
  must validate via ValidMode before construction (server.go already
  does this).
- discovery/types: strip leading semver "v" prefix in bapVersions so
  "v2.1.0" produces "mcp/2" rather than "mcp/v2".

Cleanup:
- discovery/resolver: remove redundant `entry := entry` loop variable
  capture (Go 1.22+ scoping makes it unnecessary).
- openshell/evaluator: restructure rate_limits to a clearer
  switch (violation vs. warning) so future map changes do the right thing.
- server/config + api/server: extract a single ResolvePort helper so
  agent registration and the HTTP server agree on the bound port.
- internal/testdns: extract the duplicated startMockDNS helper into a
  shared internal package; pkg/discovery and pkg/api now both import it.
- openshell/fetcher: document the singleflight + context coalescing
  trade-off (3s timeout makes shared cancellation acceptable).
- Remove .claude/agents/k8s-discovery.md (local Claude Code config,
  unintentionally committed).

Tests:
- New: TestFetcherSSRFDialBlocksPrivateIP, TestIsBlockedIP,
  TestGuardNilRecordReturnsError, expanded TestBapVersions cases.
- Updated: TestGuardFailOpenOnFetchError now asserts (Allowed=true, nil)
  rather than the old dual-return; TestGuardInvalidModePanics removed
  (no longer panics).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@IngmarVG-IB has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 57 minutes and 13 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 43156e03-040b-4585-b003-5a5435473663

📥 Commits

Reviewing files that changed from the base of the PR and between 02f7453 and 6f2d22f.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (31)
  • go.mod
  • internal/testdns/testdns.go
  • pkg/api/agents.go
  • pkg/api/agents_test.go
  • pkg/api/server.go
  • pkg/defaults/timeouts.go
  • pkg/discovery/doc.go
  • pkg/discovery/options.go
  • pkg/discovery/publisher.go
  • pkg/discovery/publisher_test.go
  • pkg/discovery/resolver.go
  • pkg/discovery/resolver_test.go
  • pkg/discovery/svcb.go
  • pkg/discovery/svcb_test.go
  • pkg/discovery/testhelper_test.go
  • pkg/discovery/types.go
  • pkg/errors/errors.go
  • pkg/errors/exitcode.go
  • pkg/openshell/doc.go
  • pkg/openshell/evaluator.go
  • pkg/openshell/evaluator_test.go
  • pkg/openshell/fetcher.go
  • pkg/openshell/fetcher_test.go
  • pkg/openshell/guard.go
  • pkg/openshell/guard_test.go
  • pkg/openshell/options.go
  • pkg/openshell/types.go
  • pkg/server/config.go
  • pkg/server/errors.go
  • pkg/server/lifecycle_test.go
  • pkg/server/server.go
📝 Walkthrough

<review_stack_artifact>

</review_stack_artifact>

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: DNS-AID agent discovery and OpenShell policy enforcement, which are the core features added throughout the changeset.
Description check ✅ Passed The description provides comprehensive context about the DNS-AID discovery implementation, OpenShell policy enforcement, integration approach, and testing, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 24

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/server/server.go (1)

163-172: ⚠️ Potential issue | 🟡 Minor

Readiness flips to true before the listener is actually bound — small but real race.

s.httpServer.ListenAndServe() runs in a goroutine, and the listener may not yet be bound when s.setReady(true) runs on Line 172. In a Kubernetes readiness-probe scenario, /ready could return 200 while connections are still being refused, causing flapping or 502s during a rolling restart. Notably, the new lifecycle_test.go tests poll TCP with net.DialTimeout rather than checking readiness state, which suggests this gap is already known.

Bind the listener synchronously before going async, e.g.:

♻️ Proposed fix sketch
-	// Start server in goroutine
-	errChan := make(chan error, 1)
-	go func() {
-		if err := s.httpServer.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
-			errChan <- err
-		}
-	}()
-
-	// Mark ready after hooks and listener are up
-	s.setReady(true)
+	// Bind listener synchronously so readiness reflects actual accept state
+	ln, err := net.Listen("tcp", s.httpServer.Addr)
+	if err != nil {
+		return aicrerrors.Wrap(aicrerrors.ErrCodeInternal, "listen failed", err)
+	}
+
+	errChan := make(chan error, 1)
+	go func() {
+		if err := s.httpServer.Serve(ln); err != nil && !errors.Is(err, http.ErrServerClosed) {
+			errChan <- err
+		}
+	}()
+
+	// Mark ready after hooks succeeded and the listener is actually bound
+	s.setReady(true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/server/server.go` around lines 163 - 172, The readiness is set before the
HTTP listener is bound; change startup to synchronously create/bind the listener
and only set readiness after bind succeeds, then serve in a goroutine: create a
net.Listener (or call s.httpServer.Listen on a previously created listener) and
check for bind errors before calling s.setReady(true), then start serving with
s.httpServer.Serve(listener) inside the existing goroutine and forward Serve
errors to errChan (preserve the existing ListenAndServe error handling logic but
use Serve to avoid the race). Ensure you reference and modify the existing
s.httpServer.ListenAndServe usage, errChan, and s.setReady calls so readiness
flips only after successful listener bind.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/api/agents_test.go`:
- Line 52: Replace the needless fmt.Sprintf call used for the Target field —
change Target: fmt.Sprintf("aicrd.default.svc.cluster.local.") to use the string
literal directly and remove the fmt import if that was its last usage; look for
the fmt.Sprintf usage in the test (Target field) and the fmt import at the top
of the file to update accordingly.

In `@pkg/api/agents.go`:
- Around line 93-128: The per-agent serial policy checks in the loop over
records (where resp := agentResponse{...} and guard.Check(ctx, rec) is called)
can block under fanout; refactor to perform guard.Check concurrently using
errgroup.WithContext(ctx) and a small concurrency semaphore (e.g., buffered
channel) to cap parallelism, collect results back into the corresponding
agentResponse (including building policyResultSummary from result.Violations)
and attach resp.PolicyResult; ensure you propagate context cancellation and
handle individual check errors by logging them (as currently done) without
failing the whole handler so the outer handler still respects
DiscoveryHandlerTimeout.

In `@pkg/api/server.go`:
- Line 109: The current check if os.Getenv("AICR_DISCOVERY_ENABLED") == "true"
rejects other common truthy values; replace this strict comparison with
strconv.ParseBool(os.Getenv("AICR_DISCOVERY_ENABLED")) and use the returned bool
to gate the discovery block (the same if that currently encloses discovery
setup). If ParseBool returns an error, log a warning (using the existing logger
used in this file) that the env value could not be parsed and treat it as false;
proceed only when the parsed bool is true. Ensure you import strconv and keep
the AICR_DISCOVERY_ENABLED env var name and the existing discovery setup code
intact.
- Around line 134-151: podNamespace currently returns the raw bytes from the
serviceaccount namespace file which can include trailing newlines and
whitespace; update podNamespace to trim whitespace from the file contents (use
strings.TrimSpace on the string(data)) before returning and ensure the "strings"
package is added to imports so discoveryDomain (which calls podNamespace)
receives a clean namespace string.

In `@pkg/discovery/publisher.go`:
- Around line 192-204: The create-then-update path in ensureAgentConfigMap
currently does an Update on IsAlreadyExists without re-fetching, which can race
and cause Conflict or clobbering; change the IsAlreadyExists branch so you first
Get the existing ConfigMap via
p.clientset.CoreV1().ConfigMaps(p.namespace).Get(ctx, cmName,
metav1.GetOptions{}), merge the labels/annotations onto that existing object,
copy its ResourceVersion into your update object, then call Update; wrap the
Update in a bounded retry loop that retries on k8serrors.IsConflict (similar to
the retry logic used by updateIndex) to handle concurrent modifications and
avoid lost updates.

In `@pkg/discovery/resolver_test.go`:
- Line 237: Replace the unnecessary call
fmt.Sprintf("aicrd.default.svc.cluster.local.") used as the Target value with
the plain string literal "aicrd.default.svc.cluster.local." (i.e., remove
fmt.Sprintf usage) and then remove the unused fmt import from
pkg/discovery/resolver_test.go so the test no longer imports fmt for no reason.

In `@pkg/discovery/resolver.go`:
- Around line 147-184: In ListIndex, add context cancellation checks inside the
outer loop over resp.Answer and the inner loop over txt.Txt so parsing can abort
promptly if ctx is cancelled: use a non-blocking select on ctx.Done() (or check
ctx.Err()) at the start of each loop iteration and return nil, ctx.Err() (or an
appropriate wrapped error) when cancelled; update handling around variables
resp, rr, txt, and s to bail out cleanly instead of continuing to parse.
- Around line 42-48: The doc comment on NewDiscoverer is misleading: it claims
the default DNS server is "the system resolver at 127.0.0.1:53" but the code
hard-codes dnsServer = "127.0.0.1:53"; either update the comment to say it
defaults to localhost (127.0.0.1:53) or change initialization to parse the
system resolver via miekg/dns.ClientConfigFromFile("/etc/resolv.conf") and set
Discoverer.dnsServer from the first nameserver with a fallback to
"127.0.0.1:53"; locate NewDiscoverer and the Discoverer.dnsServer field to apply
the fix and keep defaults.DiscoveryDNSTimeout and dns.Client usage unchanged.
- Around line 153-161: After the ExchangeContext call, add a defensive nil check
for resp before reading resp.Rcode: if resp is nil, return a wrapped
errors.ErrCodeUnavailable (with a message like "nil DNS response for <fqdn>") to
avoid a panic; apply the same nil-check pattern in querySVCB where
ExchangeContext is used so both places guard against nil responses from
d.dnsClient.ExchangeContext before accessing resp.Rcode.

In `@pkg/discovery/svcb.go`:
- Around line 41-60: ParseSVCBAnswer currently selects the first-seen dns.SVCB
with the lowest svcb.Priority and leaves ties non-deterministic; update
ParseSVCBAnswer to break ties when svcb.Priority == best.Priority by applying a
deterministic tiebreaker (e.g., compare svcb.TargetName lexicographically or
compare a canonicalized set of SvcParams) or implement RFC 9460 §2.4.3
randomized selection (seeded per-query) before calling parseSVCBRecord; ensure
the comparison logic references msg.Answer, dns.SVCB, svcb.Priority,
best.Priority and still skips alias mode (Priority == 0).
- Around line 97-98: The BAP parsing currently does params.BAP =
strings.Split(val, ",") which makes an empty string produce []string{""}; change
the logic in the SvcParamKeyBAP handling so that when val == "" (or
strings.TrimSpace(val) == "") you set params.BAP to nil or an empty slice,
otherwise split the value; update the SvcParamKeyBAP branch that assigns
params.BAP to perform this check so downstream checks like
AgentRegistration.bapVersions() using len(r.BAP) behave correctly.

In `@pkg/errors/errors.go`:
- Around line 41-44: The constant ErrCodePolicyDenied is defined but unused;
either remove ErrCodePolicyDenied from pkg/errors/errors.go or add a consumer
that returns/checks it (e.g., in the OpenShell policy evaluation path). If
removing, delete the ErrCodePolicyDenied line and keep ErrCodePolicyFetch which
is used by pkg/openshell/fetcher.go; if adding a consumer, update the OpenShell
policy evaluation or HTTP strict-mode handler to return an errors.Error with
Code set to ErrCodePolicyDenied when a policy evaluation explicitly denies a
request, and ensure callers check for ErrCodePolicyDenied where denial semantics
are handled.

In `@pkg/openshell/evaluator.go`:
- Around line 106-121: The current appliesToLayer("rate_limits", layer) branch
unconditionally appends a deny PolicyViolation for LayerTarget; change this so
we don't silently deny: either implement a real comparison between
rules.RateLimits and runtime counters on PolicyContext (e.g.,
requestsPerMinute/requestsPerHour fields) inside Evaluate (so you compare
rules.RateLimits thresholds to the PolicyContext counters and only append a
violation when exceeded), or if those counters aren't available yet, replace the
unconditional violation with an explicit “not implemented” outcome (e.g., append
a PolicyViolation or warning with Rule:"rate_limits", Detail:"rate limits
enforcement not implemented for LayerTarget" and/or log a clear TODO) so
Guard.Check/Evaluate won't silently deny once LayerTarget enforcement is wired.
- Around line 273-287: In parseTLSVersion, avoid shadowing predeclared
identifiers by assigning directly to the named return variables (major, minor,
ok) instead of creating local vars like maj/min; change "maj, err :=
strconv.Atoi(parts[0])" to "major, err = strconv.Atoi(parts[0])" and "min, err
:= strconv.Atoi(parts[1])" to "minor, err = strconv.Atoi(parts[1])", keep the
existing error checks, and return the named returns (major, minor, true) so no
local min/maj variables are introduced.

In `@pkg/openshell/fetcher_test.go`:
- Around line 151-173: The test currently fails at validateFetchURL because
srv.URL uses a raw IP; change the test so the request URL contains a hostname
(not an IP) that will resolve to a private/loopback address at dial time so
safeDialContext is exercised. Concretely: keep using NewFetcher() and the custom
transport from newSafeTransport(false), but set f.client.Transport.DialContext
to a custom DialContext that resolves the test hostname (e.g., "localtest.me" or
"example.test") to the httptest server's loopback address/port
(srv.Listener.Addr()) so DNS resolution happens at dial and safeDialContext is
invoked; ensure the URL passed to f.Fetch uses that hostname (not 127.0.0.1) so
validateFetchURL does not reject it up front. Reference functions/values:
TestFetcherSSRFDialBlocksPrivateIP, validateFetchURL, safeDialContext,
NewFetcher, newSafeTransport, f.client.Transport.DialContext, and srv.

In `@pkg/openshell/fetcher.go`:
- Around line 81-88: The http.Client created in the fetcher (where f.client is
set) follows redirects without re-validating the redirected URL, allowing TLS
downgrade or host changes; update the client initialization to set CheckRedirect
to a function that calls validateFetchURL (or at minimum enforces scheme ==
"https") for the redirect Location, and return an error to stop the redirect if
validation fails. Implement the CheckRedirect closure alongside
Timeout/Transport in the f.client initialization so redirects are rejected when
validateFetchURL or an https-only check fails; keep
safeDialContext/newSafeTransport as-is for post-resolution checks.
- Around line 99-117: The current t.DialContext checks IPs via
net.DefaultResolver.LookupIPAddr then calls dialer.DialContext(ctx, network,
addr) which re-resolves the hostname (TOCTOU); instead, after filtering ips with
isBlockedIP and allowPrivate, attempt to dial each validated IP literal directly
(use ipa.IP.String() + the original port via net.JoinHostPort) with the existing
dialer.DialContext to avoid a second DNS lookup; on first successful connection
return it, otherwise try the next IP and finally return the last error if all
fail; keep using the same ctx and network and preserve the original hostname for
TLS/SNI handling (do not alter request Host elsewhere).

In `@pkg/openshell/guard.go`:
- Around line 132-136: Guard.Check() is building a PolicyContext (pctx) with
only CallerID, CallerDomain and Protocol which means Layer 1 evaluation will see
zero-values for AuthType, HasMutualTLS, TLSVersion, DNSSECValidated,
CallerTrustScore, Method, Intent, GeoCountry and ConsentToken and thus
ModeStrict will unconditionally deny policies that require those fields; update
the evaluator in Guard.Check()/the policy evaluation path to treat
unavailable/zero-valued signals as "unknown" instead of outright
failing—specifically, detect when pctx fields are not populated and either skip
those specific checks (emit a warning/log when ModePermissive or when skipping
in ModeStrict) or convert them to non-failing tri-state checks so
required_auth_types, require_dnssec, require_mutual_tls, min_tls_version,
required_caller_trust_score, allowed_methods, allowed_intents, geo_restrictions
and consent_required do not cause automatic deny at Layer 1; include clear log
messages referencing PolicyContext and ModeStrict/ModePermissive to aid
debugging.

In `@pkg/openshell/types.go`:
- Around line 60-80: The struct comment for PolicyRules currently states "All
fields are pointers or slices to distinguish 'not set' from zero values" but
RequireDNSSEC, RequireMutualTLS, and ConsentRequired are plain bools; either
update the comment to call out these three fields as intentional plain booleans
(and briefly state that "not set" == false semantics) or change those three
fields (RequireDNSSEC, RequireMutualTLS, ConsentRequired) to *bool and add
`omitempty` to their json tags so consumers can distinguish unset vs false;
apply the chosen change consistently in pkg/openshell/types.go near the
PolicyRules type.

In `@pkg/server/config.go`:
- Around line 39-51: Replace the tolerant fmt.Sscanf parsing in ResolvePort with
strict numeric parsing: import strconv, use strconv.Atoi(portStr) to parse the
env var (handle the returned error and the parsed port bounds as before), and
keep logging the error and returning defaultPort on any parse failure; retain
fmt import only if other functions (e.g., parseConfig/SHUTDOWN_TIMEOUT_SECONDS)
still need it.

In `@pkg/server/lifecycle_test.go`:
- Around line 202-230: The test TestMultipleHooksExecuteInOrder currently
invokes hooks directly; change it to register two OnStart hooks that append to a
shared slice and then call s.Start(context.Background()) (using the same s
created with WithOnStart) to drive hook execution via Start(); after Start
returns assert the slice length and ordering (1 then 2) and fail the test if
Start returns an error. Also add a companion test that registers multiple
WithOnShutdown hooks, calls s.Shutdown(ctx) (or the appropriate shutdown entry
point) and asserts shutdown hooks run in registration order and that execution
stops on the first error where applicable. Ensure you reference s.config.OnStart
/ OnShutdown only to confirm registration counts if needed but rely on
Start()/Shutdown() for actual execution verification.
- Around line 27-36: The helper freePort currently ignores the error returned by
l.Close(); modify freePort to capture the Close() error from the net.Listener
variable l and fail the test if Close() returns an error (e.g., call err =
l.Close() and t.Fatalf/t.Errorf with the error). Update the freePort function so
the listener close is checked and reported, referencing the freePort function,
the listener variable l, and the l.Close() call.

In `@pkg/server/server.go`:
- Around line 156-161: The OnStart hook loop (iterating s.config.OnStart and
calling each hook(ctx)) can block indefinitely; wrap each hook invocation in a
bounded context using context.WithTimeout and cancel (e.g., derive ctx with a
startup timeout constant like defaults.StartupTimeout) and pass that derived
context into hook(ctxWithTimeout), then handle the case where the timeout causes
the hook to return a context.DeadlineExceeded error and wrap/return it as
before; additionally update the LifecycleHook documentation to state that hooks
must respect the provided context (or, if you prefer the alternate contract,
require each hook to set its own timeout and document that) so callers know the
enforced timeout policy.

---

Outside diff comments:
In `@pkg/server/server.go`:
- Around line 163-172: The readiness is set before the HTTP listener is bound;
change startup to synchronously create/bind the listener and only set readiness
after bind succeeds, then serve in a goroutine: create a net.Listener (or call
s.httpServer.Listen on a previously created listener) and check for bind errors
before calling s.setReady(true), then start serving with
s.httpServer.Serve(listener) inside the existing goroutine and forward Serve
errors to errChan (preserve the existing ListenAndServe error handling logic but
use Serve to avoid the race). Ensure you reference and modify the existing
s.httpServer.ListenAndServe usage, errChan, and s.setReady calls so readiness
flips only after successful listener bind.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 25f44d89-fc1e-4cb9-afe0-c7e85f6ea6d0

📥 Commits

Reviewing files that changed from the base of the PR and between 8d0168e and ab2b0f4.

⛔ Files ignored due to path filters (271)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/miekg/dns/.codecov.yml is excluded by !vendor/**
  • vendor/github.com/miekg/dns/.gitignore is excluded by !vendor/**
  • vendor/github.com/miekg/dns/AUTHORS is excluded by !vendor/**
  • vendor/github.com/miekg/dns/CODEOWNERS is excluded by !vendor/**
  • vendor/github.com/miekg/dns/CONTRIBUTORS is excluded by !vendor/**
  • vendor/github.com/miekg/dns/COPYRIGHT is excluded by !vendor/**
  • vendor/github.com/miekg/dns/LICENSE is excluded by !vendor/**
  • vendor/github.com/miekg/dns/Makefile.fuzz is excluded by !vendor/**
  • vendor/github.com/miekg/dns/Makefile.release is excluded by !vendor/**
  • vendor/github.com/miekg/dns/README.md is excluded by !vendor/**
  • vendor/github.com/miekg/dns/acceptfunc.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/client.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/clientconfig.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dane.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/defaults.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dns.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec_keygen.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec_keyscan.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec_privkey.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/doc.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/duplicate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/edns.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/format.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/fuzz.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/generate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/hash.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/labels.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/listen_no_socket_options.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/listen_socket_options.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/msg.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/msg_helpers.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/msg_truncate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/nsecx.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/privaterr.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/reverse.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/sanitize.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/scan.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/scan_rr.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/serve_mux.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/server.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/sig0.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/smimea.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/svcb.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/tlsa.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/tools.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/tsig.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/types.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/udp.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/udp_no_control.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/update.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/version.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/xfr.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/zduplicate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/zmsg.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/ztypes.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/doc.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/forward_requirements.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go.tmpl is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require_forward.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require_forward.go.tmpl is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/requirements.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/asm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/constants.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/doc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/instructions.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/setter.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/vm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/vm_instructions.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/iana/const.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_linux_32bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_linux_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_solaris_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/complete_dontwait.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/complete_nodontwait.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/empty.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/error_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/error_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_32bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_solaris_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/mmsghdr_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/mmsghdr_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_bsdvar.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_linux_32bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_linux_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_openbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_solaris_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/norace.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/race.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_mmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_msg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_nommsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_nomsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/socket.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_const_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_386.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_loong64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mips.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mips64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mipsle.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_ppc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_ppc64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_s390x.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_netbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_posix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_zos_s390x.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_aix_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_darwin_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_darwin_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_dragonfly_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_loong64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mips.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mips64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mipsle.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_ppc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_ppc64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_solaris_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/batch.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_pktinfo.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_zos.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/dgramopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/doc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/endpoint.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/genericopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/header.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/helper.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/iana.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/icmp.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/icmp_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/icmp_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/packet.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/payload.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/payload_cmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/payload_nocmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sockopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sockopt_posix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sockopt_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_aix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreq.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreq_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreqn.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreqn_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_bpf.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_bpf_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_darwin.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_dragonfly.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_freebsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_solaris.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_ssmreq.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_ssmreq_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_zos.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_aix_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_darwin.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_dragonfly.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_loong64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mips.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mips64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mipsle.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_ppc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_ppc64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_netbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_openbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_solaris.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/batch.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control_rfc2292_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control_rfc3542_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/dgramopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/doc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/endpoint.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/genericopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/header.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/helper.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/iana.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/icmp.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/icmp_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/icmp_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/icmp_solaris.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/icmp_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/icmp_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/icmp_zos.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/payload.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/payload_cmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/payload_nocmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sockopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sockopt_posix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sockopt_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sys_aix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sys_asmreq.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sys_asmreq_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sys_bpf.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/sys_bpf_stub.go is excluded by !vendor/**
📒 Files selected for processing (29)
  • go.mod
  • internal/testdns/testdns.go
  • pkg/api/agents.go
  • pkg/api/agents_test.go
  • pkg/api/server.go
  • pkg/defaults/timeouts.go
  • pkg/discovery/doc.go
  • pkg/discovery/options.go
  • pkg/discovery/publisher.go
  • pkg/discovery/publisher_test.go
  • pkg/discovery/resolver.go
  • pkg/discovery/resolver_test.go
  • pkg/discovery/svcb.go
  • pkg/discovery/svcb_test.go
  • pkg/discovery/testhelper_test.go
  • pkg/discovery/types.go
  • pkg/errors/errors.go
  • pkg/openshell/doc.go
  • pkg/openshell/evaluator.go
  • pkg/openshell/evaluator_test.go
  • pkg/openshell/fetcher.go
  • pkg/openshell/fetcher_test.go
  • pkg/openshell/guard.go
  • pkg/openshell/guard_test.go
  • pkg/openshell/options.go
  • pkg/openshell/types.go
  • pkg/server/config.go
  • pkg/server/lifecycle_test.go
  • pkg/server/server.go

Comment thread pkg/api/agents_test.go Outdated
Comment thread pkg/api/agents_test.go
Comment thread pkg/api/agents.go Outdated
Comment thread pkg/api/server.go Outdated
Comment thread pkg/api/server.go
Comment thread pkg/openshell/types.go
Comment thread pkg/server/config.go Outdated
Comment thread pkg/server/lifecycle_test.go
Comment thread pkg/server/lifecycle_test.go Outdated
Comment thread pkg/server/server.go Outdated
@IngmarVG-IB
Copy link
Copy Markdown
Author

@mchmarny thanks for the thorough review. All findings addressed in ab2b0f4e (and the branch is now rebased onto main):

Key issues

  1. SSRF bypass via hostname (pkg/openshell/fetcher.go) — Added safeDialContext on the HTTP transport that resolves the hostname and rejects connections to loopback/private/link-local/unspecified IPs post-resolution. The URL-level check is kept as a fast pre-flight, with a comment pointing to the dial-time check. New regression test TestFetcherSSRFDialBlocksPrivateIP exercises an httptest TLS server with the safe transport and confirms the dial fails. TestIsBlockedIP covers the IP classification helper directly.

  2. Silent context cancellation in DiscoverAll (pkg/discovery/resolver.go) — Goroutines now return gctx.Err() when the context is canceled, and DiscoverAll also re-checks ctx.Err() after g.Wait() to surface late cancellations. Returns a wrapped ErrCodeTimeout so callers can distinguish "no agents" from "interrupted before completion."

  3. Dual-return contract on fail-open (pkg/openshell/guard.go) — Check now returns (allowed=true, nil) on fetch failures; the fetch error is logged internally for observability. A non-nil error is returned only on programmer errors (e.g. nil record). Updated pkg/api/agents.go and TestGuardFailOpenOnFetchError to match. New TestGuardNilRecordReturnsError covers the new error path.

  4. bapVersions v-prefix (pkg/discovery/types.go) — Strips a leading v/V only when followed by a digit (semver-style), so v2.1.0 produces mcp/2. Non-semver strings like vendor1 are left untouched. Added test cases for v-prefix, V-prefix, simple v4, and the non-semver case.

Cleanup

  • Dead g.Wait() error check — Removed; the rebuilt loop now returns real errors on cancellation, so the error path is meaningful.
  • Loop variable capture — Removed entry := entry; relying on Go 1.22+ per-iteration scoping.
  • rate_limits two-branch logic — Restructured to a switch: appliesToLayer → violation, else LayerCaller → warning. Future map changes won't accidentally double-fire.
  • PORT parsed twice — Extracted server.ResolvePort(default int) int as the single source of truth; both pkg/server/config.go and pkg/api/server.go now use it. Adds basic port <= 0 || > 65535 validation that the original Sscanf lacked.
  • Duplicated startMockDNS — Extracted to internal/testdns.Start; both pkg/discovery and pkg/api now import the shared helper.
  • .claude/agents/k8s-discovery.md — Removed (was indeed an unintentional Claude Code config file).

Acknowledged but not changed

  • DNS regex underscores (pkg/api/agents.go:31) — Deliberate for K8s service domains (*.svc.cluster.local); kept as-is.
  • O(N) index rebuild on every publish (pkg/discovery/publisher.go:213) — Acceptable at current scale; will revisit when we exceed ~100 agents/namespace.
  • Singleflight-shared context (pkg/openshell/fetcher.go:103) — Added a comment explaining the trade-off; with the 3s PolicyFetchTimeout the shared cancellation is acceptable in practice.

All tests pass with -race, go vet and gofmt clean. Branch rebased onto current main.

@IngmarVG-IB
Copy link
Copy Markdown
Author

@CodeRabbit thanks for the deep dive. All 24 threads addressed in a864d661:

Security (must-fix)

  • TOCTOU SSRF (fetcher.go:117) — Closed by pinning the dial to the validated IP literal so the stdlib dialer can't re-resolve to a different (potentially private) address. Hostname verification still uses the original URL via TLS SNI.
  • HTTPS redirect downgrade (fetcher.go:88) — Added CheckRedirect that re-applies the HTTPS-only check and bounds redirect chains to 10. New test TestFetcherRejectsNonHTTPSRedirect.
  • fmt.Sscanf accepts trailing garbage (config.go:51) — Switched both ResolvePort and SHUTDOWN_TIMEOUT_SECONDS to strconv.Atoi.
  • SA-file whitespace (server.go:151) — strings.TrimSpace on both the env var and SA file content; empty post-trim falls through to "default".
  • BAP empty value (svcb.go:98) — Empty string yields nil instead of [""].
  • Resolver nil-check (resolver.go:161) — resp == nil guard added on both ListIndex and querySVCB.
  • SSRF test was a no-op (fetcher_test.go:173) — Replaced with a real test that uses the hostname localhost, so URL validation accepts it and the dial-time guard becomes the only gate.

Correctness

  • Sequential per-agent policy checks (agents.go:128) — Now fans out via errgroup.WithContext with a maxPolicyCheckConcurrency = 8 cap.
  • AICR_DISCOVERY_ENABLED parsing (server.go:109) — Uses strconv.ParseBool so 1, True, TRUE all work.
  • Publisher Create-then-Update race (publisher.go:204) — Now Create → on AlreadyExists Get/Update with bounded conflict retries; merges labels/data into the existing CM rather than blind-overwriting.
  • TXT parse loop ignored ctx (resolver.go:184) — Added ctx.Err() check in the index parse loop.
  • Lifecycle hooks unbounded (server.go:161) — Each hook now runs in context.WithTimeout(parent, defaults.ServerHookTimeout). Doc comment on LifecycleHook updated to make this contract explicit.
  • min shadow (evaluator.go:287) — Renamed return values to majorVer/minorVer.
  • rate_limits stub denial (evaluator.go:121) — Reworded to "not yet implemented at this layer" + TODO; added comment that LayerTarget enforcement requires per-caller counters not yet in PolicyContext. Today only LayerCaller is invoked, so this is dormant.
  • Layer 1 strict-mode unconditional denies (guard.go:136) — Documented the limitation in the Guard doc comment with three explicit recommendations (constrain caller-side rules, re-evaluate at Layer 2, or use ModePermissive at Layer 1).
  • TestDiscoveryDomain host-dependent (agents_test.go:189) — Extracted saNamespacePath as a package variable; test now redirects it for deterministic coverage of all three branches. Added TestPodNamespaceTrimsWhitespace.
  • TestMultipleHooksExecuteInOrder didn't drive Start() (lifecycle_test.go:230) — Rewrote to actually start the server. Added paired TestMultipleOnShutdownHooksExecuteInOrder.

Cleanup / nits

  • Unused ErrCodePolicyDenied (errors.go:44) — Removed.
  • PolicyRules doc comment vs struct (types.go:80) — Comment now explicitly notes the three boolean rules as a deliberate exception.
  • Equal-priority SVCB tie-break (svcb.go:60) — Added comment citing RFC 9460 §2.4.3 and the typical AICR deployment shape.
  • Default DNS server comment (resolver.go:48) — Reworded; 127.0.0.1:53 is the K8s pod resolver, not the host's.
  • S1039 fmt.Sprintf wrappers (agents_test.go:52, resolver_test.go:237) — Removed; dropped now-unused fmt imports.
  • freePort ignored Close() (lifecycle_test.go:36) — Now t.Fatal on close error.
  • Singleflight ctx caveat (fetcher.go:103) — Comment in earlier commit; preserved.

All 6 affected packages pass go test -race, go vet clean, gofmt clean.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

@mchmarny
Copy link
Copy Markdown
Member

@IngmarVG-IB looks like most of the issues are now addressed. Only the readiness-bind race (outside-diff) is fully unaddressed; the Layer 1 strict-mode issue has been downgraded to documented limitation rather than fixed. So once you rebase (dirty state) we should be good to merge. Thanks for your work on this.

@IngmarVG-IB IngmarVG-IB force-pushed the feat/dns-aid-discovery branch from a864d66 to be4e034 Compare May 14, 2026 15:01
IngmarVG-IB added a commit to IngmarVG-IB/aicr that referenced this pull request May 14, 2026
Address reviewer (mchmarny) findings on PR NVIDIA#469:

Security:
- openshell/fetcher: close SSRF hostname bypass by adding a safeDialContext
  on the HTTP transport that rejects connections to private/loopback IPs
  even when the URL hostname resolves to one. Adds isBlockedIP helper and
  a regression test that fails closed against an httptest TLS server.

Correctness:
- discovery/resolver: propagate context cancellation in DiscoverAll so
  callers can distinguish "no agents" from "timed out mid-discovery".
  Remove the dead g.Wait() error path now that goroutines genuinely
  return ctx.Err() on cancellation.
- openshell/guard: simplify the Check contract — fetch failures fail
  open with (Allowed=true, nil error) rather than the dual-return that
  was easy to misuse. Return an error only on programmer errors (nil
  record). Update agents.go to match.
- openshell/guard: remove panic from NewGuard; document that callers
  must validate via ValidMode before construction (server.go already
  does this).
- discovery/types: strip leading semver "v" prefix in bapVersions so
  "v2.1.0" produces "mcp/2" rather than "mcp/v2".

Cleanup:
- discovery/resolver: remove redundant `entry := entry` loop variable
  capture (Go 1.22+ scoping makes it unnecessary).
- openshell/evaluator: restructure rate_limits to a clearer
  switch (violation vs. warning) so future map changes do the right thing.
- server/config + api/server: extract a single ResolvePort helper so
  agent registration and the HTTP server agree on the bound port.
- internal/testdns: extract the duplicated startMockDNS helper into a
  shared internal package; pkg/discovery and pkg/api now both import it.
- openshell/fetcher: document the singleflight + context coalescing
  trade-off (3s timeout makes shared cancellation acceptable).
- Remove .claude/agents/k8s-discovery.md (local Claude Code config,
  unintentionally committed).

Tests:
- New: TestFetcherSSRFDialBlocksPrivateIP, TestIsBlockedIP,
  TestGuardNilRecordReturnsError, expanded TestBapVersions cases.
- Updated: TestGuardFailOpenOnFetchError now asserts (Allowed=true, nil)
  rather than the old dual-return; TestGuardInvalidModePanics removed
  (no longer panics).
IngmarVG-IB added a commit to IngmarVG-IB/aicr that referenced this pull request May 14, 2026
Address the May 11 PR NVIDIA#469 follow-up review.

Readiness-bind race (the only fully-unaddressed item from round 2):

  pkg/server/server.go Start() previously kicked off ListenAndServe in
  a goroutine and immediately flipped readiness on the main goroutine,
  opening a window where /ready returned 200 while the kernel was still
  binding the socket. Probes during that window would 200 then get
  connection-refused on the next request.

  Bind the listener synchronously (net.ListenConfig.Listen — context-
  aware, satisfies noctx) BEFORE setReady(true), then start
  http.Server.Serve(ln) in a goroutine. If the bind fails the server
  never advertises ready. New regression test
  TestReadyImpliesListenerBound asserts that a dial against the port
  succeeds the moment ready flips to true.

golangci-lint failures cleared (the CI gate this PR currently fails on):

- exhaustive (3 sites): pkg/errors/exitcode.go, pkg/server/errors.go —
  ErrCodePolicyFetch is now covered in the exit-code, HTTP-status, and
  retryable switches. Treated as upstream-unavailable (ExitUnavailable
  / 503 / retryable=true) since policy fetch hits an external URI.

- gosec G115 (int→uint16): pkg/server/config.go ResolvePort now
  returns uint16 directly, matching net's port representation. Single
  cast still needed in parseConfig because config.Port is int for
  back-compat with existing tests that set cfg.Port = port; gated with
  a nolint comment since ResolvePort guarantees the range.
  pkg/api/server.go drops the int→uint16 cast at the registration site.

- gosec G704 (SSRF taint): pkg/openshell/fetcher.go Fetch annotated
  with nolint pointing at the three layers already mitigating the
  finding (URL pre-flight, post-resolution dial guard, redirect
  check) and the regression test that exercises the dial path. Also
  unblocks the CodeQL job which was firing on the same taint analysis.

- noctx: internal/testdns/testdns.go Start switches from net.ListenPacket
  to (*net.ListenConfig).ListenPacket with context.Background() per
  noctx lint requirement.

Layer 1 strict-mode policy fields remain a documented limitation in
Guard's doc comment (per the reviewer's "downgraded to documented
limitation rather than fixed" note). Tracked in a follow-up issue.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/api/agents.go`:
- Around line 40-41: The dnsNameRegex currently allows underscores in labels
(character class `[a-z0-9_]`) which is non‑standard; update the regex named
dnsNameRegex so label character classes use `[a-z0-9-]` (and ensure the
leading-character class likewise disallows `_`) to enforce RFC 1035 rules for
domain labels, or if underscores were intentionally permitted (e.g., for
Kubernetes SRV/prefixed names), add a concise inline comment above dnsNameRegex
explaining that deliberate deviation and why; then run/update any unit tests
that validate domain parsing to reflect the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f736c478-2c31-4e82-9f11-69a42bff6271

📥 Commits

Reviewing files that changed from the base of the PR and between ab2b0f4 and be4e034.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • go.mod
  • internal/testdns/testdns.go
  • pkg/api/agents.go
  • pkg/api/agents_test.go
  • pkg/api/server.go

Comment thread pkg/api/agents.go
Comment on lines +40 to +41
// Each label: 1-63 chars of [a-z0-9_-], total <= 253 chars.
var dnsNameRegex = regexp.MustCompile(`^[a-z0-9_][a-z0-9_\-]{0,62}(\.[a-z0-9_][a-z0-9_\-]{0,62})*\.?$`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Clarify underscore allowance in DNS label validation.

The regex permits underscores ([a-z0-9_]) in DNS labels, which is non-standard per RFC 1035 (labels should be [a-z0-9-] only). Underscores are valid in SRV-style prefixed names (e.g., _svc._tcp) but not in general domain labels. Since this validates the user-supplied domain query parameter (a domain suffix, not an underscore-prefixed service name), the leniency may be unintentional. If this is deliberate for Kubernetes compatibility or another reason, a brief inline comment would clarify the intent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/api/agents.go` around lines 40 - 41, The dnsNameRegex currently allows
underscores in labels (character class `[a-z0-9_]`) which is non‑standard;
update the regex named dnsNameRegex so label character classes use `[a-z0-9-]`
(and ensure the leading-character class likewise disallows `_`) to enforce RFC
1035 rules for domain labels, or if underscores were intentionally permitted
(e.g., for Kubernetes SRV/prefixed names), add a concise inline comment above
dnsNameRegex explaining that deliberate deviation and why; then run/update any
unit tests that validate domain parsing to reflect the change.

@IngmarVG-IB
Copy link
Copy Markdown
Author

@mchmarny thanks — pushed round-2 fixes as be4e0342 on a freshly rebased branch:

Readiness-bind race (the only fully-unaddressed item)

pkg/server/server.go::Start now binds the listener synchronously via net.ListenConfig.Listen(ctx, ...) before flipping setReady(true), then runs http.Server.Serve(ln) in a goroutine. The previous code kicked off ListenAndServe in a goroutine and immediately set ready=true, opening the connection-refused window you flagged. New regression test TestReadyImpliesListenerBound asserts that a net.DialTimeout against the port succeeds the instant readiness flips.

Lint failures cleared (this PR's tests / Lint and CodeQL should now pass)

Linter Site Fix
exhaustive ×3 pkg/errors/exitcode.go, pkg/server/errors.go (×2) ErrCodePolicyFetch covered in exit-code / HTTP-status / retryable switches — treated as upstream-unavailable (ExitUnavailable / 503 / retryable=true)
gosec G115 pkg/api/server.go:208 server.ResolvePort now returns uint16 directly, matching net's port representation; cast at the registration site dropped
gosec G704 pkg/openshell/fetcher.go:221 //nolint:gosec with a 6-line block comment naming the three SSRF mitigations (URL pre-flight + post-resolution dial guard + redirect check) and the regression test that exercises them. This also unblocks the CodeQL job firing on the same taint analysis.
noctx internal/testdns/testdns.go:35 net.ListenPacket(*net.ListenConfig).ListenPacket(ctx, ...)

Layer 1 strict-mode — kept as a documented limitation in Guard's doc comment, per your note. Tracking the code fix in #885 so it doesn't get lost.

Verified locally

  • golangci-lint run on every changed package: 0 issues
  • make test: full suite passes, 75.1% coverage
  • Branch rebased onto current main, force-pushed

The pre-existing pkg/snapshotter/agent.go:187 G118 finding shows up on upstream/main too (the cancel is called from a deferred closure rather than defer cancelLogs(), which gosec doesn't trace); not part of this PR.

IngmarVG-IB added a commit to IngmarVG-IB/aicr that referenced this pull request May 14, 2026
Address reviewer (mchmarny) findings on PR NVIDIA#469:

Security:
- openshell/fetcher: close SSRF hostname bypass by adding a safeDialContext
  on the HTTP transport that rejects connections to private/loopback IPs
  even when the URL hostname resolves to one. Adds isBlockedIP helper and
  a regression test that fails closed against an httptest TLS server.

Correctness:
- discovery/resolver: propagate context cancellation in DiscoverAll so
  callers can distinguish "no agents" from "timed out mid-discovery".
  Remove the dead g.Wait() error path now that goroutines genuinely
  return ctx.Err() on cancellation.
- openshell/guard: simplify the Check contract — fetch failures fail
  open with (Allowed=true, nil error) rather than the dual-return that
  was easy to misuse. Return an error only on programmer errors (nil
  record). Update agents.go to match.
- openshell/guard: remove panic from NewGuard; document that callers
  must validate via ValidMode before construction (server.go already
  does this).
- discovery/types: strip leading semver "v" prefix in bapVersions so
  "v2.1.0" produces "mcp/2" rather than "mcp/v2".

Cleanup:
- discovery/resolver: remove redundant `entry := entry` loop variable
  capture (Go 1.22+ scoping makes it unnecessary).
- openshell/evaluator: restructure rate_limits to a clearer
  switch (violation vs. warning) so future map changes do the right thing.
- server/config + api/server: extract a single ResolvePort helper so
  agent registration and the HTTP server agree on the bound port.
- internal/testdns: extract the duplicated startMockDNS helper into a
  shared internal package; pkg/discovery and pkg/api now both import it.
- openshell/fetcher: document the singleflight + context coalescing
  trade-off (3s timeout makes shared cancellation acceptable).
- Remove .claude/agents/k8s-discovery.md (local Claude Code config,
  unintentionally committed).

Tests:
- New: TestFetcherSSRFDialBlocksPrivateIP, TestIsBlockedIP,
  TestGuardNilRecordReturnsError, expanded TestBapVersions cases.
- Updated: TestGuardFailOpenOnFetchError now asserts (Allowed=true, nil)
  rather than the old dual-return; TestGuardInvalidModePanics removed
  (no longer panics).
IngmarVG-IB added a commit to IngmarVG-IB/aicr that referenced this pull request May 14, 2026
Address the May 11 PR NVIDIA#469 follow-up review.

Readiness-bind race (the only fully-unaddressed item from round 2):

  pkg/server/server.go Start() previously kicked off ListenAndServe in
  a goroutine and immediately flipped readiness on the main goroutine,
  opening a window where /ready returned 200 while the kernel was still
  binding the socket. Probes during that window would 200 then get
  connection-refused on the next request.

  Bind the listener synchronously (net.ListenConfig.Listen — context-
  aware, satisfies noctx) BEFORE setReady(true), then start
  http.Server.Serve(ln) in a goroutine. If the bind fails the server
  never advertises ready. New regression test
  TestReadyImpliesListenerBound asserts that a dial against the port
  succeeds the moment ready flips to true.

golangci-lint failures cleared (the CI gate this PR currently fails on):

- exhaustive (3 sites): pkg/errors/exitcode.go, pkg/server/errors.go —
  ErrCodePolicyFetch is now covered in the exit-code, HTTP-status, and
  retryable switches. Treated as upstream-unavailable (ExitUnavailable
  / 503 / retryable=true) since policy fetch hits an external URI.

- gosec G115 (int→uint16): pkg/server/config.go ResolvePort now
  returns uint16 directly, matching net's port representation. Single
  cast still needed in parseConfig because config.Port is int for
  back-compat with existing tests that set cfg.Port = port; gated with
  a nolint comment since ResolvePort guarantees the range.
  pkg/api/server.go drops the int→uint16 cast at the registration site.

- gosec G704 (SSRF taint): pkg/openshell/fetcher.go Fetch annotated
  with nolint pointing at the three layers already mitigating the
  finding (URL pre-flight, post-resolution dial guard, redirect
  check) and the regression test that exercises the dial path. Also
  unblocks the CodeQL job which was firing on the same taint analysis.

- noctx: internal/testdns/testdns.go Start switches from net.ListenPacket
  to (*net.ListenConfig).ListenPacket with context.Background() per
  noctx lint requirement.

Layer 1 strict-mode policy fields remain a documented limitation in
Guard's doc comment (per the reviewer's "downgraded to documented
limitation rather than fixed" note). Tracked in a follow-up issue.
@IngmarVG-IB IngmarVG-IB force-pushed the feat/dns-aid-discovery branch from be4e034 to 02f7453 Compare May 14, 2026 16:59
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
pkg/api/server.go (1)

219-219: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

/v1/agents is hard-wired to query 127.0.0.1:53.

setupDiscovery() builds the discoverer with no WithDNSServer override, so the handler always dials localhost DNS. In most Kubernetes pods /etc/resolv.conf points at the cluster DNS service IP, not a listener on 127.0.0.1, so discovery will fail as soon as this feature is enabled unless every pod also runs a local DNS forwarder.

Does github.com/miekg/dns Client.ExchangeContext use the explicit server address passed to it (for example "127.0.0.1:53"), or does it consult /etc/resolv.conf? Also, what nameserver is typically written to /etc/resolv.conf inside a Kubernetes pod?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/api/server.go` at line 219, The discoverer is being created without a DNS
server override so discovery always dials 127.0.0.1:53; update setupDiscovery()
to pass a realistic DNS server via discovery.WithDNSServer by reading the
nameserver from the host's resolver (e.g., parse /etc/resolv.conf for the first
"nameserver" entry) or by exposing a configurable option/environment var and
fall back to the parsed resolver; call discovery.NewDiscoverer(...,
discovery.WithDNSServer(resolverAddr)) so the discoverer uses the cluster DNS IP
(as found in pod /etc/resolv.conf) instead of hard-coded 127.0.0.1:53.
pkg/api/agents.go (1)

39-41: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten domain validation to real DNS label syntax.

This regex still accepts invalid suffixes like _foo.example and foo-.example. Since domain is a user-supplied domain suffix, not an underscore-prefixed service label, letting those through means we do discovery lookups for names we should reject at the handler boundary.

🔧 Suggested regex
-var dnsNameRegex = regexp.MustCompile(`^[a-z0-9_][a-z0-9_\-]{0,62}(\.[a-z0-9_][a-z0-9_\-]{0,62})*\.?$`)
+var dnsNameRegex = regexp.MustCompile(`^(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?)(?:\.(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?))*\.?$`)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/api/agents.go` around lines 39 - 41, The current dnsNameRegex accepts
labels starting/ending with underscores or hyphens which allows invalid domains;
update the dnsNameRegex constant used to validate the user-supplied domain (the
`domain` check in pkg/api/agents.go) to a stricter DNS label regex that: forbids
underscores, requires each label to start and end with an alphanumeric
character, allows hyphens only internally, limits labels to 1–63 chars, allows
dot-separated labels up to the 253-char total and an optional trailing dot;
replace the existing regexp.MustCompile pattern for dnsNameRegex with that
stricter pattern and ensure all places that reference dnsNameRegex continue to
use it for handler-side validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/api/server.go`:
- Around line 120-123: The current error handling when calling setupDiscovery()
wraps all errors as ErrCodeInternal and loses existing structured error codes;
change the return to use errors.PropagateOrWrap so that setupDiscovery's coded
errors (e.g., invalid OPENSHELL_MODE) are propagated and only fallback to
ErrCodeInternal when necessary — replace the errors.Wrap call in the
setupDiscovery() error branch with errors.PropagateOrWrap(err,
errors.ErrCodeInternal, "failed to setup discovery") so existing structured
errors are preserved.

In `@pkg/discovery/options.go`:
- Around line 46-52: WithLabels currently writes into p.labels without guarding
against a nil map which can panic if Publisher wasn't initialized; update the
WithLabels(PublisherOption) closure to check if p.labels == nil and if so
allocate a new map before the for loop, ensuring p.labels is always non-nil when
assigning keys (reference the WithLabels function and the Publisher.labels
field).

In `@pkg/discovery/publisher.go`:
- Around line 261-271: ConfigMap listing order is unstable so build a
deterministic index by sorting the entries before joining; after accumulating
entries into the lines slice in the loop that iterates cmList.Items (the block
that sets name/proto and appends fmt.Sprintf("%s:%s", name, proto)), call
sort.Strings(lines) and then compute agentsData := strings.Join(lines, "\n");
also add the required import for the sort package to ensure compilation.

In `@pkg/discovery/resolver.go`:
- Around line 109-153: The current concurrent collection appends discoveries
under a mutex which yields non-deterministic order; change to allocate a
per-input result slot and compact after g.Wait(): iterate with for i, entry :=
range entries and capture i in the closure, create records :=
make([]AgentRecord, len(entries)) plus a parallel filled := make([]bool,
len(entries)) (or store pointers/nils) and inside the g.Go closure write
records[i] = *record and set filled[i] = true under mu after a successful
d.Discover call; after g.Wait() build the returned slice by iterating the filled
flags in order and appending only successful entries so the final order matches
the original entries order.

In `@pkg/openshell/options.go`:
- Around line 77-83: The exported option WithAllowPrivate exposes a test-only
SSRF bypass; remove it from the production API by either making it unexported
(rename WithAllowPrivate -> withAllowPrivate) or moving its implementation into
test-only code (a test helper) so callers cannot disable SSRF checks at runtime;
update any references to WithAllowPrivate in tests to use the new unexported
variant via test package or the test helper, and ensure the Fetcher struct field
allowPrivate remains internal (Fetcher.allowPrivate) with no exported setters.

---

Duplicate comments:
In `@pkg/api/agents.go`:
- Around line 39-41: The current dnsNameRegex accepts labels starting/ending
with underscores or hyphens which allows invalid domains; update the
dnsNameRegex constant used to validate the user-supplied domain (the `domain`
check in pkg/api/agents.go) to a stricter DNS label regex that: forbids
underscores, requires each label to start and end with an alphanumeric
character, allows hyphens only internally, limits labels to 1–63 chars, allows
dot-separated labels up to the 253-char total and an optional trailing dot;
replace the existing regexp.MustCompile pattern for dnsNameRegex with that
stricter pattern and ensure all places that reference dnsNameRegex continue to
use it for handler-side validation.

In `@pkg/api/server.go`:
- Line 219: The discoverer is being created without a DNS server override so
discovery always dials 127.0.0.1:53; update setupDiscovery() to pass a realistic
DNS server via discovery.WithDNSServer by reading the nameserver from the host's
resolver (e.g., parse /etc/resolv.conf for the first "nameserver" entry) or by
exposing a configurable option/environment var and fall back to the parsed
resolver; call discovery.NewDiscoverer(...,
discovery.WithDNSServer(resolverAddr)) so the discoverer uses the cluster DNS IP
(as found in pod /etc/resolv.conf) instead of hard-coded 127.0.0.1:53.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 85ab687d-edf7-4916-945a-434f44cd615d

📥 Commits

Reviewing files that changed from the base of the PR and between be4e034 and 02f7453.

⛔ Files ignored due to path filters (242)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/miekg/dns/.codecov.yml is excluded by !vendor/**
  • vendor/github.com/miekg/dns/.gitignore is excluded by !vendor/**
  • vendor/github.com/miekg/dns/AUTHORS is excluded by !vendor/**
  • vendor/github.com/miekg/dns/CODEOWNERS is excluded by !vendor/**
  • vendor/github.com/miekg/dns/CONTRIBUTORS is excluded by !vendor/**
  • vendor/github.com/miekg/dns/COPYRIGHT is excluded by !vendor/**
  • vendor/github.com/miekg/dns/LICENSE is excluded by !vendor/**
  • vendor/github.com/miekg/dns/Makefile.fuzz is excluded by !vendor/**
  • vendor/github.com/miekg/dns/Makefile.release is excluded by !vendor/**
  • vendor/github.com/miekg/dns/README.md is excluded by !vendor/**
  • vendor/github.com/miekg/dns/acceptfunc.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/client.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/clientconfig.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dane.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/defaults.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dns.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec_keygen.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec_keyscan.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/dnssec_privkey.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/doc.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/duplicate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/edns.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/format.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/fuzz.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/generate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/hash.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/labels.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/listen_no_socket_options.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/listen_socket_options.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/msg.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/msg_helpers.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/msg_truncate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/nsecx.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/privaterr.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/reverse.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/sanitize.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/scan.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/scan_rr.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/serve_mux.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/server.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/sig0.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/smimea.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/svcb.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/tlsa.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/tools.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/tsig.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/types.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/udp.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/udp_no_control.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/update.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/version.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/xfr.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/zduplicate.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/zmsg.go is excluded by !vendor/**
  • vendor/github.com/miekg/dns/ztypes.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/doc.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/forward_requirements.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require.go.tmpl is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require_forward.go is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/require_forward.go.tmpl is excluded by !vendor/**
  • vendor/github.com/stretchr/testify/require/requirements.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/asm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/constants.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/doc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/instructions.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/setter.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/vm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/bpf/vm_instructions.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/iana/const.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_linux_32bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_linux_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_solaris_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/cmsghdr_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/complete_dontwait.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/complete_nodontwait.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/empty.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/error_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/error_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_32bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_solaris_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/iovec_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/mmsghdr_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/mmsghdr_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_bsdvar.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_linux_32bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_linux_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_openbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_solaris_64bit.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/msghdr_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/norace.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/race.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_mmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_msg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_nommsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/rawconn_nomsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/socket.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_const_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_386.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_loong64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mips.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mips64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_mipsle.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_ppc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_ppc64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_linux_s390x.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_netbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_posix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/sys_zos_s390x.s is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_aix_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_darwin_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_darwin_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_dragonfly_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_freebsd_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_loong64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mips.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mips64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_mipsle.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_ppc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_ppc64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_linux_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_netbsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_openbsd_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_solaris_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/internal/socket/zsys_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/batch.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_pktinfo.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_unix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/control_zos.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/dgramopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/doc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/endpoint.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/genericopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/header.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/helper.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/iana.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/icmp.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/icmp_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/icmp_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/packet.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/payload.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/payload_cmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/payload_nocmsg.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sockopt.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sockopt_posix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sockopt_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_aix.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreq.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreq_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreqn.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_asmreqn_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_bpf.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_bpf_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_bsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_darwin.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_dragonfly.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_freebsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_linux.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_solaris.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_ssmreq.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_ssmreq_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_stub.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_windows.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/sys_zos.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_aix_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_darwin.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_dragonfly.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_freebsd_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_386.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_amd64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_arm.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_arm64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_loong64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mips.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mips64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mips64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_mipsle.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_ppc.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_ppc64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_ppc64le.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_riscv64.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_linux_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_netbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_openbsd.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_solaris.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv4/zsys_zos_s390x.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/batch.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control.go is excluded by !vendor/**
  • vendor/golang.org/x/net/ipv6/control_rfc2292_unix.go is excluded by !vendor/**
📒 Files selected for processing (31)
  • go.mod
  • internal/testdns/testdns.go
  • pkg/api/agents.go
  • pkg/api/agents_test.go
  • pkg/api/server.go
  • pkg/defaults/timeouts.go
  • pkg/discovery/doc.go
  • pkg/discovery/options.go
  • pkg/discovery/publisher.go
  • pkg/discovery/publisher_test.go
  • pkg/discovery/resolver.go
  • pkg/discovery/resolver_test.go
  • pkg/discovery/svcb.go
  • pkg/discovery/svcb_test.go
  • pkg/discovery/testhelper_test.go
  • pkg/discovery/types.go
  • pkg/errors/errors.go
  • pkg/errors/exitcode.go
  • pkg/openshell/doc.go
  • pkg/openshell/evaluator.go
  • pkg/openshell/evaluator_test.go
  • pkg/openshell/fetcher.go
  • pkg/openshell/fetcher_test.go
  • pkg/openshell/guard.go
  • pkg/openshell/guard_test.go
  • pkg/openshell/options.go
  • pkg/openshell/types.go
  • pkg/server/config.go
  • pkg/server/errors.go
  • pkg/server/lifecycle_test.go
  • pkg/server/server.go

Comment thread pkg/api/server.go
Comment on lines +120 to +123
opts, agentHandler, err := setupDiscovery()
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to setup discovery", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Don’t squash structured discovery setup errors to internal.

setupDiscovery() already returns coded errors (for example invalid OPENSHELL_MODE). Wrapping everything here as ErrCodeInternal loses that classification; errors.PropagateOrWrap(...) preserves structured errors and only applies the fallback when needed.

♻️ Minimal fix
-			return errors.Wrap(errors.ErrCodeInternal, "failed to setup discovery", err)
+			return errors.PropagateOrWrap(err, errors.ErrCodeInternal, "failed to setup discovery")

Based on learnings: use errors.PropagateOrWrap(err, errCode, message) so existing structured errors are propagated as-is.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/api/server.go` around lines 120 - 123, The current error handling when
calling setupDiscovery() wraps all errors as ErrCodeInternal and loses existing
structured error codes; change the return to use errors.PropagateOrWrap so that
setupDiscovery's coded errors (e.g., invalid OPENSHELL_MODE) are propagated and
only fallback to ErrCodeInternal when necessary — replace the errors.Wrap call
in the setupDiscovery() error branch with errors.PropagateOrWrap(err,
errors.ErrCodeInternal, "failed to setup discovery") so existing structured
errors are preserved.

Comment thread pkg/discovery/options.go
Comment on lines +46 to +52
// WithLabels sets additional labels on created Kubernetes resources.
func WithLabels(labels map[string]string) PublisherOption {
return func(p *Publisher) {
for k, v := range labels {
p.labels[k] = v
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider defensive initialization of p.labels.

WithLabels writes directly to p.labels[k] without checking if the map is nil. While the constructor likely initializes this field (tests confirm it works), defensive coding would add a nil-check guard to prevent panics if the option is ever applied to an improperly initialized Publisher.

🛡️ Optional defensive guard
 func WithLabels(labels map[string]string) PublisherOption {
 	return func(p *Publisher) {
+		if p.labels == nil {
+			p.labels = make(map[string]string)
+		}
 		for k, v := range labels {
 			p.labels[k] = v
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/discovery/options.go` around lines 46 - 52, WithLabels currently writes
into p.labels without guarding against a nil map which can panic if Publisher
wasn't initialized; update the WithLabels(PublisherOption) closure to check if
p.labels == nil and if so allocate a new map before the for loop, ensuring
p.labels is always non-nil when assigning keys (reference the WithLabels
function and the Publisher.labels field).

Comment on lines +261 to +271
// Build index as "name:protocol" entries, one per line
var lines []string
for _, cm := range cmList.Items {
name := cm.Labels[labelName]
proto := cm.Labels[labelProtocol]
if name != "" && proto != "" {
lines = append(lines, fmt.Sprintf("%s:%s", name, proto))
}
}

agentsData := strings.Join(lines, "\n")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sort index entries before writing the shared ConfigMap.

ConfigMaps().List(...) does not guarantee a stable order, so equivalent agent sets can produce different agents payloads and trigger needless index updates/conflicts. Sorting lines first keeps the published TXT index deterministic.

♻️ Minimal fix
+	sort.Strings(lines)
 	agentsData := strings.Join(lines, "\n")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/discovery/publisher.go` around lines 261 - 271, ConfigMap listing order
is unstable so build a deterministic index by sorting the entries before
joining; after accumulating entries into the lines slice in the loop that
iterates cmList.Items (the block that sets name/proto and appends
fmt.Sprintf("%s:%s", name, proto)), call sort.Strings(lines) and then compute
agentsData := strings.Join(lines, "\n"); also add the required import for the
sort package to ensure compilation.

Comment thread pkg/discovery/resolver.go
Comment on lines +109 to +153
var (
mu sync.Mutex
records []AgentRecord
)

g, gctx := errgroup.WithContext(ctx)
g.SetLimit(maxDiscoverConcurrency)

// Loop variables are per-iteration scoped under Go 1.22+; no shadowing needed.
for _, entry := range entries {
g.Go(func() error {
record, err := d.Discover(gctx, entry.Name, entry.Protocol, domain)
if err != nil {
// Treat context cancellation as fatal so the caller learns the
// result is incomplete; everything else is logged and skipped.
if gctx.Err() != nil {
return gctx.Err()
}
slog.Warn("failed to resolve agent from index",
slog.String("name", entry.Name),
slog.String("protocol", string(entry.Protocol)),
slog.String("error", err.Error()),
)
return nil
}
mu.Lock()
records = append(records, *record)
mu.Unlock()
return nil
})
}

// g.Wait() may return ctx.Err() if any goroutine propagated cancellation.
if err := g.Wait(); err != nil {
return nil, errors.Wrap(errors.ErrCodeTimeout,
"agent discovery interrupted before completion", err)
}
// Defensive: if the parent context was canceled after all goroutines
// returned successfully but before we got here, surface that too.
if err := ctx.Err(); err != nil {
return nil, errors.Wrap(errors.ErrCodeTimeout,
"agent discovery interrupted before completion", err)
}

return records, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve index order when collecting concurrent discoveries.

append under a mutex makes the returned slice depend on DNS response timing, so /v1/agents order will flap between identical requests. Keep a result slot per input index and compact after g.Wait() instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/discovery/resolver.go` around lines 109 - 153, The current concurrent
collection appends discoveries under a mutex which yields non-deterministic
order; change to allocate a per-input result slot and compact after g.Wait():
iterate with for i, entry := range entries and capture i in the closure, create
records := make([]AgentRecord, len(entries)) plus a parallel filled :=
make([]bool, len(entries)) (or store pointers/nils) and inside the g.Go closure
write records[i] = *record and set filled[i] = true under mu after a successful
d.Discover call; after g.Wait() build the returned slice by iterating the filled
flags in order and appending only successful entries so the final order matches
the original entries order.

Comment thread pkg/openshell/options.go
Comment on lines +77 to +83
// WithAllowPrivate disables SSRF private/loopback IP checks.
// This must only be used in tests.
func WithAllowPrivate(allow bool) FetcherOption {
return func(f *Fetcher) {
f.allowPrivate = allow
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the private-IP bypass out of the production API surface.

The fetcher is otherwise positioned as SSRF-protected, but exporting WithAllowPrivate(true) lets any non-test caller disable those guards with one option. If this is truly test-only, move it to a test helper or make the option unexported.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/openshell/options.go` around lines 77 - 83, The exported option
WithAllowPrivate exposes a test-only SSRF bypass; remove it from the production
API by either making it unexported (rename WithAllowPrivate -> withAllowPrivate)
or moving its implementation into test-only code (a test helper) so callers
cannot disable SSRF checks at runtime; update any references to WithAllowPrivate
in tests to use the new unexported variant via test package or the test helper,
and ensure the Fetcher struct field allowPrivate remains internal
(Fetcher.allowPrivate) with no exported setters.

@github-actions
Copy link
Copy Markdown
Contributor

@IngmarVG-IB this PR now has merge conflicts with main. Please rebase to resolve them.

@IngmarVG-IB IngmarVG-IB force-pushed the feat/dns-aid-discovery branch from 02f7453 to ff4b714 Compare May 19, 2026 20:55
IngmarVG-IB added a commit to IngmarVG-IB/aicr that referenced this pull request May 19, 2026
Address reviewer (mchmarny) findings on PR NVIDIA#469:

Security:
- openshell/fetcher: close SSRF hostname bypass by adding a safeDialContext
  on the HTTP transport that rejects connections to private/loopback IPs
  even when the URL hostname resolves to one. Adds isBlockedIP helper and
  a regression test that fails closed against an httptest TLS server.

Correctness:
- discovery/resolver: propagate context cancellation in DiscoverAll so
  callers can distinguish "no agents" from "timed out mid-discovery".
  Remove the dead g.Wait() error path now that goroutines genuinely
  return ctx.Err() on cancellation.
- openshell/guard: simplify the Check contract — fetch failures fail
  open with (Allowed=true, nil error) rather than the dual-return that
  was easy to misuse. Return an error only on programmer errors (nil
  record). Update agents.go to match.
- openshell/guard: remove panic from NewGuard; document that callers
  must validate via ValidMode before construction (server.go already
  does this).
- discovery/types: strip leading semver "v" prefix in bapVersions so
  "v2.1.0" produces "mcp/2" rather than "mcp/v2".

Cleanup:
- discovery/resolver: remove redundant `entry := entry` loop variable
  capture (Go 1.22+ scoping makes it unnecessary).
- openshell/evaluator: restructure rate_limits to a clearer
  switch (violation vs. warning) so future map changes do the right thing.
- server/config + api/server: extract a single ResolvePort helper so
  agent registration and the HTTP server agree on the bound port.
- internal/testdns: extract the duplicated startMockDNS helper into a
  shared internal package; pkg/discovery and pkg/api now both import it.
- openshell/fetcher: document the singleflight + context coalescing
  trade-off (3s timeout makes shared cancellation acceptable).
- Remove .claude/agents/k8s-discovery.md (local Claude Code config,
  unintentionally committed).

Tests:
- New: TestFetcherSSRFDialBlocksPrivateIP, TestIsBlockedIP,
  TestGuardNilRecordReturnsError, expanded TestBapVersions cases.
- Updated: TestGuardFailOpenOnFetchError now asserts (Allowed=true, nil)
  rather than the old dual-return; TestGuardInvalidModePanics removed
  (no longer panics).
IngmarVG-IB added a commit to IngmarVG-IB/aicr that referenced this pull request May 19, 2026
Address the May 11 PR NVIDIA#469 follow-up review.

Readiness-bind race (the only fully-unaddressed item from round 2):

  pkg/server/server.go Start() previously kicked off ListenAndServe in
  a goroutine and immediately flipped readiness on the main goroutine,
  opening a window where /ready returned 200 while the kernel was still
  binding the socket. Probes during that window would 200 then get
  connection-refused on the next request.

  Bind the listener synchronously (net.ListenConfig.Listen — context-
  aware, satisfies noctx) BEFORE setReady(true), then start
  http.Server.Serve(ln) in a goroutine. If the bind fails the server
  never advertises ready. New regression test
  TestReadyImpliesListenerBound asserts that a dial against the port
  succeeds the moment ready flips to true.

golangci-lint failures cleared (the CI gate this PR currently fails on):

- exhaustive (3 sites): pkg/errors/exitcode.go, pkg/server/errors.go —
  ErrCodePolicyFetch is now covered in the exit-code, HTTP-status, and
  retryable switches. Treated as upstream-unavailable (ExitUnavailable
  / 503 / retryable=true) since policy fetch hits an external URI.

- gosec G115 (int→uint16): pkg/server/config.go ResolvePort now
  returns uint16 directly, matching net's port representation. Single
  cast still needed in parseConfig because config.Port is int for
  back-compat with existing tests that set cfg.Port = port; gated with
  a nolint comment since ResolvePort guarantees the range.
  pkg/api/server.go drops the int→uint16 cast at the registration site.

- gosec G704 (SSRF taint): pkg/openshell/fetcher.go Fetch annotated
  with nolint pointing at the three layers already mitigating the
  finding (URL pre-flight, post-resolution dial guard, redirect
  check) and the regression test that exercises the dial path. Also
  unblocks the CodeQL job which was firing on the same taint analysis.

- noctx: internal/testdns/testdns.go Start switches from net.ListenPacket
  to (*net.ListenConfig).ListenPacket with context.Background() per
  noctx lint requirement.

Layer 1 strict-mode policy fields remain a documented limitation in
Guard's doc comment (per the reviewer's "downgraded to documented
limitation rather than fixed" note). Tracked in a follow-up issue.
Introduce pkg/discovery for DNS-AID based agent discovery using SVCB
records (RFC 9460) with private-use SvcParamKeys per
draft-mozleywilliams-dnsop-dnsaid-01. Agents publish themselves via K8s
ConfigMaps and discover peers through DNS lookups.

- pkg/discovery: Discoverer (DNS SVCB resolver), Publisher (K8s
  ConfigMap-based registration), SVCB parser for keys 65400-65408
- pkg/server: WithOnStart/WithOnShutdown lifecycle hooks as functional
  options, executed during server start/shutdown respectively
- pkg/api: /v1/agents endpoint listing discovered agents in a domain,
  with input validation and feature-gated via AICR_DISCOVERY_ENABLED
- pkg/defaults: discovery timeout constants and ServerDefaultPort
Strip rdatapolicy plugin mentions from pkg/discovery doc comments
and the k8s-discovery agent definition. These features are under
embargo and should not be referenced in public code.
OpenShell evaluates target agents' policy documents (served at their
dns-aid-core policy URI, SvcParamKey 65403) as Layer 1 caller-side
enforcement before allowing connections. This completes the security
story: "OpenShell secures the agent boundary; DNS-AID resolves what's
inside it."

New pkg/openshell package implements:
- All 16 native policy rules matching dns-aid-core's PolicyDocument schema
- Three enforcement modes: strict (deny), permissive (log, default), disabled
- Fail-open on policy fetch errors (matches dns-aid-core behavior)
- SSRF-protected fetcher with HTTPS-only, private IP rejection
- TTL-based cache with bounded eviction and singleflight coalescing
- Enforcement layer filtering (Layer 0/1/2 per rule)
- Realm isolation detection with cross-realm access logging

Integration:
- /v1/agents endpoint evaluates OpenShell policy per discovered agent
- OPENSHELL_MODE env var controls enforcement (strict/permissive/disabled)
- Guard accepts nil safely — zero behavioral change when discovery disabled

New error codes: POLICY_DENIED, POLICY_FETCH
New defaults: PolicyFetchTimeout (3s), PolicyCacheTTL (5m), PolicyMaxBytes (64KB)

Test coverage: 84.2% (30 tests across evaluator, fetcher, guard)
Address reviewer (mchmarny) findings on PR NVIDIA#469:

Security:
- openshell/fetcher: close SSRF hostname bypass by adding a safeDialContext
  on the HTTP transport that rejects connections to private/loopback IPs
  even when the URL hostname resolves to one. Adds isBlockedIP helper and
  a regression test that fails closed against an httptest TLS server.

Correctness:
- discovery/resolver: propagate context cancellation in DiscoverAll so
  callers can distinguish "no agents" from "timed out mid-discovery".
  Remove the dead g.Wait() error path now that goroutines genuinely
  return ctx.Err() on cancellation.
- openshell/guard: simplify the Check contract — fetch failures fail
  open with (Allowed=true, nil error) rather than the dual-return that
  was easy to misuse. Return an error only on programmer errors (nil
  record). Update agents.go to match.
- openshell/guard: remove panic from NewGuard; document that callers
  must validate via ValidMode before construction (server.go already
  does this).
- discovery/types: strip leading semver "v" prefix in bapVersions so
  "v2.1.0" produces "mcp/2" rather than "mcp/v2".

Cleanup:
- discovery/resolver: remove redundant `entry := entry` loop variable
  capture (Go 1.22+ scoping makes it unnecessary).
- openshell/evaluator: restructure rate_limits to a clearer
  switch (violation vs. warning) so future map changes do the right thing.
- server/config + api/server: extract a single ResolvePort helper so
  agent registration and the HTTP server agree on the bound port.
- internal/testdns: extract the duplicated startMockDNS helper into a
  shared internal package; pkg/discovery and pkg/api now both import it.
- openshell/fetcher: document the singleflight + context coalescing
  trade-off (3s timeout makes shared cancellation acceptable).
- Remove .claude/agents/k8s-discovery.md (local Claude Code config,
  unintentionally committed).

Tests:
- New: TestFetcherSSRFDialBlocksPrivateIP, TestIsBlockedIP,
  TestGuardNilRecordReturnsError, expanded TestBapVersions cases.
- Updated: TestGuardFailOpenOnFetchError now asserts (Allowed=true, nil)
  rather than the old dual-return; TestGuardInvalidModePanics removed
  (no longer panics).
Security:
- openshell/fetcher: close TOCTOU between LookupIPAddr and dialer.DialContext
  by pinning the dial to the validated IP literal so the stdlib dialer cannot
  re-resolve the hostname to a now-private address (DNS rebinding mitigation).
- openshell/fetcher: add CheckRedirect that re-validates each redirect target
  scheme so a 30x cannot downgrade the connection from HTTPS.
- server/config: switch ResolvePort and SHUTDOWN_TIMEOUT_SECONDS parsing from
  fmt.Sscanf to strconv.Atoi — Sscanf accepts trailing garbage ("8080abc"
  parses as 8080), Atoi rejects it.
- api/server: trim whitespace on SA-namespace file read (CRI runtimes append
  trailing newline) and on AICR_DISCOVERY_NAMESPACE; switch
  AICR_DISCOVERY_ENABLED to strconv.ParseBool so "1"/"True"/"TRUE" work.
- discovery/svcb: guard BAP parsing against empty string so "" yields nil
  (treat as unset) rather than [""], which would shadow protocol defaults.
- discovery/resolver: nil-check resp before reading resp.Rcode in both
  ListIndex and querySVCB; miekg/dns can return (nil, nil) on certain header
  failures.

Correctness:
- api/agents: fan out per-agent OpenShell policy checks via errgroup with a
  concurrency cap (8). Sequential checks would scale as N×PolicyFetchTimeout
  (3s) on a cold cache and risk blowing past DiscoveryHandlerTimeout.
- discovery/publisher: replace Create-then-blind-Update with Create →
  AlreadyExists → Get/Update with bounded retries on Conflict so concurrent
  publishers (or admission-controller mutations) produce a clean retry
  rather than silent clobber.
- discovery/resolver: add ctx.Done() check inside the TXT parse loop in
  ListIndex — pathologically large index responses must remain interruptible.
- server/server: bound each OnStart and OnShutdown lifecycle hook in a
  per-hook context (defaults.ServerHookTimeout = 30s) so a misbehaving hook
  cannot stall startup or consume the whole shutdown budget.
- openshell/evaluator: parseTLSVersion no longer shadows the Go 1.21+
  predeclared identifier "min".
- openshell/evaluator: rate_limits violation path now reports
  "not yet implemented at this layer" with a TODO; the previous "exceeded"
  message would have unconditionally denied at LayerTarget once Layer 2
  shipped because PolicyContext has no request counters.

Cleanup / docs:
- openshell/guard: document Layer 1 limitations explicitly — many policy
  fields cannot be populated pre-connection, so ModeStrict against
  connection-state rules will deny by zero value.
- openshell/types: clarify the doc comment on PolicyRules that the three
  bool fields are an intentional exception to the pointer-or-slice convention.
- errors: drop unused ErrCodePolicyDenied (no consumer in this PR).
- discovery/svcb: comment the equal-priority SVCB tie-break (RFC 9460 §2.4.3).
- discovery/resolver: replace misleading "system resolver" comment on the
  127.0.0.1:53 default — that's the K8s pod resolver, not the host's.
- api/server: extract saNamespacePath as a package var so tests can redirect
  it; TestDiscoveryDomain is now host-independent and asserts the exact
  expected default ("default.svc.cluster.local."). Adds
  TestPodNamespaceTrimsWhitespace.
- server/lifecycle_test: TestMultipleOnStartHooksExecuteInOrder now drives
  Start() rather than calling hooks manually so it actually covers the
  lifecycle dispatcher; pair test added for OnShutdown ordering.
- server/lifecycle_test: check Close() error in freePort.
- agents_test, resolver_test: drop staticcheck S1039 fmt.Sprintf wrappers.
- openshell/fetcher_test: replace stub SSRF test with one that points at the
  hostname "localhost" so it actually exercises the dial-time guard rather
  than URL-level validation. Add TestFetcherRejectsNonHTTPSRedirect.

Tests: all packages pass with -race. No behavior change when discovery is
disabled (the AICR_DISCOVERY_ENABLED gate still defaults to off).
Address the May 11 PR NVIDIA#469 follow-up review.

Readiness-bind race (the only fully-unaddressed item from round 2):

  pkg/server/server.go Start() previously kicked off ListenAndServe in
  a goroutine and immediately flipped readiness on the main goroutine,
  opening a window where /ready returned 200 while the kernel was still
  binding the socket. Probes during that window would 200 then get
  connection-refused on the next request.

  Bind the listener synchronously (net.ListenConfig.Listen — context-
  aware, satisfies noctx) BEFORE setReady(true), then start
  http.Server.Serve(ln) in a goroutine. If the bind fails the server
  never advertises ready. New regression test
  TestReadyImpliesListenerBound asserts that a dial against the port
  succeeds the moment ready flips to true.

golangci-lint failures cleared (the CI gate this PR currently fails on):

- exhaustive (3 sites): pkg/errors/exitcode.go, pkg/server/errors.go —
  ErrCodePolicyFetch is now covered in the exit-code, HTTP-status, and
  retryable switches. Treated as upstream-unavailable (ExitUnavailable
  / 503 / retryable=true) since policy fetch hits an external URI.

- gosec G115 (int→uint16): pkg/server/config.go ResolvePort now
  returns uint16 directly, matching net's port representation. Single
  cast still needed in parseConfig because config.Port is int for
  back-compat with existing tests that set cfg.Port = port; gated with
  a nolint comment since ResolvePort guarantees the range.
  pkg/api/server.go drops the int→uint16 cast at the registration site.

- gosec G704 (SSRF taint): pkg/openshell/fetcher.go Fetch annotated
  with nolint pointing at the three layers already mitigating the
  finding (URL pre-flight, post-resolution dial guard, redirect
  check) and the regression test that exercises the dial path. Also
  unblocks the CodeQL job which was firing on the same taint analysis.

- noctx: internal/testdns/testdns.go Start switches from net.ListenPacket
  to (*net.ListenConfig).ListenPacket with context.Background() per
  noctx lint requirement.

Layer 1 strict-mode policy fields remain a documented limitation in
Guard's doc comment (per the reviewer's "downgraded to documented
limitation rather than fixed" note). Tracked in a follow-up issue.
@IngmarVG-IB IngmarVG-IB force-pushed the feat/dns-aid-discovery branch from ff4b714 to 6f2d22f Compare May 19, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants