chore: main to test by mayankpande88 · Pull Request #202 · nudgebee/node-agent

mayankpande88 · 2026-02-17T10:15:24Z

No description provided.

…200) * fix: resolve external IPs to FQDNs in metric labels using DNS cache DestinationKey is set at TCP connection time, but DNS may not be cached yet due to per-CPU perf buffer race conditions. This causes all metric labels (destination, destination_workload_name) to show raw IPs instead of FQDNs for external services. Add enrichDestinationKey() that checks the DNS cache at metric emission time and substitutes the FQDN when available. Applied to all TCP metrics (bytes, connections, retransmits) and all L7 protocol metrics (HTTP, Postgres, Redis, etc). * fix: aggregate connection stats by enriched key to prevent duplicate metrics Multiple IPs resolving to the same FQDN (e.g., Google's shared IPs for monitoring.googleapis.com) caused duplicate metric errors when enriched to the same label set. Aggregate stats by enriched DestinationKey before emitting TCP metrics. * fix: migrate connection DestinationKey when DNS becomes available Root cause fix for duplicate metric errors. When a TCP connection opens before DNS is cached (per-CPU perf buffer race), the DestinationKey stores the raw IP. Later connections get an FQDN-based key. Both enrich to the same FQDN at emit time, causing "collected before with same label values" errors. migrateConnectionKeyIfNeeded updates conn.DestinationKey in-place and migrates connectionStats from the old IP-based key to the FQDN-based key. Called from updateConnectionTrafficStats and onL7RequestWithResult. Collect() aggregation kept as safety net for stale entries. * fix: use minimal structs and O(1) pod lookup in ip_resolver Replace full K8s API objects with minimal structs (MinimalOwnerInfo, MinimalService, MinimalNode) to reduce memory ~10x per resource. Add PodNameIndex for O(1) ResolvePodOwner instead of O(pods) scan. Deduplicate service resolution into resolveServiceWorkload(). Simplify getControllerOfOwner from 7 switch cases to unified lookup. * fix: eliminate duplicate HTTP parsing and remove dead code Remove 3 unused functions (ParseHTTPResponse, ParseHttpResponse, ParseHostFromHttpRequest). Fix parseRequest() to call ParseHTTPRequest once instead of both ParseHttp + ParseHTTPRequest. Pass pre-parsed method/path to observe() instead of re-parsing payload a third time. Fix pre-existing RawPath bug in ParseHTTPRequest URL construction. * fix: race conditions, missing informer, stale IP cleanup, and eBPF bounds checks - Fix concurrent map access in getMounts/getListens/ping (copy c.processes under lock) - Fix conn.Closed written outside lock in onConnectionClose - Add missing CronJob informer handler for owner chain resolution - Fix InstanceMeta missing Instance field in initial snapshot - Fix storeWorkloadsIP check-then-act race with mutex - Clean stale ClusterIPs on Service update and old pod IPs on Pod update - Add bounds checks in memcached/redis eBPF parsers to prevent OOB reads - Remove misleading defer req.Body.Close() in ParseHTTPRequest * fix: correct ssl_st struct offsets for OpenSSL 3.x TLS capture In OpenSSL 3.x, the SSL struct was split into ssl_st (base) and ssl_connection_st (connection data). rbio/wbio moved from offset 16/24 to 80/88. The eBPF code was reading garbage pointers for the BIO, causing FD extraction to fail (CONN_NOT_FOUND with invalid FDs like 2868022864). HTTP payload was captured correctly since it comes from function parameters, but the connection couldn't be matched. Adds GET_FD_V3 macro that reads rbio/wbio at the correct offsets for OpenSSL 3.x (verified against OpenSSL 3.5.4 with a test program). * fix: prefer system libssl over psycopg2 bundled lib for TLS uprobes Processes like the notification server load both psycopg2's bundled libssl-*.so and the system /usr/lib/libssl.so.3. The agent was picking the first match from /proc/pid/maps (psycopg2's), but Python's ssl module uses the system lib for HTTPS. This caused SSL uprobes to fire on the wrong library, missing Slack/Teams API calls. * fix: reduce log noise by raising debug log levels Per-event hot-path logs (L7_EVENT_REGISTRY, CONTAINER_FOUND, TIMESTAMP_MISMATCH) were at V(2), producing ~15K lines/min with KLOG_V=2. Raised to V(5). Per-request logs (HTTP2_COMPLETED, LLM_TRACK) raised to V(4). Occasional logs raised to V(3). TLS per-process debug logs moved from unconditional Infof to V(2)/V(3). * fix: align L7 metric labels with TCP and remove dead connection fields - Rename L7 labels to match TCP: destination_region → destination_workload_region, destination_az → destination_workload_az, destination_instance → actual_destination_instance - Add missing actual_destination_region and actual_destination_az labels to L7 metrics - Fix L7 destination_workload_region/az to use destinationWorkload (was using actualDestWorkload) - Remove dead fields from ConnectionKey (srcWorkload, dstWorkload, actualDestWorkload — never set) - Remove dead fields from ActiveConnection (dstWorkload, actualDestWorkload — set but never read) - Remove redundant DNS re-enrichment in trace creation (migrateConnectionKeyIfNeeded already handles it) * style: fix gofmt formatting in container.go and http.go

gemini-code-assist · 2026-02-17T10:16:48Z

Summary of Changes

Hello @mayankpande88, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a series of enhancements aimed at improving the node agent's efficiency, accuracy, and robustness. Significant work has been done to optimize Kubernetes resource handling, refine L7 tracing for better metric aggregation and FQDN resolution, and bolster the reliability of eBPF uprobe attachments across various runtimes. Additionally, systemd service detection and filtering have been improved, and the build process updated for better compatibility.

Highlights

Dockerfile and Build Process: Updated the Dockerfile to use debian:bullseye as the base image and manually install Go 1.24.9, addressing potential glibc compatibility issues with the official Golang image.
Kubernetes IP Resolver Optimization: Refactored the Kubernetes IP resolver to use minimal structs for OwnerInfo, Service, and Node within the clusterSnapshot, significantly reducing memory footprint. A PodNameIndex was also added for faster pod lookups by name and namespace.
cgroup Path Recognition and Systemd Service Handling: Expanded cgroup path regex to include kube.slice and azure.slice for improved Kubernetes and Azure service detection. Systemd property retrieval was refactored using a dedicated D-Bus client and a new SystemdProperties struct, enabling better introspection and the ability to skip well-known system services.
L7 Tracing and Metric Aggregation Improvements: Enhanced L7 tracing by implementing connection key enrichment and migration. This resolves IP-based workload names to FQDNs when DNS information becomes available, preventing duplicate metrics for connections to the same FQDN via different IPs. LLM-specific metrics were removed from the Container struct, and logging verbosity for L7 events was adjusted.
eBPF Uprobe Attachment Robustness: Introduced new ELFFile and Symbol utilities to dynamically discover function entry and return points within binaries. This improves the reliability of uprobe and uretprobe attachments for Go TLS, Python, and Node.js applications, especially for OpenSSL 3.x, making them more resilient to compiler optimizations and varying library versions.
GPU Monitoring and Configuration: Added a new command-line flag, --disable-gpu-monitoring, allowing users to optionally disable GPU monitoring via NVML.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

Dockerfile
- Updated base image from golang:1.24.7-trixie to debian:bullseye.
- Manually installed Go 1.24.9 to address glibc compatibility issues.
cgroup/cgroup.go
- Expanded systemSliceIdRegexp to include kube.slice and azure.slice.
- Added systemd to the list of prefixes for standalone process detection.
cgroup/cgroup_test.go
- Added new test cases for kube.slice, azure.slice, and .scope paths in containerByCgroup.
common/ip_resolver.go
- Removed import of k8s.io/api/batch/v1beta1.
- Introduced MinimalOwnerInfo, MinimalService, and MinimalNode structs to store only essential fields.
- Updated clusterSnapshot to use the new minimal structs and added PodNameIndex for efficient pod lookups.
- Added ipsMapMu mutex to protect concurrent access to ipsMap.
- Refactored addServiceHandlers to use MinimalService and directly store resolved workloads.
- Modified handlePodAdd to store pod UIDs in PodNameIndex.
- Updated handleNodeEvent to store MinimalNode objects.
- Adjusted getFullClusterSnapshot to populate minimal structs and PodNameIndex.
- Rewrote updateIpMapping to use MinimalService and resolveServiceWorkload.
- Updated getControllerOfOwner to work with MinimalOwnerInfo and a generic sync.Map lookup.
- Simplified ResolvePodOwner to use PodNameIndex for direct lookup.
common/net.go
- Removed debug logging for NewDestinationKey.
- Added WithResolvedDomain method to DestinationKey for FQDN enrichment.
containers/container.go
- Removed systemdTriggeredBy field from ContainerMetadata and replaced it with a SystemdProperties struct.
- Removed LLMStats struct and llmStats map from Container.
- Simplified ConnectionKey and ActiveConnection structs by removing redundant workload fields.
- Updated Collect method to use SystemdProperties for container info metrics.
- Implemented aggregation of connection stats by enriched destination key to prevent duplicate metrics.
- Adjusted logging verbosity for L7 connection events.
- Added enrichDestinationKey and migrateConnectionKeyIfNeeded methods to handle FQDN resolution for connection keys.
- Removed redundant DNS enrichment logic from onL7RequestWithResult as it's now handled by migrateConnectionKeyIfNeeded.
- Updated l7Stats.observe calls to pass method and path arguments.
- Removed HTTP/2 LLM tracking logic from onL7RequestWithResult as it's handled by the stream tracker.
- Adjusted logging verbosity for HTTP2_COMPLETED_REQUEST and HTTP2_CONNECTIONLESS events.
- Added thread-safe copying of c.processes and c.listens maps before iteration in getMounts and getListens.
containers/containerd.go
- Updated JSON unmarshalling to use GetValue() instead of Value for OCI spec and metadata.
containers/http_processor.go
- Improved HTTP request parsing by using l7.ParseHTTPRequest first, falling back to l7.ParseHttp for malformed URIs.
containers/journald.go
- Removed dependency on cgroup package.
- Changed JournaldSubscribe and JournaldUnsubscribe to use unit string instead of cgroup.Cgroup.
- Added journaldPollTimeout constant and improved follow loop to handle inotify limitations.
containers/l7.go
- Added mu (mutex) to L7Stats for thread-safe access.
- Updated observe method signature to include path and changed receiver to pointer.
- Adjusted label values for metrics to include destination_workload_region and destination_workload_az.
- Modified observe to use method and path arguments for HTTP protocol.
- Added mutex locks around L7Stats map access in observe and collect methods.
- Changed ensureInitialized and collect receivers to pointers.
containers/llm_stream.go
- Adjusted logging verbosity for LLM_STREAM_START and LLM_STREAM_FIRST_TOKEN events.
containers/metrics.go
- Removed LLMRequests, LLMTokensUsed, and LLMLatency metrics.
- Added systemd_type label to ContainerInfo metric.
containers/process.go
- Improved instrumentPython to handle empty cmdFields gracefully.
- Modified Close method to close uprobes in a separate goroutine to prevent blocking.
containers/registry.go
- Removed redundant klog.Infoln for TCP connection errors from unknown containers.
- Adjusted logging verbosity for L7 events in handleEvents, processL7Event, queueL7EventForRetry, L7_EVENT_EXPIRED, L7_EVENT_MAX_RETRIES, and L7_EVENT_RETRY_SUCCESS.
- Added a check in getOrCreateContainer to return nil if cgroup ID ends with (deleted).
- Implemented skipping of systemd system services based on the SkipSystemdSystemServices flag.
- Updated getContainerMetadata to use the new getSystemdProperties function.
containers/systemd.go
- Removed init function for D-Bus connection.
- Introduced DbusClient struct for managing D-Bus connections and caching properties.
- Added SystemdProperties struct to encapsulate unit, triggered by, and type information.
- Implemented IsEmpty() and IsSystemService() methods for SystemdProperties.
- Refactored SystemdTriggeredBy into getSystemdProperties to retrieve full systemd properties.
- Added systemServicePrefixes for identifying common system services.
ebpftracer/ebpf/l7/memcached.c
- Added a buffer size check (buf_size < 5) before reading memcached response to prevent out-of-bounds access.
ebpftracer/ebpf/l7/openssl.c
- Added SSL_V3_RBIO_OFFSET and SSL_V3_WBIO_OFFSET macros for OpenSSL 3.x.
- Introduced GET_FD_V3 macro to correctly extract file descriptors for OpenSSL 3.x.
- Updated openssl_SSL_write_enter_v3_0, openssl_SSL_read_enter_v3_0, and openssl_SSL_read_ex_enter_v3_0 to use GET_FD_V3 and ensure connection tracking.
ebpftracer/ebpf/l7/redis.c
- Added a buffer size check (buf_size < 3) before reading redis response to prevent out-of-bounds access.
ebpftracer/elf.go
- Added new file elf.go containing ELFFile and Symbol structs for robust ELF parsing, symbol lookup, and dynamic uprobe/uretprobe attachment based on instruction analysis.
ebpftracer/l7/http.go
- Added RawPath to url.URL in ParseHTTPRequest.
- Removed ParseHostFromHttpRequest and ParseHTTPResponse functions.
ebpftracer/nodejs.go
- Refactored AttachNodejsProbes to use attachNodejsUprobes helper function.
- Updated attachNodejsUprobes to leverage ELFFile and Symbol for more reliable uprobe and uretprobe attachments.
ebpftracer/python.go
- Updated muslRegexp to ld-musl[.-] for better matching.
- Refactored AttachPythonThreadLockProbes to use attachPythonUprobes helper function.
- Updated attachPythonUprobes to leverage ELFFile and Symbol for more reliable uprobe and uretprobe attachments.
ebpftracer/tls.go
- Removed getReturnOffsets function.
- Refactored AttachOpenSslUprobes and AttachGoTlsUprobes to use the new ELFFile and Symbol utilities for dynamic symbol resolution and uprobe/uretprobe attachment.
- Adjusted logging verbosity for Go TLS attachment attempts and filtered errors.
flags/flags.go
- Added DisableGPUMonitoring flag to control GPU monitoring.
- Added SkipSystemdSystemServices flag to skip monitoring of common systemd services.
- Added LogPatternsPerContainer flag to limit unique log patterns.
go.mod
- Updated github.com/cilium/ebpf to v0.20.0.
- Updated github.com/containerd/cgroups to v1.1.0.
- Updated github.com/containerd/containerd to v1.7.29.
- Updated github.com/coreos/go-systemd/v22 to v22.7.0.
- Updated github.com/godbus/dbus/v5 to v5.2.2.
- Updated github.com/opencontainers/runtime-spec to v1.1.0.
- Updated golang.org/x/sys to v0.40.0.
- Updated golang.org/x/tools to v0.38.0.
- Removed cloud.google.com/go/compute/metadata dependency.
- Updated various indirect dependencies.
gpu/gpu.go
- Added flags import.
- Added a check for flags.DisableGPUMonitoring in NewCollector to optionally skip GPU initialization.
- Added a nil check for c.iface in Close method.
install.sh
- Updated ENV_VARS regex to include DISABLE_GPU_MONITORING.
logs/journald_reader.go
- Added journaldPollTimeout constant.
- Improved error logging for journald initialization and usage checks.
- Modified follow loop to handle inotify limitations by sleeping if journal.Wait returns 0.
- Changed subscriber key from SD_JOURNAL_FIELD_SYSTEMD_CGROUP to SD_JOURNAL_FIELD_SYSTEMD_UNIT.
node/metadata/gcp.go
- Removed dependency on cloud.google.com/go/compute/metadata.
- Implemented custom GCP metadata retrieval using net/http client with network namespace execution.
node/net.go
- Expanded netDeviceFilterRe regex to include br (bridge) and vmbr (VM bridge) devices.
- Made netDeviceFilter case-insensitive.
node/net_test.go
- Updated TestNetDeviceFilter with new test cases for enX0, br0, and vmbr0.
proc/fd.go
- Adjusted error handling in ReadFds to only log warnings for non-os.IsNotExist errors.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces significant refactoring and improvements across several components. The Dockerfile is updated to use Debian Bullseye as a base image, manually installing Go 1.24.9 and build tools to address glibc compatibility issues. The cgroup parsing logic is enhanced to recognize kube.slice and azure.slice for systemd services and to classify .scope cgroups as standalone processes, with corresponding unit tests added. The Kubernetes IP resolver (common/ip_resolver.go) undergoes a major overhaul to reduce memory footprint by storing only minimal necessary fields from Kubernetes objects (Pods, Nodes, ReplicaSets, DaemonSets, StatefulSets, Jobs, Services, Deployments, CronJobs), adding CronJob support, and optimizing lookup logic with a new PodNameIndex. The common/net.go file removes debug logging and adds a helper for DestinationKey to update with resolved FQDNs. The containers/container.go file sees extensive changes, including the removal of LLMStats and simplification of connection structs. It addresses duplicate metrics by enriching destination keys with FQDNs using new enrichDestinationKey and migrateConnectionKeyIfNeeded methods, and adjusts logging verbosity. Concurrent map access issues in getMounts, getListens, and ping are resolved using read locks. Systemd integration is improved by using SystemdProperties struct to store unit, triggered-by, and type information, and journald subscriptions now use the systemd unit name. The containers/systemd.go file introduces a DbusClient with caching and retry logic for systemd properties, and defines SystemdProperties to encapsulate systemd service details, including a mechanism to skip well-known system services. The ebpftracer package includes a new elf.go file for parsing ELF binaries and attaching uprobes/uretprobes by symbol name and return offsets, which is then utilized by nodejs.go, python.go, and tls.go for more robust and version-agnostic tracing. Minor fixes are applied to ebpftracer/ebpf/l7/memcached.c and redis.c for buffer size checks, and openssl.c is updated to support OpenSSL 3.x. The flags/flags.go file adds a new flag to disable GPU monitoring and another to skip systemd system services. logs/journald_reader.go improves error logging and handling of journald polling. node/metadata/gcp.go modifies GCP metadata retrieval to use a custom HTTP client operating in the host network namespace. node/net.go expands the network device filter regex. Finally, go.mod updates several dependencies. A review comment highlights a potential infinite loop in ebpftracer/elf.go during x86_64 instruction decoding if x86asm.Decode returns an error with ins.Len being 0, suggesting a fix to always advance the index. Another comment points out that the DbusClient cache in containers/systemd.go could lead to unbounded memory growth in environments with many transient services, recommending a size limit or TTL for cache entries.

#203) * fix: add missing patternsPerLevelLimit param to logparser.NewParser calls * fix: update logparser dep, sanitize log paths, replace test credential * fix: update logparser dep to fix sensitive pattern loading The previous version had a corrupted sensitive_patterns.json that caused "cannot unmarshal object into Go value of type []SensitivePattern" errors, breaking sensitive data detection in log parsing.

mayankpande88 added 3 commits February 16, 2026 09:15

merge upstream coroot/coroot-node-agent changes

add3067

merge upstream coroot/coroot-node-agent changes (#201)

0445d57

gemini-code-assist Bot reviewed Feb 17, 2026

View reviewed changes

Comment thread ebpftracer/elf.go

Comment thread containers/systemd.go

RamanKharchee approved these changes Feb 23, 2026

View reviewed changes

blue4209211 approved these changes Feb 23, 2026

View reviewed changes

blue4209211 merged commit d0f0bcb into test Feb 23, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: main to test#202

chore: main to test#202
blue4209211 merged 4 commits into
testfrom
main

mayankpande88 commented Feb 17, 2026

Uh oh!

gemini-code-assist Bot commented Feb 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mayankpande88 commented Feb 17, 2026

Uh oh!

gemini-code-assist Bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants