Skip to content

chore: main to test#202

Merged
blue4209211 merged 4 commits into
testfrom
main
Feb 23, 2026
Merged

chore: main to test#202
blue4209211 merged 4 commits into
testfrom
main

Conversation

@mayankpande88

Copy link
Copy Markdown
Contributor

No description provided.

…200)

* fix: resolve external IPs to FQDNs in metric labels using DNS cache

DestinationKey is set at TCP connection time, but DNS may not be cached
yet due to per-CPU perf buffer race conditions. This causes all metric
labels (destination, destination_workload_name) to show raw IPs instead
of FQDNs for external services.

Add enrichDestinationKey() that checks the DNS cache at metric emission
time and substitutes the FQDN when available. Applied to all TCP metrics
(bytes, connections, retransmits) and all L7 protocol metrics (HTTP,
Postgres, Redis, etc).

* fix: aggregate connection stats by enriched key to prevent duplicate metrics

Multiple IPs resolving to the same FQDN (e.g., Google's shared IPs for
monitoring.googleapis.com) caused duplicate metric errors when enriched
to the same label set. Aggregate stats by enriched DestinationKey before
emitting TCP metrics.

* fix: migrate connection DestinationKey when DNS becomes available

Root cause fix for duplicate metric errors. When a TCP connection opens
before DNS is cached (per-CPU perf buffer race), the DestinationKey
stores the raw IP. Later connections get an FQDN-based key. Both enrich
to the same FQDN at emit time, causing "collected before with same label
values" errors.

migrateConnectionKeyIfNeeded updates conn.DestinationKey in-place and
migrates connectionStats from the old IP-based key to the FQDN-based
key. Called from updateConnectionTrafficStats and onL7RequestWithResult.
Collect() aggregation kept as safety net for stale entries.

* fix: use minimal structs and O(1) pod lookup in ip_resolver

Replace full K8s API objects with minimal structs (MinimalOwnerInfo,
MinimalService, MinimalNode) to reduce memory ~10x per resource.
Add PodNameIndex for O(1) ResolvePodOwner instead of O(pods) scan.
Deduplicate service resolution into resolveServiceWorkload().
Simplify getControllerOfOwner from 7 switch cases to unified lookup.

* fix: eliminate duplicate HTTP parsing and remove dead code

Remove 3 unused functions (ParseHTTPResponse, ParseHttpResponse,
ParseHostFromHttpRequest). Fix parseRequest() to call ParseHTTPRequest
once instead of both ParseHttp + ParseHTTPRequest. Pass pre-parsed
method/path to observe() instead of re-parsing payload a third time.
Fix pre-existing RawPath bug in ParseHTTPRequest URL construction.

* fix: race conditions, missing informer, stale IP cleanup, and eBPF bounds checks

- Fix concurrent map access in getMounts/getListens/ping (copy c.processes under lock)
- Fix conn.Closed written outside lock in onConnectionClose
- Add missing CronJob informer handler for owner chain resolution
- Fix InstanceMeta missing Instance field in initial snapshot
- Fix storeWorkloadsIP check-then-act race with mutex
- Clean stale ClusterIPs on Service update and old pod IPs on Pod update
- Add bounds checks in memcached/redis eBPF parsers to prevent OOB reads
- Remove misleading defer req.Body.Close() in ParseHTTPRequest

* fix: correct ssl_st struct offsets for OpenSSL 3.x TLS capture

In OpenSSL 3.x, the SSL struct was split into ssl_st (base) and
ssl_connection_st (connection data). rbio/wbio moved from offset
16/24 to 80/88. The eBPF code was reading garbage pointers for the
BIO, causing FD extraction to fail (CONN_NOT_FOUND with invalid FDs
like 2868022864). HTTP payload was captured correctly since it comes
from function parameters, but the connection couldn't be matched.

Adds GET_FD_V3 macro that reads rbio/wbio at the correct offsets
for OpenSSL 3.x (verified against OpenSSL 3.5.4 with a test program).

* fix: prefer system libssl over psycopg2 bundled lib for TLS uprobes

Processes like the notification server load both psycopg2's bundled
libssl-*.so and the system /usr/lib/libssl.so.3. The agent was picking
the first match from /proc/pid/maps (psycopg2's), but Python's ssl
module uses the system lib for HTTPS. This caused SSL uprobes to fire
on the wrong library, missing Slack/Teams API calls.

* fix: reduce log noise by raising debug log levels

Per-event hot-path logs (L7_EVENT_REGISTRY, CONTAINER_FOUND,
TIMESTAMP_MISMATCH) were at V(2), producing ~15K lines/min with
KLOG_V=2. Raised to V(5). Per-request logs (HTTP2_COMPLETED, LLM_TRACK)
raised to V(4). Occasional logs raised to V(3). TLS per-process debug
logs moved from unconditional Infof to V(2)/V(3).

* fix: align L7 metric labels with TCP and remove dead connection fields

- Rename L7 labels to match TCP: destination_region → destination_workload_region,
  destination_az → destination_workload_az, destination_instance → actual_destination_instance
- Add missing actual_destination_region and actual_destination_az labels to L7 metrics
- Fix L7 destination_workload_region/az to use destinationWorkload (was using actualDestWorkload)
- Remove dead fields from ConnectionKey (srcWorkload, dstWorkload, actualDestWorkload — never set)
- Remove dead fields from ActiveConnection (dstWorkload, actualDestWorkload — set but never read)
- Remove redundant DNS re-enrichment in trace creation (migrateConnectionKeyIfNeeded already handles it)

* style: fix gofmt formatting in container.go and http.go
@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello @mayankpande88, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a series of enhancements aimed at improving the node agent's efficiency, accuracy, and robustness. Significant work has been done to optimize Kubernetes resource handling, refine L7 tracing for better metric aggregation and FQDN resolution, and bolster the reliability of eBPF uprobe attachments across various runtimes. Additionally, systemd service detection and filtering have been improved, and the build process updated for better compatibility.

Highlights

  • Dockerfile and Build Process: Updated the Dockerfile to use debian:bullseye as the base image and manually install Go 1.24.9, addressing potential glibc compatibility issues with the official Golang image.
  • Kubernetes IP Resolver Optimization: Refactored the Kubernetes IP resolver to use minimal structs for OwnerInfo, Service, and Node within the clusterSnapshot, significantly reducing memory footprint. A PodNameIndex was also added for faster pod lookups by name and namespace.
  • cgroup Path Recognition and Systemd Service Handling: Expanded cgroup path regex to include kube.slice and azure.slice for improved Kubernetes and Azure service detection. Systemd property retrieval was refactored using a dedicated D-Bus client and a new SystemdProperties struct, enabling better introspection and the ability to skip well-known system services.
  • L7 Tracing and Metric Aggregation Improvements: Enhanced L7 tracing by implementing connection key enrichment and migration. This resolves IP-based workload names to FQDNs when DNS information becomes available, preventing duplicate metrics for connections to the same FQDN via different IPs. LLM-specific metrics were removed from the Container struct, and logging verbosity for L7 events was adjusted.
  • eBPF Uprobe Attachment Robustness: Introduced new ELFFile and Symbol utilities to dynamically discover function entry and return points within binaries. This improves the reliability of uprobe and uretprobe attachments for Go TLS, Python, and Node.js applications, especially for OpenSSL 3.x, making them more resilient to compiler optimizations and varying library versions.
  • GPU Monitoring and Configuration: Added a new command-line flag, --disable-gpu-monitoring, allowing users to optionally disable GPU monitoring via NVML.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • Dockerfile
    • Updated base image from golang:1.24.7-trixie to debian:bullseye.
    • Manually installed Go 1.24.9 to address glibc compatibility issues.
  • cgroup/cgroup.go
    • Expanded systemSliceIdRegexp to include kube.slice and azure.slice.
    • Added systemd to the list of prefixes for standalone process detection.
  • cgroup/cgroup_test.go
    • Added new test cases for kube.slice, azure.slice, and .scope paths in containerByCgroup.
  • common/ip_resolver.go
    • Removed import of k8s.io/api/batch/v1beta1.
    • Introduced MinimalOwnerInfo, MinimalService, and MinimalNode structs to store only essential fields.
    • Updated clusterSnapshot to use the new minimal structs and added PodNameIndex for efficient pod lookups.
    • Added ipsMapMu mutex to protect concurrent access to ipsMap.
    • Refactored addServiceHandlers to use MinimalService and directly store resolved workloads.
    • Modified handlePodAdd to store pod UIDs in PodNameIndex.
    • Updated handleNodeEvent to store MinimalNode objects.
    • Adjusted getFullClusterSnapshot to populate minimal structs and PodNameIndex.
    • Rewrote updateIpMapping to use MinimalService and resolveServiceWorkload.
    • Updated getControllerOfOwner to work with MinimalOwnerInfo and a generic sync.Map lookup.
    • Simplified ResolvePodOwner to use PodNameIndex for direct lookup.
  • common/net.go
    • Removed debug logging for NewDestinationKey.
    • Added WithResolvedDomain method to DestinationKey for FQDN enrichment.
  • containers/container.go
    • Removed systemdTriggeredBy field from ContainerMetadata and replaced it with a SystemdProperties struct.
    • Removed LLMStats struct and llmStats map from Container.
    • Simplified ConnectionKey and ActiveConnection structs by removing redundant workload fields.
    • Updated Collect method to use SystemdProperties for container info metrics.
    • Implemented aggregation of connection stats by enriched destination key to prevent duplicate metrics.
    • Adjusted logging verbosity for L7 connection events.
    • Added enrichDestinationKey and migrateConnectionKeyIfNeeded methods to handle FQDN resolution for connection keys.
    • Removed redundant DNS enrichment logic from onL7RequestWithResult as it's now handled by migrateConnectionKeyIfNeeded.
    • Updated l7Stats.observe calls to pass method and path arguments.
    • Removed HTTP/2 LLM tracking logic from onL7RequestWithResult as it's handled by the stream tracker.
    • Adjusted logging verbosity for HTTP2_COMPLETED_REQUEST and HTTP2_CONNECTIONLESS events.
    • Added thread-safe copying of c.processes and c.listens maps before iteration in getMounts and getListens.
  • containers/containerd.go
    • Updated JSON unmarshalling to use GetValue() instead of Value for OCI spec and metadata.
  • containers/http_processor.go
    • Improved HTTP request parsing by using l7.ParseHTTPRequest first, falling back to l7.ParseHttp for malformed URIs.
  • containers/journald.go
    • Removed dependency on cgroup package.
    • Changed JournaldSubscribe and JournaldUnsubscribe to use unit string instead of cgroup.Cgroup.
    • Added journaldPollTimeout constant and improved follow loop to handle inotify limitations.
  • containers/l7.go
    • Added mu (mutex) to L7Stats for thread-safe access.
    • Updated observe method signature to include path and changed receiver to pointer.
    • Adjusted label values for metrics to include destination_workload_region and destination_workload_az.
    • Modified observe to use method and path arguments for HTTP protocol.
    • Added mutex locks around L7Stats map access in observe and collect methods.
    • Changed ensureInitialized and collect receivers to pointers.
  • containers/llm_stream.go
    • Adjusted logging verbosity for LLM_STREAM_START and LLM_STREAM_FIRST_TOKEN events.
  • containers/metrics.go
    • Removed LLMRequests, LLMTokensUsed, and LLMLatency metrics.
    • Added systemd_type label to ContainerInfo metric.
  • containers/process.go
    • Improved instrumentPython to handle empty cmdFields gracefully.
    • Modified Close method to close uprobes in a separate goroutine to prevent blocking.
  • containers/registry.go
    • Removed redundant klog.Infoln for TCP connection errors from unknown containers.
    • Adjusted logging verbosity for L7 events in handleEvents, processL7Event, queueL7EventForRetry, L7_EVENT_EXPIRED, L7_EVENT_MAX_RETRIES, and L7_EVENT_RETRY_SUCCESS.
    • Added a check in getOrCreateContainer to return nil if cgroup ID ends with (deleted).
    • Implemented skipping of systemd system services based on the SkipSystemdSystemServices flag.
    • Updated getContainerMetadata to use the new getSystemdProperties function.
  • containers/systemd.go
    • Removed init function for D-Bus connection.
    • Introduced DbusClient struct for managing D-Bus connections and caching properties.
    • Added SystemdProperties struct to encapsulate unit, triggered by, and type information.
    • Implemented IsEmpty() and IsSystemService() methods for SystemdProperties.
    • Refactored SystemdTriggeredBy into getSystemdProperties to retrieve full systemd properties.
    • Added systemServicePrefixes for identifying common system services.
  • ebpftracer/ebpf/l7/memcached.c
    • Added a buffer size check (buf_size < 5) before reading memcached response to prevent out-of-bounds access.
  • ebpftracer/ebpf/l7/openssl.c
    • Added SSL_V3_RBIO_OFFSET and SSL_V3_WBIO_OFFSET macros for OpenSSL 3.x.
    • Introduced GET_FD_V3 macro to correctly extract file descriptors for OpenSSL 3.x.
    • Updated openssl_SSL_write_enter_v3_0, openssl_SSL_read_enter_v3_0, and openssl_SSL_read_ex_enter_v3_0 to use GET_FD_V3 and ensure connection tracking.
  • ebpftracer/ebpf/l7/redis.c
    • Added a buffer size check (buf_size < 3) before reading redis response to prevent out-of-bounds access.
  • ebpftracer/elf.go
    • Added new file elf.go containing ELFFile and Symbol structs for robust ELF parsing, symbol lookup, and dynamic uprobe/uretprobe attachment based on instruction analysis.
  • ebpftracer/l7/http.go
    • Added RawPath to url.URL in ParseHTTPRequest.
    • Removed ParseHostFromHttpRequest and ParseHTTPResponse functions.
  • ebpftracer/nodejs.go
    • Refactored AttachNodejsProbes to use attachNodejsUprobes helper function.
    • Updated attachNodejsUprobes to leverage ELFFile and Symbol for more reliable uprobe and uretprobe attachments.
  • ebpftracer/python.go
    • Updated muslRegexp to ld-musl[.-] for better matching.
    • Refactored AttachPythonThreadLockProbes to use attachPythonUprobes helper function.
    • Updated attachPythonUprobes to leverage ELFFile and Symbol for more reliable uprobe and uretprobe attachments.
  • ebpftracer/tls.go
    • Removed getReturnOffsets function.
    • Refactored AttachOpenSslUprobes and AttachGoTlsUprobes to use the new ELFFile and Symbol utilities for dynamic symbol resolution and uprobe/uretprobe attachment.
    • Adjusted logging verbosity for Go TLS attachment attempts and filtered errors.
  • flags/flags.go
    • Added DisableGPUMonitoring flag to control GPU monitoring.
    • Added SkipSystemdSystemServices flag to skip monitoring of common systemd services.
    • Added LogPatternsPerContainer flag to limit unique log patterns.
  • go.mod
    • Updated github.com/cilium/ebpf to v0.20.0.
    • Updated github.com/containerd/cgroups to v1.1.0.
    • Updated github.com/containerd/containerd to v1.7.29.
    • Updated github.com/coreos/go-systemd/v22 to v22.7.0.
    • Updated github.com/godbus/dbus/v5 to v5.2.2.
    • Updated github.com/opencontainers/runtime-spec to v1.1.0.
    • Updated golang.org/x/sys to v0.40.0.
    • Updated golang.org/x/tools to v0.38.0.
    • Removed cloud.google.com/go/compute/metadata dependency.
    • Updated various indirect dependencies.
  • gpu/gpu.go
    • Added flags import.
    • Added a check for flags.DisableGPUMonitoring in NewCollector to optionally skip GPU initialization.
    • Added a nil check for c.iface in Close method.
  • install.sh
    • Updated ENV_VARS regex to include DISABLE_GPU_MONITORING.
  • logs/journald_reader.go
    • Added journaldPollTimeout constant.
    • Improved error logging for journald initialization and usage checks.
    • Modified follow loop to handle inotify limitations by sleeping if journal.Wait returns 0.
    • Changed subscriber key from SD_JOURNAL_FIELD_SYSTEMD_CGROUP to SD_JOURNAL_FIELD_SYSTEMD_UNIT.
  • node/metadata/gcp.go
    • Removed dependency on cloud.google.com/go/compute/metadata.
    • Implemented custom GCP metadata retrieval using net/http client with network namespace execution.
  • node/net.go
    • Expanded netDeviceFilterRe regex to include br (bridge) and vmbr (VM bridge) devices.
    • Made netDeviceFilter case-insensitive.
  • node/net_test.go
    • Updated TestNetDeviceFilter with new test cases for enX0, br0, and vmbr0.
  • proc/fd.go
    • Adjusted error handling in ReadFds to only log warnings for non-os.IsNotExist errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces significant refactoring and improvements across several components. The Dockerfile is updated to use Debian Bullseye as a base image, manually installing Go 1.24.9 and build tools to address glibc compatibility issues. The cgroup parsing logic is enhanced to recognize kube.slice and azure.slice for systemd services and to classify .scope cgroups as standalone processes, with corresponding unit tests added. The Kubernetes IP resolver (common/ip_resolver.go) undergoes a major overhaul to reduce memory footprint by storing only minimal necessary fields from Kubernetes objects (Pods, Nodes, ReplicaSets, DaemonSets, StatefulSets, Jobs, Services, Deployments, CronJobs), adding CronJob support, and optimizing lookup logic with a new PodNameIndex. The common/net.go file removes debug logging and adds a helper for DestinationKey to update with resolved FQDNs. The containers/container.go file sees extensive changes, including the removal of LLMStats and simplification of connection structs. It addresses duplicate metrics by enriching destination keys with FQDNs using new enrichDestinationKey and migrateConnectionKeyIfNeeded methods, and adjusts logging verbosity. Concurrent map access issues in getMounts, getListens, and ping are resolved using read locks. Systemd integration is improved by using SystemdProperties struct to store unit, triggered-by, and type information, and journald subscriptions now use the systemd unit name. The containers/systemd.go file introduces a DbusClient with caching and retry logic for systemd properties, and defines SystemdProperties to encapsulate systemd service details, including a mechanism to skip well-known system services. The ebpftracer package includes a new elf.go file for parsing ELF binaries and attaching uprobes/uretprobes by symbol name and return offsets, which is then utilized by nodejs.go, python.go, and tls.go for more robust and version-agnostic tracing. Minor fixes are applied to ebpftracer/ebpf/l7/memcached.c and redis.c for buffer size checks, and openssl.c is updated to support OpenSSL 3.x. The flags/flags.go file adds a new flag to disable GPU monitoring and another to skip systemd system services. logs/journald_reader.go improves error logging and handling of journald polling. node/metadata/gcp.go modifies GCP metadata retrieval to use a custom HTTP client operating in the host network namespace. node/net.go expands the network device filter regex. Finally, go.mod updates several dependencies. A review comment highlights a potential infinite loop in ebpftracer/elf.go during x86_64 instruction decoding if x86asm.Decode returns an error with ins.Len being 0, suggesting a fix to always advance the index. Another comment points out that the DbusClient cache in containers/systemd.go could lead to unbounded memory growth in environments with many transient services, recommending a size limit or TTL for cache entries.

Comment thread ebpftracer/elf.go
Comment thread containers/systemd.go
#203)

* fix: add missing patternsPerLevelLimit param to logparser.NewParser calls

* fix: update logparser dep, sanitize log paths, replace test credential

* fix: update logparser dep to fix sensitive pattern loading

The previous version had a corrupted sensitive_patterns.json that caused
"cannot unmarshal object into Go value of type []SensitivePattern" errors,
breaking sensitive data detection in log parsing.
@blue4209211 blue4209211 merged commit d0f0bcb into test Feb 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants