chore: main to test#202
Conversation
…200) * fix: resolve external IPs to FQDNs in metric labels using DNS cache DestinationKey is set at TCP connection time, but DNS may not be cached yet due to per-CPU perf buffer race conditions. This causes all metric labels (destination, destination_workload_name) to show raw IPs instead of FQDNs for external services. Add enrichDestinationKey() that checks the DNS cache at metric emission time and substitutes the FQDN when available. Applied to all TCP metrics (bytes, connections, retransmits) and all L7 protocol metrics (HTTP, Postgres, Redis, etc). * fix: aggregate connection stats by enriched key to prevent duplicate metrics Multiple IPs resolving to the same FQDN (e.g., Google's shared IPs for monitoring.googleapis.com) caused duplicate metric errors when enriched to the same label set. Aggregate stats by enriched DestinationKey before emitting TCP metrics. * fix: migrate connection DestinationKey when DNS becomes available Root cause fix for duplicate metric errors. When a TCP connection opens before DNS is cached (per-CPU perf buffer race), the DestinationKey stores the raw IP. Later connections get an FQDN-based key. Both enrich to the same FQDN at emit time, causing "collected before with same label values" errors. migrateConnectionKeyIfNeeded updates conn.DestinationKey in-place and migrates connectionStats from the old IP-based key to the FQDN-based key. Called from updateConnectionTrafficStats and onL7RequestWithResult. Collect() aggregation kept as safety net for stale entries. * fix: use minimal structs and O(1) pod lookup in ip_resolver Replace full K8s API objects with minimal structs (MinimalOwnerInfo, MinimalService, MinimalNode) to reduce memory ~10x per resource. Add PodNameIndex for O(1) ResolvePodOwner instead of O(pods) scan. Deduplicate service resolution into resolveServiceWorkload(). Simplify getControllerOfOwner from 7 switch cases to unified lookup. * fix: eliminate duplicate HTTP parsing and remove dead code Remove 3 unused functions (ParseHTTPResponse, ParseHttpResponse, ParseHostFromHttpRequest). Fix parseRequest() to call ParseHTTPRequest once instead of both ParseHttp + ParseHTTPRequest. Pass pre-parsed method/path to observe() instead of re-parsing payload a third time. Fix pre-existing RawPath bug in ParseHTTPRequest URL construction. * fix: race conditions, missing informer, stale IP cleanup, and eBPF bounds checks - Fix concurrent map access in getMounts/getListens/ping (copy c.processes under lock) - Fix conn.Closed written outside lock in onConnectionClose - Add missing CronJob informer handler for owner chain resolution - Fix InstanceMeta missing Instance field in initial snapshot - Fix storeWorkloadsIP check-then-act race with mutex - Clean stale ClusterIPs on Service update and old pod IPs on Pod update - Add bounds checks in memcached/redis eBPF parsers to prevent OOB reads - Remove misleading defer req.Body.Close() in ParseHTTPRequest * fix: correct ssl_st struct offsets for OpenSSL 3.x TLS capture In OpenSSL 3.x, the SSL struct was split into ssl_st (base) and ssl_connection_st (connection data). rbio/wbio moved from offset 16/24 to 80/88. The eBPF code was reading garbage pointers for the BIO, causing FD extraction to fail (CONN_NOT_FOUND with invalid FDs like 2868022864). HTTP payload was captured correctly since it comes from function parameters, but the connection couldn't be matched. Adds GET_FD_V3 macro that reads rbio/wbio at the correct offsets for OpenSSL 3.x (verified against OpenSSL 3.5.4 with a test program). * fix: prefer system libssl over psycopg2 bundled lib for TLS uprobes Processes like the notification server load both psycopg2's bundled libssl-*.so and the system /usr/lib/libssl.so.3. The agent was picking the first match from /proc/pid/maps (psycopg2's), but Python's ssl module uses the system lib for HTTPS. This caused SSL uprobes to fire on the wrong library, missing Slack/Teams API calls. * fix: reduce log noise by raising debug log levels Per-event hot-path logs (L7_EVENT_REGISTRY, CONTAINER_FOUND, TIMESTAMP_MISMATCH) were at V(2), producing ~15K lines/min with KLOG_V=2. Raised to V(5). Per-request logs (HTTP2_COMPLETED, LLM_TRACK) raised to V(4). Occasional logs raised to V(3). TLS per-process debug logs moved from unconditional Infof to V(2)/V(3). * fix: align L7 metric labels with TCP and remove dead connection fields - Rename L7 labels to match TCP: destination_region → destination_workload_region, destination_az → destination_workload_az, destination_instance → actual_destination_instance - Add missing actual_destination_region and actual_destination_az labels to L7 metrics - Fix L7 destination_workload_region/az to use destinationWorkload (was using actualDestWorkload) - Remove dead fields from ConnectionKey (srcWorkload, dstWorkload, actualDestWorkload — never set) - Remove dead fields from ActiveConnection (dstWorkload, actualDestWorkload — set but never read) - Remove redundant DNS re-enrichment in trace creation (migrateConnectionKeyIfNeeded already handles it) * style: fix gofmt formatting in container.go and http.go
Summary of ChangesHello @mayankpande88, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a series of enhancements aimed at improving the node agent's efficiency, accuracy, and robustness. Significant work has been done to optimize Kubernetes resource handling, refine L7 tracing for better metric aggregation and FQDN resolution, and bolster the reliability of eBPF uprobe attachments across various runtimes. Additionally, systemd service detection and filtering have been improved, and the build process updated for better compatibility. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces significant refactoring and improvements across several components. The Dockerfile is updated to use Debian Bullseye as a base image, manually installing Go 1.24.9 and build tools to address glibc compatibility issues. The cgroup parsing logic is enhanced to recognize kube.slice and azure.slice for systemd services and to classify .scope cgroups as standalone processes, with corresponding unit tests added. The Kubernetes IP resolver (common/ip_resolver.go) undergoes a major overhaul to reduce memory footprint by storing only minimal necessary fields from Kubernetes objects (Pods, Nodes, ReplicaSets, DaemonSets, StatefulSets, Jobs, Services, Deployments, CronJobs), adding CronJob support, and optimizing lookup logic with a new PodNameIndex. The common/net.go file removes debug logging and adds a helper for DestinationKey to update with resolved FQDNs. The containers/container.go file sees extensive changes, including the removal of LLMStats and simplification of connection structs. It addresses duplicate metrics by enriching destination keys with FQDNs using new enrichDestinationKey and migrateConnectionKeyIfNeeded methods, and adjusts logging verbosity. Concurrent map access issues in getMounts, getListens, and ping are resolved using read locks. Systemd integration is improved by using SystemdProperties struct to store unit, triggered-by, and type information, and journald subscriptions now use the systemd unit name. The containers/systemd.go file introduces a DbusClient with caching and retry logic for systemd properties, and defines SystemdProperties to encapsulate systemd service details, including a mechanism to skip well-known system services. The ebpftracer package includes a new elf.go file for parsing ELF binaries and attaching uprobes/uretprobes by symbol name and return offsets, which is then utilized by nodejs.go, python.go, and tls.go for more robust and version-agnostic tracing. Minor fixes are applied to ebpftracer/ebpf/l7/memcached.c and redis.c for buffer size checks, and openssl.c is updated to support OpenSSL 3.x. The flags/flags.go file adds a new flag to disable GPU monitoring and another to skip systemd system services. logs/journald_reader.go improves error logging and handling of journald polling. node/metadata/gcp.go modifies GCP metadata retrieval to use a custom HTTP client operating in the host network namespace. node/net.go expands the network device filter regex. Finally, go.mod updates several dependencies. A review comment highlights a potential infinite loop in ebpftracer/elf.go during x86_64 instruction decoding if x86asm.Decode returns an error with ins.Len being 0, suggesting a fix to always advance the index. Another comment points out that the DbusClient cache in containers/systemd.go could lead to unbounded memory growth in environments with many transient services, recommending a size limit or TTL for cache entries.
#203) * fix: add missing patternsPerLevelLimit param to logparser.NewParser calls * fix: update logparser dep, sanitize log paths, replace test credential * fix: update logparser dep to fix sensitive pattern loading The previous version had a corrupted sensitive_patterns.json that caused "cannot unmarshal object into Go value of type []SensitivePattern" errors, breaking sensitive data detection in log parsing.
No description provided.