From 06cd59f3100a7bdfd4a9ac2ac1c2adf449378be5 Mon Sep 17 00:00:00 2001 From: Joel Takvorian Date: Tue, 10 Feb 2026 14:14:31 +0100 Subject: [PATCH 01/10] WIP: CNCF project GTR --- CNCF project - GTR.md | 328 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 328 insertions(+) create mode 100644 CNCF project - GTR.md diff --git a/CNCF project - GTR.md b/CNCF project - GTR.md new file mode 100644 index 0000000..4912b8c --- /dev/null +++ b/CNCF project - GTR.md @@ -0,0 +1,328 @@ + + +# General Technical Review - NetObserv / Sandbox + +- **Project:** NetObserv +- **Project Version:** 1.11 and above +- **Website:** https://netobserv.io/ +- **Date Updated:** 2026-02-10 +- **Template Version:** v1.0 +- **Description:** + +NetObserv is a set of components used to observe network traffic by generating NetFlows from eBPF agents with zero-instrumentation, enriching those flows using a Kubernetes-aware configurable pipeline, exporting them in various ways (logs, metrics, Kafka, IPFIX...), and finally providing a comprehensive visualization tool for making sense of that data, a network health dashboard, and a CLI. Those components are mainly designed to be deployed in Kubernetes via an integrated Operator, although they can also be used as standalones. + +The enriched NetFlows consist of basic 5-tuples information (IPs, ports…), metrics (bytes, packets, drops, latency…), kube metadata (pods, namespaces, services, owners), cloud data (zones), CNI data (network policy events), DNS (codes, qname) and more. + +The Network Health dashboard comes with its own set of health information derived from NetObserv data, and can also integrate data from other / third-party components, or customized data from users. An API is also provided for users to fully customize the generated metrics for their own use (e.g. for customized alerts). + +The CLI is a separate tool independent from the Operator, that provides similar functionality, but tailored for on-demand monitoring (as opposed to 24/7), and adds a packet capture (pcap) functionality. + +NetObserv is largely CNI-agnostic, although some specific features can relate to a particular CNI (e.g: getting network events from ovn-kubernetes). + + +## Day 0 - Planning Phase + +### Scope + + * Describe the roadmap process, how scope is determined for mid to long term features, as well as how the roadmap maps back to current contributions and maintainer ladder? + +NetObserv is the upstream of Red Hat [Network Observability](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/network_observability/index) for OpenShift. As such, a large part of the roadmap comes from the requirements on that downstream product, while it benefits equally to the upstream (there are no downstream-only features). + +TBC... + + * Describe the target persona or user(s) for the project? + +The project targets both cluster administrators and project teams. Cluster administrators have a cluster-wide view over all the network traffic, full topology, access to metrics and alerts. They can run packet-capture, they configure the cluster-scoped flow collection process. + +Through multi-tenancy, project teams have access to a subset of the traffic and the related topology. They have limited configuration options, such as per-namespace sampling or traffic flagging. + + * Explain the primary use case for the project. What additional use cases are supported by the project? + +Observing the network runtime traffic with different levels of granularity and aggregations, receiving network health info such as saturation, degraded latency, DNS issues, etc. Troubleshooting network issues, narrowing down to specific pods or services, deep-diving in netflow data or pcap. Being alerted. + +With OVN-Kubernetes CNI, network policy troubleshooting, and network isolation (UDN) visualization. + + * Explain which use cases have been identified as unsupported by the project. + +Currently, network policy troubleshooting with other CNI than OVN-Kubernetes are not supported. + +L7 observability not planned to this date (no insight into http specific data such as error codes or URLs; NetObserv operates at a lower level). + + * Describe the intended types of organizations who would benefit from adopting this project. (i.e. financial services, any software manufacturer, organizations providing platform engineering services)? + +All types of organizations may benefit from network observability. + + * Please describe any completed end user research and link to any reports. + +### Usability + +* How should the target personas interact with your project? + +Configuration is done entirely through the CRD APIs, managed by an k8s operator. It is gitops-friendly. A web console is provided for the network traffic and network health visualization. Metrics and alerts are provided for Prometheus, meaning that the users can leverage their existing tooling if they already have it. A command-line interface tool is also provided, independently from the operator, allowing users to troubleshoot network from the command line. + +* Describe the user experience (UX) and user interface (UI) of the project. + +The provided web console offers two views: Network Traffic (flows visualization) and Network Health (health rules and alerts visualization). The Network Traffic view itself consists in three subviews: + - an overview of the traffic, showing various charts + - a table displaying a flat list of network flows + - a topology graph + +In all these views, traffic can be filtered by any data (e.g. by pod, namespace, IP, port, drop cause, dns error, etc.) + +In traffic overview and topology, traffic can be aggregated at different levels (e.g. per namespace, per cloud availability zone, etc.) + +A special attention is paid to the UX with many small details, to quickly filter on a displayed element, step into an aggregated topology element, etc. + +* Describe how this project integrates with other projects in a production environment. + +NetObserv can generate many metrics, ingested by Prometheus, and alerting rules for AlertManager. Users who already use them can leverage their existing setup. + +For comprehensive observability, NetObserv can also send the network flows to Grafana Loki, and/or export them to other systems by different means: using the IPFIX standard, or the OpenTelemetry protocol (as logs or as metrics), or to a Kafka broker. Those exporting options allow to integrate with many different systems (Splunk, ElasticSearch, etc.) + +In OpenShift, the web console comes as a plugin for the OpenShift Console, ensuring a smooth integration. + +In the future, we may investigate other UI integration, such as with Headlamp. + +### Design + + * Explain the design principles and best practices the project is following. + +The project design principles and best practices are globally common to many Red Hat products. The development philosophy is "upstream first", meaning that there is no hidden code/feature that only downstream users would get. In fact, there is even no specific repository for downstream. + +All contributions happen on our GitHub repositories, which are public, go through code reviewing, automated testing, and generally manual testing. A special attention is given to performance: regressions are tracked with several tools, based on kube-burner. + +We expect a reasonably high code quality standard, without being too picky on style matters. The goal is not to discourage new contributors. + +All architectural decisions are made with care, and must be well balanced according to their drawbacks. When that happens, we expect to discuss a list of pros and cons thoughtfully. One aspect that is often overlooked at first is the impact on the maintenance and support workloads. + + * Outline or link to the project’s architecture requirements? Describe how they differ for Proof of Concept, Development, Test and Production environments, as applicable. + +?? + + * Define any specific service dependencies the project relies on in the cluster. + +Both the NetObserv operator and the `flowlogs-pipeline` component interact with the Kube API server to watch resources and, for the operator, to create or update them. + +As mentioned before, NetObserv has dependencies on Loki and Prometheus. NetObserv does not install any of them, they must be installed separately (except for Loki when configured in "demo" mode). The provided helm chart includes those dependencies as optional, to simplify the installation, but they remain unmanaged. It is not required to use Loki though, it can be disabled in configuration, in which case NetObserv relies solely on Prometheus metrics, but losing precision in the process (data in Prometheus is more aggregated). + +Optionally, Kafka can be used at a pre-ingestion stage for a production-grade, high-availability deployment (e.g, using Strimzi). + +Finally, several services require TLS certificates, which are generally provided by cert-manager or OpenShift Service Certificates. + + * Describe how the project implements Identity and Access Management. + +On the ingestion side, there is no Identity and Access Management other than with the components service accounts themselves, associated with RBAC permissions. + +On the consuming side, NetObserv does not implement by itself Identity and Access Management, however all queries run against Loki or Prometheus forward the Authorization header, delegating this aspect to those backends. In a production-grade environment, Thanos and the Loki Operator can be used to enable multi-tenancy. This is how it is implemented in OpenShift. + + * Describe how the project has addressed sovereignty. + +Open-source addresses independence. + +NetObserv does not store any data directly, this is delegated to Loki and/or Prometheus and the aforementioned exporting methods. All these options offer a very decent flexibility in terms of storage options, with interoperability, which should not cause any independence blockers. + + * Describe any compliance requirements addressed by the project. + +?? + +Downstream builds are FIPS compliant (those build recipes are open-source as well). + + * Describe the project’s High Availability requirements. + +High availability can be implemented by using Kafka deployment model (e.g. with Strimzi), and using an autoscaler for the `flowlogs-pipeline` component. Loki and Prometheus should be configured for high availability as well (this aspect is not managed by NetObserv itself; using Thanos and the Loki Operator can serve this purpose). + + * Describe the project’s resource requirements, including CPU, Network and Memory. + +Resource requirements highly depend on the cluster network topology: how many nodes and pods you have, how much traffic, etc. While eBPF ensures a minimal impact on workload performance, the generated network flows can represent a significant amount of data, which impact CPU, memory and bandwitdh. Some [recommendations](https://github.com/netobserv/network-observability-operator/blob/main/config/descriptions/ocp.md#resource-considerations) are provided, but your mileage may and will vary. Some statistics are documented [here](https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/network_observability/configuring-network-observability-operators#network-observability-resource-recommendations_network_observability). + +Mitigating high resource requirements can be done in several ways, such as by increasing the sampling interval, adding filters, or considering whether to use Loki or not. More information [here](https://github.com/netobserv/network-observability-operator/tree/main?tab=readme-ov-file#configuration). + + * Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage. + +Storage is not directly managed by NetObserv, and is to be configured via Prometheus and/or Loki. TTL is important to consider. Loki is often configured with a S3 backend storage, but other options exist, such as ODF. Just like memory, storage requirements highly depend on the cluster network topology, and can be mitigated the same way as mentioned above. + + * Please outline the project’s API Design: + +NetObserv defines several APIs: +- The [FlowCollector CRD](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md) contains the main, cluster-wide configuration for NetObserv. +- The [FlowMetric CRD](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowMetric.md) allows users to entirely customize metrics derived from network flows. +- The [FlowCollectorSlice CRD](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollectorSlice.md) allows cluster admins to delegate some of the configuration to project teams. +- The [flows format reference](https://github.com/netobserv/network-observability-operator/blob/main/docs/flows-format.adoc) describes the structure and content of a network flow, which can be consumed in various ways. +- Additionally, [there is some documentation](https://github.com/netobserv/network-observability-operator/blob/main/docs/HealthRules.md#creating-your-own-rules-that-contribute-to-the-health-dashboard) on how users can leverage the Network Health dashboard with customized metrics and alerts, involving some less formal API. + +The project CRDs follow standard Kubernetes API conventions as well as the OpenShift ones as best effort. Deviating from them is not impossible but must be argumented. + +The project configuration is designed to work well with minimal configuration. This is especially true in OpenShift, thanks to its opinionated nature, but less true in other environments. + +The default configuration is designed to work well on small/mid-sized clusters, ie. between roughly 5 and 50 nodes, with a default sampling interval set to 50 in order to preserve resource usage (as opposed to an interval of 1, which would capture all the traffic). On bigger cluster topologies, it is recommended to optimize carefully. + +Best effort is done to achieve security by default, but this is sometimes too dependent on the environment. For instance, while a network policy is installed by default in OpenShift, it is not when running in a different environment, as this may break with some CNIs. In that case, enabling the network policy must be done explicitely, or the user can configure their own policy. + +Loki must be configured accordingly to its installation, disabled, or enabled in "demo" mode. Prometheus querier URL must be configured. It is recommended to enable the embedded network policy, or to install one. In OpenShift, Prometheus and the network policy are enabled and configured automatically. + + * Describe any new or changed API types and calls \- including to cloud providers \- that will result from this project being enabled and used + * Describe compatibility of any new or changed APIs with API servers, including the Kubernetes API server + * Describe versioning of any new or changed APIs, including how breaking changes are handled + +The project release process is split between upstream and downstream releases. For both of them, content can be tracked from the repositories, which are public. + +Upstream releases happen from the `main` branches without a well-defined cadence. They use GitHub workflows to generate images and artifacts, triggered by git tags. Versions are suffixed with `-community`, e.g. `v1.11.0-community`. A helm chart is manually updated after each component is released. The release process is described [here](https://github.com/netobserv/network-observability-operator/blob/main/RELEASE.md). + +Downstream releases happen from release branches (e.g. `release-1.11`) and use Konflux / Tekton. They produce an OLM bundle and OLM catalog fragments. They are loosely aligned with OpenShift releases. + +Versioning upstream and downstream is aligned on "major.minor", but not necessarily on ".patch". For instance, downstream `v1.2.3` and `v1.2.3-community` should have the same features (in `1.2`) but not necessarily the same fixes/patches (in `.3`). + +### Installation + +Upstream releases can be installed via Helm, as [documented here](https://github.com/netobserv/network-observability-operator/blob/main/README.md#getting-started). From a fresh/vanilla cluster (e.g. using KIND), it can be done in 5 commands (installing cert-manager, installing NetObserv, configuring a `FlowCollector`). + +Testing and validating the installation can be done by port-forwarding the web console URL and checking its content. This is described in the same link above. + +### Security + + + +## Day 1 \- Installation and Deployment Phase + +### Project Installation and Configuration + + + +### Project Enablement and Rollback + + + +### Rollout, Upgrade and Rollback Planning + + + +## Day 2 \- Day-to-Day Operations Phase + +### Scalability/Reliability + + + +Load tests are performed very regularly on different cluster sizes (25 and 250 nodes) to track any performance regression, using prow and kube-burner-ocp. Not all configurations can be tested this way, so the focus is set on very high range of production-grade installations, with Kafka, the Loki Operator, all features enabled, and maximum sampling (capturing all the traffic). + +[This page](https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/network_observability/configuring-network-observability-operators) shows a short summary of these tests, alongside with resource limits recommendations. More information can be obtained from prow runs, publicly available ([here's an example](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-eng-ocp-qe-perfscale-ci-netobserv-perf-tests-netobserv-aws-4.21-nightly-x86-node-density-heavy-25nodes/2020627868538638336/artifacts/node-density-heavy-25nodes/openshift-qe-orion/artifacts/data-netobserv-perf-node-density-heavy-AWS-25w.csv)). + + +### Observability Requirements + + + +NetObserv own observability relies heavily on Prometheus metrics, and to a lesser extent, unstructured logs and profiling. There is no plan at this time to bake tracing or structured logs directly in the code. + +From the 4 components part of the operator (eBPF agent, flowlogs-pipeline, the web console and the operator itself), the eBPF agent and flowlogs-pipeline are the two most critical to observe. They both provide metrics such as: +- Error counters, labeled by code and component. +- Gauges tracking persistent data structure sizes. +- Messages / events counters. +- Some histograms tracking operation latency. + +In OpenShift, a Health dashboard is provided to track the most meaningful metrics, alongside with more general ones (CPU, memory, file descriptors, goroutines...). For non-OpenShift, a similar dashboard could be created. + +Two Prometheus alerting rules are created, to detect the absence of flows: one for flows received by flowlogs-pipeline, the other for flows written to Loki. Those alerts fire when something prevents NetObserv from running normally. + +In addition to the metrics, potential configuration issues, or deployment issues, are reported as FlowCollector Conditions by the operator. + +Profiling (pprof) can be enabled by configuring ports in FlowCollector. It triggers a restart of the profiled workloads. + +### Dependencies + + + +### Troubleshooting + + + +### Compliance + + + + + +### Security + + From d472edcf488c3ba471469f8ae193f2eaef857c68 Mon Sep 17 00:00:00 2001 From: Joel Takvorian Date: Tue, 10 Feb 2026 22:43:52 +0100 Subject: [PATCH 02/10] WIP: CNCF project GTR + self assessment --- .../CNCF project - GTR.md | 26 ++- cncf/Security Self-Assessment.md | 179 ++++++++++++++++++ 2 files changed, 203 insertions(+), 2 deletions(-) rename CNCF project - GTR.md => cncf/CNCF project - GTR.md (91%) create mode 100644 cncf/Security Self-Assessment.md diff --git a/CNCF project - GTR.md b/cncf/CNCF project - GTR.md similarity index 91% rename from CNCF project - GTR.md rename to cncf/CNCF project - GTR.md index 4912b8c..972e71f 100644 --- a/CNCF project - GTR.md +++ b/cncf/CNCF project - GTR.md @@ -147,9 +147,9 @@ High availability can be implemented by using Kafka deployment model (e.g. with * Describe the project’s resource requirements, including CPU, Network and Memory. -Resource requirements highly depend on the cluster network topology: how many nodes and pods you have, how much traffic, etc. While eBPF ensures a minimal impact on workload performance, the generated network flows can represent a significant amount of data, which impact CPU, memory and bandwitdh. Some [recommendations](https://github.com/netobserv/network-observability-operator/blob/main/config/descriptions/ocp.md#resource-considerations) are provided, but your mileage may and will vary. Some statistics are documented [here](https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/network_observability/configuring-network-observability-operators#network-observability-resource-recommendations_network_observability). +Resource requirements highly depend on the cluster network topology: how many nodes and pods you have, how much traffic, etc. While eBPF ensures a minimal impact on workload performance, the generated network flows can represent a significant amount of data, which impact nodes CPU, memory and bandwitdh. Some [recommendations](https://github.com/netobserv/network-observability-operator/blob/main/config/descriptions/ocp.md#resource-considerations) are provided, but your mileage may and will vary. Some statistics are documented [here](https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/network_observability/configuring-network-observability-operators#network-observability-resource-recommendations_network_observability). -Mitigating high resource requirements can be done in several ways, such as by increasing the sampling interval, adding filters, or considering whether to use Loki or not. More information [here](https://github.com/netobserv/network-observability-operator/tree/main?tab=readme-ov-file#configuration). +Mitigating high resource requirements can be done in several ways, such as by increasing the sampling interval, adding filters, or considering whether or not to use Loki. More information [here](https://github.com/netobserv/network-observability-operator/tree/main?tab=readme-ov-file#configuration). * Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage. @@ -208,6 +208,28 @@ Testing and validating the installation can be done by port-forwarding the web c * Describe how the project is handling certificate rotation and mitigates any issues with certificates. * Describe how the project is following and implementing [secure software supply chain best practices](https://project.linuxfoundation.org/hubfs/CNCF\_SSCP\_v1.pdf) --> +- [Self assessment](./Security%20Self-Assessment.md) +- On TAG Security whitepaper: +1. Make security a design requirement +Security measures have been baked in from GA day-0, and continuously improved over time. For instance, from day-0, TLS / mTLS has been recommended through Kafka; RBAC and multi-tenancy supported via the Loki Operator; eBPF agents, running with elevated privileges, are segregated in a different namespace; fine-grained capabilities are favored whenever possible. A threat modeling as been done internally at Red Hat. +2. Applying secure configuration has the best user experience +Security by default is preferred, although not always possible. Servers use TLS by default. eBPF agents run in non-privileged mode by default. +Network policy is unfortunately not always installed by default, as it may blocks communications unexpectedly with some CNIs, but it does in OpenShift. +3. Selecting insecure configuration is a conscious decision +Features that require the eBPF agent privileged mode will not automatically enable it: it remains a conscious decision. +4. Transition from insecure to secure state is possible +All the configuration is managed through the Operator with a typical reconciliation, which ensures transitions work seemlessly, in one way or another. +5. Secure defaults are inherited +NetObserv does not override any known secure defaults. +6. Exception lists have first class support +N/A +7. Secure defaults protect against pervasive vulnerability exploits. +Containers run as non-root; Release pipeline includes vulnerability scans. +8. Security limitations of a system are explainable +While security limitations are not hidden, they may not be very visible. This is something to add to the roadmap. + +TBC + ## Day 1 \- Installation and Deployment Phase diff --git a/cncf/Security Self-Assessment.md b/cncf/Security Self-Assessment.md new file mode 100644 index 0000000..d5f4da9 --- /dev/null +++ b/cncf/Security Self-Assessment.md @@ -0,0 +1,179 @@ +# NetObserv Self-Assessment + +Security reviewers: Joël Takvorian + +This document is the Security Self-Assessment required for CNCF sandbox projects. + +## Table of Contents + +* [Metadata](#metadata) + * [Security links](#security-links) +* [Overview](#overview) + * [Actors](#actors) + * [Actions](#actions) + * [Background](#background) + * [Goals](#goals) + * [Non-goals](#non-goals) +* [Self-assessment use](#self-assessment-use) +* [Security functions and features](#security-functions-and-features) +* [Project compliance](#project-compliance) +* [Secure development practices](#secure-development-practices) +* [Security issue resolution](#security-issue-resolution) +* [Appendix](#appendix) + +## Metadata + +### Software + +- https://github.com/netobserv/network-observability-operator +- https://github.com/netobserv/flowlogs-pipeline +- https://github.com/netobserv/netobserv-ebpf-agent +- https://github.com/netobserv/network-observability-console-plugin +- https://github.com/netobserv/network-observability-cli + +### Security Provider? + +No. + +### Languages + +- Go +- TypeScript +- C (eBPF) +- Bash + +### Software Bill of Materials + +SBOM of downstream builds are publicly available (e.g. https://quay.io/repository/redhat-user-workloads/ocp-network-observab-tenant/network-observability-operator-ystream, see .sbom suffixed tags). While upstream builds don't have SBOM attached, they should be mostly identical, as upstream and downstream builds share the same code and base images. Minor differences should be expected though. + +### Security Links + +TODO + +## Overview + +NetObserv is a set of components used to observe network traffic by generating NetFlows from eBPF agents, enriching those flows with Kubernetes metadata, exporting them in various ways (logs, metrics, Kafka, IPFIX...), and finally providing a comprehensive visualization tool for making sense of that data, a network health dashboard, and a CLI. Those components are mainly designed to be deployed in Kubernetes via an integrated Operator. + +### Background + +Kubernetes can be complex, and so does Kubernetes networking. Especially as it can differ from a CNI to another. Cluster admins often find important to have a good observability over the network, that clearly maps with Kubernetes resources (Services, Pods, Nodes...). This is what NetObserv aims to offer. Additionally, it aims at identifying network issues, and raising alerts. While it is not designed as a security tool, the data that it provides can be leveraged, for instance, to detect network threat patterns. + +### Actors + +1. The [operator](https://github.com/netobserv/network-observability-operator), orchestrates the deployment of all related components (listed below), based on the supplied configuration. It operates at the cluster scope. +2. The [eBPF agent](https://github.com/netobserv/flowlogs-pipeline) and [flowlogs-pipeline](https://github.com/netobserv/flowlogs-pipeline) are collecting network flows from the hosts (nodes), processing them, before sending them to storage or custom exporters. +3. The [web console](https://github.com/netobserv/network-observability-console-plugin) reads data from the stores to display dashboards. +4. The [CLI](https://github.com/netobserv/network-observability-cli) is an independent piece that also starts the eBPF agents and flowlogs-pipeline for on-demand monitoring, from the command line. + +### Actions + +The operator reads the main configuration (FlowCollector CRD) to determine how to deploy and configure the related components. + +The eBPF agents are deployed, one per node (DaemonSet), with elevated privleges, load their eBPF payload in the host kernel, and start collecting network flows. Those flows are sent to flowlogs-pipeline, which correlate them with Kubernetes resources, and performs various transformations, before sending them to a log store (Loki) and/or expose them as Prometheus metrics. Other exporting options exist. Loki, Prometheus and any receiving system are not part of the NetObserv payload, they must be installed and managed separately. + +Optionally, Apache Kafka can be used as an intermediate between the eBPF agents and flowlogs-pipeline. + +The web console fetches the network flows from the stores (Loki and/or Prometheus) to display dashboards. It does not connect directly to other NetObserv components. + +The architecture is described more in details [here](https://github.com/netobserv/network-observability-operator/blob/main/docs/Architecture.md), with diagrams included. + +### Goals + +NetObserv intends to provide visibility on the cluster network traffic, and to help troubleshooting network issues. + +In terms of security, because the NetObserv operator has cluster-wide access to many resources, and because the eBPF agents have elevated privileges on nodes, both of them must not be accessible by non-admins. + +Additionally, NetObserv MUST NOT + +- Leak any network data or metadata to unauthorized users. +- Cause any harm by being gamed when reading network packets (untrusted). +- Allow connections from untrusted workloads to flowlogs-pipeline. + +### Non-Goals + +- Enforce RBAC when querying backend stores: this is the responsibility of the components that manage those stores (e.g. the Loki Operator comes with a Gateway that enforces RBAC; NetObserv connects to that Gateway). + +## Self-assessment Use + +This self-assessment is created by the NetObserv team to perform an internal analysis of the project's security. It is not intended to provide a security audit of NetObserv, or function as an independent assessment or attestation of NetObserv's security health. + +This document serves to provide NetObserv users with an initial understanding of NetObserv's security, where to find existing security documentation, NetObserv plans for security, and general overview of NetObserv security practices, both for development of NetObserv as well as security of NetObserv. + +This document provides NetObserv maintainers and stakeholders with additional context to help inform the roadmap creation process, so that security and feature improvements can be prioritized accordingly. + +## Security functions and features + +| Component | Applicability | Description of Importance | +| ------------------------- | ---------------- | --------------------------------------------------------------------------------- | +| Namespace segregation | Critical | For hardened security, the components that require elevated privileges are deployed in their own namespace, flagged as privileged, that should be only accessible by cluster admins. | +| Non-root eBPF agents | SecurityRelevant | Whenever possible, the eBPF agents run with fine-grained privileges (e.g. CAP_BPF) instead of full privileges. Some features, however, do require full privileges. | +| Network policies | SecurityRelevant | A network policy can be installed automatically to better isolate the communications of the NetObserv workloads. However, due to policies being somewhat CNI-dependent and the inherent risk of breaking communications with untested CNIs, this feature is not enabled by default, except in OpenShift. | +| Encrypted traffic | SecurityRelevant | All servers are configured with TLS by default. | +| Authorized traffic (mTLS) | SecurityRelevant | Traffic between the eBPF agents and flowlogs-pipeline can be authorized on both sides (mTLS) when using with Kafka. It is planned to bring mTLS to other modes, without Kafka. When not using mTLS, it is highly recommended to protect the netobserv namespace with a network policy. | +| RBAC-enforced stores | SecurityRelevant | Multi-tenancy can be achieved when supported by the backend stores: e.g. Loki with the Loki Operator, Prometheus with Thanos. In that case, NetObserv can be configured to forward user tokens. | + +## Project Compliance + +N/A: the project has not been evaluated against compliance standards as of today. + +### Future State + +Compliance can be evaluated based on demand. + +## Secure Development Practices + +A high security standard is observed, enforced by company policy (Red Hat). + +### Deployment Pipeline + +In order to secure the SDLC from development to deployment, the following measures are in place: + +- Branch protection on the default (`main`) branch, and release branches (`release-*`): + - Require a pull request before merging + - Require approvals: 1 + - Dismiss stale pull request approvals when new commits are pushed + - Require review from Code Owners + - Require status checks to pass before merging + - Build, linting, tests, clean state checks must pass + - In the eBPF agent, BPF bytecode is verified + - Force-push not allowed +- Code owners need to have 2FA enabled. +- Vulnerabilities in dependencies, and dependency upgrades, are managed via Dependabot and Renovate. +- Some weaknesses are reported by linters (golangci-lint, eslint). + - `govulncheck` use to be added to the roadmap. +- Downstream release process is automated. + - It includes vulnerability scans, FIPS-compliance checks, immutable images, SBOM, signing. +- Upstream release process is partly automated (the helm chart bundling is not, at this time). + - More security measures to be added to the roadmap. + +### Communication Channels + +- Internal communications among NetObserv maintainers working at Red Hat happen in private Slack channels. +- Communications with maintainers external to Red Hat happen in the public Slack channel (`#netobserv-project` on http://cloud-native.slack.com/). +- Inbound communications are accepted through that same channel, or through GitHub Issues, or the GitHub discussion pages. +- Outbound messages to users can be made via documentation, release notes, blogs, social media and the public slack channel. + +## Security Issue Resolution + +As a Red Hat product, security issues and procedures are described on the [Security Contacts and Procedures](https://access.redhat.com/security/team/contact/?extIdCarryOver=true&sc_cid=701f2000001Css5AAC) page. + +### Responsible Disclosure Practice + +The same page mentioned above describes the Responsible Disclosure Practice. An email should be send to the Red Hat Product Security team, who will engage the discussion with the project maintainers, and respond to the reporter. + +### Incident Response + +In the event that a vulnerability is reported, the maintainer team, the Red Hat Product Security team and the reporter will collaborate to determine the validity and criticality of the report. Based on these findings, the fix will be triaged and the maintainer team will work to issue a patch in a timely manner. + +Patches will be made to the `main` and the latest release branches, and new releases (upstream and downstream) will be triggered. Information will be disseminated to the community through all appropriate outbound channels as soon as possible based on the circumstance. + +## Appendix + +- Known Issues Over Time + - Known issues are currently tracked in the project roadmap. There are currently no known vulnerabilities in the current supported version. +- OpenSSF Best Practices + - The process to get a Best Practices badge is not yet on the roadmap. +- Case Studies + - TBC +- Related Projects / Vendors + - Similar to: Cilium Hubble, Pixie, Microsoft Retina. A differentiator is that NetObserv is fully open-source, CNI-independent, and actively maintained. It has some unique features, such as its FlowMetrics API. It also tries to differentiate with a polished UX. From 567ee76f4a3817f593160935e9fe8504f6f93cdc Mon Sep 17 00:00:00 2001 From: Joel Takvorian Date: Wed, 11 Feb 2026 12:56:37 +0100 Subject: [PATCH 03/10] Apply suggestion from @jotak --- cncf/Security Self-Assessment.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cncf/Security Self-Assessment.md b/cncf/Security Self-Assessment.md index d5f4da9..67f267d 100644 --- a/cncf/Security Self-Assessment.md +++ b/cncf/Security Self-Assessment.md @@ -87,7 +87,7 @@ Additionally, NetObserv MUST NOT - Leak any network data or metadata to unauthorized users. - Cause any harm by being gamed when reading network packets (untrusted). -- Allow connections from untrusted workloads to flowlogs-pipeline. +- Allow connections from untrusted workloads to any ingest-side component, that could alter the data produced. ### Non-Goals From 1e829c11aaeccd8d63e181d0e3cef7f3722b1c78 Mon Sep 17 00:00:00 2001 From: Joel Takvorian Date: Fri, 13 Feb 2026 15:48:03 +0100 Subject: [PATCH 04/10] Add roadmap, more info, address feedback: - Add community roadmap - Discuss production-readiness - Mention bpfman --- CONTRIBUTING.md | 2 +- cncf/CNCF project - GTR.md | 8 +++-- cncf/Security Self-Assessment.md | 30 ++++++++++-------- cncf/roadmap.md | 53 ++++++++++++++++++++++++++++++++ 4 files changed, 76 insertions(+), 17 deletions(-) create mode 100644 cncf/roadmap.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index fcbb956..5e3d2a4 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -8,7 +8,7 @@ There should be mutual respect between contributors and maintainers. ### Security vulnerabilities -Unlike other contributions, if you think you discovered a security vulnerability, please do not report it publicly or even fix it publicly. Please follow the instructions described on the [Red Hat Customer Portal](https://access.redhat.com/security/team/contact/?extIdCarryOver=true&sc_cid=701f2000001Css5AAC) instead. +Unlike other contributions, if you think you discovered a security vulnerability, please do not report it publicly or even fix it publicly. Please follow the instructions described on the [Red Hat Customer Portal](https://access.redhat.com/security/team/contact/) instead. ### Documentation contributions diff --git a/cncf/CNCF project - GTR.md b/cncf/CNCF project - GTR.md index 972e71f..3f32c02 100644 --- a/cncf/CNCF project - GTR.md +++ b/cncf/CNCF project - GTR.md @@ -23,7 +23,7 @@ These questions are to gather knowledge about the project. Project maintainers a - **Template Version:** v1.0 - **Description:** -NetObserv is a set of components used to observe network traffic by generating NetFlows from eBPF agents with zero-instrumentation, enriching those flows using a Kubernetes-aware configurable pipeline, exporting them in various ways (logs, metrics, Kafka, IPFIX...), and finally providing a comprehensive visualization tool for making sense of that data, a network health dashboard, and a CLI. Those components are mainly designed to be deployed in Kubernetes via an integrated Operator, although they can also be used as standalones. +NetObserv is a set of components used to observe network traffic by generating NetFlows from eBPF agents with zero-instrumentation, enriching those flows using a Kubernetes-aware configurable pipeline, exporting them in various ways (to Loki, Prometheus, Kafka, OpenTelemetry or IPFIX), providing a comprehensive web UI for visualization, and a CLI. Those components are mainly designed to be deployed in Kubernetes via an integrated Operator, although they can also be used as standalones. The enriched NetFlows consist of basic 5-tuples information (IPs, ports…), metrics (bytes, packets, drops, latency…), kube metadata (pods, namespaces, services, owners), cloud data (zones), CNI data (network policy events), DNS (codes, qname) and more. @@ -42,7 +42,7 @@ NetObserv is largely CNI-agnostic, although some specific features can relate to NetObserv is the upstream of Red Hat [Network Observability](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/network_observability/index) for OpenShift. As such, a large part of the roadmap comes from the requirements on that downstream product, while it benefits equally to the upstream (there are no downstream-only features). -TBC... +While the downstream product is considered today *production-ready*, that is not the case of the upstream project however. We have created [a specific roadmap](./roadmap.md) to address the known issues in the upstream project, and eventually fill the gap for production readiness. * Describe the target persona or user(s) for the project? @@ -121,6 +121,8 @@ As mentioned before, NetObserv has dependencies on Loki and Prometheus. NetObser Optionally, Kafka can be used at a pre-ingestion stage for a production-grade, high-availability deployment (e.g, using Strimzi). +Optionally, the eBPF agents can integrate with [bpfman](https://bpfman.io/). By doing so, the highly privileged operations, such as loading bpf programs in the kernel, are delegated to bpfman. It allows to run the eBPF agent unprivileged. + Finally, several services require TLS certificates, which are generally provided by cert-manager or OpenShift Service Certificates. * Describe how the project implements Identity and Access Management. @@ -226,7 +228,7 @@ N/A 7. Secure defaults protect against pervasive vulnerability exploits. Containers run as non-root; Release pipeline includes vulnerability scans. 8. Security limitations of a system are explainable -While security limitations are not hidden, they may not be very visible. This is something to add to the roadmap. +While security limitations are not hidden, they may not be very visible. This is something added [to the roadmap](./roadmap.md). TBC diff --git a/cncf/Security Self-Assessment.md b/cncf/Security Self-Assessment.md index 67f267d..44d3a17 100644 --- a/cncf/Security Self-Assessment.md +++ b/cncf/Security Self-Assessment.md @@ -44,7 +44,7 @@ No. ### Software Bill of Materials -SBOM of downstream builds are publicly available (e.g. https://quay.io/repository/redhat-user-workloads/ocp-network-observab-tenant/network-observability-operator-ystream, see .sbom suffixed tags). While upstream builds don't have SBOM attached, they should be mostly identical, as upstream and downstream builds share the same code and base images. Minor differences should be expected though. +SBOM of downstream builds are publicly available (e.g. https://quay.io/repository/redhat-user-workloads/ocp-network-observab-tenant/network-observability-operator-ystream, see .sbom suffixed tags). While upstream builds don't have SBOM attached, they should be mostly identical, as upstream and downstream builds share the same code and base images. Minor differences should be expected though. Adding SBOM to the upstream builds is part of the roadmap. ### Security Links @@ -56,12 +56,12 @@ NetObserv is a set of components used to observe network traffic by generating N ### Background -Kubernetes can be complex, and so does Kubernetes networking. Especially as it can differ from a CNI to another. Cluster admins often find important to have a good observability over the network, that clearly maps with Kubernetes resources (Services, Pods, Nodes...). This is what NetObserv aims to offer. Additionally, it aims at identifying network issues, and raising alerts. While it is not designed as a security tool, the data that it provides can be leveraged, for instance, to detect network threat patterns. +Kubernetes is complex, and so is Kubernetes networking, especially since it can differ from one CNI to another. Cluster admins often find it important to have good observability over the network that clearly maps with Kubernetes resources (e.g. services, pods, nodes). This is what NetObserv aims to offer. Additionally, it aims at identifying network issues and raising alerts. While it is not designed as a security tool, the data that it provides can be leveraged, for instance, to detect network threat patterns. It is used for auditing purpose. ### Actors 1. The [operator](https://github.com/netobserv/network-observability-operator), orchestrates the deployment of all related components (listed below), based on the supplied configuration. It operates at the cluster scope. -2. The [eBPF agent](https://github.com/netobserv/flowlogs-pipeline) and [flowlogs-pipeline](https://github.com/netobserv/flowlogs-pipeline) are collecting network flows from the hosts (nodes), processing them, before sending them to storage or custom exporters. +2. The [eBPF agent](https://github.com/netobserv/flowlogs-pipeline) and [flowlogs-pipeline](https://github.com/netobserv/flowlogs-pipeline) are extracting and collecting network flows from the hosts (nodes), processing them, before sending them to storage or custom exporters. 3. The [web console](https://github.com/netobserv/network-observability-console-plugin) reads data from the stores to display dashboards. 4. The [CLI](https://github.com/netobserv/network-observability-cli) is an independent piece that also starts the eBPF agents and flowlogs-pipeline for on-demand monitoring, from the command line. @@ -69,7 +69,7 @@ Kubernetes can be complex, and so does Kubernetes networking. Especially as it c The operator reads the main configuration (FlowCollector CRD) to determine how to deploy and configure the related components. -The eBPF agents are deployed, one per node (DaemonSet), with elevated privleges, load their eBPF payload in the host kernel, and start collecting network flows. Those flows are sent to flowlogs-pipeline, which correlate them with Kubernetes resources, and performs various transformations, before sending them to a log store (Loki) and/or expose them as Prometheus metrics. Other exporting options exist. Loki, Prometheus and any receiving system are not part of the NetObserv payload, they must be installed and managed separately. +The eBPF agents are deployed, one per node (DaemonSet), with elevated privileges, load their eBPF payload in the host kernel, and start extracting network flows. Those flows are sent to flowlogs-pipeline, which correlates them with Kubernetes resources, and performs various transformations, before sending them to a log store (Loki) and/or expose them as Prometheus metrics. Other exporting options exist. Loki, Prometheus and any receiving system are not part of the NetObserv payload. They must be installed and managed separately. Optionally, Apache Kafka can be used as an intermediate between the eBPF agents and flowlogs-pipeline. @@ -86,7 +86,7 @@ In terms of security, because the NetObserv operator has cluster-wide access to Additionally, NetObserv MUST NOT - Leak any network data or metadata to unauthorized users. -- Cause any harm by being gamed when reading network packets (untrusted). +- Cause any harm by being gamed when reading network packets (untrusted input). - Allow connections from untrusted workloads to any ingest-side component, that could alter the data produced. ### Non-Goals @@ -106,15 +106,19 @@ This document provides NetObserv maintainers and stakeholders with additional co | Component | Applicability | Description of Importance | | ------------------------- | ---------------- | --------------------------------------------------------------------------------- | | Namespace segregation | Critical | For hardened security, the components that require elevated privileges are deployed in their own namespace, flagged as privileged, that should be only accessible by cluster admins. | +| TLS version restriction | Critical | When TLS is used, minimum version is forced to 1.3. | | Non-root eBPF agents | SecurityRelevant | Whenever possible, the eBPF agents run with fine-grained privileges (e.g. CAP_BPF) instead of full privileges. Some features, however, do require full privileges. | -| Network policies | SecurityRelevant | A network policy can be installed automatically to better isolate the communications of the NetObserv workloads. However, due to policies being somewhat CNI-dependent and the inherent risk of breaking communications with untested CNIs, this feature is not enabled by default, except in OpenShift. | -| Encrypted traffic | SecurityRelevant | All servers are configured with TLS by default. | +| Unprivileged eBPF agents | SecurityRelevant | Using bpfman (a CNCF project) allows to run agents without privileges. Operations requiring elevated privileges are delegated to bpfman. | +| Network policies | SecurityRelevant | A network policy can be installed automatically to better isolate the communications of the NetObserv workloads. However, due to policies being somewhat CNI-dependent and the inherent risk of breaking communications with untested CNIs, this feature is not enabled by default, except with OVN-Kubernetes. | +| Encrypted traffic | SecurityRelevant | All servers are configured with TLS by default. In OpenShift, certificates are generated automatically. Else, they must be provided by the users. | | Authorized traffic (mTLS) | SecurityRelevant | Traffic between the eBPF agents and flowlogs-pipeline can be authorized on both sides (mTLS) when using with Kafka. It is planned to bring mTLS to other modes, without Kafka. When not using mTLS, it is highly recommended to protect the netobserv namespace with a network policy. | | RBAC-enforced stores | SecurityRelevant | Multi-tenancy can be achieved when supported by the backend stores: e.g. Loki with the Loki Operator, Prometheus with Thanos. In that case, NetObserv can be configured to forward user tokens. | ## Project Compliance -N/A: the project has not been evaluated against compliance standards as of today. +Downstream builds are FIPS-140 compliant. Those build recipes are open-source and can be replicated. + +The project has not been evaluated against other compliance standards as of today. ### Future State @@ -140,11 +144,11 @@ In order to secure the SDLC from development to deployment, the following measur - Code owners need to have 2FA enabled. - Vulnerabilities in dependencies, and dependency upgrades, are managed via Dependabot and Renovate. - Some weaknesses are reported by linters (golangci-lint, eslint). - - `govulncheck` use to be added to the roadmap. + - `govulncheck` usage is [in the roadmap](./roadmap.md). - Downstream release process is automated. - It includes vulnerability scans, FIPS-compliance checks, immutable images, SBOM, signing. - Upstream release process is partly automated (the helm chart bundling is not, at this time). - - More security measures to be added to the roadmap. + - Improvement are listed [in the roadmap](./roadmap.md). ### Communication Channels @@ -155,7 +159,7 @@ In order to secure the SDLC from development to deployment, the following measur ## Security Issue Resolution -As a Red Hat product, security issues and procedures are described on the [Security Contacts and Procedures](https://access.redhat.com/security/team/contact/?extIdCarryOver=true&sc_cid=701f2000001Css5AAC) page. +As a Red Hat product, security issues and procedures are described on the [Security Contacts and Procedures](https://access.redhat.com/security/team/contact/) page. The GitHub repositories have a SECURITY.md file where reporters can find this information. ### Responsible Disclosure Practice @@ -170,9 +174,9 @@ Patches will be made to the `main` and the latest release branches, and new rele ## Appendix - Known Issues Over Time - - Known issues are currently tracked in the project roadmap. There are currently no known vulnerabilities in the current supported version. + - Known issues are currently tracked in the [project roadmap](./roadmap.md). There are currently no known vulnerabilities in the current supported version. - OpenSSF Best Practices - - The process to get a Best Practices badge is not yet on the roadmap. + - The process to get a Best Practices badge is not currently on the roadmap. - Case Studies - TBC - Related Projects / Vendors diff --git a/cncf/roadmap.md b/cncf/roadmap.md new file mode 100644 index 0000000..3045bcc --- /dev/null +++ b/cncf/roadmap.md @@ -0,0 +1,53 @@ +# NetObserv: community roadmap + +NetObserv is an open-source project with upstream / community releases, and downstream / product releases often referred to as "Network Observability for OpenShift", which is a Red Hat product. + +This section describes a roadmap specifically for the upstream project. Improvements targeting downstream are not listed here, even though they will also benefit to the upstream. The maintainers team uses GitHub issues for community-driven tasks, and a partially public [JIRA tracker](https://issues.redhat.com/projects/NETOBSERV/) for product-driven tasks. + +## Milestone: production-readiness + +### Current status + +Production-readiness is not equivalent between the upstream and the downstream versions. The downstream product is production-ready and actually deployed in production. A gap exists with the community releases, which this roadmap precisely aims to address. To date, deploying the community releases to production should be considered with care. + +### Next steps + +#### Security by default + +- It is currently up to the users to create Network Policies that will restrict the access from and to the NetObserv namespaces. It should be noted that the NetObserv operator *does* embed an optional network policy for that purpose, however it has some known issues with CNIs other than OVN-Kubernetes (and possibly unknown issues as well). + +- The default "Service" deployment model requires to manually configure TLS or mTLS between NetObserv components, or to disable TLS. The "Kafka" mode (which is not default) can also be configured manually with mTLS. Using Strimzi and KafkaUser make it rather straightforward, though that's still not a default. + +*Why it differs from downstream:* + +When the OVN-Kubernetes CNI is detected, which is the default in OpenShift, a network policy is deployed by default, restricting the access from and to the NetObserv namespaces. Additionally, the default "Service" deployment model uses TLS, automatically configured based on OpenShift serving certificates feature. + +*Actions to take:* + +- Network policies + - Investigate on the network policy issues, make it work for some common CNIs. + - Raise warnings / degrade conditions when no network policy is detected. + - Better document the required ACLs for NetObserv, for users using CNIs that don't support the embedded policy. +- TLS / mTLS + - Integrate Trust-manager resources in the Helm chart by default ([issue #2360](https://github.com/netobserv/network-observability-operator/issues/2360)). + - Raise warnings / degrade conditions when no TLS is detected. + +#### Securing the upstream build and release processes + +Upstream builds and releases are done through GitHub worklows. The workflow generates OCI images, which are pushed to quay.io. + +Releases are triggered by a git tag, and resulting images are pushed as well to quay.io. The community release process is described [here](https://github.com/netobserv/network-observability-operator/blob/main/RELEASE.md). Some manual steps are involved to publish the Helm chart. + +This process is functional but lacks some features to make it really safe for production. OCI images are not signed, and no SBOM is generated. + +*Why it differs from downstream:* + +Downstream builds and releases use an entirely different workflow that is common across many Red Hat products, using [Konflux](https://konflux-ci.dev/docs/), with the highest security standards. It produces not a Helm chart, but an OLM bundle. + +*Actions to take:* + +- Release process improvement + - Generate SBOM + - Sign images + - Make releases immutable + - Automate the Helm chart generation From 0c354e77ced134571d1c1234845c7ca5602ec358 Mon Sep 17 00:00:00 2001 From: Joel Takvorian Date: Fri, 13 Feb 2026 18:17:00 +0100 Subject: [PATCH 05/10] continue filling the gtr --- cncf/CNCF project - GTR.md | 165 +++++++++++++++++++++---------------- cncf/roadmap.md | 4 + 2 files changed, 100 insertions(+), 69 deletions(-) diff --git a/cncf/CNCF project - GTR.md b/cncf/CNCF project - GTR.md index 3f32c02..436d059 100644 --- a/cncf/CNCF project - GTR.md +++ b/cncf/CNCF project - GTR.md @@ -38,43 +38,45 @@ NetObserv is largely CNI-agnostic, although some specific features can relate to ### Scope - * Describe the roadmap process, how scope is determined for mid to long term features, as well as how the roadmap maps back to current contributions and maintainer ladder? + **Describe the roadmap process, how scope is determined for mid to long term features, as well as how the roadmap maps back to current contributions and maintainer ladder?** NetObserv is the upstream of Red Hat [Network Observability](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/network_observability/index) for OpenShift. As such, a large part of the roadmap comes from the requirements on that downstream product, while it benefits equally to the upstream (there are no downstream-only features). While the downstream product is considered today *production-ready*, that is not the case of the upstream project however. We have created [a specific roadmap](./roadmap.md) to address the known issues in the upstream project, and eventually fill the gap for production readiness. - * Describe the target persona or user(s) for the project? + **Describe the target persona or user(s) for the project?** The project targets both cluster administrators and project teams. Cluster administrators have a cluster-wide view over all the network traffic, full topology, access to metrics and alerts. They can run packet-capture, they configure the cluster-scoped flow collection process. Through multi-tenancy, project teams have access to a subset of the traffic and the related topology. They have limited configuration options, such as per-namespace sampling or traffic flagging. - * Explain the primary use case for the project. What additional use cases are supported by the project? + **Explain the primary use case for the project. What additional use cases are supported by the project?** Observing the network runtime traffic with different levels of granularity and aggregations, receiving network health info such as saturation, degraded latency, DNS issues, etc. Troubleshooting network issues, narrowing down to specific pods or services, deep-diving in netflow data or pcap. Being alerted. With OVN-Kubernetes CNI, network policy troubleshooting, and network isolation (UDN) visualization. - * Explain which use cases have been identified as unsupported by the project. + **Explain which use cases have been identified as unsupported by the project.** Currently, network policy troubleshooting with other CNI than OVN-Kubernetes are not supported. L7 observability not planned to this date (no insight into http specific data such as error codes or URLs; NetObserv operates at a lower level). - * Describe the intended types of organizations who would benefit from adopting this project. (i.e. financial services, any software manufacturer, organizations providing platform engineering services)? + **Describe the intended types of organizations who would benefit from adopting this project. (i.e. financial services, any software manufacturer, organizations providing platform engineering services)?** All types of organizations may benefit from network observability. - * Please describe any completed end user research and link to any reports. + **Please describe any completed end user research and link to any reports.** + +Unfortunately there is no such kind of end user research publicly available. ### Usability -* How should the target personas interact with your project? +**How should the target personas interact with your project?** Configuration is done entirely through the CRD APIs, managed by an k8s operator. It is gitops-friendly. A web console is provided for the network traffic and network health visualization. Metrics and alerts are provided for Prometheus, meaning that the users can leverage their existing tooling if they already have it. A command-line interface tool is also provided, independently from the operator, allowing users to troubleshoot network from the command line. -* Describe the user experience (UX) and user interface (UI) of the project. +**Describe the user experience (UX) and user interface (UI) of the project.** The provided web console offers two views: Network Traffic (flows visualization) and Network Health (health rules and alerts visualization). The Network Traffic view itself consists in three subviews: - an overview of the traffic, showing various charts @@ -87,7 +89,7 @@ In traffic overview and topology, traffic can be aggregated at different levels A special attention is paid to the UX with many small details, to quickly filter on a displayed element, step into an aggregated topology element, etc. -* Describe how this project integrates with other projects in a production environment. +**Describe how this project integrates with other projects in a production environment.** NetObserv can generate many metrics, ingested by Prometheus, and alerting rules for AlertManager. Users who already use them can leverage their existing setup. @@ -99,7 +101,7 @@ In the future, we may investigate other UI integration, such as with Headlamp. ### Design - * Explain the design principles and best practices the project is following. + **Explain the design principles and best practices the project is following.** The project design principles and best practices are globally common to many Red Hat products. The development philosophy is "upstream first", meaning that there is no hidden code/feature that only downstream users would get. In fact, there is even no specific repository for downstream. @@ -109,11 +111,11 @@ We expect a reasonably high code quality standard, without being too picky on st All architectural decisions are made with care, and must be well balanced according to their drawbacks. When that happens, we expect to discuss a list of pros and cons thoughtfully. One aspect that is often overlooked at first is the impact on the maintenance and support workloads. - * Outline or link to the project’s architecture requirements? Describe how they differ for Proof of Concept, Development, Test and Production environments, as applicable. + **Outline or link to the project’s architecture requirements? Describe how they differ for Proof of Concept, Development, Test and Production environments, as applicable.** ?? - * Define any specific service dependencies the project relies on in the cluster. + **Define any specific service dependencies the project relies on in the cluster.** Both the NetObserv operator and the `flowlogs-pipeline` component interact with the Kube API server to watch resources and, for the operator, to create or update them. @@ -125,39 +127,39 @@ Optionally, the eBPF agents can integrate with [bpfman](https://bpfman.io/). By Finally, several services require TLS certificates, which are generally provided by cert-manager or OpenShift Service Certificates. - * Describe how the project implements Identity and Access Management. + **Describe how the project implements Identity and Access Management.** On the ingestion side, there is no Identity and Access Management other than with the components service accounts themselves, associated with RBAC permissions. On the consuming side, NetObserv does not implement by itself Identity and Access Management, however all queries run against Loki or Prometheus forward the Authorization header, delegating this aspect to those backends. In a production-grade environment, Thanos and the Loki Operator can be used to enable multi-tenancy. This is how it is implemented in OpenShift. - * Describe how the project has addressed sovereignty. + **Describe how the project has addressed sovereignty.** Open-source addresses independence. NetObserv does not store any data directly, this is delegated to Loki and/or Prometheus and the aforementioned exporting methods. All these options offer a very decent flexibility in terms of storage options, with interoperability, which should not cause any independence blockers. - * Describe any compliance requirements addressed by the project. + **Describe any compliance requirements addressed by the project.** -?? +Downstream builds are FIPS-140 compliant. Those build recipes are open-source and can be replicated. -Downstream builds are FIPS compliant (those build recipes are open-source as well). +The project has not been evaluated against other compliance standards as of today. - * Describe the project’s High Availability requirements. + **Describe the project’s High Availability requirements.** High availability can be implemented by using Kafka deployment model (e.g. with Strimzi), and using an autoscaler for the `flowlogs-pipeline` component. Loki and Prometheus should be configured for high availability as well (this aspect is not managed by NetObserv itself; using Thanos and the Loki Operator can serve this purpose). - * Describe the project’s resource requirements, including CPU, Network and Memory. + **Describe the project’s resource requirements, including CPU, Network and Memory.** Resource requirements highly depend on the cluster network topology: how many nodes and pods you have, how much traffic, etc. While eBPF ensures a minimal impact on workload performance, the generated network flows can represent a significant amount of data, which impact nodes CPU, memory and bandwitdh. Some [recommendations](https://github.com/netobserv/network-observability-operator/blob/main/config/descriptions/ocp.md#resource-considerations) are provided, but your mileage may and will vary. Some statistics are documented [here](https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/network_observability/configuring-network-observability-operators#network-observability-resource-recommendations_network_observability). Mitigating high resource requirements can be done in several ways, such as by increasing the sampling interval, adding filters, or considering whether or not to use Loki. More information [here](https://github.com/netobserv/network-observability-operator/tree/main?tab=readme-ov-file#configuration). - * Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage. + **Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage.** Storage is not directly managed by NetObserv, and is to be configured via Prometheus and/or Loki. TTL is important to consider. Loki is often configured with a S3 backend storage, but other options exist, such as ODF. Just like memory, storage requirements highly depend on the cluster network topology, and can be mitigated the same way as mentioned above. - * Please outline the project’s API Design: + **Please outline the project’s API Design:** NetObserv defines several APIs: - The [FlowCollector CRD](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md) contains the main, cluster-wide configuration for NetObserv. @@ -176,9 +178,9 @@ Best effort is done to achieve security by default, but this is sometimes too de Loki must be configured accordingly to its installation, disabled, or enabled in "demo" mode. Prometheus querier URL must be configured. It is recommended to enable the embedded network policy, or to install one. In OpenShift, Prometheus and the network policy are enabled and configured automatically. - * Describe any new or changed API types and calls \- including to cloud providers \- that will result from this project being enabled and used - * Describe compatibility of any new or changed APIs with API servers, including the Kubernetes API server - * Describe versioning of any new or changed APIs, including how breaking changes are handled + * Describe any new or changed API types and calls \- including to cloud providers \- that will result from this project being enabled and used + * Describe compatibility of any new or changed APIs with API servers, including the Kubernetes API server + * Describe versioning of any new or changed APIs, including how breaking changes are handled The project release process is split between upstream and downstream releases. For both of them, content can be tracked from the repositories, which are public. @@ -196,22 +198,12 @@ Testing and validating the installation can be done by port-forwarding the web c ### Security - -- [Self assessment](./Security%20Self-Assessment.md) -- On TAG Security whitepaper: +**Please provide a link to the project’s cloud native [security self assessment](https://tag-security.cncf.io/community/assessments/).** + +=> [Security self assessment](./Security%20Self-Assessment.md) + +**Please review the [Cloud Native Security Tenets](https://github.com/cncf/tag-security/blob/main/community/resources/security-whitepaper/secure-defaults-cloud-native-8.md) from TAG Security.** + 1. Make security a design requirement Security measures have been baked in from GA day-0, and continuously improved over time. For instance, from day-0, TLS / mTLS has been recommended through Kafka; RBAC and multi-tenancy supported via the Loki Operator; eBPF agents, running with elevated privileges, are segregated in a different namespace; fine-grained capabilities are favored whenever possible. A threat modeling as been done internally at Red Hat. 2. Applying secure configuration has the best user experience @@ -230,57 +222,92 @@ Containers run as non-root; Release pipeline includes vulnerability scans. 8. Security limitations of a system are explainable While security limitations are not hidden, they may not be very visible. This is something added [to the roadmap](./roadmap.md). -TBC +**How do you recommend users alter security defaults in order to "loosen" the security of the project? Please link to any documentation the project has written concerning these use cases.** +We are not currently emphasizing the security risks associated with relaxing the default settings. This is something to improve. + +**Security Hygiene** + + +**Cloud Native Threat Modeling** + ## Day 1 \- Installation and Deployment Phase ### Project Installation and Configuration - +**Describe what project installation and configuration look like.** ### Project Enablement and Rollback - +**How can this project be enabled or disabled in a live cluster? Please describe any downtime required of the control plane or nodes.** + +**Describe how enabling the project changes any default behavior of the cluster or running workloads.** + +**Describe how the project tests enablement and disablement.** + +**How does the project clean up any resources created, including CRDs?** ### Rollout, Upgrade and Rollback Planning - +**How does the project intend to provide and maintain compatibility with infrastructure and orchestration management tools like Kubernetes and with what frequency?** + +**Describe how the project handles rollback procedures.** + +**How can a rollout or rollback fail? Describe any impact to already running workloads.** + +**Describe any specific metrics that should inform a rollback.** + +**Explain how upgrades and rollbacks were tested and how the upgrade-\>downgrade-\>upgrade path was tested.** + +**Explain how the project informs users of deprecations and removals of features and APIs.** + +**Explain how the project permits utilization of alpha and beta capabilities as part of a rollout.** ## Day 2 \- Day-to-Day Operations Phase ### Scalability/Reliability - +**Describe how the project increases the size or count of existing API objects.** + +Quite low: NetObserv main API (`FlowCollector`) is a singleton, cluster-scope resource, and is sufficient to run NetObserv. + +The other resources are optional: +- `FlowMetrics` allows to create customized metrics, we barely expect a dozen at most. +- `FlowCollectorSlice` allows to delegate some parts of the `FlowCollector` config to project teams. If used extensively, it's probably capped at one per namespace. + +**Describe how the project defines Service Level Objectives (SLOs) and Service Level Indicators (SLIs).** + +**Describe any operations that will increase in time covered by existing SLIs/SLOs.** + +**Describe the increase in resource usage in any components as a result of enabling this project, to include CPU, Memory, Storage, Throughput.** + +TBC + +**Describe which conditions enabling / using this project would result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)** + +TBC / Agent fd + +**Describe the load testing that has been performed on the project and the results.** Load tests are performed very regularly on different cluster sizes (25 and 250 nodes) to track any performance regression, using prow and kube-burner-ocp. Not all configurations can be tested this way, so the focus is set on very high range of production-grade installations, with Kafka, the Loki Operator, all features enabled, and maximum sampling (capturing all the traffic). [This page](https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/network_observability/configuring-network-observability-operators) shows a short summary of these tests, alongside with resource limits recommendations. More information can be obtained from prow runs, publicly available ([here's an example](https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-eng-ocp-qe-perfscale-ci-netobserv-perf-tests-netobserv-aws-4.21-nightly-x86-node-density-heavy-25nodes/2020627868538638336/artifacts/node-density-heavy-25nodes/openshift-qe-orion/artifacts/data-netobserv-perf-node-density-heavy-AWS-25w.csv)). +**Describe the recommended limits of users, requests, system resources, etc. and how they were obtained.** + +TBC + +**Describe which resilience pattern the project uses and how, including the circuit breaker pattern.** + +TBC Kafka, Loki rate limit / retries, ... ### Observability Requirements diff --git a/cncf/roadmap.md b/cncf/roadmap.md index 3045bcc..c2f8ed5 100644 --- a/cncf/roadmap.md +++ b/cncf/roadmap.md @@ -28,9 +28,13 @@ When the OVN-Kubernetes CNI is detected, which is the default in OpenShift, a ne - Investigate on the network policy issues, make it work for some common CNIs. - Raise warnings / degrade conditions when no network policy is detected. - Better document the required ACLs for NetObserv, for users using CNIs that don't support the embedded policy. + - Document the risks associated to not having a network policy. - TLS / mTLS - Integrate Trust-manager resources in the Helm chart by default ([issue #2360](https://github.com/netobserv/network-observability-operator/issues/2360)). - Raise warnings / degrade conditions when no TLS is detected. + - Document the risks associated to disabling TLS. +- Documentation review + - Review all the documentation to emphasize more the risks when relaxing the secured defaults (e.g.: privileged agent). #### Securing the upstream build and release processes From 630048fc8678a36dd41ad2babfdb4b9e071ffa82 Mon Sep 17 00:00:00 2001 From: Joel Takvorian Date: Mon, 16 Feb 2026 15:16:31 +0100 Subject: [PATCH 06/10] update sovereignty --- cncf/CNCF project - GTR.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/cncf/CNCF project - GTR.md b/cncf/CNCF project - GTR.md index 436d059..5783200 100644 --- a/cncf/CNCF project - GTR.md +++ b/cncf/CNCF project - GTR.md @@ -74,7 +74,7 @@ Unfortunately there is no such kind of end user research publicly available. **How should the target personas interact with your project?** -Configuration is done entirely through the CRD APIs, managed by an k8s operator. It is gitops-friendly. A web console is provided for the network traffic and network health visualization. Metrics and alerts are provided for Prometheus, meaning that the users can leverage their existing tooling if they already have it. A command-line interface tool is also provided, independently from the operator, allowing users to troubleshoot network from the command line. +Configuration is done entirely through the CRD APIs, managed by a k8s operator. It is gitops-friendly. A web console is provided for the network traffic and network health visualization. Metrics and alerts are provided for Prometheus, meaning that the users can leverage their existing tooling if they already have it. A command-line interface tool is also provided, independently from the operator, allowing users to troubleshoot network from the command line. **Describe the user experience (UX) and user interface (UI) of the project.** @@ -101,7 +101,7 @@ In the future, we may investigate other UI integration, such as with Headlamp. ### Design - **Explain the design principles and best practices the project is following.** +**Explain the design principles and best practices the project is following.** The project design principles and best practices are globally common to many Red Hat products. The development philosophy is "upstream first", meaning that there is no hidden code/feature that only downstream users would get. In fact, there is even no specific repository for downstream. @@ -111,11 +111,11 @@ We expect a reasonably high code quality standard, without being too picky on st All architectural decisions are made with care, and must be well balanced according to their drawbacks. When that happens, we expect to discuss a list of pros and cons thoughtfully. One aspect that is often overlooked at first is the impact on the maintenance and support workloads. - **Outline or link to the project’s architecture requirements? Describe how they differ for Proof of Concept, Development, Test and Production environments, as applicable.** +**Outline or link to the project’s architecture requirements? Describe how they differ for Proof of Concept, Development, Test and Production environments, as applicable.** -?? - **Define any specific service dependencies the project relies on in the cluster.** + +**Define any specific service dependencies the project relies on in the cluster.** Both the NetObserv operator and the `flowlogs-pipeline` component interact with the Kube API server to watch resources and, for the operator, to create or update them. @@ -127,39 +127,39 @@ Optionally, the eBPF agents can integrate with [bpfman](https://bpfman.io/). By Finally, several services require TLS certificates, which are generally provided by cert-manager or OpenShift Service Certificates. - **Describe how the project implements Identity and Access Management.** +**Describe how the project implements Identity and Access Management.** On the ingestion side, there is no Identity and Access Management other than with the components service accounts themselves, associated with RBAC permissions. On the consuming side, NetObserv does not implement by itself Identity and Access Management, however all queries run against Loki or Prometheus forward the Authorization header, delegating this aspect to those backends. In a production-grade environment, Thanos and the Loki Operator can be used to enable multi-tenancy. This is how it is implemented in OpenShift. - **Describe how the project has addressed sovereignty.** +**Describe how the project has addressed sovereignty.** -Open-source addresses independence. +Open-source addresses independence. Being a vendor-neutral CNCF sandbox project would reinforce it. NetObserv does not store any data directly, this is delegated to Loki and/or Prometheus and the aforementioned exporting methods. All these options offer a very decent flexibility in terms of storage options, with interoperability, which should not cause any independence blockers. - **Describe any compliance requirements addressed by the project.** +**Describe any compliance requirements addressed by the project.** Downstream builds are FIPS-140 compliant. Those build recipes are open-source and can be replicated. The project has not been evaluated against other compliance standards as of today. - **Describe the project’s High Availability requirements.** +**Describe the project’s High Availability requirements.** High availability can be implemented by using Kafka deployment model (e.g. with Strimzi), and using an autoscaler for the `flowlogs-pipeline` component. Loki and Prometheus should be configured for high availability as well (this aspect is not managed by NetObserv itself; using Thanos and the Loki Operator can serve this purpose). - **Describe the project’s resource requirements, including CPU, Network and Memory.** +**Describe the project’s resource requirements, including CPU, Network and Memory.** Resource requirements highly depend on the cluster network topology: how many nodes and pods you have, how much traffic, etc. While eBPF ensures a minimal impact on workload performance, the generated network flows can represent a significant amount of data, which impact nodes CPU, memory and bandwitdh. Some [recommendations](https://github.com/netobserv/network-observability-operator/blob/main/config/descriptions/ocp.md#resource-considerations) are provided, but your mileage may and will vary. Some statistics are documented [here](https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/network_observability/configuring-network-observability-operators#network-observability-resource-recommendations_network_observability). Mitigating high resource requirements can be done in several ways, such as by increasing the sampling interval, adding filters, or considering whether or not to use Loki. More information [here](https://github.com/netobserv/network-observability-operator/tree/main?tab=readme-ov-file#configuration). - **Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage.** +**Describe the project’s storage requirements, including its use of ephemeral and/or persistent storage.** Storage is not directly managed by NetObserv, and is to be configured via Prometheus and/or Loki. TTL is important to consider. Loki is often configured with a S3 backend storage, but other options exist, such as ODF. Just like memory, storage requirements highly depend on the cluster network topology, and can be mitigated the same way as mentioned above. - **Please outline the project’s API Design:** +**Please outline the project’s API Design:** NetObserv defines several APIs: - The [FlowCollector CRD](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md) contains the main, cluster-wide configuration for NetObserv. From bc05d2c1352801648d79d68bb881105d06ec942e Mon Sep 17 00:00:00 2001 From: Joel Takvorian Date: Mon, 16 Feb 2026 15:40:10 +0100 Subject: [PATCH 07/10] Add Security Hygiene --- cncf/CNCF project - GTR.md | 18 ++++++++++++------ cncf/Security Self-Assessment.md | 1 + 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/cncf/CNCF project - GTR.md b/cncf/CNCF project - GTR.md index 5783200..e799b17 100644 --- a/cncf/CNCF project - GTR.md +++ b/cncf/CNCF project - GTR.md @@ -174,13 +174,15 @@ The project configuration is designed to work well with minimal configuration. T The default configuration is designed to work well on small/mid-sized clusters, ie. between roughly 5 and 50 nodes, with a default sampling interval set to 50 in order to preserve resource usage (as opposed to an interval of 1, which would capture all the traffic). On bigger cluster topologies, it is recommended to optimize carefully. -Best effort is done to achieve security by default, but this is sometimes too dependent on the environment. For instance, while a network policy is installed by default in OpenShift, it is not when running in a different environment, as this may break with some CNIs. In that case, enabling the network policy must be done explicitely, or the user can configure their own policy. +Best effort is done to achieve security by default, but this is sometimes too dependent on the environment. For instance, while a network policy is installed by default in OpenShift with OVN-Kubernetes, it is not when running in a different environment, as this may break with some CNIs. In that case, enabling the network policy must be done explicitely, or the user can configure their own policy. Loki must be configured accordingly to its installation, disabled, or enabled in "demo" mode. Prometheus querier URL must be configured. It is recommended to enable the embedded network policy, or to install one. In OpenShift, Prometheus and the network policy are enabled and configured automatically. + The project release process is split between upstream and downstream releases. For both of them, content can be tracked from the repositories, which are public. @@ -224,13 +226,17 @@ While security limitations are not hidden, they may not be very visible. This is **How do you recommend users alter security defaults in order to "loosen" the security of the project? Please link to any documentation the project has written concerning these use cases.** -We are not currently emphasizing the security risks associated with relaxing the default settings. This is something to improve. +We are not currently emphasizing the security risks associated with relaxing the default settings. This is something we [plan to improve](./roadmap.md). **Security Hygiene** - + +We use SAST and DAST frameworks to maintain basic health and security: + +- The project uses GitHub pull requests for all contributions, configured with branch protections, enforced 2FA, required approvals, and CI job checks. + - CI jobs include compiling, linting, image-building and automated tests. +- The maintainers team has access to a snyk dashboard. +- A combination of dependabot and renovate is used to track outdated dependencies. +- Vulnerabilities in dependencies are reported using GitHub security advisories, as well as Clair. **Cloud Native Threat Modeling**