From 8d540f1f09e4b18dac6aa8718730679081afb84f Mon Sep 17 00:00:00 2001 From: Miriam Weiss Date: Tue, 19 May 2026 13:10:36 +0300 Subject: [PATCH] CNV-80564: Document Node Memory Overview dashboard for Virtualization --- modules/virt-node-memory-dashboard.adoc | 96 ++++++++++++++++++++ virt/monitoring/virt-prometheus-queries.adoc | 2 + 2 files changed, 98 insertions(+) create mode 100644 modules/virt-node-memory-dashboard.adoc diff --git a/modules/virt-node-memory-dashboard.adoc b/modules/virt-node-memory-dashboard.adoc new file mode 100644 index 000000000000..cd4e765e7324 --- /dev/null +++ b/modules/virt-node-memory-dashboard.adoc @@ -0,0 +1,96 @@ +// Module included in the following assemblies: +// +// * virt/support/virt-prometheus-queries.adoc + +:_mod-docs-content-type: REFERENCE +[id="virt-node-memory-dashboard_{context}"] += Node Memory dashboard + +[role="_abstract"] +The Node Memory dashboard displays physical and virtual memory utilization across the cluster, focusing on how virtual machines (VMs) affect node memory. + +Use this dashboard to monitor memory capacity, detect memory pressure, identify overcommitment risks, and validate that system processes do not exceed their reserved memory. + +You can access this dashboard from the web console in *Observe* -> *Dashboards (Perses)*. + +== Dashboard filters + +The dashboard provides two filter variables in the top toolbar: + +[cols="3",options="header"] +|=== +| Filter | Description | Default +| *Node* | Filters all panels to display information about one or more nodes in the cluster. | *All* +| *role* | Filters all panels to display information about nodes with a specific Kubernetes role, populated from `kube_node_role`. | *worker* +|=== + +You can also change the time range displayed in the different panels in the top toolbar of the dashboard. + +== Summary section + +The *Summary* section provides a high-level overview of cluster memory health. The panels in this section provide general memory utilization data for the cluster and the nodes. + +== Cluster section + +The *Cluster* section provides a more detailed view of cluster-wide memory behavior over a range of time. This section includes the following panels: + +[cols="3",options="header"] +|=== +| Panel | Type | Description +| *Physical Memory Utilization & Requests* | Time series | Displays the total node memory capacity alongside actual memory utilization split into virtualization and non-virtualization workloads. This panel also shows the memory request plan (system-reserved requests and pod requests) so that you can compare actual usage to the scheduler expectations. +| *Cluster Utilization* | Gauge | Displays the aggregated memory utilization of the nodes in the cluster. This panel also appears in the *Summary* section. +| *Virtual Memory Assignment* | Time series | This panel displays the worst-case scenario if all virtual memory is used at present. It shows total node capacity, utilization without virtualization, and non-virtualization utilization combined with the total assigned VM memory. If the *VM assigned virtual memory* line nears or exceeds *Node capacity*, the cluster risks out-of-memory conditions under full VM memory pressure. +| *Cluster Virtual Committed* | Gauge | Displays the percentage of committed virtual memory out of all of the allocatable physical memory. This panel also appears in the *Summary* section. +| *Cluster - Memory Pressure* | Time series | Shows cluster-level memory PSI rates for `Waiting` (processes delayed by memory) and `Stalled` (processes completely blocked). +| *Cluster - Aggregated Swap* | Time series | Shows total swap capacity and usage across all the nodes that you select. Rising swap usage indicates memory pressure that has not yet caused PSI stalls. +|=== + +== Nodes section + +The *Nodes* section breaks down memory information per node to help identify imbalances. This section includes the following panels: + +[cols="3",options="header"] +|=== +| Panel | Type | Description +| *Utilization - Actual Overcommit Level* | Time series | Displays the per-node allocatable memory utilization minus the system-reserved memory. +| *Node Utilization - min* | Gauge | Displays the lowest node utilization in the cluster. This panel also appears in the *Summary* section. +| *Plan - Pod Requests per Node* | Time series | Shows the memory request fill level per node: the sum of all pod memory requests on a node divided by the node's allocatable memory. +| *Node Requests - min/max* | Stat | Displays the minimum and maximum pod request ratios per node Use this panel to quickly assess the cluster's remaining capacity to host new workloads. +| *Plan - Virtual Memory Commit Level* | Time series | Displays the virtual memory commit ratio per node. The panel compares total active and assigned VM memory against the node's available memory. +| *Node Virtual - min/max* | Stat | Displays the minimum and maximum virtual commit levels per node. +| *Node - Pressure* | Time series | Displays the memory PSI waiting rate per node. Each line represents one node to help you identify which nodes experience memory contention. +| *Node PSI - max* | Gauge | Displays the highest PSI value across all nodes. PSI shows the amount of time applications are stalled or delayed waiting for memory resources. This panel also appears in the *Summary* section. +|=== + +== System Reserved section + +The *System Reserved* section monitors whether the system (hypervisor) processes stay within their reserved memory budget. This section includes the following panels: + +[cols="3",options="header"] +|=== +| Panel | Type | Description +| *Utilization - Reserved System Memory* | Time series | Displays the top 5 nodes by system-reserved memory utilization. This metric compares active system process memory against the reserved memory budget. +| *Utilization - min/max* | Time series | Displays the minimum and maximum system-reserved memory utilization across all nodes over time. +| *System Exceeding Reservation* | Stat | Displays the percentage or number of nodes out of all monitored nodes that are currently triggering the `SystemMemoryExceedsReservation` alert. This panel also appears in the *Summary* section. +|=== + +== Workloads section + +The *Workloads* section, which is collapsed by default, focuses on individual VM memory behavior. + +[cols="3",options="header"] +|=== +| Panel | Type | Description +| *VM Overcommit Ratio* | Time series | Displays the ratio of assigned virtual memory to pod memory requests for each VM. +| *VM Virtual Committed* | Gauge | Displays the average VM overcommit ratio, which includes launcher overhead. This panel also appears in the *Summary* section. +| *VM Virtual Memory Utilization vs Host VM Utilization* | Time series | Displays the 10 VMs with the highest ratio between guest-reported memory usage and host-side container memory usage. Use this panel to identify VMs where guest-reported usage differs significantly from host-side accounting. This difference indicates balloon driver effectiveness or memory accounting discrepancies. +| *Number of Running VMs* | Time series | Displays the total number of running virtual machine instances (VMIs) on the cluster to provide context for the other workload panels. +|=== + +== Interpreting the dashboard + +Monitor the dashboard indicators in the *Summary* section to identify early warning signs of memory pressure and prevent critical out-of-memory events. + +Healthy state:: The *Cluster Utilization* panel gauge is green (below 70%), the *Cluster Virtual Committed* panel gauge is below 120%, the *Node PSI - max* values are near zero, and no nodes exceed their system reservation. +Warning signs:: Utilization gauges turn amber (80% to 90%), the virtual commit approaches 150%, or individual nodes diverge significantly from the cluster average, which suggests imbalanced scheduling. +Critical state:: Utilization gauges turn red (above 90%), PSI values exceed 0.5, system reservation is exceeded on any node, or virtual commit ratios per node exceed 200%. These conditions indicate that the cluster is at risk of out-of-memory events and VM eviction. diff --git a/virt/monitoring/virt-prometheus-queries.adoc b/virt/monitoring/virt-prometheus-queries.adoc index 867fb41c17a0..61688110171e 100644 --- a/virt/monitoring/virt-prometheus-queries.adoc +++ b/virt/monitoring/virt-prometheus-queries.adoc @@ -28,6 +28,8 @@ include::modules/virt-querying-metrics.adoc[leveloffset=+1] include::modules/virt-live-migration-metrics.adoc[leveloffset=+2] +include::modules/virt-node-memory-dashboard.adoc[leveloffset=+1] + [id="additional-resources_virt-prometheus-queries"] [role="_additional-resources"] == Additional resources