From 8d540f1f09e4b18dac6aa8718730679081afb84f Mon Sep 17 00:00:00 2001
From: Miriam Weiss <miweiss@redhat.com>
Date: Tue, 19 May 2026 13:10:36 +0300
Subject: [PATCH] CNV-80564: Document Node Memory Overview dashboard for
 Virtualization

---
 modules/virt-node-memory-dashboard.adoc      | 96 ++++++++++++++++++++
 virt/monitoring/virt-prometheus-queries.adoc |  2 +
 2 files changed, 98 insertions(+)
 create mode 100644 modules/virt-node-memory-dashboard.adoc

diff --git a/modules/virt-node-memory-dashboard.adoc b/modules/virt-node-memory-dashboard.adoc
new file mode 100644
index 000000000000..cd4e765e7324
--- /dev/null
+++ b/modules/virt-node-memory-dashboard.adoc
@@ -0,0 +1,96 @@
+// Module included in the following assemblies:
+//
+// * virt/support/virt-prometheus-queries.adoc
+
+:_mod-docs-content-type: REFERENCE
+[id="virt-node-memory-dashboard_{context}"]
+= Node Memory dashboard
+
+[role="_abstract"]
+The Node Memory dashboard displays physical and virtual memory utilization across the cluster, focusing on how virtual machines (VMs) affect node memory.
+
+Use this dashboard to monitor memory capacity, detect memory pressure, identify overcommitment risks, and validate that system processes do not exceed their reserved memory.
+
+You can access this dashboard from the web console in *Observe* -> *Dashboards (Perses)*.
+
+== Dashboard filters
+
+The dashboard provides two filter variables in the top toolbar:
+
+[cols="3",options="header"]
+|===
+| Filter | Description | Default
+| *Node* | Filters all panels to display information about one or more nodes in the cluster. | *All*
+| *role* | Filters all panels to display information about nodes with a specific Kubernetes role, populated from `kube_node_role`. | *worker*
+|===
+
+You can also change the time range displayed in the different panels in the top toolbar of the dashboard.
+
+== Summary section
+
+The *Summary* section provides a high-level overview of cluster memory health. The panels in this section provide general memory utilization data for the cluster and the nodes.
+
+== Cluster section
+
+The *Cluster* section provides a more detailed view of cluster-wide memory behavior over a range of time. This section includes the following panels:
+
+[cols="3",options="header"]
+|===
+| Panel | Type | Description
+| *Physical Memory Utilization & Requests* | Time series | Displays the total node memory capacity alongside actual memory utilization split into virtualization and non-virtualization workloads. This panel also shows the memory request plan (system-reserved requests and pod requests) so that you can compare actual usage to the scheduler expectations.
+| *Cluster Utilization* | Gauge | Displays the aggregated memory utilization of the nodes in the cluster. This panel also appears in the *Summary* section.
+| *Virtual Memory Assignment* | Time series | This panel displays the worst-case scenario if all virtual memory is used at present. It shows total node capacity, utilization without virtualization, and non-virtualization utilization combined with the total assigned VM memory. If the *VM assigned virtual memory* line nears or exceeds *Node capacity*, the cluster risks out-of-memory conditions under full VM memory pressure.
+| *Cluster Virtual Committed* | Gauge | Displays the percentage of committed virtual memory out of all of the allocatable physical memory. This panel also appears in the *Summary* section.
+| *Cluster - Memory Pressure* | Time series | Shows cluster-level memory PSI rates for `Waiting` (processes delayed by memory) and `Stalled` (processes completely blocked).
+| *Cluster - Aggregated Swap* | Time series | Shows total swap capacity and usage across all the nodes that you select. Rising swap usage indicates memory pressure that has not yet caused PSI stalls.
+|===
+
+== Nodes section
+
+The *Nodes* section breaks down memory information per node to help identify imbalances. This section includes the following panels:
+
+[cols="3",options="header"]
+|===
+| Panel | Type | Description
+| *Utilization - Actual Overcommit Level* | Time series | Displays the per-node allocatable memory utilization minus the system-reserved memory.
+| *Node Utilization - min* | Gauge | Displays the lowest node utilization in the cluster. This panel also appears in the *Summary* section.
+| *Plan - Pod Requests per Node* | Time series | Shows the memory request fill level per node: the sum of all pod memory requests on a node divided by the node's allocatable memory.
+| *Node Requests - min/max* | Stat | Displays the minimum and maximum pod request ratios per node Use this panel to quickly assess the cluster's remaining capacity to host new workloads.
+| *Plan - Virtual Memory Commit Level* | Time series | Displays the virtual memory commit ratio per node. The panel compares total active and assigned VM memory against the node's available memory.
+| *Node Virtual - min/max* | Stat | Displays the minimum and maximum virtual commit levels per node.
+| *Node - Pressure* | Time series | Displays the memory PSI waiting rate per node. Each line represents one node to help you identify which nodes experience memory contention.
+| *Node PSI - max* | Gauge | Displays the highest PSI value across all nodes. PSI shows the amount of time applications are stalled or delayed waiting for memory resources. This panel also appears in the *Summary* section.
+|===
+
+== System Reserved section
+
+The *System Reserved* section monitors whether the system (hypervisor) processes stay within their reserved memory budget. This section includes the following panels:
+
+[cols="3",options="header"]
+|===
+| Panel | Type | Description
+| *Utilization - Reserved System Memory* | Time series | Displays the top 5 nodes by system-reserved memory utilization. This metric compares active system process memory against the reserved memory budget.
+| *Utilization - min/max* | Time series | Displays the minimum and maximum system-reserved memory utilization across all nodes over time.
+| *System Exceeding Reservation* | Stat | Displays the percentage or number of nodes out of all monitored nodes that are currently triggering the `SystemMemoryExceedsReservation` alert. This panel also appears in the *Summary* section.
+|===
+
+== Workloads section
+
+The *Workloads* section, which is collapsed by default, focuses on individual VM memory behavior.
+
+[cols="3",options="header"]
+|===
+| Panel | Type | Description
+| *VM Overcommit Ratio* | Time series | Displays the ratio of assigned virtual memory to pod memory requests for each VM.
+| *VM Virtual Committed* | Gauge | Displays the average VM overcommit ratio, which includes launcher overhead. This panel also appears in the *Summary* section.
+| *VM Virtual Memory Utilization vs Host VM Utilization* | Time series | Displays the 10 VMs with the highest ratio between guest-reported memory usage and host-side container memory usage. Use this panel to identify VMs where guest-reported usage differs significantly from host-side accounting. This difference indicates balloon driver effectiveness or memory accounting discrepancies.
+| *Number of Running VMs* | Time series | Displays the total number of running virtual machine instances (VMIs) on the cluster to provide context for the other workload panels.
+|===
+
+== Interpreting the dashboard
+
+Monitor the dashboard indicators in the *Summary* section to identify early warning signs of memory pressure and prevent critical out-of-memory events.
+
+Healthy state:: The *Cluster Utilization* panel gauge is green (below 70%), the *Cluster Virtual Committed* panel gauge is below 120%, the *Node PSI - max* values are near zero, and no nodes exceed their system reservation.
+Warning signs:: Utilization gauges turn amber (80% to 90%), the virtual commit approaches 150%, or individual nodes diverge significantly from the cluster average, which suggests imbalanced scheduling.
+Critical state:: Utilization gauges turn red (above 90%), PSI values exceed 0.5, system reservation is exceeded on any node, or virtual commit ratios per node exceed 200%. These conditions indicate that the cluster is at risk of out-of-memory events and VM eviction.
diff --git a/virt/monitoring/virt-prometheus-queries.adoc b/virt/monitoring/virt-prometheus-queries.adoc
index 867fb41c17a0..61688110171e 100644
--- a/virt/monitoring/virt-prometheus-queries.adoc
+++ b/virt/monitoring/virt-prometheus-queries.adoc
@@ -28,6 +28,8 @@ include::modules/virt-querying-metrics.adoc[leveloffset=+1]
 
 include::modules/virt-live-migration-metrics.adoc[leveloffset=+2]
 
+include::modules/virt-node-memory-dashboard.adoc[leveloffset=+1]
+
 [id="additional-resources_virt-prometheus-queries"]
 [role="_additional-resources"]
 == Additional resources