Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions modules/virt-node-memory-dashboard.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
// Module included in the following assemblies:
//
// * virt/support/virt-prometheus-queries.adoc

:_mod-docs-content-type: REFERENCE
[id="virt-node-memory-dashboard_{context}"]
= Node Memory dashboard

[role="_abstract"]
The Node Memory dashboard displays physical and virtual memory utilization across the cluster, focusing on how virtual machines (VMs) affect node memory.

Use this dashboard to monitor memory capacity, detect memory pressure, identify overcommitment risks, and validate that system processes do not exceed their reserved memory.

You can access this dashboard from the web console in *Observe* -> *Dashboards (Perses)*.

== Dashboard filters

The dashboard provides two filter variables in the top toolbar:

[cols="3",options="header"]
|===
| Filter | Description | Default
| *Node* | Filters all panels to display information about one or more nodes in the cluster. | *All*
| *role* | Filters all panels to display information about nodes with a specific Kubernetes role, populated from `kube_node_role`. | *worker*
|===

You can also change the time range displayed in the different panels in the top toolbar of the dashboard.

== Summary section

The *Summary* section provides a high-level overview of cluster memory health. The panels in this section provide general memory utilization data for the cluster and the nodes.

== Cluster section

The *Cluster* section provides a more detailed view of cluster-wide memory behavior over a range of time. This section includes the following panels:

[cols="3",options="header"]
|===
| Panel | Type | Description
| *Physical Memory Utilization & Requests* | Time series | Displays the total node memory capacity alongside actual memory utilization split into virtualization and non-virtualization workloads. This panel also shows the memory request plan (system-reserved requests and pod requests) so that you can compare actual usage to the scheduler expectations.
| *Cluster Utilization* | Gauge | Displays the aggregated memory utilization of the nodes in the cluster. This panel also appears in the *Summary* section.
| *Virtual Memory Assignment* | Time series | This panel displays the worst-case scenario if all virtual memory is used at present. It shows total node capacity, utilization without virtualization, and non-virtualization utilization combined with the total assigned VM memory. If the *VM assigned virtual memory* line nears or exceeds *Node capacity*, the cluster risks out-of-memory conditions under full VM memory pressure.
| *Cluster Virtual Committed* | Gauge | Displays the percentage of committed virtual memory out of all of the allocatable physical memory. This panel also appears in the *Summary* section.
| *Cluster - Memory Pressure* | Time series | Shows cluster-level memory PSI rates for `Waiting` (processes delayed by memory) and `Stalled` (processes completely blocked).
| *Cluster - Aggregated Swap* | Time series | Shows total swap capacity and usage across all the nodes that you select. Rising swap usage indicates memory pressure that has not yet caused PSI stalls.
|===

== Nodes section

The *Nodes* section breaks down memory information per node to help identify imbalances. This section includes the following panels:

[cols="3",options="header"]
|===
| Panel | Type | Description
| *Utilization - Actual Overcommit Level* | Time series | Displays the per-node allocatable memory utilization minus the system-reserved memory.
| *Node Utilization - min* | Gauge | Displays the lowest node utilization in the cluster. This panel also appears in the *Summary* section.
| *Plan - Pod Requests per Node* | Time series | Shows the memory request fill level per node: the sum of all pod memory requests on a node divided by the node's allocatable memory.
| *Node Requests - min/max* | Stat | Displays the minimum and maximum pod request ratios per node Use this panel to quickly assess the cluster's remaining capacity to host new workloads.
| *Plan - Virtual Memory Commit Level* | Time series | Displays the virtual memory commit ratio per node. The panel compares total active and assigned VM memory against the node's available memory.
| *Node Virtual - min/max* | Stat | Displays the minimum and maximum virtual commit levels per node.
| *Node - Pressure* | Time series | Displays the memory PSI waiting rate per node. Each line represents one node to help you identify which nodes experience memory contention.
| *Node PSI - max* | Gauge | Displays the highest PSI value across all nodes. PSI shows the amount of time applications are stalled or delayed waiting for memory resources. This panel also appears in the *Summary* section.
|===

== System Reserved section

The *System Reserved* section monitors whether the system (hypervisor) processes stay within their reserved memory budget. This section includes the following panels:

[cols="3",options="header"]
|===
| Panel | Type | Description
| *Utilization - Reserved System Memory* | Time series | Displays the top 5 nodes by system-reserved memory utilization. This metric compares active system process memory against the reserved memory budget.
| *Utilization - min/max* | Time series | Displays the minimum and maximum system-reserved memory utilization across all nodes over time.
| *System Exceeding Reservation* | Stat | Displays the percentage or number of nodes out of all monitored nodes that are currently triggering the `SystemMemoryExceedsReservation` alert. This panel also appears in the *Summary* section.
|===

== Workloads section

The *Workloads* section, which is collapsed by default, focuses on individual VM memory behavior.

[cols="3",options="header"]
|===
| Panel | Type | Description
| *VM Overcommit Ratio* | Time series | Displays the ratio of assigned virtual memory to pod memory requests for each VM.
| *VM Virtual Committed* | Gauge | Displays the average VM overcommit ratio, which includes launcher overhead. This panel also appears in the *Summary* section.
| *VM Virtual Memory Utilization vs Host VM Utilization* | Time series | Displays the 10 VMs with the highest ratio between guest-reported memory usage and host-side container memory usage. Use this panel to identify VMs where guest-reported usage differs significantly from host-side accounting. This difference indicates balloon driver effectiveness or memory accounting discrepancies.
Comment thread
MirzWeiss marked this conversation as resolved.
| *Number of Running VMs* | Time series | Displays the total number of running virtual machine instances (VMIs) on the cluster to provide context for the other workload panels.
|===

== Interpreting the dashboard

Monitor the dashboard indicators in the *Summary* section to identify early warning signs of memory pressure and prevent critical out-of-memory events.

Healthy state:: The *Cluster Utilization* panel gauge is green (below 70%), the *Cluster Virtual Committed* panel gauge is below 120%, the *Node PSI - max* values are near zero, and no nodes exceed their system reservation.
Warning signs:: Utilization gauges turn amber (80% to 90%), the virtual commit approaches 150%, or individual nodes diverge significantly from the cluster average, which suggests imbalanced scheduling.
Critical state:: Utilization gauges turn red (above 90%), PSI values exceed 0.5, system reservation is exceeded on any node, or virtual commit ratios per node exceed 200%. These conditions indicate that the cluster is at risk of out-of-memory events and VM eviction.
2 changes: 2 additions & 0 deletions virt/monitoring/virt-prometheus-queries.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ include::modules/virt-querying-metrics.adoc[leveloffset=+1]

include::modules/virt-live-migration-metrics.adoc[leveloffset=+2]

include::modules/virt-node-memory-dashboard.adoc[leveloffset=+1]

[id="additional-resources_virt-prometheus-queries"]
[role="_additional-resources"]
== Additional resources
Expand Down