-
Notifications
You must be signed in to change notification settings - Fork 1.9k
CNV-80564: Document Node Memory Overview dashboard for Virtualization #111888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MirzWeiss
wants to merge
1
commit into
openshift:main
Choose a base branch
from
MirzWeiss:CNV-80564
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+98
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| // Module included in the following assemblies: | ||
| // | ||
| // * virt/support/virt-prometheus-queries.adoc | ||
|
|
||
| :_mod-docs-content-type: REFERENCE | ||
| [id="virt-node-memory-dashboard_{context}"] | ||
| = Node Memory dashboard | ||
|
|
||
| [role="_abstract"] | ||
| The Node Memory dashboard displays physical and virtual memory utilization across the cluster, focusing on how virtual machines (VMs) affect node memory. | ||
|
|
||
| Use this dashboard to monitor memory capacity, detect memory pressure, identify overcommitment risks, and validate that system processes do not exceed their reserved memory. | ||
|
|
||
| You can access this dashboard from the web console in *Observe* -> *Dashboards (Perses)*. | ||
|
|
||
| == Dashboard filters | ||
|
|
||
| The dashboard provides two filter variables in the top toolbar: | ||
|
|
||
| [cols="3",options="header"] | ||
| |=== | ||
| | Filter | Description | Default | ||
| | *Node* | Filters all panels to display information about one or more nodes in the cluster. | *All* | ||
| | *role* | Filters all panels to display information about nodes with a specific Kubernetes role, populated from `kube_node_role`. | *worker* | ||
| |=== | ||
|
|
||
| You can also change the time range displayed in the different panels in the top toolbar of the dashboard. | ||
|
|
||
| == Summary section | ||
|
|
||
| The *Summary* section provides a high-level overview of cluster memory health. The panels in this section provide general memory utilization data for the cluster and the nodes. | ||
|
|
||
| == Cluster section | ||
|
|
||
| The *Cluster* section provides a more detailed view of cluster-wide memory behavior over a range of time. This section includes the following panels: | ||
|
|
||
| [cols="3",options="header"] | ||
| |=== | ||
| | Panel | Type | Description | ||
| | *Physical Memory Utilization & Requests* | Time series | Displays the total node memory capacity alongside actual memory utilization split into virtualization and non-virtualization workloads. This panel also shows the memory request plan (system-reserved requests and pod requests) so that you can compare actual usage to the scheduler expectations. | ||
| | *Cluster Utilization* | Gauge | Displays the aggregated memory utilization of the nodes in the cluster. This panel also appears in the *Summary* section. | ||
| | *Virtual Memory Assignment* | Time series | This panel displays the worst-case scenario if all virtual memory is used at present. It shows total node capacity, utilization without virtualization, and non-virtualization utilization combined with the total assigned VM memory. If the *VM assigned virtual memory* line nears or exceeds *Node capacity*, the cluster risks out-of-memory conditions under full VM memory pressure. | ||
| | *Cluster Virtual Committed* | Gauge | Displays the percentage of committed virtual memory out of all of the allocatable physical memory. This panel also appears in the *Summary* section. | ||
| | *Cluster - Memory Pressure* | Time series | Shows cluster-level memory PSI rates for `Waiting` (processes delayed by memory) and `Stalled` (processes completely blocked). | ||
| | *Cluster - Aggregated Swap* | Time series | Shows total swap capacity and usage across all the nodes that you select. Rising swap usage indicates memory pressure that has not yet caused PSI stalls. | ||
| |=== | ||
|
|
||
| == Nodes section | ||
|
|
||
| The *Nodes* section breaks down memory information per node to help identify imbalances. This section includes the following panels: | ||
|
|
||
| [cols="3",options="header"] | ||
| |=== | ||
| | Panel | Type | Description | ||
| | *Utilization - Actual Overcommit Level* | Time series | Displays the per-node allocatable memory utilization minus the system-reserved memory. | ||
| | *Node Utilization - min* | Gauge | Displays the lowest node utilization in the cluster. This panel also appears in the *Summary* section. | ||
| | *Plan - Pod Requests per Node* | Time series | Shows the memory request fill level per node: the sum of all pod memory requests on a node divided by the node's allocatable memory. | ||
| | *Node Requests - min/max* | Stat | Displays the minimum and maximum pod request ratios per node Use this panel to quickly assess the cluster's remaining capacity to host new workloads. | ||
| | *Plan - Virtual Memory Commit Level* | Time series | Displays the virtual memory commit ratio per node. The panel compares total active and assigned VM memory against the node's available memory. | ||
| | *Node Virtual - min/max* | Stat | Displays the minimum and maximum virtual commit levels per node. | ||
| | *Node - Pressure* | Time series | Displays the memory PSI waiting rate per node. Each line represents one node to help you identify which nodes experience memory contention. | ||
| | *Node PSI - max* | Gauge | Displays the highest PSI value across all nodes. PSI shows the amount of time applications are stalled or delayed waiting for memory resources. This panel also appears in the *Summary* section. | ||
| |=== | ||
|
|
||
| == System Reserved section | ||
|
|
||
| The *System Reserved* section monitors whether the system (hypervisor) processes stay within their reserved memory budget. This section includes the following panels: | ||
|
|
||
| [cols="3",options="header"] | ||
| |=== | ||
| | Panel | Type | Description | ||
| | *Utilization - Reserved System Memory* | Time series | Displays the top 5 nodes by system-reserved memory utilization. This metric compares active system process memory against the reserved memory budget. | ||
| | *Utilization - min/max* | Time series | Displays the minimum and maximum system-reserved memory utilization across all nodes over time. | ||
| | *System Exceeding Reservation* | Stat | Displays the percentage or number of nodes out of all monitored nodes that are currently triggering the `SystemMemoryExceedsReservation` alert. This panel also appears in the *Summary* section. | ||
| |=== | ||
|
|
||
| == Workloads section | ||
|
|
||
| The *Workloads* section, which is collapsed by default, focuses on individual VM memory behavior. | ||
|
|
||
| [cols="3",options="header"] | ||
| |=== | ||
| | Panel | Type | Description | ||
| | *VM Overcommit Ratio* | Time series | Displays the ratio of assigned virtual memory to pod memory requests for each VM. | ||
| | *VM Virtual Committed* | Gauge | Displays the average VM overcommit ratio, which includes launcher overhead. This panel also appears in the *Summary* section. | ||
| | *VM Virtual Memory Utilization vs Host VM Utilization* | Time series | Displays the 10 VMs with the highest ratio between guest-reported memory usage and host-side container memory usage. Use this panel to identify VMs where guest-reported usage differs significantly from host-side accounting. This difference indicates balloon driver effectiveness or memory accounting discrepancies. | ||
| | *Number of Running VMs* | Time series | Displays the total number of running virtual machine instances (VMIs) on the cluster to provide context for the other workload panels. | ||
| |=== | ||
|
|
||
| == Interpreting the dashboard | ||
|
|
||
| Monitor the dashboard indicators in the *Summary* section to identify early warning signs of memory pressure and prevent critical out-of-memory events. | ||
|
|
||
| Healthy state:: The *Cluster Utilization* panel gauge is green (below 70%), the *Cluster Virtual Committed* panel gauge is below 120%, the *Node PSI - max* values are near zero, and no nodes exceed their system reservation. | ||
| Warning signs:: Utilization gauges turn amber (80% to 90%), the virtual commit approaches 150%, or individual nodes diverge significantly from the cluster average, which suggests imbalanced scheduling. | ||
| Critical state:: Utilization gauges turn red (above 90%), PSI values exceed 0.5, system reservation is exceeded on any node, or virtual commit ratios per node exceed 200%. These conditions indicate that the cluster is at risk of out-of-memory events and VM eviction. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.