From 41b05a6addefda1c6f0ed111e565179c9a37aa5e Mon Sep 17 00:00:00 2001 From: Michael Hennecke Date: Tue, 5 May 2026 13:44:48 +0200 Subject: [PATCH 1/2] DAOS-18907 doc: markdown fixes markdown fixes in admin guide docs Doc-only: true Signed-off-by: Michael Hennecke --- docs/admin/administration.md | 110 ++++++++++++------- docs/admin/common_tasks.md | 18 ++- docs/admin/deployment.md | 64 +++++------ docs/admin/env_variables.md | 6 +- docs/admin/pool_operations.md | 54 ++++----- docs/admin/predeployment_check.md | 37 ++++--- docs/admin/telemetry_guide.md | 84 +++++++------- docs/admin/tiering_uns.md | 60 +++++----- docs/admin/troubleshooting.md | 175 +++++++++++++++++++++++------- docs/admin/vmd.md | 5 +- 10 files changed, 375 insertions(+), 238 deletions(-) diff --git a/docs/admin/administration.md b/docs/admin/administration.md index a13e223fc35..7d1d85db8d9 100644 --- a/docs/admin/administration.md +++ b/docs/admin/administration.md @@ -149,18 +149,21 @@ Help Options: If an arg is not passed, then that logging parameter for each engine process is reset to the values set in the server config file that was used when starting `daos_server`. + - `--masks` will be reset to the value of the engine config `log_mask` parameter. -- `--streams` will be reset to the `env_vars` `DD_MASK` environment variable value or to an empty -string if not set. -- `--subsystems` will be reset to the `env_vars` `DD_SUBSYS` environment variable value or to an -empty string if not set. +- `--streams` will be reset to the `env_vars` `DD_MASK` environment variable value + or to an empty string if not set. +- `--subsystems` will be reset to the `env_vars` `DD_SUBSYS` environment variable value + or to an empty string if not set. Example usage: + ``` dmg server set-logmasks -m DEBUG,MEM=ERR -d mgmt,md -s server,mgmt,bio,common ``` This example would be a runtime equivalent to setting the following in the server config file: + ``` ... engines: @@ -177,7 +180,7 @@ example given above. For more information on the usage of masks (`D_LOG_MASK`), streams (`DD_MASK`) and subsystems (`DD_SUBSYS`) parameters refer to the -[`Debugging System`](https://docs.daos.io/v2.6/admin/troubleshooting/#debugging-system) section. +[Debugging System](https://docs.daos.io/v2.6/admin/troubleshooting/#debugging-system) section. ## System Monitoring @@ -295,6 +298,7 @@ prometheus --config-file=$HOME/.prometheus.yml ## Storage Operations Storage subcommands can be used to operate on host storage. + ```bash $ dmg storage --help Usage: @@ -313,6 +317,7 @@ Available commands: Storage query subcommands can be used to get detailed information about how DAOS is using host storage. + ```bash $ dmg storage query --help Usage: @@ -332,6 +337,7 @@ To query SCM and NVMe storage space usage and show how much space is available t create new DAOS pools with, run the following command: - Query Per-Server Space Utilization: + ```bash $ dmg storage query usage --help Usage: @@ -344,6 +350,7 @@ The command output shows online DAOS storage utilization, only including storage statistics for devices that have been formatted by DAOS control-plane and assigned to a currently running rank of the DAOS system. This represents the storage that can host DAOS pools. + ```bash $ dmg storage query usage Hosts SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used @@ -354,11 +361,11 @@ wolf-72 6.4 TB 2.0 TB 68 % 1.5 TB 1.1 TB 27 % Note that the table values are per-host (storage server) and SCM/NVMe capacity pool component values specified in -[`dmg pool create`](https://docs.daos.io/v2.6/admin/pool_operations/#pool-creationdestroy) +[dmg pool create](https://docs.daos.io/v2.6/admin/pool_operations/#pool-creationdestroy) are per rank. If multiple ranks (I/O processes) have been configured per host in the server configuration file -[`daos_server.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) +[daos\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) then the values supplied to `dmg pool create` should be a maximum of the SCM/NVMe free space divided by the number of ranks per host. @@ -376,6 +383,7 @@ overhead). Useful admin dmg commands to query NVMe SSD health: - Query Per-Server Metadata: + ```bash $ dmg storage query list-devices --help Usage: @@ -391,6 +399,7 @@ Usage: -u, --uuid= Device UUID (all devices if blank) -e, --show-evicted Show only evicted faulty devices ``` + ```bash $ dmg storage query list-pools --help Usage: @@ -409,10 +418,11 @@ stored SMD device and pool tables, respectively. The device table maps the inter device UUID to attached VOS target IDs. The rank number of the server where the device is located is also listed, along with the current device state. The current device states are the following: - - NORMAL: a fully functional device in-use by DAOS - - EVICTED: the device is no longer in-use by DAOS - - UNPLUGGED: the device is currently unplugged from the system (may or not be evicted) - - NEW: the device is plugged and available and not currently in-use by DAOS + +- NORMAL: a fully functional device in-use by DAOS +- EVICTED: the device is no longer in-use by DAOS +- UNPLUGGED: the device is currently unplugged from the system (may or not be evicted) +- NEW: the device is plugged and available and not currently in-use by DAOS To list only devices in the EVICTED state, use the (--show-evicted|-e) option to the list-devices command. @@ -426,6 +436,7 @@ are both VMD devices with transport addresses in the BDF format behind the VMD a The pool table maps the DAOS pool UUID to attached VOS target IDs and will list all of the server ranks that the pool is distributed on. With the additional verbose flag, the mapping of SPDK blob IDs to VOS target IDs will also be displayed. + ```bash $ dmg -l boro-11,boro-13 storage query list-devices ------- @@ -443,6 +454,7 @@ boro-11 UUID:2ccb8afb-5d32-454e-86e3-762ec5dca7be [TrAddr:5d0505:03:00.0] Targets:[1 3] Rank:1 State:NORMAL LED:OFF ``` + ```bash $ dmg -l boro-11,boro-13 storage query list-pools ------- @@ -465,6 +477,7 @@ boro-11 ``` - Query Storage Device Health Data: + ```bash $ dmg storage query list-devices --health --help Usage: @@ -480,6 +493,7 @@ Usage: -u, --uuid= Device UUID (all devices if blank) -e, --show-evicted Show only evicted faulty devices ``` + ```bash $ dmg storage scan --nvme-health --help Usage: @@ -505,6 +519,7 @@ Note: A reasonable timed workload > 60 min must be ran for the SMART stats to re (Raw values are 65535). Media wear percentage can be calculated by dividing by 1024 to find the percentage of the maximum rated cycles. + ```bash $ dmg -l boro-11 storage query list-devices --health --uuid=d5ec1227-6f39-40db-a1f6-70245aa079f1 ------- @@ -555,8 +570,8 @@ boro-11 PLL Lock Loss Count:0 NAND Bytes Written:244081 Host Bytes Written:52114 - ``` + #### Exclusion and Hotplug - Automatic exclusion of an NVMe SSD: @@ -613,6 +628,7 @@ applied: ``` - Manually exclude an NVMe SSD: + ```bash $ dmg storage set nvme-faulty --help Usage: @@ -628,6 +644,7 @@ Usage: To manually evict an NVMe SSD (auto eviction is covered later in this section), the device state needs to be set faulty by running the following command: + ```bash $ dmg storage set nvme-faulty --host=boro-11 --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 NOTICE: This command will permanently mark the device as unusable! @@ -635,6 +652,7 @@ Are you sure you want to continue? (yes/no) yes set-faulty operation performed successfully on the following host: wolf-310:10001 ``` + The device state will transition from "NORMAL" to "EVICTED" (shown above), during which time the faulty device reaction will have been triggered (all targets on the SSD will be rebuilt). The SSD will remain evicted until device replacement occurs. @@ -649,6 +667,7 @@ The LED of the VMD device will remain in this state until replaced by a new devi unbound from the kernel driver and bound instead to a user-space driver so that the device can be used with DAOS. To rebind an SSD on a single host, run the following command (replace SSD PCI address and hostname with appropriate values): + ```bash $ dmg storage nvme-rebind -a 0000:84:00.0 -l wolf-167 Command completed successfully @@ -659,6 +678,7 @@ DAOS I/O engine processes. Now the new device can be used in the following `dmg storage replace nvme` command. - Replace an excluded SSD with a New Device: + ```bash $ dmg storage replace nvme --help Usage: @@ -674,10 +694,12 @@ Usage: To replace an NVMe SSD with an evicted device and reintegrate it into use with DAOS, run the following command: + ```bash $ dmg storage replace nvme --host=boro-11 --old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=80c9f1be-84b9-4318-a1be-c416c96ca48b dev-replace operation performed successfully on the following host: boro-11:10001 ``` + The old, now replaced device will remain in an "EVICTED" state until it is unplugged. The new device will transition from a "NEW" state to a "NORMAL" state (shown above). @@ -686,11 +708,13 @@ The new device will transition from a "NEW" state to a "NORMAL" state (shown abo In order to reuse a device that was previously set as FAULTY and evicted from the DAOS system, an admin can run the following command (setting the old device UUID to be the new device UUID): + ```bash $ dmg storage replace nvme --host=boro-11 ---old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 NOTICE: Attempting to reuse a previously set FAULTY device! dev-replace operation performed successfully on the following host: boro-11:10001 ``` + The FAULTY device will transition from an "EVICTED" state back to a "NORMAL" state, and will again be available for use with DAOS. The use case of this command will mainly be for testing or for accidental device eviction. @@ -704,6 +728,7 @@ The feature supports two LED device events: locating a healthy device and locati an evicted device. - Locate a Healthy SSD: + ```bash $ dmg storage led identify --help Usage: @@ -721,6 +746,7 @@ Usage: To identify a single SSD, any of the Device-UUIDs can be used which can be found from output of the `dmg storage query list-devices` command: + ```bash $ dmg -l boro-11 storage led identify 6fccb374-413b-441a-bfbe-860099ac5e8d --------- @@ -733,6 +759,7 @@ boro-11 The SSD PCI address can also be used in the command to identify a SSD. The PCI address should refer to a VMD backing device and can be found from either `dmg storage scan -v` or `dmg storage query list-devices` commands: + ```bash $ dmg -l boro-11 storage led identify 850505:0b:00.0 --------- @@ -744,6 +771,7 @@ boro-11 To identify multiple SSDs, supply a comma separated list of Device-UUIDs and/or PCI addresses, adding custom timeout of 5 minutes for LED identification (time to flash LED for): + ```bash $ dmg -l boro-11 storage led identify --timeout 5 850505:0a:00.0,6fccb374-413b-441a-bfbe-860099ac5e8d,850505:11:00.0 --------- @@ -781,6 +809,7 @@ no positional arguments are supplied. To verify the LED state of SSDs the following command can be used in a similar way to the identify command: + ```bash $ dmg -l boro-11 storage led check 850505:0a:00.0,6fccb374-413b-441a-bfbe-860099ac5e8d,850505:11:00.0 --------- @@ -820,15 +849,16 @@ removed and storage wiped. System commands will be handled by a DAOS Server acting as the MS leader and listening on the address specified in the DMG config file "hostlist" parameter. See -[`daos_control.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_control.yml) +[daos\_control.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_control.yml) for details. At least one of the addresses in the hostlist parameters should match one of the `mgmt_svc_replicas` addresses specified in the server config file -[`daos_server.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) +[daos\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) that is supplied when starting `daos_server` instances. - Commands used to manage a DAOS System: + ```bash $ dmg system --help Usage: @@ -852,6 +882,7 @@ The system membership refers to the DAOS engine processes that have registered, or joined, a specific DAOS system. - Query System Membership: + ```bash $ dmg system query --help Usage: @@ -880,8 +911,9 @@ from the pools it hosted, please check the pool operation section on how to reintegrate an excluded engine. After one or more DAOS engines being excluded, the DAOS agent cache needs to be -refreshed. For detailed information, please refer to the [1][System Deployment -documentation]. Before refreshing the DAOS Agent cache, it should be checked +refreshed. For detailed information, please refer to the +[1][System Deployment documentation]. +Before refreshing the DAOS Agent cache, it should be checked that the exclusion information has been spread to the Management Service leader. This could be done using the `dump-attachinfo` sub-command of the `daos_agent` executable: @@ -903,12 +935,12 @@ transport_config: log_file: /var/log/daos/daos_agent-tmp.log ``` - ### Shutdown When up and running, the entire system can be shutdown. - Stop a System: + ```bash $ dmg system stop --help Usage: @@ -949,6 +981,7 @@ This is useful to stop (and restart) misbehaving engines. The system can be started backup after a controlled shutdown. - Start a System: + ```bash $ dmg system start --help Usage: @@ -978,15 +1011,17 @@ be reintegrated. Please see the pool operation section for more information. To reformat the system after a controlled shutdown, run the command: -`$ dmg storage format --force` +``` +$ dmg storage format --force +``` -- `--force` flag indicates that a (re)format operation should be -performed disregarding existing filesystems -- if no record of previously running ranks can be found, reformat is -performed on the hosts that are specified in the `daos_control.yml` -config file's `hostlist` parameter. -- if system membership has records of previously running ranks, storage -allocated to those ranks will be formatted +- The `--force` flag indicates that a (re)format operation should be + performed disregarding existing filesystems +- If no record of previously running ranks can be found, reformat is + performed on the hosts that are specified in the `daos_control.yml` + config file's `hostlist` parameter. +- If system membership has records of previously running ranks, storage + allocated to those ranks will be formatted The output table will indicate action and result. @@ -1008,7 +1043,6 @@ DAOS I/O Engines will be started, and all DAOS pools will have been removed. ``` Then restart DAOS Servers and format. - ### Storage Format Replace If storage metadata for a rank is lost, for example after losing PMem contents after NVDIMM failure, @@ -1017,6 +1051,7 @@ the storage server has not changed the old rank can be "reused" by formatting us `dmg storage format --replace` option. An examples workflow would be: + - `daos_server` is running and PMem NVDIMM fails causing an engine to enter excluded state. - `daos_server` is stopped, storage server powered down, faulty PMem NVDIMM is replaced. - After powering up storage server, `daos_server scm prepare` command is used to repair PMem. @@ -1052,7 +1087,6 @@ formatted again by running `dmg storage format`. A reboot will be required to finalize the change of the PMem allocation goals. - ### System Extension To add a new server to an existing DAOS system, one should install: @@ -1101,6 +1135,7 @@ An administrator may add or remove hosts from the MS replica list. 5. Restart all `daos_server` and `daos_agent` processes. To verify that the updated MS replicas came up correctly: + 1. Use the `dmg system query` command to check that all expected ranks have come up in the Joined state. The command should not time out. 2. Use the `dmg system leader-query` to ensure a leader election has completed. @@ -1109,8 +1144,7 @@ To verify that the updated MS replicas came up correctly: When removing or replacing MS replicas, do *not* replace all old replicas with new ones. At least one old replica must remain in the list to act as a data source for - the new replicas. - + the new replicas. ## Software Upgrade @@ -1142,7 +1176,7 @@ The following table is intended to visually depict the interoperability policies for all major components in a DAOS system. -||Server
(daos_server)|Engine
(daos_engine)|Agent
(daos_agent)|Client
(libdaos)|Admin
(dmg)| +||Server
(daos\_server)|Engine
(daos\_engine)|Agent
(daos\_agent)|Client
(libdaos)|Admin
(dmg)| |:---|:---:|:---:|:---:|:---:|:---:| |Server|x.y.z|x.y.z|x.(y±1)|n/a|x.y| |Engine|x.y.z|x.y.z|n/a|x.(y±1)|n/a| @@ -1151,14 +1185,16 @@ policies for all major components in a DAOS system. |Admin|x.y|n/a|n/a|n/a|n/a| Key: - * x.y.z: Major.Minor.Patch must be equal - * x.y: Major.Minor must be equal - * x.(y±1): Major must be equal, Minor must be equal or -1/+1 release version - * n/a: Components do not communicate + +* x.y.z: Major.Minor.Patch must be equal +* x.y: Major.Minor must be equal +* x.(y±1): Major must be equal, Minor must be equal or -1/+1 release version +* n/a: Components do not communicate Examples: - * daos_server 2.4.0 is only compatible with daos_engine 2.4.0 - * daos_agent 2.6.0 is compatible with daos_server 2.4.0 (2.5 is a development version) - * dmg 2.4.1 is compatible with daos_server 2.4.0 + +* daos\_server 2.4.0 is only compatible with daos\_engine 2.4.0 +* daos\_agent 2.6.0 is compatible with daos\_server 2.4.0 (2.5 is a development version) +* dmg 2.4.1 is compatible with daos\_server 2.4.0 [1]: (Refresh DAOS Agent Cache) diff --git a/docs/admin/common_tasks.md b/docs/admin/common_tasks.md index a378c04b7ae..03d0cf63953 100644 --- a/docs/admin/common_tasks.md +++ b/docs/admin/common_tasks.md @@ -1,6 +1,10 @@ # DAOS Common Tasks -This section describes some of the common tasks handled by admins at a high level. See [System Deployment](./deployment.md#system-deployment), [DAOS System Administration](./administration.md#daos-system-administration), and [Pool Operations](./pool_operations.md#pool-operations) for more detailed explanations about each step. +This section describes some of the common tasks handled by admins at a high level. +See [System Deployment](./deployment.md#system-deployment), +[DAOS System Administration](./administration.md#daos-system-administration), +and [Pool Operations](./pool_operations.md#pool-operations) +for more detailed explanations about each step. ## Single host setup with PMEM and NVMe @@ -9,13 +13,12 @@ This section describes some of the common tasks handled by admins at a high leve 3. Install `daos-server` and `daos-client` RPMs. 4. Generate certificate files. 5. Copy one of the example configs from `utils/config/examples` to -`/etc/daos` and adjust it based on the environment. E.g., `mgmt_svc_replicas`, -`class`. + `/etc/daos` and adjust it based on the environment. E.g., `mgmt_svc_replicas`, + `class`. 6. Check that the directory where the log files will be created exists. E.g., -`control_log_file`, `log_file` field in `engines` section. + `control_log_file`, `log_file` field in `engines` section. 7. Start `daos_server`. -8. Use `dmg config generate` to generate the config file that contains PMEM and -NVMe. +8. Use `dmg config generate` to generate the config file that contains PMEM and NVMe. 9. Define the certificate files in the server config. 10. Start server with the generated config file. 11. Check that the server is waiting for SCM format. Call `dmg storage format`. @@ -62,6 +65,7 @@ server hosts. 1. Start DAOS server with PMEM + NVMe and format. 2. Create a pool with a size percentage. For example, + ``` dmg pool create --size=50% ``` @@ -72,6 +76,7 @@ The percentage is applied to the usable space. 1. Start DAOS server on one host. 2. Create a file that specifies the server host in `/etc/daos`. It's usually called `daos_control.yml`. Add the following: + ``` hostlist: - @@ -83,6 +88,7 @@ transport_config: cert: /etc/daos/certs/admin.crt key: /etc/daos/certs/admin.key ``` + `server_host` is the hostname where the server is running. `group_name` is usually `daos_server`. Match the `port` field defined in the server config. Adjust `transport_config` accordingly. diff --git a/docs/admin/deployment.md b/docs/admin/deployment.md index 88e0c4bd16c..aa6669e492c 100644 --- a/docs/admin/deployment.md +++ b/docs/admin/deployment.md @@ -74,7 +74,7 @@ The configuration file location can be specified on the command line (`/etc/daos/daos_server.yml`). Parameter descriptions are specified in -[`daos_server.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) +[daos\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) and example configuration files in the [examples](https://github.com/daos-stack/daos/tree/master/utils/config/examples) directory. @@ -97,30 +97,30 @@ for the path specified through the -o option of the `daos_server` command line, if unspecified then `/etc/daos/daos_server.yml` is used. Refer to the example configuration file -[`daos_server.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) +[daosi\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) for latest information and examples. #### MD-on-SSD Configuration -To enable MD-on-SSD, the Control-Plane-Metadata ('control_metadata') global section of the +To enable MD-on-SSD, the Control-Plane-Metadata (`control_metadata`) global section of the configuration file -[`daos_server.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) +[daos\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) needs to specify a persistent location to store control-plane specific metadata (which would be -stored on PMem in non MD-on-SSD mode). Either set 'control_metadata:path' to an existing (mounted) -local filesystem path or set 'control_metadata:device' to a storage partition which can be mounted +stored on PMem in non MD-on-SSD mode). Either set `control_metadata:path` to an existing (mounted) +local filesystem path or set `control_metadata:device` to a storage partition which can be mounted and formatted by the control-plane during storage format. In the latter case when specifying a device the path parameter value will be used as the mountpoint path. The MD-on-SSD code path will only be used if it is explicitly enabled by specifying the new -'bdev_role' property for the NVMe storage tier(s) in the 'daos_server.yml' file. There are three -types of 'bdev_role': wal, meta, and data. Each role must be assigned to exactly one NVMe tier. +`bdev_role` property for the NVMe storage tier(s) in the `daos_server.yml` file. There are three +types of `bdev_role`: wal, meta, and data. Each role must be assigned to exactly one NVMe tier. Depending on the number of NVMe SSDs per DAOS engine there may be one, two or three NVMe tiers with -different 'bdev_role' assignments. +different `bdev_role` assignments. For a complete server configuration file example enabling MD-on-SSD, see -[`daos_server_mdonssd.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml). +[daos\_server_mdonssd.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml). -Below are four different 'daos_server.yml' storage configuration snippets that represent scenarios +Below are four different `daos_server.yml` storage configuration snippets that represent scenarios for a DAOS engine with four NVMe SSDs and MD-on-SSD enabled. @@ -148,7 +148,6 @@ This example shows the typical use case for a DAOS server with a small number of only four or five NVMe SSDs per engine, it is natural to assign all three roles to all NVMe SSDs configured as a single NVMe tier. - 2. Two NVMe tiers, one SSD assigned wal role (tier-1) and three SSDs assigned both meta and data roles (tier-2): @@ -254,11 +253,11 @@ engine but maybe practical with a larger number of SSDs and so illustrated here DAOS can attempt to produce a server configuration file that makes optimal use of hardware on a given set of hosts either through the `dmg` or `daos_server` tools. -To generate an MD-on-SSD configurations set both '--control-metadata-path' and '--use-tmpfs-scm' +To generate an MD-on-SSD configurations set both `--control-metadata-path` and `--use-tmpfs-scm` options as detailed below. Note that due to the number of variables considered when generating a configuration automatically the result may not be the most optimal in all situations. -##### Generating Configuration File Using daos_server Tool +##### Generating Configuration File Using daos\_server Tool To generate a configuration file for a single storage server, run the `daos_server config generate` command locally. In this case, the `daos_server` service should not be running on the local host. @@ -399,7 +398,7 @@ storage tier. The RAM-disk sizes will be calculated based on the host's total me by `/proc/meminfo`). - `--control-metadata-path` specifies a persistent location to store control-plane metadata which -allows MD-on-SSD DAOS deployments to survive without data loss over 'daos_server' restarts. If +allows MD-on-SSD DAOS deployments to survive without data loss over `daos_server` restarts. If this option is set then a MD-on-SSD config will be generated. - `--fabric-ports` enables custom port numbers to be assigned to each engine's fabric settings. @@ -844,7 +843,7 @@ To set the addresses of which DAOS Servers to task, provide either: - `-l ` on the commandline when invoking, or - `hostlist: ` in the control configuration file - [`daos_control.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_control.yml) + [daos_control.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_control.yml) Where `` represents a slurm-style hostlist string e.g. `foo-1[28-63],bar[256-511]`. @@ -978,8 +977,8 @@ network. `daos_server (nvme|scm) scan` can be used to query storage on the local host directly. !!! note - 'daos_server' commands will refuse to run if a process with the same name exists (e.g. as a - systemd service under the 'daos_server' userid). + `daos_server` commands will refuse to run if a process with the same name exists (e.g. as a + systemd service under the `daos_server` userid). NVMe SSDs no longer need to be made accessible first by running `daos_server nvme prepare`, `daos_server nvme scan` will take the necessary steps to prepare the devices unless `--skip-prep` @@ -990,7 +989,7 @@ To use an alternative driver with SPDK, set `--disable-vfio` in the nvme prepare fallback to using UIO user-space driver with SPDK instead. !!! note - If UIO user-space driver is used instead of VFIO, 'daos_server' needs to be run as root. + If UIO user-space driver is used instead of VFIO, `daos_server` needs to be run as root. The output will be equivalent running `dmg storage scan --verbose` remotely. @@ -1035,7 +1034,7 @@ manual reset to do so. !!! warning Due to [SPDK issue 2926](https://github.com/spdk/spdk/issues/2926), if VMD is enabled and - PCI_ALLOWED list is set to a subset of available VMD controllers (as specified in the server + PCI\_ALLOWED list is set to a subset of available VMD controllers (as specified in the server config file) then the backing devices of the unselected VMD controllers will be bound to no driver and therefore inaccessible from both OS and SPDK. Workaround is to run `daos_server nvme scan --ignore-config` to reset driver bindings for all VMD controllers. @@ -1279,7 +1278,7 @@ For class == "nvme", the following parameters should be populated: - `bdev_list` should be populated with NVMe PCI addresses. - `bdev_roles` optionally specifies a list of roles for this tier. By default, the DAOS server will assign roles to bdev tiers - automatically, so the bdev_roles directive is only needed when that + automatically, so the `bdev_roles` directive is only needed when that assignment doesn't match your use case. When "dcpm" is used for the first tier, this list should be omitted or @@ -1293,7 +1292,7 @@ For class == "nvme", the following parameters should be populated: will assign them. Otherwise all roles must be assigned to a tier. See the sample configuration file -[`daos_server.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) +[daos\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) and example configuration files in the [examples](https://github.com/daos-stack/daos/tree/master/utils/config/examples) directory for more details. @@ -1303,7 +1302,7 @@ To use an alternative driver with SPDK, set `disable_vfio: true` in the global s server config file to fallback to using UIO user-space driver with SPDK instead. !!! note - If UIO user-space driver is used instead of VFIO, 'daos_server' needs to be run as root. + If UIO user-space driver is used instead of VFIO, `daos_server` needs to be run as root. If VMD is enabled on a host, its usage will be enabled by default meaning that the `bdev_list` device addresses will be interpreted as VMD endpoints and storage scan will report the details of @@ -1311,14 +1310,6 @@ the physical NVMe backing devices that belong to each VMD endpoint. To disable t VMD-enabled host, set `disable_vmd: true` in the global section of the config to fallback to using physical NVMe devices only. -!!! warning - If upgrading from DAOS 2.0 to a greater version, the old 'enable_vmd' server config file - parameter is no longer honored and instead should be removed (or replaced by - `disable_vmd: true` if VMD is to be explicitly disabled). - - Otherwise 'daos_server' may fail config validation and not start after an update from 2.0 to a - greater version. - #### Example Configurations To illustrate, assume a cluster with homogeneous hardware configurations that @@ -1498,7 +1489,8 @@ This configuration yields the fastest access to that network device. Information about the network configuration is stored as metadata on the DAOS storage. -If, after initial deployment, the provider must be changed, please follow the directions to [`change fabric provider`](https://github.com/daos-stack/daos/blob/master/docs/admin/common_tasks.md#change-fabric-provider-on-a-daos-system). +If, after initial deployment, the provider must be changed, please follow the directions to +[change fabric provider](https://github.com/daos-stack/daos/blob/master/docs/admin/common_tasks.md#change-fabric-provider-on-a-daos-system). #### Provider Testing @@ -1547,10 +1539,10 @@ per four target threads, for example `targets: 16` and `nr_xs_helpers: 4`. The server should have sufficiently many physical cores to support the number of targets plus the additional service threads. -The 'targets:' and 'nr_xs_helpers:' requirement are mandatory, if the number +The `targets:` and `nr_xs_helpers:` requirement are mandatory, if the number of physical cores are not enough it will fail the starting of the daos engine (notes that 2 cores reserved for system service), or configures with ENV -"DAOS_TARGET_OVERSUBSCRIBE=1" to force starting daos engine (possibly hurts +`DAOS_TARGET_OVERSUBSCRIBE=1` to force starting daos engine (possibly hurts performance as multiple XS compete on same core). @@ -1742,7 +1734,9 @@ If you wish to use systemd with a development build, you must copy the Agent ser file from `utils/systemd/` to `/usr/lib/systemd/system/`. Then modify the `ExecStart` line to point to your Agent configuration file: -`ExecStart=/usr/bin/daos_agent -o <'path to agent configuration file/daos_agent.yml'>` +``` +ExecStart=/usr/bin/daos_agent -o <'path to agent configuration file/daos_agent.yml'> +``` Once the service file is installed and `systemctl daemon-reload` has been run to reload the configuration, the `daos_agent` can be started through systemd diff --git a/docs/admin/env_variables.md b/docs/admin/env_variables.md index 008a83ca4a5..12a00d406de 100644 --- a/docs/admin/env_variables.md +++ b/docs/admin/env_variables.md @@ -49,7 +49,7 @@ Environment variables in this section only apply to the server side. |DAOS\_SCHED\_RELAX\_MODE|The mode of CPU relaxing on idle. "disabled":disable relaxing; "net":wait on network request for INTVL; "sleep":sleep for INTVL. STRING. Default to "net"| |DAOS\_SCHED\_RELAX\_INTVL|CPU relax interval in milliseconds. INTEGER. Default to 1 ms.| |DAOS\_STRICT\_SHUTDOWN|Use the strict mode when shutting down engines. BOOL. Default to 0. In the strict mode, when certain resource leaks are detected, for instance, the engine will raise an assertion failure.| -|DAOS\_DTX\_AGG\_THD\_CNT|DTX aggregation count threshold. The valid range is [2^20, 2^24]. The default value is 2^19*7.| +|DAOS\_DTX\_AGG\_THD\_CNT|DTX aggregation count threshold. The valid range is [2^20, 2^24]. The default value is 2^19.| |DAOS\_DTX\_AGG\_THD\_AGE|DTX aggregation age threshold in seconds. The valid range is [210, 1830]. The default value is 630.| |DAOS\_DTX\_RPC\_HELPER\_THD|DTX RPC helper threshold. The valid range is [18, unlimited). The default value is 513.| |DAOS\_DTX\_BATCHED\_ULT\_MAX|The max count of DTX batched commit ULTs. The valid range is [0, unlimited). 0 means to commit DTX synchronously. The default value is 32.| @@ -80,10 +80,10 @@ Environment variables in this section only apply to the client side. |Variable |Description| |------------|-----------| -|D\_LOG\_FILE|DAOS debug logs (both server and client) are written to stdout by default. The debug location can be modified by setting this environment variable ("D\_LOG\_FILE=/tmp/daos_debug.log").| +|D\_LOG\_FILE|DAOS debug logs (both server and client) are written to stdout by default. The debug location can be modified by setting this environment variable (`D\_LOG\_FILE=/tmp/daos_debug.log`).| |D\_LOG\_FILE\_APPEND\_PID|If set and not 0, causes the main PID to be appended at the end of D\_LOG\_FILE path name (both server and client).| |D\_LOG\_STDERR\_IN\_LOG|If set and not 0, causes stderr messages to be merged in D\_LOG\_FILE.| -|D\_LOG\_SIZE|DAOS debug logs (both server and client) have a 1GB file size limit by default. When this limit is reached, the current log file is closed and renamed with a .old suffix, and a new one is opened. This mechanism will repeat each time the limit is reached, meaning that available saved log records could be found in both ${D_LOG_FILE} and last generation of ${D_LOG_FILE}.old files, to a maximum of the most recent 2*D_LOG_SIZE records. This can be modified by setting this environment variable ("D_LOG_SIZE=536870912"). Sizes can also be specified in human-readable form using `k`, `m`, `g`, `K`, `M`, and `G`. The lower-case specifiers are base-10 multipliers and the upper case specifiers are base-2 multipliers.| +|D\_LOG\_SIZE|DAOS debug logs (both server and client) have a 1GB file size limit by default. When this limit is reached, the current log file is closed and renamed with a .old suffix, and a new one is opened. This mechanism will repeat each time the limit is reached, meaning that available saved log records could be found in both ${D_LOG_FILE} and last generation of ${D_LOG_FILE}.old files, to a maximum of the most recent `2*D_LOG_SIZE` records. This can be modified by setting this environment variable ("D_LOG_SIZE=536870912"). Sizes can also be specified in human-readable form using `k`, `m`, `g`, `K`, `M`, and `G`. The lower-case specifiers are base-10 multipliers and the upper case specifiers are base-2 multipliers.| |D\_LOG\_FLUSH|Allows to specify a non-default logging level where flushing will occur. By default, only levels above WARN will cause an immediate flush instead of buffering.| |D\_LOG\_TRUNCATE|By default log is appended. But if set this variable will cause log to be truncated upon first open and logging start.| |DD\_SUBSYS |Used to specify which subsystems to enable. DD\_SUBSYS can be set to individual subsystems for finer-grained debugging ("DD\_SUBSYS=vos"), multiple facilities ("DD\_SUBSYS=bio,mgmt,misc,mem"), or all facilities ("DD\_SUBSYS=all") which is also the default setting. If a facility is not enabled, then only ERR messages or more severe messages will print.| diff --git a/docs/admin/pool_operations.md b/docs/admin/pool_operations.md index 1879f060502..516e6ba44d2 100644 --- a/docs/admin/pool_operations.md +++ b/docs/admin/pool_operations.md @@ -79,7 +79,7 @@ whether you want to specify an absolute size, use available capacity percentages - **Minimums:** - SCM: **16 MiB per target** (e.g., `--scm-size=256MiB` for 16 targets). - NVMe: Optional, but if set must be **≥ 1 GiB per target** (e.g., `--nvme-size=16GiB`). - - Total pool size is calculated as: + - Total pool size is calculated as: ``` total_size = (scm_size + nvme_size) × number_of_engines ``` @@ -101,6 +101,7 @@ whether you want to specify an absolute size, use available capacity percentages Examples: To create a pool labeled `tank`: + ```bash $ dmg pool create --size=${N}TB tank ``` @@ -189,6 +190,7 @@ rank/engine where bdev roles META and DATA are not shared. This is a snippet of the server config file engine section showing storage definitions with `bdev_roles` "meta" and "data" assigned to separate tiers: + ```bash storage: - @@ -206,6 +208,7 @@ definitions with `bdev_roles` "meta" and "data" assigned to separate tiers: This pool command requests to use all available storage and maintain a 1:1 Memory-File to Metadata-Storage size ratio (mem-ratio): + ```bash $ dmg pool create bob --size 100% --mem-ratio 100% @@ -224,6 +227,7 @@ Pool created with 15.91%,84.09% storage tier ratio Rough calculations: `dmg storage scan` shows that for each rank, one 800GB SSD is assigned for each tier (first: WAL+META, second: DATA). `df -h /mnt/daos*` reports usable ramdisk capacity for the single rank is 142 GiB (152 GB). + - Expected Data storage would then be 800GB for the pool (one rank). - Expected Meta storage at 100% mem-ratio would be the total ramdisk capacity. - Expected Memory-File size would be identical to Meta storage size. @@ -249,7 +253,6 @@ Pool created with 27.46%,72.54% storage tier ratio Memory File Size : 151 GB (151 GB / rank) ``` - 3. If we then try the same with bdev roles META and DATA are shared. Here we can illustrate how metadata overheads are accommodated for when the same devices share roles (and will be used to store both metadata and data). @@ -257,6 +260,7 @@ devices share roles (and will be used to store both metadata and data). This is a snippet of the server config file engine section showing storage definitions with `bdev_roles` "meta" and "data" assigned to the same (single) tier: + ```bash storage: - @@ -270,6 +274,7 @@ tier: This pool command requests to use all available storage and maintain a 1:1 Memory-File to Metadata-Storage size ratio (mem-ratio): + ```bash $ dmg pool create bob --size 100% --mem-ratio 100% @@ -289,7 +294,6 @@ Looking at this output and comparing with example no. 1 we observe that because both SSDs are sharing META and DATA roles, more capacity is available for DATA. - 4. If the `--mem-ratio` is then reduced to 50% in the above example, we end up with double the Metadata-Storage size which detracts from the DATA capacity. @@ -311,7 +315,6 @@ Pool created with 20.32%,79.68% storage tier ratio META has been doubled at the cost of DATA capacity. - 5. Adding another engine/rank on the same host results in more than double DATA capacity because RAM-disk capacity is halved across two engines/ranks on the same host and this results in a reduction of META and increase in DATA per-rank sizes. @@ -335,7 +338,6 @@ Pool created with 8.65%,91.35% storage tier ratio Memory File Size : 129 GB (64 GB / rank) ``` - 6. A larger pool with 6 engines/ranks across 3 hosts using the same shared-role configuration and pool-create commandline as the previous example. @@ -358,7 +360,6 @@ Pool created with 8.65%,91.35% storage tier ratio Here the size has increased linearly with the per-rank sizes remaining the same. - 7. Now for a more involved example with shared roles. Create two pools of roughly equal size each using half available capacity and a `--mem-ratio` of 50%. @@ -474,7 +475,6 @@ created pool is slightly smaller meaning the total cumulative pool size reading `4.5+3.8 == 8.3 TB` rather than `8.9 TB` which can be partly explained because of extra per-pool overheads and possible rounding in size calculations. - 8. Now for a similar experiment as example no. 8 but with separate META and DATA roles. @@ -564,7 +564,6 @@ some idea of how to use the tool commands. Capacity can be best utilized by understanding assignment of roles and SSDs across tiers and the tuning of the mem-ratio pool create option. - ### Listing Pools To see a list of the pools in the DAOS system: @@ -602,6 +601,7 @@ tank 8a05bf3a-a088-4a77-bb9f-df989fce7cc8 1-3 3 GB 10 kB 0% ``` In MD-on-SSD mode: + ```bash $ dmg pool list --verbose Label UUID SvcReps Meta Size Meta Used Meta Imbalance DATA Size DATA Used DATA Imbalance Disabled @@ -953,16 +953,16 @@ called aggregation. The reclaim property defines what strategy to use to reclaimed unused version. Three options are supported: -* "lazy" : Trigger aggregation only when there is no IO activities or SCM free space is under pressure (default strategy) -* "time" : Trigger aggregation regularly despite of IO activities. -* "disabled" : Never trigger aggregation. The system will eventually run out of space even if data is being deleted. +* `lazy` : Trigger aggregation only when there is no IO activities or SCM free space is under pressure (default strategy) +* `time` : Trigger aggregation regularly despite of IO activities. +* `disabled` : Never trigger aggregation. The system will eventually run out of space even if data is being deleted. ### Self-healing Policy (self\_heal) This property defines whether a failing engine is automatically evicted from the pool. Once excluded, the self-healing mechanism will be triggered to restore the pool data redundancy on the surviving storage engines. -Two options are supported: "exclude" (default strategy) and "rebuild". +Two options are supported: `exclude` (default strategy) and `rebuild`. ### Reserved Space for Rebuilds (space\_rb) @@ -1041,34 +1041,34 @@ This property controls how checkpoints are triggered for each target. When enab checkpointing will always trigger if there is space pressure in the WAL. There are three supported options: -* "timed" : Checkpointing is also triggered periodically (default option). -* "lazy" : Checkpointing is only triggered when there is WAL space pressure. -* "disabled" : Checkpointing is disabled. WAL space may be exhausted. +* `timed` : Checkpointing is also triggered periodically (default option). +* `lazy` : Checkpointing is only triggered when there is WAL space pressure. +* `disabled` : Checkpointing is disabled. WAL space may be exhausted. #### Checkpoint frequency (checkpoint\_freq) This property controls how often checkpoints are triggered. It is only relevant -if the checkpoint policy is "timed". The value is specified in seconds in the +if the checkpoint policy is `timed`. The value is specified in seconds in the range [1, 1000000] with a default of 5. Values outside the range are automatically adjusted. #### Checkpoint threshold (checkpoint\_thresh) This property controls the percentage of WAL usage to automatically trigger a checkpoint. -It is not relevant when the checkpoint policy is "disabled". The value is specified +It is not relevant when the checkpoint policy is `disabled`. The value is specified as a percentage in the range [10-75] with a default of 50. Values outside the range are automatically adjusted. #### Reintegration mode (reintegration) This property controls how reintegration will recover data. Three options are supported: -"data_sync" (default strategy) and "no_data_sync", "incremental". with "data_sync", reintegration -will discard pool data and trigger rebuild to sync data. With "no_data_sync", reintegration only -updates pool map to include rank. While with "incremental", reintegration will not discard pool +`data_sync` (default strategy) and `no_data_sync`, `incremental`. with `data_sync`, reintegration +will discard pool data and trigger rebuild to sync data. With `no_data_sync`, reintegration only +updates pool map to include rank. While with `incremental`, reintegration will not discard pool data but will trigger rebuild to sync data only beyond global stable epoch, the reintegration is incremental as old data below global stable epoch need not to be migrated. -NB: with "no_data_sync" enabled, containers will be turned to read-only, daos won't trigger +NB: with `no_data_sync` enabled, containers will be turned to read-only, daos won't trigger rebuild to restore the pool data redundancy on the surviving storage engines if there are dead rank events. @@ -1258,7 +1258,7 @@ to restore redundancy on the remaining engines. !!! note Exclusion may compromise the Pool Redundancy Factor (RF), potentially leading to data loss. If this is the case, the command will refuse to perform the exclusion - and return the error code -DER_RF. You can proceed with the exclusion by specifying + and return the error code -DER\_RF. You can proceed with the exclusion by specifying the --force option. Please note that forcing the operation may result in data loss, and it is strongly recommended to verify the RF status before proceeding. @@ -1304,7 +1304,7 @@ operation is ongoing. Drain additionally enables non-replicated data to be rebuilt onto another target whereas in a conventional failure scenario non-replicated data would not be integrated into a rebuild and would be lost. Drain operation is not allowed if there are other ongoing rebuild operations, otherwise -it will return -DER_BUSY. +it will return -DER\_BUSY. An operator can drain one or more engines or targets from a specific DAOS pool using the rank(s) where the target(s) reside, as well as the target index(es) on the rank(s). If a target idx list is @@ -1354,7 +1354,7 @@ original state. The operator can either reintegrate specific targets for an engine rank by supplying a target idx list, or reintegrate an entire engine rank by omitting the list. Reintegrate operation is not allowed if there are other ongoing rebuild operations, -otherwise it will return -DER_BUSY. +otherwise it will return -DER\_BUSY. An operator can reintegrate one or more engines or targets from a specific DAOS pool using the rank(s) where the target(s) reside, as well as the target index(es) on the rank(s). If a target idx @@ -1400,7 +1400,7 @@ $ dmg pool reintegrate $DAOS_POOL --ranks=5 --target-idx=0,1 While dmg pool query and list show how many targets are disabled for each pool, there is currently no way to list the targets that have actually been disabled. As a result, it is recommended for now to try to reintegrate all engine ranks reported as disabled in - `dmg pool query`. The string output after "Disabled ranks:" in the pool query output can + `dmg pool query`. The string output after `Disabled ranks:` in the pool query output can be used as the `--ranks=` option value in `dmg pool reintegrate` command. #### System Reintegrate @@ -1435,7 +1435,7 @@ pool. This will automatically trigger a server rebalance operation where objects within the extended pool will be rebalanced across the new storage. Extend operation is not allowed if there are other ongoing rebuild operations, -otherwise it will return -DER_BUSY. +otherwise it will return -DER\_BUSY. ``` $ dmg pool extend $DAOS_POOL --ranks=${rank1},${rank2}... @@ -1475,7 +1475,7 @@ those failures can be recovered and DAOS engines can be restarted and the system can function again. Administrator can set the default pool redundancy factor by environment variable -"DAOS_POOL_RF" in the server yaml file. If SWIM detects and reports an engine is +`DAOS_POOL_RF` in the server yaml file. If SWIM detects and reports an engine is dead and the number of failed fault domain exceeds or is going to exceed the pool redundancy factor, it will not change pool map immediately. Instead, it will give log messages: diff --git a/docs/admin/predeployment_check.md b/docs/admin/predeployment_check.md index 0c9c92a049f..67addee72eb 100644 --- a/docs/admin/predeployment_check.md +++ b/docs/admin/predeployment_check.md @@ -29,9 +29,9 @@ $ sudo reboot ``` !!! note - To force SPDK to use UIO rather than VFIO at daos_server runtime, set - 'disable_vfio' in the [server config file](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml#L109), - but note that this will require running daos_server as root. + To force SPDK to use UIO rather than VFIO at daos\_server runtime, set + `disable_vfio` in the [server config file](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml#L109), + but note that this will require running daos\_server as root. !!! warning If VFIO is not enabled on RHEL 8.x and derivatives, you will run into the issue described in: @@ -117,7 +117,7 @@ Some special configuration is required for the `verbs` provider to use librdmacm with multiple interfaces. The same configuration is required for the `tcp` provider and for all of the `ucx` provider options. -First, the accept_local feature must be enabled on the network interfaces +First, the `accept_local` feature must be enabled on the network interfaces to be used by DAOS. This can be done using the following command: ``` @@ -125,7 +125,7 @@ $ sudo sysctl -w net.ipv4.conf.all.accept_local=1 ``` Second, Linux must be configured to only send ARP replies on the interface -targeted in the ARP request. This is configured via the arp_ignore parameter. +targeted in the ARP request. This is configured via the `arp_ignore` parameter. This should be set to 2 if all the IPoIB interfaces on the client and storage nodes are in the same logical subnet (e.g. ib0 == 10.0.0.27, ib1 == 10.0.1.27, prefix=16). @@ -141,7 +141,7 @@ set to 1. $ sysctl -w net.ipv4.conf.all.arp_ignore=1 ``` -Finally, the rp_filter is set to 1 by default on several distributions (e.g. on +Finally, the `rp_filter` is set to 1 by default on several distributions (e.g. on CentOS 7 and EL 8) and should be set to either 0 or 2, with 2 being more secure. This is true even if the configuration uses a single logical subnet. must be replaced with the interface names) @@ -154,7 +154,8 @@ All those parameters can be made persistent in /etc/sysctl.conf by adding a new sysctl file under /etc/sysctl.d (e.g. /etc/sysctl.d/95-daos-net.conf) with all the relevant settings. -For more information, please refer to the [librdmacm documentation](https://github.com/linux-rdma/rdma-core/blob/master/Documentation/librdmacm.md) +For more information, please refer to the +[librdmacm documentation](https://github.com/linux-rdma/rdma-core/blob/master/Documentation/librdmacm.md) ### Firewall @@ -181,10 +182,12 @@ the necessary directories are setup. A sign that this step may have been missed is when starting daos_server or daos_agent, you may see the message: + ```bash $ mkdir /var/run/daos_server: permission denied Unable to create socket directory: /var/run/daos_server ``` + #### Non-default Directory By default, daos_server and daos_agent will use the directories @@ -211,14 +214,16 @@ therefore, if reboots are infrequent, an easy solution while still utilizing the default locations is to create the required directories manually. To do this execute the following commands. -daos_server: +daos\_server: + ```bash $ mkdir /var/run/daos_server $ chmod 0755 /var/run/daos_server $ chown user:user /var/run/daos_server (where user is the user you will run daos_server as) ``` -daos_agent: +daos\_agent: + ```bash $ mkdir /var/run/daos_agent $ chmod 0755 /var/run/daos_agent @@ -253,7 +258,7 @@ that require elevated privileges on behalf of `daos_server`. When DAOS is installed from RPM, the `daos_server_helper` helper is automatically installed -to the correct location with the correct permissions. The RPM creates a "daos_server" +to the correct location with the correct permissions. The RPM creates a `daos_server` system group and configures permissions such that `daos_server_helper` may only be invoked from `daos_server`. @@ -309,9 +314,9 @@ failures. For RPM installations, the `daos_server` and `daos_agent` services will typically be launched by `systemd` and its `LimitMEMLOCK` limit is set to `infinity` in the -[`daos_server.service`](https://github.com/daos-stack/daos/blob/master/utils/systemd/daos_server.service) +[daos\_server.service](https://github.com/daos-stack/daos/blob/master/utils/systemd/daos_server.service) and -[`daos_agent.service`](https://github.com/daos-stack/daos/blob/master/utils/systemd/daos_agent.service) +[daos\_agent.service](https://github.com/daos-stack/daos/blob/master/utils/systemd/daos_agent.service) unit files. (Note that values set in `/etc/security/limits.conf` are ignored by services launched through `systemd`.) @@ -390,6 +395,7 @@ is installed - `/install/share/spdk/scripts/setup.sh` after build from DAOS source Bind the SSDs with the following commands: + ```bash $ sudo /usr/share/spdk/scripts/setup.sh 0000:01:00.0 (8086 0953): nvme -> vfio-pci @@ -404,10 +410,12 @@ Now the SSDs can be accessed by SPDK we can use the `spdk_nvme_manage` tool to f the SSDs with a 4K block size. `spdk_nvme_manage` tool is provided by SPDK and will be found in the following locations: + - `/usr/bin/spdk_nvme_manage` if DAOS-maintained spdk-21.07-10 (or greater) RPM is installed - `/install/prereq/release/spdk/bin/spdk_nvme_manage` after build from DAOS source Choose to format a SSD, use option "6" for formatting: + ```bash $ sudo /usr/bin/spdk_nvme_manage NVMe Management Options @@ -425,6 +433,7 @@ NVMe Management Options Available SSDs will then be listed and you will be prompted to select one. Select the SSD to format, enter PCI Address "01:00.00": + ```bash 0000:01:00.00 INTEL SSDPEDMD800G4 CVFT45050002800CGN 0 Please Input PCI Address(domain:bus:dev.func): @@ -499,6 +508,7 @@ NVMe Management Options Controller details should show new "Current LBA Format". Verify "Current LBA Format" is set to "LBA Format #03": + ```bash ===================================================== NVMe Controller: 0000:01:00.00 @@ -532,7 +542,6 @@ Displayed details for controller show LBA format is now "#03". Perform the above process for all SSDs that will be used by DAOS. - ## Hugepage allocation and memory fragmentation DAOS uses linux hugepages for DMA buffer allocation. If hugepage memory becomes fragmented, DMA @@ -549,7 +558,6 @@ requested. [See here for details of allocating hugepages at boot](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-memory-configuring-huge-pages) - ## Disabling transparent hugepage (THP) feature at boot-time Linux transparent hugepages feature can increase the likelihood of hugepage memory fragmentation @@ -557,4 +565,3 @@ and should be disabled for optimal performance of DAOS. [See here for details of how to disable THP on boot](https://docs.kernel.org/admin-guide/mm/transhuge.html#boot-parameters) - diff --git a/docs/admin/telemetry_guide.md b/docs/admin/telemetry_guide.md index eae31acdc8d..a9dd4d34318 100644 --- a/docs/admin/telemetry_guide.md +++ b/docs/admin/telemetry_guide.md @@ -1,11 +1,11 @@ # DAOS Telemetry Example -This document will help to run daos metrics command and collect some key metrics from the +This document will help to run daos metrics command and collect some key metrics from the server to help debug the issues and analyze the system behavior. ## How to run telemetry command: -### Directly on server using daos_metrics command as sudo user +### Directly on server using daos\_metrics command as sudo user - Example of collecting the pool query metrics on the servers using daos_metrics command. - daos_metrics -S will show telemetry data from First I/O Engine (default 0) @@ -48,7 +48,7 @@ connecting to brd-222:9191... - Some metrics are only available on pool leader rank so identify the leader rank for that pool from the pool query command. - Below is the example of pool query where leader rank is 1 - - Pool 55cc96d8-5c46-41f4-af29-881d293b6f6f, ntarget=48, disabled=0, `leader=1`, version=1, state=Ready + - Pool 55cc96d8-5c46-41f4-af29-881d293b6f6f, ntarget=48, disabled=0, `leader=1`, version=1, state=Ready ``` #sudo dmg pool query samir_pool @@ -68,7 +68,7 @@ Pool space info: Free: 598 GB, min:12 GB, max:12 GB, mean:12 GB ``` - - Find the leader rank address so that daos_metrics command can be run on that specific server. + - Find the leader rank address so that daos\_metrics command can be run on that specific server. In this example Rank 1 is on `brd-221.daos.hpc.amslabs.hpecorp.net` (`10.214.213.41`) ``` @@ -172,7 +172,7 @@ export MY_MOUNT=/tmp/daos_mount In case of no response to any dmg pool command or any I/O operation means any one of the single xstream might have stuck. Either ULT is stuck or NVMe cannot respond to I/O operation. To check if ULT is stuck, run metrics command on each server and check for sched/cycle_duration and sched/cycle_size. -In this case sched/cycle_duration and sched/cycle_size for stuck xstream counter value is higher (outlier) compared to other xstream and ULT count. +In this case sched/cycle_duration and sched/cycle_size for stuck xstream counter value is higher (outlier) compared to other xstream and ULT count. - sched/cycle_duration: Schedule cycle duration, units: ms - sched/cycle_size: Schedule cycle size, units: ULT @@ -182,7 +182,7 @@ Below is the example on real system where ULT was stuck and not responding. You **xs_3: 72508 ULT** ``` # sudo daos_metrics -C -S 0 | grep -e cycle - + cycle_duration xs_0: 4 ms [min: 0, max: 736, avg: 1, sum: 4374707, stddev: 2, samples: 4337768] xs_1: 0 ms [min: 0, max: 4, avg: 0, sum: 12, stddev: 1, samples: 57] @@ -243,45 +243,44 @@ This metrics are available in different IO size ranges from 256B to 4GB so looks ``` #sudo daos_metrics -C -S 0 | grep 'io/latency/update' -ID: 0/io/latency/update/4MB/tgt_0,16349826,733843,16349826,7329515.976190,42,307839671,4196687.177444 -ID: 0/io/latency/update/4MB/tgt_1,1260,1147,2191,1463.423077,52,76098,273.640909 -ID: 0/io/latency/update/4MB/tgt_2,1252,1122,2275,1452.000000,62,90024,272.896966 -ID: 0/io/latency/update/4MB/tgt_3,1637,1179,2639,1558.844444,45,70148,302.601219 -ID: 0/io/latency/update/4MB/tgt_4,1155,1119,2280,1496.857143,49,73346,281.746857 -ID: 0/io/latency/update/4MB/tgt_5,1804,1139,1920,1493.767442,43,64232,234.072520 -ID: 0/io/latency/update/4MB/tgt_6,1160,1136,2550,1560.862745,51,79604,293.899440 -ID: 0/io/latency/update/4MB/tgt_7,1399,1126,1969,1411.929825,57,80480,195.942125 -ID: 0/io/latency/update/4MB/tgt_8,15264368,857936,19645847,9109087.453125,64,582981597,5094157.829112 -ID: 0/io/latency/update/4MB/tgt_9,1601,1146,2455,1437.038462,52,74726,262.549712 -ID: 0/io/latency/update/4MB/tgt_10,1366,1138,2094,1459.828125,64,93429,228.692526 -ID: 0/io/latency/update/4MB/tgt_11,1118,1113,2742,1475.378788,66,97375,309.820731 -ID: 0/io/latency/update/4MB/tgt_12,1169,1158,2531,1492.392857,56,83574,270.312323 -ID: 0/io/latency/update/4MB/tgt_13,1477,1148,2204,1485.853659,41,60920,244.983118 -ID: 0/io/latency/update/4MB/tgt_14,1159,1159,2390,1523.333333,48,73120,318.466026 +ID: 0/io/latency/update/4MB/tgt_0,16349826,733843,16349826,7329515.976190,42,307839671,4196687.177444 +ID: 0/io/latency/update/4MB/tgt_1,1260,1147,2191,1463.423077,52,76098,273.640909 +ID: 0/io/latency/update/4MB/tgt_2,1252,1122,2275,1452.000000,62,90024,272.896966 +ID: 0/io/latency/update/4MB/tgt_3,1637,1179,2639,1558.844444,45,70148,302.601219 +ID: 0/io/latency/update/4MB/tgt_4,1155,1119,2280,1496.857143,49,73346,281.746857 +ID: 0/io/latency/update/4MB/tgt_5,1804,1139,1920,1493.767442,43,64232,234.072520 +ID: 0/io/latency/update/4MB/tgt_6,1160,1136,2550,1560.862745,51,79604,293.899440 +ID: 0/io/latency/update/4MB/tgt_7,1399,1126,1969,1411.929825,57,80480,195.942125 +ID: 0/io/latency/update/4MB/tgt_8,15264368,857936,19645847,9109087.453125,64,582981597,5094157.829112 +ID: 0/io/latency/update/4MB/tgt_9,1601,1146,2455,1437.038462,52,74726,262.549712 +ID: 0/io/latency/update/4MB/tgt_10,1366,1138,2094,1459.828125,64,93429,228.692526 +ID: 0/io/latency/update/4MB/tgt_11,1118,1113,2742,1475.378788,66,97375,309.820731 +ID: 0/io/latency/update/4MB/tgt_12,1169,1158,2531,1492.392857,56,83574,270.312323 +ID: 0/io/latency/update/4MB/tgt_13,1477,1148,2204,1485.853659,41,60920,244.983118 +ID: 0/io/latency/update/4MB/tgt_14,1159,1159,2390,1523.333333,48,73120,318.466026 ID: 0/io/latency/update/4MB/tgt_15,1511,1165,2318,1447.608696,46,66590,253.351094 #sudo daos_metrics -C -S 0 | grep 'io/latency/fetch' -ID: 0/io/latency/fetch/4MB/tgt_0,1390,1099,2169,1380.785714,42,57993,202.810200 -ID: 0/io/latency/fetch/4MB/tgt_1,1902,1413,2956,1845.769231,52,95980,313.043041 -ID: 0/io/latency/fetch/4MB/tgt_2,1741,1395,2493,1783.983871,62,110607,226.501945 -ID: 0/io/latency/fetch/4MB/tgt_3,1543,1241,2568,1824.800000,45,82116,281.414092 -ID: 0/io/latency/fetch/4MB/tgt_4,1705,1506,2426,1850.020408,49,90651,232.079413 -ID: 0/io/latency/fetch/4MB/tgt_5,1579,1251,2396,1754.139535,43,75428,213.314275 -ID: 0/io/latency/fetch/4MB/tgt_6,1566,1262,2403,1747.823529,51,89139,260.631134 -ID: 0/io/latency/fetch/4MB/tgt_7,1663,1354,2912,1853.631579,57,105657,287.610267 -ID: 0/io/latency/fetch/4MB/tgt_8,1508,1051,2276,1417.562500,64,90724,271.118956 -ID: 0/io/latency/fetch/4MB/tgt_9,1508,1404,2468,1791.788462,52,93173,251.042324 -ID: 0/io/latency/fetch/4MB/tgt_10,1746,1453,2645,1796.203125,64,114957,230.458630 -ID: 0/io/latency/fetch/4MB/tgt_11,1695,1394,2416,1761.151515,66,116236,220.046376 -ID: 0/io/latency/fetch/4MB/tgt_12,1966,1396,2654,1740.464286,56,97466,238.501684 -ID: 0/io/latency/fetch/4MB/tgt_13,1915,1341,2613,1774.536585,41,72756,237.038298 -ID: 0/io/latency/fetch/4MB/tgt_14,1861,1337,2543,1807.625000,48,86766,279.890680 -ID: 0/io/latency/fetch/4MB/tgt_15,1740,1326,2420,1733.521739,46,79742,238.393674 +ID: 0/io/latency/fetch/4MB/tgt_0,1390,1099,2169,1380.785714,42,57993,202.810200 +ID: 0/io/latency/fetch/4MB/tgt_1,1902,1413,2956,1845.769231,52,95980,313.043041 +ID: 0/io/latency/fetch/4MB/tgt_2,1741,1395,2493,1783.983871,62,110607,226.501945 +ID: 0/io/latency/fetch/4MB/tgt_3,1543,1241,2568,1824.800000,45,82116,281.414092 +ID: 0/io/latency/fetch/4MB/tgt_4,1705,1506,2426,1850.020408,49,90651,232.079413 +ID: 0/io/latency/fetch/4MB/tgt_5,1579,1251,2396,1754.139535,43,75428,213.314275 +ID: 0/io/latency/fetch/4MB/tgt_6,1566,1262,2403,1747.823529,51,89139,260.631134 +ID: 0/io/latency/fetch/4MB/tgt_7,1663,1354,2912,1853.631579,57,105657,287.610267 +ID: 0/io/latency/fetch/4MB/tgt_8,1508,1051,2276,1417.562500,64,90724,271.118956 +ID: 0/io/latency/fetch/4MB/tgt_9,1508,1404,2468,1791.788462,52,93173,251.042324 +ID: 0/io/latency/fetch/4MB/tgt_10,1746,1453,2645,1796.203125,64,114957,230.458630 +ID: 0/io/latency/fetch/4MB/tgt_11,1695,1394,2416,1761.151515,66,116236,220.046376 +ID: 0/io/latency/fetch/4MB/tgt_12,1966,1396,2654,1740.464286,56,97466,238.501684 +ID: 0/io/latency/fetch/4MB/tgt_13,1915,1341,2613,1774.536585,41,72756,237.038298 +ID: 0/io/latency/fetch/4MB/tgt_14,1861,1337,2543,1807.625000,48,86766,279.890680 +ID: 0/io/latency/fetch/4MB/tgt_15,1740,1326,2420,1733.521739,46,79742,238.393674 ``` - ### NVMe Device Error Many times, NVMe device has error which can also be an indication for slow performance or system stuck issue. @@ -305,7 +304,7 @@ ID: 0/nvme/0000:83:00.0/vendor/crc_err_cnt_raw,0 ## Metrics Unit Type -daos_metrics output is available in multiple units. for example, Counters, Gauge. It can display the data based on different unit type. +daos\_metrics output is available in multiple units. for example, Counters, Gauge. It can display the data based on different unit type. ### Display Counter type metrics A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset or to zero on restart. @@ -338,9 +337,9 @@ ID: 0/net/ofi+tcp;ofi_rxm/hg/active_rpcs/ctx_3,0,,,,,,Mercury-layer count of act Gauge metrics units are in format where multiple values are display for number of samples. For example, update/fetch latency output. ``` - latency - update - 256B + latency + update + 256B tgt_0: 118 us [min: 15, max: 3703, avg: 100, sum: 200968, stddev: 124, samples: 2010] ``` @@ -359,6 +358,7 @@ Gauge metrics units are in format where multiple values are display for number o Metrics counter will be reset when system restarts or it can be reset using below command on individual servers. For Engine 0 & 1 (In case multiple engines are running on same node) + ``` sudo daos_metrics -S 0 -e; sudo daos_metrics -S 1 -e ``` diff --git a/docs/admin/tiering_uns.md b/docs/admin/tiering_uns.md index d856d0a153f..877c6e1a7c9 100644 --- a/docs/admin/tiering_uns.md +++ b/docs/admin/tiering_uns.md @@ -37,42 +37,42 @@ The current state of work can be summarized as follows : ### Building and using a DAOS-aware Lustre version As indicated before, a Lustre Client patch (for LU-12682) has been developed - to allow for the application's transparent access to the DAOS container's data - from a Lustre foreign file/dir. +to allow for the application's transparent access to the DAOS container's data +from a Lustre foreign file/dir. This patch can be found at https://review.whamcloud.com/35856 and has - been landed onto master but is still not integrated with an official - Lustre version. This patch must be applied on top of the selected Lustre - version's source tree. +been landed onto master but is still not integrated with an official +Lustre version. This patch must be applied on top of the selected Lustre +version's source tree. After any conflicts are resolved, Lustre must be built and - the generated RPMs installed on client nodes by following the instructions at - https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source. +the generated RPMs installed on client nodes by following the instructions at +https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source. The Lustre client mount command must use the new - `foreign_symlink=` option to set the prefix to be used in - front of the `/` relative path, based on pool/container - information being extracted from the LOV/LMV foreign symlink EAs. This can - be configured by dynamically modifying both `foreign_symlink_[enable,prefix]` - parameters for each Lustre client mount, using the - `lctl set_param llite/*/foreign_symlink_[enable,prefix]=[0|1,]` command. - The Dfuse instance will then use this prefix to mount/expose all - DAOS pools, or use `/[/]` to mount a - single pool/container. - -To allow non-root/admin users to use the llapi_set_dirstripe() - API (like the `daos cont create` command with `--path` option), or the - `lfs setdirstripe` command, the Lustre MDS servers configuration must - be modified accordingly by running the - `lctl set_param mdt/*/enable_remote_dir_gid=-1` command. - - Additionally, there is a feature available to provide a customized format - of LOV/LMV EAs, apart from the default `/`, through the - `llite/*/foreign_symlink_upcall` tunable. This provides the path - of a user-land upcall, that will indicate where to extract - `` and `` in the LOV/LMV EAs, using a series of [pos, len] - tuples and constant strings. `lustre/utils/l_foreign_symlink.c` is a helper - example in the Lustre source code. +`foreign_symlink=` option to set the prefix to be used in +front of the `/` relative path, based on pool/container +information being extracted from the LOV/LMV foreign symlink EAs. This can +be configured by dynamically modifying both `foreign_symlink_[enable,prefix]` +parameters for each Lustre client mount, using the +`lctl set_param llite/*/foreign_symlink_[enable,prefix]=[0|1,]` command. +The Dfuse instance will then use this prefix to mount/expose all +DAOS pools, or use `/[/]` to mount a +single pool/container. + +To allow non-root/admin users to use the `llapi_set_dirstripe()` +API (like the `daos cont create` command with `--path` option), or the +`lfs setdirstripe` command, the Lustre MDS servers configuration must +be modified accordingly by running the +`lctl set_param mdt/*/enable_remote_dir_gid=-1` command. + +Additionally, there is a feature available to provide a customized format +of LOV/LMV EAs, apart from the default `/`, through the +`llite/*/foreign_symlink_upcall` tunable. This provides the path +of a user-land upcall, that will indicate where to extract +`` and `` in the LOV/LMV EAs, using a series of [pos, len] +tuples and constant strings. `lustre/utils/l_foreign_symlink.c` is a helper +example in the Lustre source code. ## Data Migration diff --git a/docs/admin/troubleshooting.md b/docs/admin/troubleshooting.md index 8d615e4991f..053a92b33a6 100644 --- a/docs/admin/troubleshooting.md +++ b/docs/admin/troubleshooting.md @@ -58,7 +58,7 @@ For a full list of errors, please check (`DER_ERR_GURT_BASE` is equal to 1000, and `DER_ERR_DAOS_BASE` is equal to 2000). -The function d_errstr() is provided in the API to convert an error +The function `d_errstr()` is provided in the API to convert an error number to an error message. ## Log Files @@ -87,7 +87,7 @@ levels may be set using the `control_log_mask` config parameter: Data Plane (`daos_engine`) logging is configured on a per-instance basis. In other words, each section under the `servers:` section must have its own logging configuration. The `log_file` config parameter -is converted to a D_LOG_FILE environment variable value. For more +is converted to a D\_LOG\_FILE environment variable value. For more detail, please see the [Debugging System](#debugging-system) section of this document. @@ -118,8 +118,8 @@ The debug logging system includes a series of subsystems or facilities which define groups for related log messages (defined per source file). There are common facilities which are defined in GURT, as well as other facilities that can be defined on a per-project basis (such as those for -CaRT and DAOS). DD_SUBSYS can be used to set which subsystems to enable -logging. By default all subsystems are enabled ("DD_SUBSYS=all"). +CaRT and DAOS). DD\_SUBSYS can be used to set which subsystems to enable +logging. By default all subsystems are enabled (`DD_SUBSYS=all`). - DAOS Facilities: daos, array, kv, common, tree, vos, client, server, rdb, rsvc, pool, container, object, @@ -203,7 +203,7 @@ composition of multiple individual bits. Please note: where in these examples the export command is shown setting an environment variable, this is intended to convey either that the variable is actually set (for the client environment), or -configured for the engines in the `daos_server.yml` file (`log_mask` per engine, and env_vars +configured for the engines in the `daos_server.yml` file (`log_mask` per engine, and `env_vars` values per engine for the `DD_SUBSYS` and `DD_MASK` variable assignments). - Generic setup for all messages (default settings) @@ -220,7 +220,8 @@ values per engine for the `DD_SUBSYS` and `DD_MASK` variable assignments). - Gather daos metadata logs if a pool/container resource problem is observed, using the provided group mask D_LOG_MASK=DEBUG -> log at DEBUG level from all facilities - DD_MASK=group_metadata -> limit logging to include default and metadata-specific streams. Or, specify DD_MASK=group_metadata_only for just metadata-specific log entries. + DD_MASK=group_metadata -> limit logging to include default and metadata-specific streams. + Or, specify DD_MASK=group_metadata_only for just metadata-specific log entries. - Disable a noisy debug logging subsystem @@ -244,7 +245,9 @@ Refer to the DAOS Environment Variables document for more information about the debug system environment. ## Common DAOS Problems -### Incompatible Agent #### + +### Incompatible Agent + When DER_AGENT_INCOMPAT is received, it means that the client library libdaos.so is likely mismatched with the DAOS Agent. The libdaos.so, DAOS Agent and DAOS Server must be built from compatible sources so that the GetAttachInfo protocol @@ -252,14 +255,16 @@ is the same between each component. Depending on your situation, you will need to either update the DAOS Agent or the libdaos.so to the newer version in order to maintain compatibility with each other. -### HLC Sync ### +### HLC Sync + When DER_HLC_SYNC is received, it means that sender and receiver HLC timestamps are off by more than maximum allowed system clock offset (1 second by default). In order to correct this situation synchronize all server clocks to the same reference time, using services like NTP. -### Shared Memory Errors ### +### Shared Memory Errors + When DER_SHMEM_PERMS is received it means that this I/O Engine lacked the permissions to access the shared memory megment left behind by a previous run of the I/O Engine on the same machine. This happens when the I/O Engine fails to @@ -267,8 +272,8 @@ remove the shared memory segment upon shutdown, and, there is a mismatch between the user/group used to launch the I/O Engine between these successive runs. To remedy the problem, manually identify the shared memory segment and remove it. -Issue ```ipcs``` to view the Shared Memory Segments. The output will show a -list of segments organized by ```key```. +Issue `ipcs` to view the Shared Memory Segments. The output will show a +list of segments organized by `key`. ``` ipcs @@ -286,21 +291,26 @@ key shmid owner perms bytes nattch status key semid owner perms nsems ``` -Shared Memory Segments with keys [0x10242048 .. (0x10242048 + number of I/O -Engines running)] are the segments that must be removed. Use ```ipcrm``` to -remove the segment. +Shared Memory Segments with keys +[0x10242048 .. (0x10242048 + number of I/O Engines running)] +are the segments that must be removed. +Use `ipcrm` to remove the segment. For example, to remove the shared memory segment left behind by I/O Engine instance 0, issue: + ``` sudo ipcrm -M 0x10242048 ``` + To remove the shared memory segment left behind by I/O Engine instance 1, issue: + ``` sudo ipcrm -M 0x10242049 ``` ### Server Start Issues + 1. Read the log located in the `control_log_file`. 1. Verify that the `daos_server` process is not currently running. 1. Check the SCM device path in /dev. @@ -316,26 +326,37 @@ sudo ipcrm -M 0x10242049 1. Verify that the `mgmt_svc_replicas` host is accessible and the port is not used. 1. Check the `provider` entry. See the "Network Scan and Configuration" section of the admin guide for determining the right provider to use. 1. Check `fabric_iface` in `engines`. They should be available and enabled. -1. Check that `socket_dir` is writable by the daos_server. +1. Check that `socket_dir` is writable by the daos\_server. ### Errors creating a Pool + 1. Check which engine rank you want to create a pool in with `dmg system query --verbose` and verify their State is Joined. -1. `DER_NOSPACE(-1007)` appears: Check the size of the NVMe and PMem. Next, check the size of the existing pool. Then check that this new pool being created will fit into the remaining disk space. +1. `DER_NOSPACE(-1007)` appears: Check the size of the NVMe and PMem. Next, check the size of the existing pool. + Then check that this new pool being created will fit into the remaining disk space. ### Problems creating a container + 1. Check that the path to daos is your intended binary. It's usually `/usr/bin/daos`. 1. When the server configuration is changed, it's necessary to restart the agent. -1. `DER_UNREACH(-1006)`: Check the socket ID consistency between PMem and NVMe. First, determine which socket you're using with `daos_server network scan -p all`. e.g., if the interface you're using in the engine section is eth0, find which NUMA Socket it belongs to. Next, determine the disks you can use with this socket by calling `daos_server nvme scan` or `dmg storage scan`. e.g., if eth0 belongs to NUMA Socket 0, use only the disks with 0 in the Socket ID column. -1. Check the interface used in the server config (`fabric_iface`) also exists in the client and can communicate with the server. +1. `DER_UNREACH(-1006)`: Check the socket ID consistency between PMem and NVMe. + First, determine which socket you're using with `daos_server network scan -p all`. + For example, if the interface you're using in the engine section is eth0, + find which NUMA Socket it belongs to. Next, determine the disks you can use with + this socket by calling `daos_server nvme scan` or `dmg storage scan`. + For example, if eth0 belongs to NUMA Socket 0, use only the disks with 0 in the Socket ID column. +1. Check the interface used in the server config (`fabric_iface`) also exists + in the client and can communicate with the server. 1. Check the `access_points` of the agent config points to the correct server hosts. 1. Call `daos pool query` and check that the pool exists and has free space. ### Applications run slow + Verify if you're using Infiniband for `fabric_iface`: in the server config. The IO will be significantly slower with Ethernet. ## Common Errors and Workarounds -### Use dmg command without daos_server_helper privilege +### Use dmg command without privilege + ``` # Error message or timeout after dmg system query $ dmg system query @@ -355,7 +376,9 @@ ERROR: dmg: Unable to load Certificate Data: could not load cert: stat /etc/daos # 2. Make sure the admin-host allow_insecure mode matches the applicable servers. ``` -### use daos command before daos_agent started + +### use daos command before daos\_agent started + ``` $ daos cont create $DAOS_POOL daos ERR src/common/drpc.c:217 unixcomm_connect() Failed to connect to /var/run/daos_agent/daos_agent.sock, errno=2(No such file or directory) @@ -364,12 +387,15 @@ failed to initialize daos: Miscellaneous error (-1025) # Work around to check for daos_agent certification and start daos_agent + #check for /etc/daos/certs/daosCA.crt, agent.crt and agent.key $ sudo systemctl stop daos_agent.service $ sudo systemctl start daos_agent.service $ sudo systemctl status daos_agent.service ``` + ### use daos command with invalid or wrong parameters + ``` # Lack of providing daos pool_uuid $ daos pool list-cont @@ -407,7 +433,9 @@ use 'daos help RESOURCE' for resource specifics $ daos pool list-cont --pool=$DAOS_POOL bc4fe707-7470-4b7d-83bf-face75cc98fc ``` + ### dmg pool create failed due to no space + ``` $ dmg pool create --size=50G mypool Creating DAOS pool with automatic storage allocation: 50 GB NVMe + 6.00% SCM @@ -435,7 +463,9 @@ ERROR: dmg: pool create failed: DER_NOSPACE(-1007): No space on storage target ----- --------- -------- -------- ---------- --------- --------- boro-8 17 GB 2.9 GB 83 % 0 B 0 B N/A ``` + ### dmg pool destroy force + ``` # dmg pool destroy Timeout or failed due to pool having active connections # Workaround using pool destroy --force option @@ -443,7 +473,9 @@ ERROR: dmg: pool create failed: DER_NOSPACE(-1007): No space on storage target $ dmg pool destroy mypool --force Pool-destroy command succeeded ``` + ### dmg pool destroy recursive + ``` # dmg pool destroy Timeout or failed due to pool having associated container(s) # Workaround using pool destroy --recursive option @@ -451,7 +483,9 @@ ERROR: dmg: pool create failed: DER_NOSPACE(-1007): No space on storage target $ dmg pool destroy mypool --recursive Pool-destroy command succeeded ``` -### daos_engine fails to start with error "Address already in use" + +### daos\_engine fails to start with error "Address already in use" + ``` 09/26-15:25:38.06 node-1 DAOS[3851384/-1/0] external ERR # [4462751.071824] mercury->cls: [error] /builddir/build/BUILD/mercury-2.2.0/src/na/na_ofi.c:3638 na_ofi_basic_ep_open(): fi_enable() failed, rc: -98 (Address already in use) @@ -463,7 +497,8 @@ fabric_iface_port: 31316 # engine 1 fabric_iface_port: 31416 ``` -### daos_agent cache of engine URIs is stale + +### daos\_agent cache of engine URIs is stale The `daos_agent` cache may become invalid if `daos_engine` processes restart with different configurations or IP addresses, or if the DAOS system is reformatted. @@ -556,13 +591,36 @@ Alternately, the administrator may erase and re-format the DAOS system to start ### Engines become unavailable -Engines may become unavailable due to server power losses and reboots, network switch failures, etc. After staying unavailable for a certain period of time, these engines may become "excluded" or "errored" in `dmg system query` output. Once the states of all engines stabilize (see [`CRT_EVENT_DELAY`](env_variables.md)), each pool will check whether there is enough redundancy (see [Pool RF](pool_operations.md#pool-redundancy-factor)) to tolerate the unavailability of the "excluded" or "errored" engines. If there is enough redundancy, these engines will be excluded from the pool ("Disabled ranks" in `dmg pool query --health-only` output); otherwise, the pool will perform no exclusion ("Dead ranks" in `dmg pool query --health-only` output as described in [Querying a Pool](pool_operations.md#querying-a-pool)) and may become temporarily unavailable (as seen by timeouts of `dmg pool query`, `dmg pool list`, etc.). Similarly, when engines become available, whenever the states of all engines stabilize, each pool will perform the aforementioned check for any unavailable engines that remain. +Engines may become unavailable due to server power losses and reboots, +network switch failures, etc. After staying unavailable for a certain +period of time, these engines may become "excluded" or "errored" in +`dmg system query` output. Once the states of all engines stabilize +(see [CRT\_EVENT\_DELAY](env_variables.md)), each pool will check +whether there is enough redundancy +(see [Pool RF](pool_operations.md#pool-redundancy-factor)) +o tolerate the unavailability of the "excluded" or "errored" engines. +If there is enough redundancy, these engines will be excluded from the pool +("Disabled ranks" in `dmg pool query --health-only` output). +Otherwise, the pool will perform no exclusion +("Dead ranks" in `dmg pool query --health-only` output as described in +[Querying a Pool](pool_operations.md#querying-a-pool)) +and may become temporarily unavailable +(as seen by timeouts of `dmg pool query`, `dmg pool list`, etc.). +Similarly, when engines become available, whenever the states of all +engines stabilize, each pool will perform the aforementioned check for +any unavailable engines that remain. + +To restore availability as well as capacity and performance, +try to start all "excluded" or "errored" engines. +Starting all of them at the same time minimizes the chance of triggering +rebuild jobs. In many cases, the following command suffices: -To restore availability as well as capacity and performance, try to start all "excluded" or "errored" engines. Starting all of them at the same time minimizes the chance of triggering rebuild jobs. In many cases, the following command suffices: ``` $ dmg system start ``` + If some pools remain unavailable (e.g., `dmg pool list` keeps timing out) after the previous step, restart the whole system: + ``` $ dmg system stop --force $ dmg system start @@ -616,13 +674,14 @@ tmpfs 4096 0 4096 0% /sys/fs/cgroup wolf-1:/export/home/samirrav 6289369088 5927917792 361451296 95% /home/samirrav # ``` -### e2fsck +### e2fsck #### e2fsck command execution on non-corrupted file system. - "-f": Force check file system even it seems clean. - "-n": Use the option to assume an answer of 'no' to all questions. + ``` #/sbin/e2fsck -f -n /dev/pmem1 e2fsck 1.43.8 (1-Jan-2018) @@ -636,6 +695,7 @@ daos: 34/759040 files (0.0% non-contiguous), 8179728/777240064 blocks #echo $? 0 ``` + - Return Code: "0 - No errors" #### e2fsck command execution on corrupted file system. @@ -643,6 +703,7 @@ daos: 34/759040 files (0.0% non-contiguous), 8179728/777240064 blocks - "-f": Force check file system even it seems clean. - "-n": Use the option to assume an answer of 'no' to all questions. - "-C0": To monitored the progress of the filesystem check. + ``` # /sbin/e2fsck -f -n -C0 /dev/pmem1 e2fsck 1.43.8 (1-Jan-2018) @@ -677,6 +738,7 @@ daos: 13/759040 files (0.0% non-contiguous), 334428/777240064 blocks # echo $? 4 ``` + - Return Code: "4 - File system errors left uncorrected" #### e2fsck command to repair and fixing the issue. @@ -684,6 +746,7 @@ daos: 13/759040 files (0.0% non-contiguous), 334428/777240064 blocks - "-f": Force check file system even it seems clean. - "-p": Automatically fix any filesystem problems that can be safely fixed without human intervention. - "-C0": To monitored the progress of the filesystem check. + ``` #/sbin/e2fsck -f -p -C0 /dev/pmem1 daos was not cleanly unmounted, check forced. @@ -701,8 +764,8 @@ daos: 13/759040 files (0.0% non-contiguous), 334428/777240064 blocks # echo $? 1 ``` -- Return Code: "1 - File system errors corrected" +- Return Code: "1 - File system errors corrected" ### ipmctl @@ -710,7 +773,8 @@ IPMCTL utility is used for Intel® Optane™ persistent memory for managing, dia [IPMCTL user guide](https://docs.pmem.io/ipmctl-user-guide/) has more details about the utility. DAOS user can use the [diagnostic](https://docs.pmem.io/ipmctl-user-guide/debug/run-diagnostic) and -[show error log](https://docs.pmem.io/ipmctl-user-guide/debug/show-error-log) functionality to debug the PMem related issues. +[show error log](https://docs.pmem.io/ipmctl-user-guide/debug/show-error-log) +functionality to debug the PMem related issues. #### ipmctl show command to get the DIMM ID connected to specific CPU. @@ -749,6 +813,7 @@ This test will verify the PMem health parameters are under acceptable values. It #### Run quick diagnostic test on specific dimm from socket. By default it will run diagnostic test for all dimms. * -dimm : DIMM ID from the ipmctl command above + ``` #ipmctl start -diagnostic quick -dimm 0x0001 --Test = Quick @@ -859,6 +924,7 @@ No errors found on PMem module 0x1121 0x1121 | 02/03/2022 16:50:17 | 0x04 - Locked/Illegal Access Show Error executed successfully ``` + ### ndctl NDCTL is another utility library for managing the PMem. The ndctl provides functional used for PMem and namespace management, @@ -877,10 +943,11 @@ This utility can be used after ipmctl where name space is already created by ipm Please refer the [ndctl-list](https://docs.pmem.io/ndctl-user-guide/ndctl-man-pages/ndctl-list) command guide for more details about the command options. -Total Sixteen PMem connected to single system. Eight PMem DIMMs are connected to single socket (reference "ipmctl show -topology" section under ipmctl). +Total Sixteen PMem connected to single system. Eight PMem DIMMs are connected to single socket (reference "ipmctl show -topology" section under ipmctl). The SCM modules are typically configured in AppDirect interleaved mode. They are thus presented to the operating system as a single PMem namespace per socket (in fsdax mode). * -M : Include Media Error + ``` # ndctl list -M [ @@ -931,11 +998,13 @@ This can lead to an application getting stuck in an infinite loop on IO operatio } } ] -``` +``` #### ndctl wait-scrub command ran on system which has bad blocks. -Please refer the [ndctl wait-scrub](https://docs.pmem.io/ndctl-user-guide/ndctl-man-pages/untitled-2) command guide for more information about the command options. +Please refer the [ndctl wait-scrub](https://docs.pmem.io/ndctl-user-guide/ndctl-man-pages/untitled-2) +command guide for more information about the command options. + ``` # ndctl wait-scrub [ @@ -959,7 +1028,8 @@ Please refer the [ndctl wait-scrub](https://docs.pmem.io/ndctl-user-guide/ndctl- #### command execution on system where bad blocks are scrubbed. -Please refer the [ndctl list](https://docs.pmem.io/ndctl-user-guide/ndctl-man-pages/ndctl-list) user guide for more information about the command options. +Please refer the [ndctl list](https://docs.pmem.io/ndctl-user-guide/ndctl-man-pages/ndctl-list) +user guide for more information about the command options. * -M : Include Media Error @@ -999,19 +1069,24 @@ Please refer the [ndctl list](https://docs.pmem.io/ndctl-user-guide/ndctl-man-pa } ] ``` + ### pmempool The pmempool is a management tool for Persistent Memory pool files created by PMDK libraries. DAOS uses the PMDK library to manage persistence inside ext4 files. -[pmempool](https://github.com/daos-stack/pmdk/blob/stable-2.1/doc/pmempool/pmempool-check.1.md) can check consistency of a given pool file. -It can be run with -r (repair) option which can fix some of the issues with pool file. DAOS will have more number of such pool file (vos-*), based -on number of targets mention per daos engine. User may need to check each vos pool file for corruption on faulty pool. +[pmempool](https://github.com/daos-stack/pmdk/blob/stable-2.1/doc/pmempool/pmempool-check.1.md) +can check consistency of a given pool file. +It can be run with -r (repair) option which can fix some of the issues with pool file. +DAOS will have more number of such pool file (`vos-*`), based +on number of targets mention per daos engine. +User may need to check each vos pool file for corruption on faulty pool. #### Unclean shutdown Example of the system which is not shutdown properly and that set the mode as dirty on the VOS pool file. * -v: More verbose. + ``` # pmempool check /mnt/daos0/0d977cd9-2571-49e8-902d-953f6adc6120/vos-0 -v checking shutdown state @@ -1028,6 +1103,7 @@ Example of check repair command ran on the system to fix the unclean shutdown. * -v: More verbose. * -r: repair the pool. * -y: Answer yes to all question. + ``` # pmempool check /mnt/daos0/894b94ee-cdb2-4241-943c-08769542d327/vos-0 -vry checking shutdown state @@ -1043,6 +1119,7 @@ pool header correct #### Check consistency. Check the consistency of the VOS pool file after repair. + ``` # pmempool check /mnt/daos0/894b94ee-cdb2-4241-943c-08769542d327/vos-0 -v checking shutdown state @@ -1056,7 +1133,7 @@ pool header correct ## Syslog -[`RAS events`](https://docs.daos.io/v2.6/admin/administration/#ras-events) are printed to the Syslog +[RAS events](https://docs.daos.io/v2.6/admin/administration/#ras-events) are printed to the Syslog by 'daos_server' processes via the Go standard library API. If no Syslog daemon is configured on the host, errors will be printed to the 'daos_server' log file: @@ -1097,6 +1174,7 @@ server package e.g. 'rsyslog'. ## Tools to debug connectivity issues across nodes ### ifconfig + ``` $ ifconfig lo: flags=73 mtu 65536 @@ -1124,9 +1202,11 @@ eth1: flags=4163 mtu 9000 TX packets 61 bytes 4156 (4.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ``` + You can get the ip and network interface card (NIC) name with ifconfig. Important: Please run ifconfig on both DAOS server and client nodes to make sure mtu size are same for the network interfaces on different nodes. Mismatched mtu size could lead to DAOS hang on RDMA over converged Ethernet (RoCE) interfaces. ### lstopo-no-graphics + ``` $ lstopo-no-graphics ... @@ -1143,10 +1223,12 @@ $ lstopo-no-graphics OpenFabrics "mlx5_1" ... ``` + You can get the domain name and numa node information of your NICs. In case lstopo-no-graphics in not installed, you can install package "hwloc" with yum/dnf or other package managers. ### ping + ``` client_node $ ping -c 3 -I eth1 10.165.192.121 PING 10.165.192.121 (10.165.192.121) from 10.165.192.2 ens102: 56(84) bytes of data. @@ -1154,9 +1236,11 @@ PING 10.165.192.121 (10.165.192.121) from 10.165.192.2 ens102: 56(84) bytes of d 64 bytes from 10.165.192.121: icmp_seq=2 ttl=64 time=0.120 ms 64 bytes from 10.165.192.121: icmp_seq=3 ttl=64 time=0.083 ms ``` + Make sure ping can reach the NIC your DAOS server is bound to. -### fi_pingpong +### fi\_pingpong + ``` server_node $ fi_pingpong -p "tcp;ofi_rxm" -e rdm -d eth0 client_node $ fi_pingpong -p "tcp;ofi_rxm" -e rdm -d eth0 ip_of_eth0_server @@ -1179,19 +1263,24 @@ Make sure communications with verbs can go through. server_node $ fi_pingpong -p "verbs;ofi_rxm" -e rdm -d mlx5_0 client_node $ fi_pingpong -p "verbs;ofi_rxm" -e rdm -d mlx5_0 ip_of_mlx5_0_server ``` -### ib_send_lat + +### ib\_send\_lat + ``` server_node $ ib_send_lat -d mlx5_0 -s 16384 -D 3 client_node $ ib_send_lat -d mlx5_0 -s 16384 -D 3 ip_of_server ``` + This test checks whether verbs goes through with Infiniband or RoCE cards. In case ib_send_lat in not installed, you can install package "perftest" with yum/dnf or other package managers. ## Tools to measure the network latency and bandwidth across nodes ### The tools in perftest for Infiniband and RoCE + You can install package "perftest" with yum/dnf or other package managers if it is not available. -Examples for measuring bandwidth, +Examples for measuring bandwidth: + ``` ib_read_bw -a ib_read_bw -a 192.168.1.46 @@ -1214,23 +1303,29 @@ ib_send_lat -a ib_send_lat -a 192.168.1.46 ``` -### fi_pingpong for Ethernet +### fi\_pingpong for Ethernet + You can install package "libfabric" with yum/dnf or other package managers if it is not available. -Example, +Example: + ``` server_node $ fi_pingpong -p "tcp;ofi_rxm" -e rdm -d eth0 -I 1000 client_node $ fi_pingpong -p "tcp;ofi_rxm" -e rdm -d eth0 -I 1000 ip_of_eth0_server ``` + This reports network bandwidth. One can deduce the latency for given packet size. ## Tools to diagnose network issues for a large cluster ### [Intel CLuster Checker](https://www.intel.com/content/www/us/en/developer/tools/oneapi/cluster-checker.html) + This suite contains multiple useful tools including network_time_uniformity to debug network issue. ### [mpi-benchmarks](https://github.com/intel/mpi-benchmarks) + Tools like IMB-P2P, IMB-MPI1, and IMB-RMA are helpful for the sanity check of the latency and bandwidth. + ``` $ for((i=1;i<=65536;i*=4)); do echo "$i"; done &> msglen $ mpirun -np 4 -f hostlist ./IMB-P2P -msglen msglen PingPong diff --git a/docs/admin/vmd.md b/docs/admin/vmd.md index 5047c857220..76de24f3b7b 100644 --- a/docs/admin/vmd.md +++ b/docs/admin/vmd.md @@ -3,7 +3,7 @@ [Intel VMD (Volume Management Device)](https://www.intel.com/content/www/us/en/architecture-and-technology/intel-volume-management-device-overview.html) is a feature introduced with the Intel Xeon Scalable processor family to help manage NVMe drives. -It provides features such as **surprise hot plug, LED management, +It provides features such as **surprise hot plug, LED management, error isolation** and **bootable RAID**. The Intel VMD functionality is provided as part of the @@ -23,7 +23,7 @@ in the servers' UEFI. It can then be also enabled in the `daos_server.yml` configuration file, as described below. DAOS 2.2 did enable VMD-managed devices in the `daos_server.yml` configuration file -(and as arguments to some DAOS management commands), but did +(and as arguments to some DAOS management commands), but did _not_ yet provide any additional functionality over non-VMD devices. DAOS 2.4 introduced the **LED management** feature that requires VMD. @@ -35,7 +35,6 @@ Customers who intend to utilize DAOS capabilities that depend on VMD are encouraged to enable VMD when setting up the DAOS cluster, because changing from a non-VMD setup to VMD is not possible without reformatting the DAOS storage. - ## NVMe view with VMD disabled (before binding to SPDK) The following is an example of the `lspci` view on a server with eight From 8fa05be19cf9fbeb09b7339659c7e63a555fe956 Mon Sep 17 00:00:00 2001 From: Michael Hennecke Date: Tue, 5 May 2026 16:56:44 +0200 Subject: [PATCH 2/2] DAOS-18907 doc: markdown fixes fix typos Doc-only: true Signed-off-by: Michael Hennecke --- docs/admin/deployment.md | 4 ++-- docs/admin/troubleshooting.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/admin/deployment.md b/docs/admin/deployment.md index aa6669e492c..633a48a83c8 100644 --- a/docs/admin/deployment.md +++ b/docs/admin/deployment.md @@ -97,7 +97,7 @@ for the path specified through the -o option of the `daos_server` command line, if unspecified then `/etc/daos/daos_server.yml` is used. Refer to the example configuration file -[daosi\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) +[daos\_server.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml) for latest information and examples. #### MD-on-SSD Configuration @@ -118,7 +118,7 @@ Depending on the number of NVMe SSDs per DAOS engine there may be one, two or th different `bdev_role` assignments. For a complete server configuration file example enabling MD-on-SSD, see -[daos\_server_mdonssd.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml). +[daos\_server\_mdonssd.yml](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml). Below are four different `daos_server.yml` storage configuration snippets that represent scenarios for a DAOS engine with four NVMe SSDs and MD-on-SSD enabled. diff --git a/docs/admin/troubleshooting.md b/docs/admin/troubleshooting.md index 053a92b33a6..d9f6763de73 100644 --- a/docs/admin/troubleshooting.md +++ b/docs/admin/troubleshooting.md @@ -598,7 +598,7 @@ period of time, these engines may become "excluded" or "errored" in (see [CRT\_EVENT\_DELAY](env_variables.md)), each pool will check whether there is enough redundancy (see [Pool RF](pool_operations.md#pool-redundancy-factor)) -o tolerate the unavailability of the "excluded" or "errored" engines. +to tolerate the unavailability of the "excluded" or "errored" engines. If there is enough redundancy, these engines will be excluded from the pool ("Disabled ranks" in `dmg pool query --health-only` output). Otherwise, the pool will perform no exclusion