From 6bc9a6d2e0840dcd5879fd4703e235ddfd3cfdf1 Mon Sep 17 00:00:00 2001 From: rgurunathan Date: Mon, 11 May 2026 17:10:33 -0400 Subject: [PATCH 1/4] #15 - Add C++ bench infrastructure for DGX Spark performance report Lays the groundwork for the DGX Spark performance report (issue #15). Code and doc-skeleton only; per-cell numbers fill in via a follow-on commit on the same PR. Bench changes (examples/ only, no src/ touched): - raw_bench_common: add TokenBucketPacer (--target-gbps software pacer) and parse_target_gbps helper; emit seconds= in the shared rx_count_worker print; build the print in a stringstream so concurrent RX/TX worker output doesn't interleave on stdout. - raw_gpudirect_bench: wire pacer into the TX worker; track packet/byte counts; print TX complete: with seconds=. - rdma_bench: pacer for SEND path only; split send/recv completion counts and bytes per role; seconds= in server/client complete lines. - socket_bench: pacer; make iteration cap opt-in (iterations <= 0 means time-bounded by --seconds); sent_bytes/recv_bytes tracking. - CMakeLists: link raw_bench_common.cpp + CUDA::cudart into rdma/socket bench targets now that they consume the pacer helper. Drop accounting (without modifying managers, per zero-src/ constraint): - DPDK: parsed from DAQIRI_LOG_INFO output of PrintDpdkStats by wrapper. - RDMA: parsed from "CQ error" lines in DAQIRI_LOG_ERROR by wrapper. Bench-side CQE counting isn't possible -- the manager filters error completions before they reach get_rx_burst. - Socket UDP: kernel drops via /proc/net/udp diff by wrapper. - Socket TCP: nstat retrans/inerrs diff by wrapper (TCP has no clean "drops" semantic). New tooling: - examples/bench_capture_environment.sh: snapshots uname, kernel cmdline, CPU/NUMA, hugepages, NIC/PCIe state, OFED, GPU, governor, isolcpus, IRQ affinity, git rev once per result set. - examples/run_spark_bench.sh: sweep wrapper (smoke|sweep|drop-curve modes) per backend; runs bench under mpstat + nvidia-smi dmon; computes pps/Gbps in one place from packets=/bytes=/seconds= stdout fields; writes one CSV row per cell. Documentation: - docs/performance-dgx-spark.md: complete skeleton with TBD cells. Two headline tables (native-shape peak + matched 8 KB op size) so cross-backend comparison is honest. Per-backend sweep dimensions table addresses the DPDK packets / RoCE messages / TCP stream semantic mismatch. Documents that HDS is deferred on GB10 (host_pinned collapses the HDS vs. plain GPUDirect distinction). - mkdocs.yml, docs/index.html, README.md, AGENTS.md: wire the new doc into nav, landing-page News card, README Documentation table, AGENTS Documentation section + drift-hotspot list. - .gitignore: bench-results/ (wrapper output). Verified on a DGX Spark inside the project container: DPDK loopback smoke test produces correctly-formatted TX/RX complete: lines at ~84 Gbps cross-chip; pacer holds 1.055 Gbps when --target-gbps 1 is set (within the +/-5% software pacer accuracy documented in the report); socket UDP smoke produces the new sent_packets/recv_packets/seconds output format; wrapper produces a clean CSV row with non-zero gbps and drops=0. RDMA single-host smoke is deferred -- kernel routes local local IPs through lo (not the RoCE NICs), which requires a network namespace to test on one host; that setup lands with the data-fill commit. Planned follow-up PRs that build on this one: - This PR (#15), second commit: run the full sweep via run_spark_bench.sh and populate the C++ loopback cells in docs/performance-dgx-spark.md. Lands before this PR is marked ready-for-review. - PR for #15 workloads: introduce examples/bench_post_process.{h,cu} (cuFFT + cuBLAS post-process layer, examples-only -- no library dependency change), refactor rx_count_worker to defer burst free until the post-process CUDA event signals, and fill FFT/GEMM cells in the performance doc. - PR for #16 (Python loopback): add daqiri_bench_rdma.py and daqiri_bench_socket.py mirroring the C++ benches' --target-gbps and stdout format; reuses the existing pybind11 surface, no new bindings. Fills Python loopback cells. - PR for #16 workloads: pybind11 wrappers for the PR 2 post-process layer in python/bench_post_process_pybind.cpp; fills Python FFT/GEMM cells and completes the Spark v1 report. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: rgurunathan --- .gitignore | 1 + AGENTS.md | 4 +- README.md | 1 + docs/index.html | 6 + docs/performance-dgx-spark.md | 312 ++++++++++++++++++++++++++ examples/CMakeLists.txt | 6 +- examples/bench_capture_environment.sh | 116 ++++++++++ examples/raw_bench_common.cpp | 55 ++++- examples/raw_bench_common.h | 30 +++ examples/raw_gpudirect_bench.cpp | 33 ++- examples/rdma_bench.cpp | 42 +++- examples/run_spark_bench.sh | 289 ++++++++++++++++++++++++ examples/socket_bench.cpp | 49 +++- mkdocs.yml | 2 + 14 files changed, 921 insertions(+), 25 deletions(-) create mode 100644 docs/performance-dgx-spark.md create mode 100755 examples/bench_capture_environment.sh create mode 100755 examples/run_spark_bench.sh diff --git a/.gitignore b/.gitignore index e221700..dc4cf82 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,6 @@ build*/ site/ +bench-results/ # macOS .DS_Store diff --git a/AGENTS.md b/AGENTS.md index c45ef72..79d9568 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -93,14 +93,16 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf - `docs/index.html` — custom HTML landing page (not generated by MkDocs, hand-maintained) - `docs/daqiri-api.html` — standalone HTML API reference (hand-maintained) - `docs/api-guide.md`, `docs/getting-started.md`, `docs/configuration.md` — core markdown docs +- `docs/performance-dgx-spark.md` — per-platform performance report (DGX Spark; more platforms to follow) - `docs/tutorials/` — tutorial walkthroughs (background, system config, benchmarking, config files) - `docs/stylesheets/extra.css` — custom theme overrides **Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots: - **Backend list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/configuration.md` - **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section -- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage) +- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage), and per-platform performance docs (`docs/performance-*.md`) - **Public API** (`src/common.h`, `src/types.h`, `src/manager.h`) — `docs/api-guide.md`, `docs/daqiri-api.html` +- **Bench CLI flags or output format** (`examples/raw_bench_common.{h,cpp}`, `*_bench.cpp`) — per-platform performance docs' Methodology section, `examples/run_spark_bench.sh` parsing logic - **Doc reorganization** (any rename in `docs/`) — `docs/index.html` landing page, `mkdocs.yml` nav, README Documentation table The full mapping with rationale lives in the docs-sync agent rule. Internal-link, anchor, and nav drift is enforced by CI (`.github/workflows/docs.yml`); content drift (stale binary names, defaults) is still a manual check at commit time. diff --git a/README.md b/README.md index 50112c6..30c17e1 100644 --- a/README.md +++ b/README.md @@ -81,6 +81,7 @@ Reference material for the DAQIRI codebase: - [Getting Started](docs/getting-started.md) — System requirements, build/install instructions, and CMake options - [Configuration Reference](docs/configuration.md) — Full YAML config reference for all backends - [API Guide](docs/api-guide.md) — BurstParams, RX/TX workflows, buffer lifecycle, status codes +- [Performance: DGX Spark](docs/performance-dgx-spark.md) — Per-platform throughput, drop, and utilization numbers for all backends - [Contributing](CONTRIBUTING.md) — Contribution guidelines, coding standards, DCO sign-off ## Tutorials diff --git a/docs/index.html b/docs/index.html index 61a9eb2..53f2ffe 100644 --- a/docs/index.html +++ b/docs/index.html @@ -595,6 +595,12 @@

News

+
+
Performance2026
+
DAQIRI Performance on DGX Spark
+
NVIDIA — Throughput, drops, and resource utilization for DPDK GPUDirect, RoCE, and socket backends measured on a DGX Spark (GB10) workstation. First in a series of per-platform performance reports.
+ +
GitHub2025
DAQIRI Open-Sourced on GitHub
diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md new file mode 100644 index 0000000..1e74cd6 --- /dev/null +++ b/docs/performance-dgx-spark.md @@ -0,0 +1,312 @@ +# Performance: DGX Spark + +This page reports DAQIRI throughput, drop, and resource-utilization numbers +measured on a DGX Spark (GB10) workstation. It is the first in a series of +per-platform performance reports; the same section layout will be reused for +IGX, x86-server, and other targets. + +The numbers below are reproducible — every cell is generated by +[`examples/run_spark_bench.sh`](https://github.com/nvidia/daqiri/blob/main/examples/run_spark_bench.sh) +against the YAML configs in `examples/`, with the system state captured by +[`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh) +alongside each result set. + +## Summary + +Two headline tables. **Native-shape peak** reports each backend at its +best-case operation size (the configuration the backend was designed for). +**Matched 8 KB** drives DPDK / RoCE / Socket UDP at a common ~8 KB unit of +work so the cross-backend comparison is apples-to-apples; TCP is omitted from +that table since it has no operation boundary. + +### Native-shape peak — max no-drop throughput (Gbps) + +| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| --------------------- | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- | +| DPDK GPUDirect (8 KB) | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| RoCE (8 MB SEND) | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| Socket UDP (MTU) | _TBD (PR 1)_ | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | +| Socket TCP (stream) | _TBD (PR 1)_ | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | + +### Matched 8 KB operation — cross-backend Gbps + +| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| ------------------------ | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- | +| DPDK GPUDirect | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| RoCE | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| Socket UDP (1472 B, MTU) | _TBD (PR 1)_ | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | + +!!! note "Why two tables" + A single Gbps number isn't enough to compare backends fairly. A DPDK + backend at peak with 8 KB packets is doing very different work than a RoCE + backend at peak with 8 MB messages, even when both report the same Gbps. + The matched 8 KB table makes the cross-backend comparison honest; the + native-shape table shows each backend's design ceiling. + +## Known limitations on this platform + +- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses + `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory + for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable — + so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no + longer changes the memory path; it only changes the segment partition, + which makes "HDS vs. plain GPUDirect" a non-distinction on this platform. + HDS is characterized when this report extends to IGX and x86-server + platforms where device memory works. See + [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking. +- **p99/p999 latency is not in v1.** The bench output captures throughput, + drops, and resource utilization. Per-burst RX timestamping and percentile + aggregation are deferred to a follow-up issue. + +## System under test + +The reproducibility appendix has the full capture. Key fields: + +| Field | Value | +| ----------------- | ------------------------------------ | +| Platform | NVIDIA DGX Spark (GB10) | +| GPU | Blackwell (compute capability 12.1) | +| NIC | ConnectX-7 (two ports, tied / QSFP loopback) | +| Topology | `0000:01:00.0` (mlx5_0) → `0002:01:00.0` (mlx5_2), physical loopback cable | +| OS | _captured in `environment.txt`_ | +| Kernel | _captured in `environment.txt`_ | +| CUDA driver | _captured in `environment.txt`_ | +| DPDK | patched per `dpdk_patches/` (container build) | +| DAQIRI commit | _captured in `environment.txt`_ | + +## Methodology + +### Bench commands + +Each backend has a dedicated bench executable in `examples/`. PR 1 numbers +come from these: + +```bash +# DPDK GPUDirect — physical loopback +./build/examples/daqiri_bench_raw_gpudirect \ + examples/daqiri_bench_raw_tx_rx_spark.yaml \ + --seconds 30 [--target-gbps G] + +# RoCE — same NIC, two ports +./build/examples/daqiri_bench_rdma \ + examples/daqiri_bench_rdma_tx_rx_spark.yaml \ + --seconds 30 --mode both [--target-gbps G] + +# Socket UDP / TCP — localhost +./build/examples/daqiri_bench_socket \ + examples/daqiri_bench_socket_udp_tx_rx.yaml \ + --seconds 30 --mode both [--target-gbps G] +``` + +The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC: + +```bash +ETH_DST_ADDR="$(cat /sys/class/net//address)" +``` + +### Per-backend sweep dimensions + +The "payload × batch" sweep doesn't map uniformly across backends. Each +backend has its own sweep: + +| Backend | Sweep dim 1 | Sweep dim 2 | Native-shape cell | Matched-size cell | +| ------------------ | ------------------------------------------------------ | -------------------------------------- | -------------------- | ---------------------- | +| DPDK GPUDirect | payload_size ∈ {64, 256, 1024, 4096, 8000} B | batch_size ∈ {256, 1024, 4096, 10240} | (8000, 10240) | (8000, 10240) — same | +| RoCE | message_size ∈ {4 K, 64 K, 1 M, 8 M} B | batch_size ∈ {1, 4, 16} | (8 M, 1) | (8 K, 16) | +| Socket UDP | payload_size ∈ {64, 256, 1024, 1472} B (MTU-bound) | batch_size ∈ {1, 32, 256} | (1472, 256) | (1472, 256) — closest to 8 K under MTU cap | +| Socket TCP | message_size ∈ {1 K, 64 K, 1 M} B | n/a (single stream) | (64 K) | n/a | + +### "No-drop" threshold + +A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run. +The headline tables report the highest target rate at which a run was still +drop-free under this threshold. The methodology does not use a percentile cap +— either there are drops or there are not. + +### Drop-curve sweep + +The drop curve sweeps `--target-gbps` while holding the native-shape cell +constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`) +adds a software-paced sleep after each burst; accuracy is ~±5 % at high rates +due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK +`accurate_send`) is unused but would tighten DPDK-only precision; deferred to +a follow-up. + +### Drop sources per backend + +- **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO` + output (the bench's stderr). +- **RoCE** — count of `CQ error` lines in `DAQIRI_LOG_ERROR` output (RDMA + manager filters error completions but logs each one). +- **Socket UDP** — diff of the `drops` column in `/proc/net/udp` over the run. +- **Socket TCP** — `nstat -a` diff of `TcpExtTCPLostRetransmit` / + `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is + the closest proxy. + +### External captures per run + +Each run records, in parallel with the bench: + +- `mpstat -P ALL 1 ` — per-core CPU busy%. +- `nvidia-smi dmon -s pucvmet -c ` — GPU SM%, mem%, DRAM bandwidth. + +Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, hugepages, +GPU state, DAQIRI commit) is captured once per result set by +`bench_capture_environment.sh`. + +## Results — DPDK GPUDirect + +_PR 1: this section is populated after the first sweep run._ + +### Drop curve at native shape + +| target_gbps | achieved Gbps | RX pps | drops | +| ----------- | ------------- | ------- | ----- | +| _TBD (PR 1)_ | | | | + +### Payload × batch sweep + +| payload | batch | Gbps | pps | drops | +| ------- | ----- | ---- | --- | ----- | +| _TBD (PR 1)_ | | | | | + +### CPU and GPU utilization (headline cell) + +| Resource | Value | +| --------------- | --------------- | +| Master core % | _TBD_ | +| TX core % | _TBD_ | +| RX core % | _TBD_ | +| GPU SM % | _TBD (near 0; GPU is DMA target, not compute)_ | +| GPU mem BW % | _TBD_ | + +## Results — RoCE + +_PR 1: this section is populated after the first sweep run._ + +### Drop curve at native shape + +| target_gbps | achieved Gbps | completions/s | drops (CQ errors) | +| ----------- | ------------- | ------------- | ----------------- | +| _TBD (PR 1)_ | | | | + +### Message-size × batch sweep + +| message_size | batch | Gbps | completions/s | drops | +| ------------ | ----- | ---- | ------------- | ----- | +| _TBD (PR 1)_ | | | | | + +### CPU and GPU utilization (headline cell) + +| Resource | Value | +| --------------- | ----- | +| Server core % | _TBD_ | +| Client core % | _TBD_ | +| GPU SM % | _TBD_ | +| GPU mem BW % | _TBD_ | + +## Results — Socket + +_PR 1: this section is populated after the first sweep run. GPU rows N/A._ + +### UDP — payload × batch sweep + +| payload | batch | Gbps | pps | drops | +| ------- | ----- | ---- | --- | ----- | +| _TBD (PR 1)_ | | | | | + +### TCP — message-size sweep + +| message_size | Gbps | retrans/inerrs | +| ------------ | ---- | -------------- | +| _TBD (PR 1)_ | | | + +## Workload variants (FFT, GEMM) + +The post-process layer ([PR 2](https://github.com/NVIDIA/daqiri/issues/15)) +adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` / +`cuBLAS` on the received GPU-resident payload, and reports the resulting +throughput delta and GPU utilization. + +**Sizes (representative):** + +- FFT: 1D complex-to-complex, length 1024. +- GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload). + +### DPDK GPUDirect + +_TBD (PR 2)._ + +### RoCE + +_TBD (PR 2)._ Note the unit-of-work mismatch when comparing across backends: +RoCE applies the post-process kernel once per ~8 MB SEND; DPDK applies it +once per packet. The throughput numbers are comparable; "operations per +burst" is not. + +## Python results + +The Python benches ([PR 3](https://github.com/NVIDIA/daqiri/issues/16)) mirror +the C++ benches' CLI and stdout format, using the existing pybind11 bindings. + +### Loopback + +_TBD (PR 3)._ + +### FFT / GEMM via pybind of the C++ post-process layer + +_TBD (PR 4)._ + +## Reproducibility appendix + +### Container + +All commands below assume execution inside the project container, as required +by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md). The +container must be started in privileged mode with all GPUs and hugepage +mounts passed through. + +### Full result regeneration + +```bash +# 1. Build inside the container (root). +cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \ + -DDAQIRI_MGR="dpdk socket rdma" +cmake --build build -j + +# 2. Snapshot environment + run the sweeps. +export DAQIRI_BUILD_DIR="$PWD/build" +export ETH_DST_ADDR="$(cat /sys/class/net//address)" + +# Native-shape headline cell: +./examples/run_spark_bench.sh dpdk smoke +./examples/run_spark_bench.sh rdma smoke +./examples/run_spark_bench.sh socket-udp smoke +./examples/run_spark_bench.sh socket-tcp smoke + +# Full payload × batch sweep: +./examples/run_spark_bench.sh dpdk sweep +./examples/run_spark_bench.sh rdma sweep +./examples/run_spark_bench.sh socket-udp sweep +./examples/run_spark_bench.sh socket-tcp sweep + +# Drop curve (sweeps --target-gbps): +./examples/run_spark_bench.sh dpdk drop-curve +./examples/run_spark_bench.sh rdma drop-curve +./examples/run_spark_bench.sh socket-udp drop-curve +./examples/run_spark_bench.sh socket-tcp drop-curve +``` + +### Environment-only capture + +Useful for filing a bug report or comparing two Spark units: + +```bash +./examples/bench_capture_environment.sh /tmp/spark-env +``` + +### Tuning prerequisites + +System tuning is required before the numbers in this report are reproducible. +See [`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md) +for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity. diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt index 4bc11aa..4807b85 100644 --- a/examples/CMakeLists.txt +++ b/examples/CMakeLists.txt @@ -71,14 +71,16 @@ add_daqiri_raw_bench(daqiri_bench_raw_reorder_quantize raw_reorder_quantize_benc add_daqiri_raw_bench(daqiri_example_gds_write gds_write_example.cpp) add_daqiri_raw_bench(daqiri_example_pcap_writer pcap_writer_example.cpp) -add_executable(daqiri_bench_rdma rdma_bench.cpp) +add_executable(daqiri_bench_rdma rdma_bench.cpp raw_bench_common.cpp) link_daqiri_bench(daqiri_bench_rdma) +target_link_libraries(daqiri_bench_rdma PRIVATE CUDA::cudart) set_target_properties(daqiri_bench_rdma PROPERTIES BUILD_RPATH "$ORIGIN/../src;$ORIGIN/../src/third_party/yaml-cpp" ) -add_executable(daqiri_bench_socket socket_bench.cpp) +add_executable(daqiri_bench_socket socket_bench.cpp raw_bench_common.cpp) link_daqiri_bench(daqiri_bench_socket) +target_link_libraries(daqiri_bench_socket PRIVATE CUDA::cudart) foreach(cfg IN LISTS DAQIRI_BENCH_CONFIGS) configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${cfg} ${CMAKE_CURRENT_BINARY_DIR}/${cfg} COPYONLY) diff --git a/examples/bench_capture_environment.sh b/examples/bench_capture_environment.sh new file mode 100755 index 0000000..3314b08 --- /dev/null +++ b/examples/bench_capture_environment.sh @@ -0,0 +1,116 @@ +#!/usr/bin/env bash +# +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Capture host/NIC/GPU/build state for a benchmark run, so numbers are +# reproducible across machines and over time. Writes one structured text file +# with named sections. +# +# Usage: ./bench_capture_environment.sh +# Default output dir: bench-results// + +set -u + +OUT_DIR="${1:-bench-results/$(date -u +%Y%m%dT%H%M%SZ)}" +mkdir -p "$OUT_DIR" +OUT="$OUT_DIR/environment.txt" + +# Run a command, capturing exit status. Always write a header so the section is +# present even when the command is missing or fails — silent absence is harder +# to debug than an explicit "command not found". +run_section() { + local label="$1"; shift + { + echo "==========================================================" + echo "[$label]" + echo " cmd: $*" + echo "==========================================================" + if command -v "$1" >/dev/null 2>&1 || [[ "$1" == /* || "$1" == ./* ]]; then + "$@" 2>&1 + echo " (exit: $?)" + else + echo " (command not found in PATH: $1)" + fi + echo + } >> "$OUT" +} + +# Cat a file/glob; write a header either way. +cat_section() { + local label="$1"; shift + { + echo "==========================================================" + echo "[$label]" + echo " paths: $*" + echo "==========================================================" + for p in "$@"; do + if compgen -G "$p" >/dev/null; then + for f in $p; do + echo "----- $f -----" + cat "$f" 2>&1 + done + else + echo " (no match: $p)" + fi + done + echo + } >> "$OUT" +} + +: > "$OUT" + +echo "DAQIRI benchmark environment capture" >> "$OUT" +echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> "$OUT" +echo "Host: $(hostname)" >> "$OUT" +echo "Output: $OUT" >> "$OUT" +echo >> "$OUT" + +# --- Kernel / OS --- +run_section "uname" uname -a +cat_section "kernel-cmdline" /proc/cmdline +cat_section "os-release" /etc/os-release +run_section "lsb-release" lsb_release -a +run_section "clocksource" cat /sys/devices/system/clocksource/clocksource0/current_clocksource + +# --- CPU / NUMA / IRQ --- +run_section "numactl" numactl --show +run_section "lscpu" lscpu +cat_section "cpu-isolated" /sys/devices/system/cpu/isolated +run_section "cpufreq-info" cpupower frequency-info +cat_section "cpu-governor" /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +run_section "irq-mlx5" bash -c "grep mlx5 /proc/interrupts || true" + +# --- Hugepages --- +cat_section "hugepages" /sys/kernel/mm/hugepages/*/nr_hugepages +run_section "free-h" free -h + +# --- PCIe topology --- +run_section "lspci-mellanox" bash -c "lspci -vvv -d 15b3: 2>/dev/null" +run_section "lspci-nvidia" bash -c "lspci -vvv -d 10de: 2>/dev/null" + +# --- NIC: OFED / firmware / DPDK binding --- +run_section "ofed-info" ofed_info -s +run_section "mlxfwmanager" mlxfwmanager --query +run_section "dpdk-devbind" dpdk-devbind.py --status +# Per-iface ethtool — iterate over the daqiri-tx/rx names if present, else all mlx5. +for iface in daqiri-tx daqiri-rx $(ls /sys/class/net 2>/dev/null | grep -E '^(enP|enp|eth)' || true); do + [[ -d "/sys/class/net/$iface" ]] || continue + run_section "ethtool-i:$iface" ethtool -i "$iface" + run_section "ethtool-g:$iface" ethtool -g "$iface" + run_section "ethtool-l:$iface" ethtool -l "$iface" + cat_section "iface-mtu:$iface" "/sys/class/net/$iface/mtu" + cat_section "iface-mac:$iface" "/sys/class/net/$iface/address" +done + +# --- GPU --- +run_section "nvidia-smi-q" nvidia-smi -q +run_section "nvidia-smi-tempclk" nvidia-smi --query-gpu=name,driver_version,temperature.gpu,clocks.current.sm,clocks.current.memory --format=csv + +# --- Build state --- +DAQIRI_DIR="$(git -C "$(dirname "$0")/.." rev-parse --show-toplevel 2>/dev/null || pwd)" +run_section "git-rev-parse" git -C "$DAQIRI_DIR" rev-parse HEAD +run_section "git-status" git -C "$DAQIRI_DIR" status --short +run_section "git-describe" git -C "$DAQIRI_DIR" describe --always --dirty + +echo "Capture complete: $OUT" diff --git a/examples/raw_bench_common.cpp b/examples/raw_bench_common.cpp index 405c500..bb1e42c 100644 --- a/examples/raw_bench_common.cpp +++ b/examples/raw_bench_common.cpp @@ -19,10 +19,12 @@ #include +#include #include #include #include #include +#include #include #include @@ -97,6 +99,44 @@ int parse_run_seconds(int argc, char **argv) { return run_seconds; } +double parse_target_gbps(int argc, char **argv) { + double target_gbps = 0.0; + for (int i = 2; i + 1 < argc; i += 2) { + if (std::string(argv[i]) == "--target-gbps") { + target_gbps = std::stod(argv[i + 1]); + } + } + return target_gbps; +} + +TokenBucketPacer::TokenBucketPacer(double target_gbps) + : target_bps_(target_gbps > 0.0 ? target_gbps * 1e9 : 0.0), + t0_(std::chrono::steady_clock::now()) {} + +void TokenBucketPacer::wait_for_bytes(size_t bytes, std::atomic &stop) { + if (target_bps_ <= 0.0) { + return; + } + total_bytes_ += bytes; + const double scheduled_secs = (total_bytes_ * 8.0) / target_bps_; + const auto scheduled = t0_ + std::chrono::duration_cast< + std::chrono::steady_clock::duration>( + std::chrono::duration(scheduled_secs)); + // Slice the wait into 10 ms chunks so a stop flag (--seconds expiry or + // Ctrl-C) can break us out promptly. The total slept across the slices + // accumulates to the scheduled deadline, so pacing remains accurate. + constexpr auto kSlice = std::chrono::milliseconds(10); + while (!stop.load()) { + const auto now = std::chrono::steady_clock::now(); + if (scheduled <= now) { + return; + } + const auto remaining = scheduled - now; + std::this_thread::sleep_for( + std::min(remaining, kSlice)); + } +} + bool has_bench_rx(const YAML::Node &root) { return root["bench_rx"] && root["bench_rx"]["interface_name"]; } @@ -287,6 +327,7 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic &stop) { uint64_t pkts = 0; uint64_t bytes = 0; uint64_t bursts = 0; + const auto t0 = std::chrono::steady_clock::now(); while (!stop.load()) { const auto num_rx_queues = static_cast(daqiri::get_num_rx_queues(port_id)); @@ -307,9 +348,17 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic &stop) { std::this_thread::sleep_for(std::chrono::microseconds(100)); } } - - std::cout << "RX complete: packets=" << pkts << " bytes=" << bytes - << " bursts=" << bursts << "\n"; + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - t0) + .count(); + + // Build the line in a stringstream so the print to stdout is a single + // write(). RX and TX workers race at end-of-run and naive `cout <<` can + // interleave their output (corrupting downstream parsers). + std::ostringstream oss; + oss << "RX complete: packets=" << pkts << " bytes=" << bytes + << " bursts=" << bursts << " seconds=" << secs << "\n"; + std::cout << oss.str() << std::flush; } } // namespace daqiri::bench diff --git a/examples/raw_bench_common.h b/examples/raw_bench_common.h index d53a4f9..cf0ee0c 100644 --- a/examples/raw_bench_common.h +++ b/examples/raw_bench_common.h @@ -21,6 +21,7 @@ #include #include +#include #include #include #include @@ -28,6 +29,34 @@ namespace daqiri::bench { +// Software token-bucket pacer used by the bench TX workers. When +// target_gbps == 0 the wait_for_bytes() call is a no-op early return, so the +// pacer adds no overhead when --target-gbps is unset. +// +// Accuracy: ~5% at high rates due to Linux nanosleep granularity and scheduler +// jitter. Acceptable for drop-curve sweeps; tighter pacing would require +// hardware TX timestamping (DAQIRI's accurate_send YAML flag), deferred. +class TokenBucketPacer { +public: + TokenBucketPacer() = default; + explicit TokenBucketPacer(double target_gbps); + + // Call after each TX burst. Sleeps in short slices until the pacer's notion + // of "time the configured target rate would have taken to send the + // accumulated bytes" catches up, OR `stop` flips true. Slicing keeps the + // bench responsive to --seconds expiry / Ctrl-C without truncating the total + // sleep (which would silently break pacing for low target rates). + void wait_for_bytes(size_t bytes, std::atomic &stop); + + bool enabled() const { return target_bps_ > 0.0; } + double target_gbps() const { return target_bps_ / 1e9; } + +private: + double target_bps_ = 0.0; // 0 means disabled + uint64_t total_bytes_ = 0; + std::chrono::steady_clock::time_point t0_; +}; + struct RawBenchTxConfig { std::string interface_name = "tx_port"; uint32_t batch_size = 1024; @@ -68,6 +97,7 @@ class PinnedHostBuffer { }; int parse_run_seconds(int argc, char **argv); +double parse_target_gbps(int argc, char **argv); bool has_bench_rx(const YAML::Node &root); bool has_bench_tx(const YAML::Node &root); RawBenchRxConfig parse_rx(const YAML::Node &root); diff --git a/examples/raw_gpudirect_bench.cpp b/examples/raw_gpudirect_bench.cpp index e93d0c8..cbfa9f9 100644 --- a/examples/raw_gpudirect_bench.cpp +++ b/examples/raw_gpudirect_bench.cpp @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -35,6 +36,7 @@ namespace { void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, + daqiri::bench::TokenBucketPacer &pacer, std::atomic &stop) { const int port_id = daqiri::get_port_id(cfg.interface_name); if (port_id < 0) { @@ -67,6 +69,12 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, std::unordered_set initialized_tx_buffers; + uint64_t tx_packets = 0; + uint64_t tx_bytes = 0; + uint64_t tx_bursts = 0; + const auto t0 = std::chrono::steady_clock::now(); + const uint32_t pkt_bytes = cfg.header_size + cfg.payload_size; + while (!stop.load()) { auto *msg = daqiri::create_tx_burst_params(); daqiri::set_header(msg, static_cast(port_id), 0, cfg.batch_size, @@ -109,7 +117,7 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, } if (daqiri::set_packet_lengths( - msg, i, {static_cast(cfg.header_size + cfg.payload_size)}) != + msg, i, {static_cast(pkt_bytes)}) != daqiri::Status::SUCCESS) { failed = true; break; @@ -121,18 +129,34 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, continue; } daqiri::send_tx_burst(msg); + const uint64_t burst_bytes = + static_cast(num_pkts) * pkt_bytes; + tx_packets += static_cast(num_pkts); + tx_bytes += burst_bytes; + ++tx_bursts; + pacer.wait_for_bytes(burst_bytes, stop); } + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - t0) + .count(); + // Single-write print so concurrent RX worker output doesn't interleave. + std::ostringstream oss; + oss << "TX complete: packets=" << tx_packets << " bytes=" << tx_bytes + << " bursts=" << tx_bursts << " seconds=" << secs << "\n"; + std::cout << oss.str() << std::flush; } } // namespace int main(int argc, char **argv) { if (argc < 2) { - std::cerr << "Usage: " << argv[0] << " [--seconds N]\n"; + std::cerr << "Usage: " << argv[0] + << " [--seconds N] [--target-gbps G]\n"; return 1; } const int run_seconds = daqiri::bench::parse_run_seconds(argc, argv); + const double target_gbps = daqiri::bench::parse_target_gbps(argc, argv); const auto root = YAML::LoadFile(argv[1]); if (daqiri::daqiri_init(argv[1]) != daqiri::Status::SUCCESS) { std::cerr << "daqiri_init failed\n"; @@ -150,14 +174,15 @@ int main(int argc, char **argv) { std::atomic stop{false}; std::thread tx_thread; std::thread rx_thread; + daqiri::bench::TokenBucketPacer tx_pacer(target_gbps); if (has_rx) { rx_thread = std::thread(daqiri::bench::rx_count_worker, daqiri::bench::parse_rx(root), std::ref(stop)); } if (has_tx) { - tx_thread = - std::thread(tx_worker, daqiri::bench::parse_tx(root), std::ref(stop)); + tx_thread = std::thread(tx_worker, daqiri::bench::parse_tx(root), + std::ref(tx_pacer), std::ref(stop)); } daqiri::bench::wait_for_stop(run_seconds, stop); diff --git a/examples/rdma_bench.cpp b/examples/rdma_bench.cpp index dbe8186..5b9bcfd 100644 --- a/examples/rdma_bench.cpp +++ b/examples/rdma_bench.cpp @@ -25,6 +25,7 @@ #include #include +#include "raw_bench_common.h" #include "src/common.h" namespace { @@ -48,6 +49,8 @@ struct RdmaBenchConfig { struct RdmaWorkerStats { uint64_t send_completions = 0; uint64_t recv_completions = 0; + uint64_t send_bytes = 0; + uint64_t recv_bytes = 0; }; RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) { @@ -62,7 +65,8 @@ RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) { return cfg; } -void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorkerStats& stats) { +void rdma_worker(const RdmaBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer, + std::atomic& stop, RdmaWorkerStats& stats) { static constexpr int kMaxOutstanding = 5; int outstanding_send = 0; int outstanding_recv = 0; @@ -116,6 +120,10 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker } outstanding++; wr_id++; + // Only meter actual byte transmissions (SENDs), not RECEIVE-side posts. + if (op == daqiri::RDMAOpCode::SEND) { + pacer.wait_for_bytes(static_cast(cfg.message_size), stop); + } }; if (cfg.send) { post_req(outstanding_send, send_wr_id, daqiri::RDMAOpCode::SEND, send_mr); } @@ -127,10 +135,12 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::SEND && outstanding_send > 0) { outstanding_send--; stats.send_completions++; + stats.send_bytes += static_cast(cfg.message_size); } else if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::RECEIVE && outstanding_recv > 0) { outstanding_recv--; stats.recv_completions++; + stats.recv_bytes += static_cast(cfg.message_size); } daqiri::free_tx_burst(completion); } else { @@ -143,17 +153,21 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker int main(int argc, char** argv) { if (argc < 2) { - std::cerr << "Usage: " << argv[0] << " [--seconds N] [--mode server|client|both]\n"; + std::cerr << "Usage: " << argv[0] + << " [--seconds N] [--mode server|client|both] [--target-gbps G]\n"; return 1; } int run_seconds = 10; + double target_gbps = 0.0; std::string mode = "both"; for (int i = 2; i + 1 < argc; i += 2) { if (std::string(argv[i]) == "--seconds") { run_seconds = std::stoi(argv[i + 1]); } else if (std::string(argv[i]) == "--mode") { mode = argv[i + 1]; + } else if (std::string(argv[i]) == "--target-gbps") { + target_gbps = std::stod(argv[i + 1]); } } @@ -168,18 +182,22 @@ int main(int argc, char** argv) { std::thread client_thread; RdmaWorkerStats server_stats; RdmaWorkerStats client_stats; + daqiri::bench::TokenBucketPacer server_pacer(target_gbps); + daqiri::bench::TokenBucketPacer client_pacer(target_gbps); bool run_server = false; bool run_client = false; if ((mode == "server" || mode == "both") && root["rdma_bench_server"]) { run_server = true; server_thread = std::thread( - rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]), std::ref(stop), std::ref(server_stats)); + rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]), + std::ref(server_pacer), std::ref(stop), std::ref(server_stats)); } if ((mode == "client" || mode == "both") && root["rdma_bench_client"]) { run_client = true; client_thread = std::thread( - rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]), std::ref(stop), std::ref(client_stats)); + rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]), + std::ref(client_pacer), std::ref(stop), std::ref(client_stats)); } if (!server_thread.joinable() && !client_thread.joinable()) { @@ -203,11 +221,23 @@ int main(int argc, char** argv) { if (server_thread.joinable()) { server_thread.join(); } if (client_thread.joinable()) { client_thread.join(); } + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - start) + .count(); + if (run_server) { - std::cout << "Server received messages: " << server_stats.recv_completions << '\n'; + std::cout << "Server complete: send_completions=" << server_stats.send_completions + << " recv_completions=" << server_stats.recv_completions + << " send_bytes=" << server_stats.send_bytes + << " recv_bytes=" << server_stats.recv_bytes + << " seconds=" << secs << '\n'; } if (run_client) { - std::cout << "Client received messages: " << client_stats.recv_completions << '\n'; + std::cout << "Client complete: send_completions=" << client_stats.send_completions + << " recv_completions=" << client_stats.recv_completions + << " send_bytes=" << client_stats.send_bytes + << " recv_bytes=" << client_stats.recv_bytes + << " seconds=" << secs << '\n'; } daqiri::print_stats(); diff --git a/examples/run_spark_bench.sh b/examples/run_spark_bench.sh new file mode 100755 index 0000000..e85df69 --- /dev/null +++ b/examples/run_spark_bench.sh @@ -0,0 +1,289 @@ +#!/usr/bin/env bash +# +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Sweep wrapper for DAQIRI benchmarks on DGX Spark. Runs the bench across a +# matrix of (payload/message size, batch size, target-gbps), captures per-run +# CPU/GPU/NIC counters, and emits one CSV row per cell into bench-results/. +# +# Drop sources per backend (per the report methodology): +# DPDK : grep imissed/ierrors/rx_nombuf from bench log (DAQIRI_LOG_INFO). +# RDMA : grep "CQ error" lines from bench log (DAQIRI_LOG_ERROR). +# socket : diff /proc/net/udp drops column (UDP); nstat -a (TCP retransmits). +# +# Usage: +# ./run_spark_bench.sh [mode] +# backend ∈ {dpdk, rdma, socket-udp, socket-tcp} +# mode ∈ {smoke, sweep, drop-curve} (default: smoke) +# +# Required environment in current shell: +# DAQIRI_BUILD_DIR — path to the cmake build dir (defaults to ../build). +# ETH_DST_ADDR — required for dpdk backend (the RX iface MAC). +# RX_IFACE — kernel name of the RX interface for /proc/net/udp diff +# (e.g. enP2p1s0f0np0); required for socket-udp. +# +# Run inside the project container as root (per AGENTS.md). + +set -u +set -o pipefail + +# -------------------------------------------------------------------------- +# Configuration +# -------------------------------------------------------------------------- + +BACKEND="${1:-}" +MODE="${2:-smoke}" +if [[ -z "$BACKEND" ]]; then + echo "Usage: $0 [smoke|sweep|drop-curve]" >&2 + exit 1 +fi + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +BUILD_DIR="${DAQIRI_BUILD_DIR:-$SCRIPT_DIR/../build}" +TS="$(date -u +%Y%m%dT%H%M%SZ)" +OUT_DIR="$SCRIPT_DIR/../bench-results/$TS-$BACKEND-$MODE" +mkdir -p "$OUT_DIR" + +CSV="$OUT_DIR/runs.csv" +echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_busy_mean,gpu_sm_pct,gpu_mem_bw" > "$CSV" + +# Capture slow-moving environment state once per result set. +"$SCRIPT_DIR/bench_capture_environment.sh" "$OUT_DIR" + +RUN_SECONDS=30 +DRIVER_LOG="$OUT_DIR/last_run.stderr" + +# Per-backend sweep matrices (see docs/performance-dgx-spark.md methodology). +# Native-shape sizes are the leftmost entry; "matched 8K" cell is also included. +case "$BACKEND" in + dpdk) + PAYLOADS_SWEEP=(8000 4096 1024 256 64) + BATCHES_SWEEP=(10240 4096 1024 256) + PAYLOADS_HEADLINE=(8000) + BATCHES_HEADLINE=(10240) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_raw_tx_rx_spark.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect" + : "${ETH_DST_ADDR:?ETH_DST_ADDR must be set for dpdk backend (cat /sys/class/net//address)}" + ;; + rdma) + PAYLOADS_SWEEP=(8000000 1048576 65536 8192 4096) + BATCHES_SWEEP=(1) + PAYLOADS_HEADLINE=(8000000) + BATCHES_HEADLINE=(1) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_rdma_tx_rx_spark.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_rdma" + ;; + socket-udp) + PAYLOADS_SWEEP=(1472 1024 256 64) + BATCHES_SWEEP=(256 32 1) + PAYLOADS_HEADLINE=(1472) + BATCHES_HEADLINE=(256) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_udp_tx_rx.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket" + ;; + socket-tcp) + PAYLOADS_SWEEP=(1048576 65536 1024) + BATCHES_SWEEP=(1) + PAYLOADS_HEADLINE=(65536) + BATCHES_HEADLINE=(1) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_tcp_tx_rx.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket" + ;; + *) echo "Unknown backend: $BACKEND" >&2; exit 1 ;; +esac + +DROP_CURVE_TARGETS=(1 5 10 25 50 75 100 0) # 0 means unpaced (line rate) + +# -------------------------------------------------------------------------- +# Helpers +# -------------------------------------------------------------------------- + +# Read a scalar field from a `key=value` style stdout line. +# usage: extract_field +extract_field() { + local prefix="$1" field="$2" file="$3" + grep -E "^$prefix" "$file" | tail -n1 | grep -oE " $field=[^ ]+" | head -n1 | sed -E "s/.*$field=//" +} + +# Sum DPDK drop counters from the manager log emitted via DAQIRI_LOG_INFO. +parse_dpdk_drops() { + local log="$1" + local sum=0 v + for key in imissed ierrors rx_nombuf; do + v="$(grep -oE "$key=[0-9]+" "$log" 2>/dev/null | tail -n1 | sed -E "s/.*=//" || true)" + [[ -n "${v:-}" ]] && sum=$((sum + v)) + done + echo "$sum" +} + +# Count RDMA CQ errors in the manager log. +parse_rdma_drops() { + local log="$1" + grep -c 'CQ error' "$log" 2>/dev/null || echo 0 +} + +# Snapshot socket drops on the kernel side. +snapshot_proc_net_udp() { + awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0 +} +snapshot_nstat() { + nstat -a 2>/dev/null | awk '/TcpExtTCPLostRetransmit|TcpRetransSegs|TcpInErrs/ { s += $2 } END { print s+0 }' || echo 0 +} + +# Substitute payload / batch into the base YAML and write a temp config. +generate_yaml() { + local out="$1" payload="$2" batch="$3" + case "$BACKEND" in + dpdk) + sed -E \ + -e "s|^( *payload_size: ).*|\1$payload|" \ + -e "s|^( *batch_size: ).*|\1$batch|" \ + -e "s|<00:00:00:00:00:00>|$ETH_DST_ADDR|g" \ + "$BASE_YAML" > "$out" + ;; + rdma) + sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out" + ;; + socket-udp|socket-tcp) + sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out" + ;; + esac +} + +# Run one cell. Echoes the CSV row to stdout. +run_cell() { + local lang="$1" payload="$2" batch="$3" target_gbps="$4" + local cell="$lang-$BACKEND-p$payload-b$batch-g$target_gbps" + local cell_dir="$OUT_DIR/$cell" + mkdir -p "$cell_dir" + + local yaml="$cell_dir/config.yaml" + generate_yaml "$yaml" "$payload" "$batch" + + # Snapshot kernel-side drop counters. + local udp_before tcp_before + udp_before="$(snapshot_proc_net_udp)" + tcp_before="$(snapshot_nstat)" + + # Background captures. + ( mpstat -P ALL 1 "$RUN_SECONDS" > "$cell_dir/mpstat.txt" 2>&1 ) & + local mpstat_pid=$! + ( nvidia-smi dmon -s pucvmet -c "$RUN_SECONDS" > "$cell_dir/nvidia_smi_dmon.txt" 2>&1 ) & + local dmon_pid=$! + + # Run the bench. Stderr captures DAQIRI_LOG_* output (DPDK/RDMA drop sources). + local stdout="$cell_dir/stdout.txt" + local stderr="$cell_dir/stderr.txt" + local args=("$yaml" --seconds "$RUN_SECONDS") + [[ "$target_gbps" != "0" ]] && args+=(--target-gbps "$target_gbps") + [[ "$BACKEND" == "rdma" || "$BACKEND" =~ ^socket- ]] && args+=(--mode both) + + "$BENCH_BIN" "${args[@]}" > "$stdout" 2> "$stderr" || true + cp "$stderr" "$DRIVER_LOG" + + # Stop background captures (they self-terminate at -c , but reap if needed). + wait "$mpstat_pid" 2>/dev/null || true + wait "$dmon_pid" 2>/dev/null || true + + # Parse bench stdout. For RX-bearing benches "RX complete" is authoritative; + # for TX-only configs fall back to "TX complete". + local pkts bytes secs + pkts="$(extract_field 'RX complete' packets "$stdout")" + bytes="$(extract_field 'RX complete' bytes "$stdout")" + secs="$(extract_field 'RX complete' seconds "$stdout")" + if [[ -z "$pkts" ]]; then + pkts="$(extract_field 'TX complete' packets "$stdout")" + bytes="$(extract_field 'TX complete' bytes "$stdout")" + secs="$(extract_field 'TX complete' seconds "$stdout")" + fi + if [[ -z "$pkts" ]]; then + # RDMA prints "Client/Server complete: ... send_completions=N send_bytes=N seconds=S" + pkts="$(extract_field 'Client complete' send_completions "$stdout")" + bytes="$(extract_field 'Client complete' send_bytes "$stdout")" + secs="$(extract_field 'Client complete' seconds "$stdout")" + fi + pkts="${pkts:-0}"; bytes="${bytes:-0}"; secs="${secs:-0}" + + local pps gbps + pps="$(awk -v p="$pkts" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.0f", p/s; else print 0 }')" + gbps="$(awk -v b="$bytes" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.3f", (b*8.0)/s/1e9; else print 0 }')" + + # Drops per backend. + local drops drops_kind + case "$BACKEND" in + dpdk) + drops="$(parse_dpdk_drops "$stderr")" + drops_kind="dpdk-imissed+ierrors+nombuf" + ;; + rdma) + drops="$(parse_rdma_drops "$stderr")" + drops_kind="rdma-cqe-error" + ;; + socket-udp) + local udp_after; udp_after="$(snapshot_proc_net_udp)" + drops="$((udp_after - udp_before))" + drops_kind="udp-proc-net-udp-drops" + ;; + socket-tcp) + local tcp_after; tcp_after="$(snapshot_nstat)" + drops="$((tcp_after - tcp_before))" + drops_kind="tcp-nstat-retrans+inerrs" + ;; + esac + + # Mean CPU busy% across all cores (mpstat row "all"). Simple aggregate; per-core + # in mpstat.txt for deeper review. + local cpu_busy_mean + cpu_busy_mean="$(awk '/Average:.*all/ { print 100 - $NF; exit }' "$cell_dir/mpstat.txt" 2>/dev/null || echo 0)" + cpu_busy_mean="${cpu_busy_mean:-0}" + + # GPU SM% and memory BW from nvidia-smi dmon. dmon output columns vary by + # driver; sm is typically column 5 in `-s pucvmet`, mem is column 6. + local gpu_sm gpu_mem + gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \ + "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)" + gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \ + "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)" + + echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_busy_mean,$gpu_sm,$gpu_mem" \ + | tee -a "$CSV" +} + +# -------------------------------------------------------------------------- +# Driver +# -------------------------------------------------------------------------- + +case "$MODE" in + smoke) + # One cell, native-shape, unpaced. + for p in "${PAYLOADS_HEADLINE[@]}"; do + for b in "${BATCHES_HEADLINE[@]}"; do + run_cell cpp "$p" "$b" 0 + done + done + ;; + sweep) + # Full payload × batch matrix at line rate. + for p in "${PAYLOADS_SWEEP[@]}"; do + for b in "${BATCHES_SWEEP[@]}"; do + run_cell cpp "$p" "$b" 0 + done + done + ;; + drop-curve) + # Hold native-shape constant, sweep target_gbps. + for p in "${PAYLOADS_HEADLINE[@]}"; do + for b in "${BATCHES_HEADLINE[@]}"; do + for g in "${DROP_CURVE_TARGETS[@]}"; do + run_cell cpp "$p" "$b" "$g" + done + done + done + ;; + *) echo "Unknown mode: $MODE" >&2; exit 1 ;; +esac + +echo +echo "Results in: $OUT_DIR" +echo "CSV: $CSV" diff --git a/examples/socket_bench.cpp b/examples/socket_bench.cpp index 5fda5f5..8ebb27f 100644 --- a/examples/socket_bench.cpp +++ b/examples/socket_bench.cpp @@ -26,6 +26,7 @@ #include #include +#include "raw_bench_common.h" #include "src/common.h" namespace { @@ -50,6 +51,8 @@ struct SocketBenchConfig { struct SocketWorkerStats { uint64_t sent_packets = 0; uint64_t received_packets = 0; + uint64_t sent_bytes = 0; + uint64_t received_bytes = 0; }; SocketBenchConfig parse_socket_cfg(const YAML::Node& node) { @@ -65,7 +68,8 @@ SocketBenchConfig parse_socket_cfg(const YAML::Node& node) { return cfg; } -void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, SocketWorkerStats& stats) { +void socket_worker(const SocketBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer, + std::atomic& stop, SocketWorkerStats& stats) { uintptr_t conn_id = 0; uint16_t port = 0; uint16_t queue = 0; @@ -92,8 +96,14 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket } } - const bool send_done = !cfg.send || stats.sent_packets >= static_cast(cfg.iterations); - const bool recv_done = !cfg.receive || stats.received_packets >= static_cast(cfg.iterations); + // When cfg.iterations <= 0, the loop is time-bounded (driven by stop.load() + // set by --seconds). Otherwise the iteration cap applies as before. + const bool send_done = !cfg.send || + (cfg.iterations > 0 && + stats.sent_packets >= static_cast(cfg.iterations)); + const bool recv_done = !cfg.receive || + (cfg.iterations > 0 && + stats.received_packets >= static_cast(cfg.iterations)); if (send_done && recv_done) { break; } if (cfg.send && !send_done) { @@ -110,6 +120,8 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket if (daqiri::send_tx_burst(msg) == daqiri::Status::SUCCESS) { stats.sent_packets++; + stats.sent_bytes += static_cast(cfg.message_size); + pacer.wait_for_bytes(static_cast(cfg.message_size), stop); } } else { daqiri::free_tx_metadata(msg); @@ -120,7 +132,9 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket daqiri::BurstParams* burst = nullptr; if (daqiri::get_rx_burst(&burst, conn_id, cfg.server) == daqiri::Status::SUCCESS && burst != nullptr) { - stats.received_packets += static_cast(daqiri::get_num_packets(burst)); + const uint64_t rx_pkts = static_cast(daqiri::get_num_packets(burst)); + stats.received_packets += rx_pkts; + stats.received_bytes += daqiri::get_burst_tot_byte(burst); daqiri::free_all_packets_and_burst_rx(burst); } else { std::this_thread::sleep_for(std::chrono::microseconds(100)); @@ -134,17 +148,20 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket int main(int argc, char** argv) { if (argc < 2) { std::cerr << "Usage: " << argv[0] - << " [--seconds N] [--mode server|client|both]\n"; + << " [--seconds N] [--mode server|client|both] [--target-gbps G]\n"; return 1; } int run_seconds = 10; + double target_gbps = 0.0; std::string mode = "both"; for (int i = 2; i + 1 < argc; i += 2) { if (std::string(argv[i]) == "--seconds") { run_seconds = std::stoi(argv[i + 1]); } else if (std::string(argv[i]) == "--mode") { mode = argv[i + 1]; + } else if (std::string(argv[i]) == "--target-gbps") { + target_gbps = std::stod(argv[i + 1]); } } @@ -159,6 +176,8 @@ int main(int argc, char** argv) { std::thread client_thread; SocketWorkerStats server_stats; SocketWorkerStats client_stats; + daqiri::bench::TokenBucketPacer server_pacer(target_gbps); + daqiri::bench::TokenBucketPacer client_pacer(target_gbps); bool run_server = false; bool run_client = false; @@ -166,6 +185,7 @@ int main(int argc, char** argv) { run_server = true; server_thread = std::thread(socket_worker, parse_socket_cfg(root["socket_bench_server"]), + std::ref(server_pacer), std::ref(stop), std::ref(server_stats)); } @@ -173,6 +193,7 @@ int main(int argc, char** argv) { run_client = true; client_thread = std::thread(socket_worker, parse_socket_cfg(root["socket_bench_client"]), + std::ref(client_pacer), std::ref(stop), std::ref(client_stats)); } @@ -199,13 +220,23 @@ int main(int argc, char** argv) { if (server_thread.joinable()) { server_thread.join(); } if (client_thread.joinable()) { client_thread.join(); } + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - start) + .count(); + if (run_server) { - std::cout << "Server sent packets: " << server_stats.sent_packets - << ", received packets: " << server_stats.received_packets << '\n'; + std::cout << "Server complete: sent_packets=" << server_stats.sent_packets + << " recv_packets=" << server_stats.received_packets + << " sent_bytes=" << server_stats.sent_bytes + << " recv_bytes=" << server_stats.received_bytes + << " seconds=" << secs << '\n'; } if (run_client) { - std::cout << "Client sent packets: " << client_stats.sent_packets - << ", received packets: " << client_stats.received_packets << '\n'; + std::cout << "Client complete: sent_packets=" << client_stats.sent_packets + << " recv_packets=" << client_stats.received_packets + << " sent_bytes=" << client_stats.sent_bytes + << " recv_bytes=" << client_stats.received_bytes + << " seconds=" << secs << '\n'; } daqiri::print_stats(); diff --git a/mkdocs.yml b/mkdocs.yml index 4b80201..81d2dcf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -35,6 +35,8 @@ nav: - Getting Started: getting-started.md - Configuration: configuration.md - API Reference: api-guide.md + - Performance: + - DGX Spark: performance-dgx-spark.md - Tutorials: - Background: tutorials/background.md - System Configuration: tutorials/system_configuration.md From 54b77f58f3f91459034c66e7332230d6590ca4dd Mon Sep 17 00:00:00 2001 From: rgurunathan Date: Wed, 13 May 2026 09:24:40 -0400 Subject: [PATCH 2/4] #15 - Add DPDK data-fill and drop-curve-matrix mode to perf doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fills the DPDK GPUDirect numbers (sweep, drop curve, headline tables) into docs/performance-dgx-spark.md from the 2026-05-12 bench runs, and adds the supporting infra for ongoing data fills: - New examples/run_spark_bench.sh mode `drop-curve-matrix` sweeps payload x target_gbps at the headline batch (feeds an upcoming payload x target_gbps heatmap, distinct from the existing 1D drop curve). - New scripts/spark_data_fill.sh one-shot driver runs the DPDK + socket sweep and drop-curve modes back-to-back, with pre-flight checks and orphan-hugepage cleanup between runs. RDMA is deferred from PR 1 (single-host loopback over the cable needs a netns + two-process refactor; tracked separately). - docs/stylesheets/extra.css adds a .perf-matrix heatmap (green / yellow / red cells) used by the payload x batch matrix and upcoming target_gbps matrix. - mkdocs.yml enables the footnotes extension so the deferred-row footnotes ([^1]/[^2]/[^3]) render. - Container snippet in the perf doc now auto-injects ETH_DST_ADDR (and RX_IFACE) from the host so the DPDK benches just work after `docker run`. RoCE and socket rows in the headline tables are marked deferred with footnotes; the corresponding follow-up issues (drafts at claude_plans/daqiri-pr15-spark-bench-followups.md) are: [^1] RoCE single-host loopback shortcuts through `lo` because both endpoints live in the root netns, so the QSFP cable carries no traffic; fix is an examples-only two-netns + two-process orchestration in run_spark_bench.sh. [^2] Socket UDP --mode both deadlocks on peer learning — both ends spin send-then-receive in one process, the server never learns a peer, and only ~1000 packets / 30 s trickle through under a flood of "no learned peer" ERROR spam. [^3] Socket TCP --mode both aborts immediately after the second accept with a glibc malloc.c:2599 (sysmalloc) heap-integrity assertion — likely a double-free or OOB write in the TCP socket-mgr init path. Issue numbers will be back-filled into the footnotes once filed. Co-Authored-By: Claude Opus 4.7 Signed-off-by: rgurunathan --- docs/performance-dgx-spark.md | 352 ++++++++++++++++++++++++---------- docs/stylesheets/extra.css | 60 ++++++ examples/run_spark_bench.sh | 75 ++++++-- mkdocs.yml | 1 + scripts/spark_data_fill.sh | 143 ++++++++++++++ 5 files changed, 511 insertions(+), 120 deletions(-) create mode 100755 scripts/spark_data_fill.sh diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md index 1e74cd6..28930c7 100644 --- a/docs/performance-dgx-spark.md +++ b/docs/performance-dgx-spark.md @@ -21,20 +21,24 @@ that table since it has no operation boundary. ### Native-shape peak — max no-drop throughput (Gbps) -| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | -| --------------------- | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- | -| DPDK GPUDirect (8 KB) | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | -| RoCE (8 MB SEND) | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | -| Socket UDP (MTU) | _TBD (PR 1)_ | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | -| Socket TCP (stream) | _TBD (PR 1)_ | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | +| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| --------------------- | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- | +| DPDK GPUDirect (8 KB) | **96.4** | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| RoCE (8 MB SEND) | _deferred_[^1] | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| Socket UDP (MTU) | _deferred_[^2] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | +| Socket TCP (stream) | _deferred_[^3] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | ### Matched 8 KB operation — cross-backend Gbps -| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | -| ------------------------ | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- | -| DPDK GPUDirect | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | -| RoCE | _TBD (PR 1)_ | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | -| Socket UDP (1472 B, MTU) | _TBD (PR 1)_ | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | +| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| ------------------------ | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- | +| DPDK GPUDirect | **96.4** | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| RoCE | _deferred_[^1] | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| Socket UDP (1472 B, MTU) | _deferred_[^2] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | + +[^1]: RoCE single-host loopback is deferred from PR 1 — the bench's `--mode both` runs both ends in one process; with both IPs locally bound, the kernel shortcuts RC connection setup through `lo` rather than the cable. Real loopback measurement requires two netns (one per port), `rdma system set netns exclusive`, RDMA-device netns transfer, a YAML split, and two-process orchestration. Tracked in a follow-up issue. +[^2]: Socket UDP `--mode both` deadlocks on peer learning: both server and client try to transmit before either has received, so neither side learns its peer. Only ~1000 packets / 30 s trickle through. Tracked in a follow-up issue. +[^3]: Socket TCP `--mode both` aborts with a glibc heap-corruption assertion (`malloc.c:2599`) immediately after the second TCP connection accept. Bench is unrunnable. Tracked in a follow-up issue. !!! note "Why two tables" A single Gbps number isn't enough to compare backends fairly. A DPDK @@ -54,6 +58,17 @@ that table since it has no operation boundary. HDS is characterized when this report extends to IGX and x86-server platforms where device memory works. See [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking. +- **RoCE single-host loopback is deferred from this report.** See footnote [^1] + on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in + `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0) + and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the + RC connection through `lo` and the QSFP cable carries no traffic. A + follow-up will land the two-netns + two-process orchestration and re-fill + the RoCE rows. +- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3]. + Both are bench-side bugs uncovered during PR 1 verification on Spark: + UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with + a glibc heap-corruption assertion on init. Follow-up issues track each. - **p99/p999 latency is not in v1.** The bench output captures throughput, drops, and resource utilization. Per-burst RX timestamping and percentile aggregation are deferred to a follow-up issue. @@ -78,21 +93,23 @@ The reproducibility appendix has the full capture. Key fields: ### Bench commands -Each backend has a dedicated bench executable in `examples/`. PR 1 numbers -come from these: +Each backend has a dedicated bench executable in `examples/`. The DPDK +numbers in this report come from the first command; the RoCE and Socket +commands are listed here for documentation and will be the basis for the +future fill of those rows. ```bash -# DPDK GPUDirect — physical loopback +# DPDK GPUDirect — physical loopback (used in this report) ./build/examples/daqiri_bench_raw_gpudirect \ examples/daqiri_bench_raw_tx_rx_spark.yaml \ --seconds 30 [--target-gbps G] -# RoCE — same NIC, two ports +# RoCE — same NIC, two ports (deferred; see Known limitations) ./build/examples/daqiri_bench_rdma \ examples/daqiri_bench_rdma_tx_rx_spark.yaml \ --seconds 30 --mode both [--target-gbps G] -# Socket UDP / TCP — localhost +# Socket UDP / TCP — localhost (deferred; see Known limitations) ./build/examples/daqiri_bench_socket \ examples/daqiri_bench_socket_udp_tx_rx.yaml \ --seconds 30 --mode both [--target-gbps G] @@ -106,6 +123,15 @@ ETH_DST_ADDR="$(cat /sys/class/net//address)" ### Per-backend sweep dimensions +**Payload** is the user-data bytes in one packet (DPDK / Socket UDP), +one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI +hands to (or pulls from) the NIC in one `rte_eth_tx_burst` / +`rte_eth_rx_burst` call — the burst size knob, not a packet-size knob. +Larger batches amortize doorbell and API overhead per packet; smaller +batches keep per-call latency lower. Batch only matters when the bench +is not yet at the link ceiling — at saturation, the NIC is the +bottleneck and batch size has near-zero effect. + The "payload × batch" sweep doesn't map uniformly across backends. Each backend has its own sweep: @@ -145,81 +171,162 @@ a follow-up. ### External captures per run -Each run records, in parallel with the bench: - -- `mpstat -P ALL 1 ` — per-core CPU busy%. -- `nvidia-smi dmon -s pucvmet -c ` — GPU SM%, mem%, DRAM bandwidth. - -Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, hugepages, -GPU state, DAQIRI commit) is captured once per result set by +Each run records, alongside the bench: + +- `/proc/stat` snapshots before and after the bench process. The wrapper + computes per-core busy% for the master / TX / RX cores (cores 8 / 17 / 18 + on Spark) by delta over the run window. `mpstat` is not used — it is + often missing from minimal containers, and the per-core CPUs we care + about are pinned by the YAML so a targeted delta is more meaningful + than a system-wide average. +- `nvidia-smi dmon -s pucvmet -c ` — GPU SM%, mem-controller%, and + PCIe rxpci / txpci columns (the latter are reported as `-` on the + current Spark driver; SM% and mem% are near zero for plain GPUDirect + because the GPU is a DMA target, not a compute engine). + +Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, +hugepages, GPU state, DAQIRI commit) is captured once per result set by `bench_capture_environment.sh`. ## Results — DPDK GPUDirect -_PR 1: this section is populated after the first sweep run._ +Native shape on Spark is 8 KB payload, batch 10240 — the configuration the +DPDK backend was designed around. All cells below ran for 30 s with +`drops == 0`. ### Drop curve at native shape -| target_gbps | achieved Gbps | RX pps | drops | -| ----------- | ------------- | ------- | ----- | -| _TBD (PR 1)_ | | | | - -### Payload × batch sweep - -| payload | batch | Gbps | pps | drops | -| ------- | ----- | ---- | --- | ----- | -| _TBD (PR 1)_ | | | | | - -### CPU and GPU utilization (headline cell) - -| Resource | Value | -| --------------- | --------------- | -| Master core % | _TBD_ | -| TX core % | _TBD_ | -| RX core % | _TBD_ | -| GPU SM % | _TBD (near 0; GPU is DMA target, not compute)_ | -| GPU mem BW % | _TBD_ | +Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The +token-bucket pacer tracks target within ±0.02 Gbps until the link saturates +near 96 Gbps. Beyond that, target=100 and unpaced both report the +achievable ceiling. TX and RX cores spin in poll-mode regardless of target +rate (visible in the CPU table below). + +| target Gbps | achieved Gbps | RX pps | drops | TX core % | RX core % | +| ----------- | ------------- | --------- | ----- | --------- | --------- | +| 1 | 1.011 | 15,678 | 0 | 92.0 | 92.0 | +| 5 | 5.012 | 77,697 | 0 | 91.8 | 91.8 | +| 10 | 10.001 | 155,032 | 0 | 92.5 | 92.5 | +| 25 | 25.008 | 387,650 | 0 | 91.9 | 91.9 | +| 50 | 50.001 | 775,062 | 0 | 92.8 | 92.8 | +| 75 | 74.999 | 1,162,551 | 0 | 91.7 | 91.7 | +| 100 | 96.370 | 1,493,823 | 0 | 91.6 | 91.6 | +| unpaced | 95.897 | 1,486,498 | 0 | 91.6 | 90.5 | + +### Payload × batch matrix + +Each cell shows the achieved Gbps and drops over a 30 s unpaced run. +Coloring is relative to the global max across the matrix (here +**104.989 Gbps** at payload 4096 B, batch 4096): + +
+ green — no drops, Gbps ≥ 90% of max + yellow — no drops, Gbps ≥ 70% of max + red — drops, or Gbps < 70% of max +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
payloadbatch (packets per burst)
2561024409610240
8000 B96.4 Gbps0 drops96.4 Gbps0 drops96.4 Gbps0 drops95.9 Gbps0 drops
4096 B104.9 Gbps0 drops104.9 Gbps0 drops105.0 Gbps0 drops103.0 Gbps0 drops
1024 B64.9 Gbps0 drops86.2 Gbps0 drops71.0 Gbps0 drops72.7 Gbps0 drops
256 B24.5 Gbps0 drops24.8 Gbps0 drops29.3 Gbps0 drops21.1 Gbps0 drops
64 B10.1 Gbps0 drops10.3 Gbps0 drops9.8 Gbps0 drops10.3 Gbps0 drops
+ +**Reading the matrix.** At payload ≥ 4 KB the link saturates (~96–105 +Gbps) and batch size barely moves the number, so every cell is green. +The 1 KB row is the transition: pps and Gbps both matter, batch size +starts to influence which side dominates, and most cells fall under 70% +of the global max. At ≤ 256 B the bottleneck is packets-per-second +(~10 M pps ceiling at 64 B), so effective Gbps stays well below the +link ceiling regardless of batch. Run-to-run variance on the unpaced +cells is ~0.5 Gbps; row-internal Gbps differences smaller than that +should be treated as noise. + +### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced) + +| Resource | Value | Note | +| -------------------- | ----- | ------------------------------------------ | +| Master core (CPU 8) | 3.3% | Mostly idle; orchestration only | +| TX core (CPU 17) | 91.4% | Poll-mode spin; rate-independent | +| RX core (CPU 18) | 91.4% | Poll-mode spin; rate-independent | +| GPU SM % | 0.0% | GPU is a DMA target, no compute | +| GPU mem-ctrl % | 0.0% | Payload writes traverse PCIe, not the GPU memory controller | + +The TX/RX cores stay at ~92% across every drop-curve step from 1 Gbps to +line rate — characteristic of DPDK's poll-mode driver, which spins waiting +for descriptors regardless of offered load. The master core handles +configuration only and idles below 5 % at the headline shape; at smaller +payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts +flow through the orchestration path. That asymmetry is data, not a bug, +and is captured in the per-cell artifacts under `bench-results/`. ## Results — RoCE -_PR 1: this section is populated after the first sweep run._ - -### Drop curve at native shape - -| target_gbps | achieved Gbps | completions/s | drops (CQ errors) | -| ----------- | ------------- | ------------- | ----------------- | -| _TBD (PR 1)_ | | | | - -### Message-size × batch sweep - -| message_size | batch | Gbps | completions/s | drops | -| ------------ | ----- | ---- | ------------- | ----- | -| _TBD (PR 1)_ | | | | | - -### CPU and GPU utilization (headline cell) - -| Resource | Value | -| --------------- | ----- | -| Server core % | _TBD_ | -| Client core % | _TBD_ | -| GPU SM % | _TBD_ | -| GPU mem BW % | _TBD_ | +**Deferred from this report.** See [headline-table footnote 1](#fn:1) and +the Known limitations section. Single-host RoCE loopback on Spark requires a +two-netns + two-process orchestration that the wrapper does not yet +implement. The RoCE rows will be filled when the follow-up issue lands. ## Results — Socket -_PR 1: this section is populated after the first sweep run. GPU rows N/A._ - -### UDP — payload × batch sweep - -| payload | batch | Gbps | pps | drops | -| ------- | ----- | ---- | --- | ----- | -| _TBD (PR 1)_ | | | | | +**Deferred from this report.** See [headline-table footnotes 2 and 3](#fn:2) +and the Known limitations section. Both backends produced unusable data on +Spark during PR 1 verification: -### TCP — message-size sweep +- **Socket UDP** in `--mode both` deadlocks on peer learning — both ends try + to transmit before either has received, only ~1000 packets per 30 s get + through (≈ 390 kbps). Visible as repeated + `[ERROR] UDP server has no learned peer yet; cannot transmit` lines in + the bench's stderr. +- **Socket TCP** in `--mode both` aborts with a glibc heap-corruption + assertion (`Fatal glibc error: malloc.c:2599 (sysmalloc)`) immediately + after the second TCP connection accept. The bench crashes before any + send / recv completes, so no completion line is printed. -| message_size | Gbps | retrans/inerrs | -| ------------ | ---- | -------------- | -| _TBD (PR 1)_ | | | +Both are tracked as separate follow-up issues; the Socket rows here will be +filled once the underlying bench bugs are fixed. ## Workload variants (FFT, GEMM) @@ -261,45 +368,79 @@ _TBD (PR 4)._ ### Container -All commands below assume execution inside the project container, as required -by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md). The -container must be started in privileged mode with all GPUs and hugepage -mounts passed through. +All commands below assume execution inside the project container, as +required by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md). +On the host, launch the container in privileged mode with all GPUs, +hugepage mounts, and `/sys` passed through, and bind-mount the repo at +`/workspace`: -### Full result regeneration +```bash +# RX-side NIC; auto-injects ETH_DST_ADDR for the DPDK bench wrappers. +RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}" +sudo docker run --rm -it \ + --net host --ipc=host \ + --runtime=nvidia --gpus all \ + --privileged \ + --ulimit memlock=-1 --ulimit stack=67108864 \ + -v "$(pwd):/workspace" \ + -v /dev/hugepages:/dev/hugepages \ + -v /mnt/huge:/mnt/huge \ + -v /sys:/sys \ + -w /workspace \ + -e ETH_DST_ADDR="$(cat /sys/class/net/$RX_IFACE/address)" \ + -e RX_IFACE="$RX_IFACE" \ + daqiri:local \ + bash +``` + +### Build + +Inside the container: ```bash -# 1. Build inside the container (root). cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \ -DDAQIRI_MGR="dpdk socket rdma" cmake --build build -j +``` + +### One-shot driver + +The whole DPDK matrix (sweep + drop-curve) is driven by a single script +which handles pre-flight checks (hugepage availability, RX iface MAC, +link state), orphan-hugepage cleanup between cells, and a final summary of +per-backend result directories: + +```bash +./scripts/spark_data_fill.sh dpdk +``` + +The script defaults to `dpdk socket-udp socket-tcp` if invoked with no +arguments; on Spark, the socket backends will fail their own pre-flight +once the follow-up issues land. RDMA is currently rejected by the +pre-flight (see Known limitations). + +### Per-backend wrapper invocations -# 2. Snapshot environment + run the sweeps. +For ad-hoc runs of a single cell or a single mode: + +```bash export DAQIRI_BUILD_DIR="$PWD/build" -export ETH_DST_ADDR="$(cat /sys/class/net//address)" - -# Native-shape headline cell: -./examples/run_spark_bench.sh dpdk smoke -./examples/run_spark_bench.sh rdma smoke -./examples/run_spark_bench.sh socket-udp smoke -./examples/run_spark_bench.sh socket-tcp smoke - -# Full payload × batch sweep: -./examples/run_spark_bench.sh dpdk sweep -./examples/run_spark_bench.sh rdma sweep -./examples/run_spark_bench.sh socket-udp sweep -./examples/run_spark_bench.sh socket-tcp sweep - -# Drop curve (sweeps --target-gbps): -./examples/run_spark_bench.sh dpdk drop-curve -./examples/run_spark_bench.sh rdma drop-curve -./examples/run_spark_bench.sh socket-udp drop-curve -./examples/run_spark_bench.sh socket-tcp drop-curve +export ETH_DST_ADDR="$(cat /sys/class/net//address)" # DPDK only + +./examples/run_spark_bench.sh dpdk smoke # native-shape headline cell +./examples/run_spark_bench.sh dpdk sweep # full payload × batch matrix +./examples/run_spark_bench.sh dpdk drop-curve # sweep --target-gbps ``` +Each invocation emits `bench-results/-dpdk-/` containing +one subdirectory per cell (stdout / stderr / config / dmon / cpu_stat +captures), an `environment.txt` snapshot, and a `runs.csv` aggregating +the cell-level metrics. + ### Environment-only capture -Useful for filing a bug report or comparing two Spark units: +Useful for filing a bug report or comparing two Spark units without +running the bench: ```bash ./examples/bench_capture_environment.sh /tmp/spark-env @@ -307,6 +448,7 @@ Useful for filing a bug report or comparing two Spark units: ### Tuning prerequisites -System tuning is required before the numbers in this report are reproducible. -See [`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md) +System tuning is required before the numbers in this report are +reproducible. See +[`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md) for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index b5ce45f..c3e156a 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -50,3 +50,63 @@ [data-md-color-scheme="slate"] .md-footer { background: #111; } + +/* ── Performance-report heatmap cells ───────────────────────────────── */ +/* Used by the payload×batch and payload×target_gbps matrices in + docs/performance-*.md. Threshold logic (vs. matrix-global max): + green = no drops AND Gbps ≥ 90% of max + yellow = no drops AND Gbps ≥ 70% of max + red = any drops OR Gbps < 70% of max */ +.md-typeset table.perf-matrix { + width: 85%; + table-layout: fixed; + border-collapse: separate; + border-spacing: 5px; + font-size: 0.64rem; +} +.md-typeset table.perf-matrix th, +.md-typeset table.perf-matrix td { + text-align: center; + vertical-align: middle; + font-variant-numeric: tabular-nums; + padding: 0.55em 0.6em; + border-radius: 4px; +} +.md-typeset table.perf-matrix td small { + display: block; + opacity: 0.75; + font-size: 0.85em; + margin-top: 0.25em; +} +.md-typeset table.perf-matrix td.cell-green { + background-color: rgba(118, 185, 0, 0.28); + color: inherit; +} +.md-typeset table.perf-matrix td.cell-yellow { + background-color: rgba(255, 196, 0, 0.32); + color: inherit; +} +.md-typeset table.perf-matrix td.cell-red { + background-color: rgba(220, 60, 60, 0.32); + color: inherit; +} +.md-typeset table.perf-matrix th { + background-color: rgba(255, 255, 255, 0.05); + font-weight: 600; +} +/* Compact legend chips that pair with the matrix. */ +.md-typeset .perf-legend { + display: flex; + gap: 0.75em; + margin: 0.5em 0 1em 0; + font-size: 0.85em; + flex-wrap: wrap; +} +.md-typeset .perf-legend span { + padding: 0.1em 0.55em; + border-radius: 0.25em; + white-space: nowrap; +} +.md-typeset .perf-legend .cell-green { background-color: rgba(118, 185, 0, 0.28); } +.md-typeset .perf-legend .cell-yellow { background-color: rgba(255, 196, 0, 0.32); } +.md-typeset .perf-legend .cell-red { background-color: rgba(220, 60, 60, 0.32); } diff --git a/examples/run_spark_bench.sh b/examples/run_spark_bench.sh index e85df69..c2aa75a 100755 --- a/examples/run_spark_bench.sh +++ b/examples/run_spark_bench.sh @@ -15,7 +15,7 @@ # Usage: # ./run_spark_bench.sh [mode] # backend ∈ {dpdk, rdma, socket-udp, socket-tcp} -# mode ∈ {smoke, sweep, drop-curve} (default: smoke) +# mode ∈ {smoke, sweep, drop-curve, drop-curve-matrix} (default: smoke) # # Required environment in current shell: # DAQIRI_BUILD_DIR — path to the cmake build dir (defaults to ../build). @@ -35,7 +35,7 @@ set -o pipefail BACKEND="${1:-}" MODE="${2:-smoke}" if [[ -z "$BACKEND" ]]; then - echo "Usage: $0 [smoke|sweep|drop-curve]" >&2 + echo "Usage: $0 [smoke|sweep|drop-curve|drop-curve-matrix]" >&2 exit 1 fi @@ -46,7 +46,7 @@ OUT_DIR="$SCRIPT_DIR/../bench-results/$TS-$BACKEND-$MODE" mkdir -p "$OUT_DIR" CSV="$OUT_DIR/runs.csv" -echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_busy_mean,gpu_sm_pct,gpu_mem_bw" > "$CSV" +echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_master_pct,cpu_tx_pct,cpu_rx_pct,gpu_sm_pct,gpu_mem_pct" > "$CSV" # Capture slow-moving environment state once per result set. "$SCRIPT_DIR/bench_capture_environment.sh" "$OUT_DIR" @@ -64,6 +64,7 @@ case "$BACKEND" in BATCHES_HEADLINE=(10240) BASE_YAML="$SCRIPT_DIR/daqiri_bench_raw_tx_rx_spark.yaml" BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 : "${ETH_DST_ADDR:?ETH_DST_ADDR must be set for dpdk backend (cat /sys/class/net//address)}" ;; rdma) @@ -73,6 +74,7 @@ case "$BACKEND" in BATCHES_HEADLINE=(1) BASE_YAML="$SCRIPT_DIR/daqiri_bench_rdma_tx_rx_spark.yaml" BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_rdma" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 ;; socket-udp) PAYLOADS_SWEEP=(1472 1024 256 64) @@ -81,6 +83,7 @@ case "$BACKEND" in BATCHES_HEADLINE=(256) BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_udp_tx_rx.yaml" BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 ;; socket-tcp) PAYLOADS_SWEEP=(1048576 65536 1024) @@ -89,6 +92,7 @@ case "$BACKEND" in BATCHES_HEADLINE=(1) BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_tcp_tx_rx.yaml" BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 ;; *) echo "Unknown backend: $BACKEND" >&2; exit 1 ;; esac @@ -131,6 +135,31 @@ snapshot_nstat() { nstat -a 2>/dev/null | awk '/TcpExtTCPLostRetransmit|TcpRetransSegs|TcpInErrs/ { s += $2 } END { print s+0 }' || echo 0 } +# Snapshot /proc/stat per-cpu counters to a file. Mpstat is often not installed +# in the bench container; /proc/stat is always available. +snapshot_cpu_stat() { + awk '/^cpu[0-9]+/ { + total = $2+$3+$4+$5+$6+$7+$8 + busy = total - $5 - $6 + print $1, total, busy + }' /proc/stat > "$1" +} + +# Compute busy% for a single cpu index between two /proc/stat snapshots. +cpu_busy_pct() { + local before="$1" after="$2" cpu_idx="$3" + awk -v cpu="cpu$cpu_idx" ' + NR == FNR { b_total[$1] = $2; b_busy[$1] = $3; next } + { a_total[$1] = $2; a_busy[$1] = $3 } + END { + dt = a_total[cpu] - b_total[cpu] + db = a_busy[cpu] - b_busy[cpu] + if (dt > 0) printf "%.1f", (db * 100.0) / dt + else printf "0.0" + } + ' "$before" "$after" +} + # Substitute payload / batch into the base YAML and write a temp config. generate_yaml() { local out="$1" payload="$2" batch="$3" @@ -166,9 +195,10 @@ run_cell() { udp_before="$(snapshot_proc_net_udp)" tcp_before="$(snapshot_nstat)" - # Background captures. - ( mpstat -P ALL 1 "$RUN_SECONDS" > "$cell_dir/mpstat.txt" 2>&1 ) & - local mpstat_pid=$! + # Snapshot per-cpu stats just before the bench starts. + snapshot_cpu_stat "$cell_dir/cpu_stat.before" + + # Background GPU dmon (1-sec sample, RUN_SECONDS samples). ( nvidia-smi dmon -s pucvmet -c "$RUN_SECONDS" > "$cell_dir/nvidia_smi_dmon.txt" 2>&1 ) & local dmon_pid=$! @@ -182,8 +212,11 @@ run_cell() { "$BENCH_BIN" "${args[@]}" > "$stdout" 2> "$stderr" || true cp "$stderr" "$DRIVER_LOG" + # Snapshot per-cpu stats right after the bench exits (before background + # captures finish reaping, to bound the window). + snapshot_cpu_stat "$cell_dir/cpu_stat.after" + # Stop background captures (they self-terminate at -c , but reap if needed). - wait "$mpstat_pid" 2>/dev/null || true wait "$dmon_pid" 2>/dev/null || true # Parse bench stdout. For RX-bearing benches "RX complete" is authoritative; @@ -232,21 +265,23 @@ run_cell() { ;; esac - # Mean CPU busy% across all cores (mpstat row "all"). Simple aggregate; per-core - # in mpstat.txt for deeper review. - local cpu_busy_mean - cpu_busy_mean="$(awk '/Average:.*all/ { print 100 - $NF; exit }' "$cell_dir/mpstat.txt" 2>/dev/null || echo 0)" - cpu_busy_mean="${cpu_busy_mean:-0}" + # Per-core CPU busy% over the bench window. Cores defined per-backend + # (master/TX/RX) match the YAML so we measure the threads we actually pin. + local cpu_master_pct cpu_tx_pct cpu_rx_pct + cpu_master_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_MASTER")" + cpu_tx_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_TX")" + cpu_rx_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_RX")" - # GPU SM% and memory BW from nvidia-smi dmon. dmon output columns vary by - # driver; sm is typically column 5 in `-s pucvmet`, mem is column 6. + # GPU SM% (column 5) and memory-controller % (column 6) from nvidia-smi + # dmon -s pucvmet. These are near zero for GPUDirect workloads (GPU is a + # DMA target, not a compute engine). local gpu_sm gpu_mem gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \ "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)" gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \ "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)" - echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_busy_mean,$gpu_sm,$gpu_mem" \ + echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_master_pct,$cpu_tx_pct,$cpu_rx_pct,$gpu_sm,$gpu_mem" \ | tee -a "$CSV" } @@ -281,6 +316,16 @@ case "$MODE" in done done ;; + drop-curve-matrix) + # 2D drop curve: sweep payload × target_gbps at the headline batch. + for p in "${PAYLOADS_SWEEP[@]}"; do + for b in "${BATCHES_HEADLINE[@]}"; do + for g in "${DROP_CURVE_TARGETS[@]}"; do + run_cell cpp "$p" "$b" "$g" + done + done + done + ;; *) echo "Unknown mode: $MODE" >&2; exit 1 ;; esac diff --git a/mkdocs.yml b/mkdocs.yml index 81d2dcf..0fd6b1a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -47,6 +47,7 @@ markdown_extensions: - admonition - attr_list - def_list + - footnotes - md_in_html - tables - toc: diff --git a/scripts/spark_data_fill.sh b/scripts/spark_data_fill.sh new file mode 100755 index 0000000..0e4f81b --- /dev/null +++ b/scripts/spark_data_fill.sh @@ -0,0 +1,143 @@ +#!/usr/bin/env bash +# Drives the PR 1 data-fill bench runs for the DGX Spark performance report. +# +# Runs DPDK GPUDirect, socket-UDP, and socket-TCP through their sweep and +# drop-curve modes via examples/run_spark_bench.sh, with pre-flight checks +# and orphan-hugepage cleanup. RDMA is deferred from PR 1 (single-host +# loopback over the cable needs a netns+two-process refactor; tracked +# separately). +# +# Run inside the project container (privileged, --gpus all, /dev/hugepages +# and /mnt/huge mounted, repo at /workspace). +# +# Usage: +# ./scripts/spark_data_fill.sh # all three backends +# ./scripts/spark_data_fill.sh dpdk # just DPDK +# ./scripts/spark_data_fill.sh socket-udp socket-tcp +# +# Env overrides: +# ETH_DST_ADDR — RX-side MAC. Auto-detected from +# /sys/class/net/enP2p1s0f0np0/address if unset. +# RX_IFACE — RX netdev name (default enP2p1s0f0np0). +# DAQIRI_BUILD_DIR — defaults to ./build. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" +WRAPPER="$REPO_ROOT/examples/run_spark_bench.sh" +BUILD_DIR="${DAQIRI_BUILD_DIR:-$REPO_ROOT/build}" +RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}" + +BACKENDS=("$@") +[[ ${#BACKENDS[@]} -eq 0 ]] && BACKENDS=(dpdk socket-udp socket-tcp) + +# --- pre-flight ------------------------------------------------------------ + +preflight_fail() { echo "PREFLIGHT FAIL: $*" >&2; exit 1; } +note() { echo "[$(date -u +%H:%M:%SZ)] $*"; } + +[[ -x "$WRAPPER" ]] || preflight_fail "wrapper missing or not executable: $WRAPPER" + +for be in "${BACKENDS[@]}"; do + case "$be" in + dpdk) bin="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect" ;; + socket-udp|socket-tcp) bin="$BUILD_DIR/examples/daqiri_bench_socket" ;; + rdma) preflight_fail "RDMA is deferred from PR 1; see follow-up issue" ;; + *) preflight_fail "unknown backend: $be" ;; + esac + [[ -x "$bin" ]] || preflight_fail "missing bench binary: $bin (run cmake --build first)" +done + +# DPDK-only checks. +if [[ " ${BACKENDS[*]} " == *" dpdk "* ]]; then + free_hp="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)" + [[ "${free_hp:-0}" -ge 4 ]] || preflight_fail "HugePages_Free=$free_hp (need >=4); clean /mnt/huge and /dev/hugepages from prior runs" + + if [[ -z "${ETH_DST_ADDR:-}" ]]; then + mac_path="/sys/class/net/$RX_IFACE/address" + [[ -r "$mac_path" ]] || preflight_fail "cannot read $mac_path; set ETH_DST_ADDR explicitly" + ETH_DST_ADDR="$(cat "$mac_path")" + export ETH_DST_ADDR + note "ETH_DST_ADDR auto-detected from $RX_IFACE: $ETH_DST_ADDR" + fi + + carrier="$(cat "/sys/class/net/$RX_IFACE/carrier" 2>/dev/null || echo 0)" + [[ "$carrier" == "1" ]] || preflight_fail "RX iface $RX_IFACE has no carrier (cable unplugged or link down)" +fi + +note "Pre-flight OK. Backends: ${BACKENDS[*]}" +note "Build dir: $BUILD_DIR" +note "Repo root: $REPO_ROOT" + +# --- hugepage cleanup helper ---------------------------------------------- + +# DPDK leaves orphan rtemap_* files when a bench aborts. Clean between runs so +# we don't run out of hugepages mid-sweep. +clean_orphan_hugepages() { + local pre post freed + pre="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)" + : "${pre:=0}" + shopt -s nullglob + # DPDK uses a random per-process file prefix (override with --file-prefix); + # match anything ending in `map_` to catch the common shape without + # nuking unrelated files. Skip any that are still held by a live process. + for f in /dev/hugepages/*map_[0-9]* /mnt/huge/*map_[0-9]*; do + if ! fuser -- "$f" >/dev/null 2>&1; then + rm -f -- "$f" 2>/dev/null || true + fi + done + shopt -u nullglob + post="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)" + : "${post:=0}" + freed=$((post - pre)) + if [[ "$freed" -gt 0 ]]; then + note "Freed $freed orphan hugepages (now ${post} free)" + fi + return 0 +} + +# --- driver loop ----------------------------------------------------------- + +declare -a RESULT_DIRS + +run_backend_mode() { + local backend="$1" mode="$2" + note "=== Running: $backend $mode ===" + clean_orphan_hugepages + + # Stream wrapper output live (per-cell CSV rows appear as they finish) while + # also keeping a log for post-run parsing of the "Results in:" line. + local log="/tmp/spark_data_fill.$backend.$mode.log" + local rc=0 + "$WRAPPER" "$backend" "$mode" 2>&1 | tee "$log" || rc=$? + rc="${PIPESTATUS[0]:-$rc}" + + if [[ "$rc" -eq 0 ]]; then + local result_dir + result_dir="$(awk '/^Results in:/ { print $3 }' "$log" | tail -n1)" + [[ -n "$result_dir" ]] && RESULT_DIRS+=("$backend/$mode -> $result_dir") + note "$backend $mode complete" + else + note "$backend $mode FAILED (exit $rc); continuing" + tail -n 40 "$log" >&2 + fi + clean_orphan_hugepages +} + +for be in "${BACKENDS[@]}"; do + run_backend_mode "$be" sweep + run_backend_mode "$be" drop-curve +done + +# --- summary --------------------------------------------------------------- + +echo +echo "==========================================" +echo "Data-fill complete. Result directories:" +echo "==========================================" +for r in "${RESULT_DIRS[@]}"; do + echo " $r" +done +echo +echo "Next: aggregate CSVs and fill docs/performance-dgx-spark.md." From c5fffd5cfdf3ebf3382db962dd32bc0c76526099 Mon Sep 17 00:00:00 2001 From: rgurunathan Date: Wed, 13 May 2026 09:43:53 -0400 Subject: [PATCH 3/4] #15 - Ignore pcie_schematic.png generated by tune_system.py `python/tune_system.py` writes a PCIe topology schematic to `pcie_schematic.png` in the working directory by default (introduced in #61). Anyone who runs the tuner from the repo root then has it sitting untracked in `git status` forever. Ignore the default path so it stops showing up in working-tree status. Custom output paths passed via `--output` are unaffected. Co-Authored-By: Claude Opus 4.7 Signed-off-by: rgurunathan --- .gitignore | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.gitignore b/.gitignore index dc4cf82..472185d 100644 --- a/.gitignore +++ b/.gitignore @@ -2,5 +2,8 @@ build*/ site/ bench-results/ +# tune_system.py default output +pcie_schematic.png + # macOS .DS_Store From 14b6f0dd003bcd8f176cfd43b9a418f3ba21ab60 Mon Sep 17 00:00:00 2001 From: rgurunathan Date: Wed, 13 May 2026 10:33:41 -0400 Subject: [PATCH 4/4] #15 - Add payload x target_gbps matrix and restructure perf doc Adds the payload x target_gbps heatmap to the DPDK GPUDirect results (5 payloads x 8 target rates, batch=10240) from the 2026-05-13 drop-curve-matrix run. Coloring is relative to the effective target min(target_gbps, 96 Gbps link cap) so the heatmap reads as "does the configuration sustain the requested rate?" rather than "what's the absolute peak?". The 8000 B / 4096 B rows are all green; 1024 B is green-except-unpaced (where the master core peaks at 93%); 256 B and 64 B turn red once target crosses the PPS ceiling. Restructures the report around four top-level sections to keep backend results scannable as more platforms and workloads land: - Summary (headline tables; unchanged content, just stays at top) - Introduction (System under test + Methodology, demoted to H3) - C++ Results (DPDK / RoCE / Socket / Workload variants, demoted to H3; their per-cell subsections demoted to H4; workload variants renamed to "DPDK GPUDirect - FFT/GEMM" and "RoCE - FFT/GEMM" so anchors don't collide with the C++ Results backend headings) - Python Results (renamed from "Python results"; unchanged otherwise) - Reproduce these results (renamed from "Reproducibility appendix"; same content) - TODO: Not Yet Implemented / Known Limitations (moved from below Summary to the bottom; relabeled to make clear these are pending items, not platform constraints) Adds a compact in-page Contents list under the intro paragraphs linking to each top-level section. The Material sidebar TOC still works for fine-grained navigation; the inline list is for orienting the reader on first arrival. Plain-text references to "Known limitations" updated to "TODO / Known Limitations" with anchor links where appropriate (RoCE / Socket deferred-results subsections, one-shot driver note). Also tunes the .perf-matrix CSS so the new 9-column matrix fits within the content area without text overflow: width 100%, table-layout: auto, white-space: nowrap on cells, slightly tighter padding. The narrower 5-column payload x batch matrix continues to render fine under the same rule. Co-Authored-By: Claude Opus 4.7 Signed-off-by: rgurunathan --- docs/performance-dgx-spark.md | 216 ++++++++++++++++++++++++++-------- docs/stylesheets/extra.css | 7 +- 2 files changed, 168 insertions(+), 55 deletions(-) diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md index 28930c7..9854243 100644 --- a/docs/performance-dgx-spark.md +++ b/docs/performance-dgx-spark.md @@ -11,6 +11,15 @@ against the YAML configs in `examples/`, with the system state captured by [`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh) alongside each result set. +**Contents** + +- [Summary](#summary) +- [Introduction](#introduction) — system under test, methodology +- [C++ Results](#c-results) — DPDK, RoCE, Socket, workload variants +- [Python Results](#python-results) +- [Reproduce these results](#reproduce-these-results) +- [TODO: Not Yet Implemented / Known Limitations](#todo-not-yet-implemented-known-limitations) + ## Summary Two headline tables. **Native-shape peak** reports each backend at its @@ -47,33 +56,9 @@ that table since it has no operation boundary. The matched 8 KB table makes the cross-backend comparison honest; the native-shape table shows each backend's design ceiling. -## Known limitations on this platform - -- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses - `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory - for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable — - so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no - longer changes the memory path; it only changes the segment partition, - which makes "HDS vs. plain GPUDirect" a non-distinction on this platform. - HDS is characterized when this report extends to IGX and x86-server - platforms where device memory works. See - [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking. -- **RoCE single-host loopback is deferred from this report.** See footnote [^1] - on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in - `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0) - and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the - RC connection through `lo` and the QSFP cable carries no traffic. A - follow-up will land the two-netns + two-process orchestration and re-fill - the RoCE rows. -- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3]. - Both are bench-side bugs uncovered during PR 1 verification on Spark: - UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with - a glibc heap-corruption assertion on init. Follow-up issues track each. -- **p99/p999 latency is not in v1.** The bench output captures throughput, - drops, and resource utilization. Per-burst RX timestamping and percentile - aggregation are deferred to a follow-up issue. +## Introduction -## System under test +### System under test The reproducibility appendix has the full capture. Key fields: @@ -89,9 +74,9 @@ The reproducibility appendix has the full capture. Key fields: | DPDK | patched per `dpdk_patches/` (container build) | | DAQIRI commit | _captured in `environment.txt`_ | -## Methodology +### Methodology -### Bench commands +#### Bench commands Each backend has a dedicated bench executable in `examples/`. The DPDK numbers in this report come from the first command; the RoCE and Socket @@ -104,12 +89,12 @@ future fill of those rows. examples/daqiri_bench_raw_tx_rx_spark.yaml \ --seconds 30 [--target-gbps G] -# RoCE — same NIC, two ports (deferred; see Known limitations) +# RoCE — same NIC, two ports (deferred; see TODO / Known Limitations) ./build/examples/daqiri_bench_rdma \ examples/daqiri_bench_rdma_tx_rx_spark.yaml \ --seconds 30 --mode both [--target-gbps G] -# Socket UDP / TCP — localhost (deferred; see Known limitations) +# Socket UDP / TCP — localhost (deferred; see TODO / Known Limitations) ./build/examples/daqiri_bench_socket \ examples/daqiri_bench_socket_udp_tx_rx.yaml \ --seconds 30 --mode both [--target-gbps G] @@ -121,7 +106,7 @@ The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC: ETH_DST_ADDR="$(cat /sys/class/net//address)" ``` -### Per-backend sweep dimensions +#### Per-backend sweep dimensions **Payload** is the user-data bytes in one packet (DPDK / Socket UDP), one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI @@ -142,14 +127,14 @@ backend has its own sweep: | Socket UDP | payload_size ∈ {64, 256, 1024, 1472} B (MTU-bound) | batch_size ∈ {1, 32, 256} | (1472, 256) | (1472, 256) — closest to 8 K under MTU cap | | Socket TCP | message_size ∈ {1 K, 64 K, 1 M} B | n/a (single stream) | (64 K) | n/a | -### "No-drop" threshold +#### "No-drop" threshold A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run. The headline tables report the highest target rate at which a run was still drop-free under this threshold. The methodology does not use a percentile cap — either there are drops or there are not. -### Drop-curve sweep +#### Drop-curve sweep The drop curve sweeps `--target-gbps` while holding the native-shape cell constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`) @@ -158,7 +143,7 @@ due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK `accurate_send`) is unused but would tighten DPDK-only precision; deferred to a follow-up. -### Drop sources per backend +#### Drop sources per backend - **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO` output (the bench's stderr). @@ -169,7 +154,7 @@ a follow-up. `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is the closest proxy. -### External captures per run +#### External captures per run Each run records, alongside the bench: @@ -188,13 +173,15 @@ Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, hugepages, GPU state, DAQIRI commit) is captured once per result set by `bench_capture_environment.sh`. -## Results — DPDK GPUDirect +## C++ Results + +### DPDK GPUDirect Native shape on Spark is 8 KB payload, batch 10240 — the configuration the DPDK backend was designed around. All cells below ran for 30 s with `drops == 0`. -### Drop curve at native shape +#### Drop curve at native shape Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The token-bucket pacer tracks target within ±0.02 Gbps until the link saturates @@ -213,7 +200,104 @@ rate (visible in the CPU table below). | 100 | 96.370 | 1,493,823 | 0 | 91.6 | 91.6 | | unpaced | 95.897 | 1,486,498 | 0 | 91.6 | 90.5 | -### Payload × batch matrix +#### Payload × target_gbps matrix + +Holds batch=10240 constant; sweeps payload and `--target-gbps` +together. Each cell shows the achieved Gbps and drops over a 30 s +run. Coloring is relative to the **effective target** +`min(target_gbps, 96 Gbps link cap)`; the "unpaced" column uses the +link cap as its effective target: + +
+ green — no drops, achieved ≥ 95% of effective target + yellow — no drops, achieved ≥ 70% of effective target + red — drops, or achieved < 70% of effective target +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
payloadtarget Gbps
1510255075100unpaced
8000 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops25.0 Gbps0 drops50.0 Gbps0 drops75.0 Gbps0 drops95.9 Gbps0 drops96.4 Gbps0 drops
4096 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops25.0 Gbps0 drops50.0 Gbps0 drops75.0 Gbps0 drops100.0 Gbps0 drops102.7 Gbps0 drops
1024 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops24.9 Gbps0 drops49.8 Gbps0 drops74.2 Gbps0 drops94.0 Gbps0 drops68.2 Gbps0 drops
256 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops24.7 Gbps0 drops21.1 Gbps0 drops21.2 Gbps0 drops21.2 Gbps0 drops21.6 Gbps0 drops
64 B1.0 Gbps0 drops5.0 Gbps0 drops9.9 Gbps0 drops8.9 Gbps0 drops8.9 Gbps0 drops13.7 Gbps0 drops8.9 Gbps0 drops8.7 Gbps0 drops
+ +**Reading the matrix.** The token-bucket pacer tracks target within +sub-Gbps at payloads ≥ 1 KB right up to the link cap. The PPS +ceiling for each payload (~21 Gbps at 256 B, ~9 Gbps at 64 B) shows +up as a horizontal saturation band: once `target_gbps` crosses that +ceiling, increasing it further produces no additional throughput. +The 64 B / target=75 cell (13.7 Gbps) is a pacer transient when the +requested rate is well above achievable — the next cell at +target=100 falls back to the PPS ceiling. The 1024 B / unpaced cell +(68.2 Gbps, yellow) sits below its own paced cells; 1024 B is the +size where the master loop transitions from idle to fully busy +(`cpu_master_pct` jumps from ~3% to ~93% in the corresponding +unpaced run), and per-cell numbers in this regime carry roughly +±5 Gbps of run-to-run variance. + +#### Payload × batch matrix Each cell shows the achieved Gbps and drops over a 30 s unpaced run. Coloring is relative to the global max across the matrix (here @@ -284,7 +368,7 @@ link ceiling regardless of batch. Run-to-run variance on the unpaced cells is ~0.5 Gbps; row-internal Gbps differences smaller than that should be treated as noise. -### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced) +#### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced) | Resource | Value | Note | | -------------------- | ----- | ------------------------------------------ | @@ -302,18 +386,20 @@ payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts flow through the orchestration path. That asymmetry is data, not a bug, and is captured in the per-cell artifacts under `bench-results/`. -## Results — RoCE +### RoCE **Deferred from this report.** See [headline-table footnote 1](#fn:1) and -the Known limitations section. Single-host RoCE loopback on Spark requires a -two-netns + two-process orchestration that the wrapper does not yet -implement. The RoCE rows will be filled when the follow-up issue lands. +the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations) +section. Single-host RoCE loopback on Spark requires a two-netns + +two-process orchestration that the wrapper does not yet implement. The RoCE +rows will be filled when the follow-up issue lands. -## Results — Socket +### Socket **Deferred from this report.** See [headline-table footnotes 2 and 3](#fn:2) -and the Known limitations section. Both backends produced unusable data on -Spark during PR 1 verification: +and the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations) +section. Both backends produced unusable data on Spark during PR 1 +verification: - **Socket UDP** in `--mode both` deadlocks on peer learning — both ends try to transmit before either has received, only ~1000 packets per 30 s get @@ -328,7 +414,7 @@ Spark during PR 1 verification: Both are tracked as separate follow-up issues; the Socket rows here will be filled once the underlying bench bugs are fixed. -## Workload variants (FFT, GEMM) +### Workload variants (FFT, GEMM) The post-process layer ([PR 2](https://github.com/NVIDIA/daqiri/issues/15)) adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` / @@ -340,18 +426,18 @@ throughput delta and GPU utilization. - FFT: 1D complex-to-complex, length 1024. - GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload). -### DPDK GPUDirect +#### DPDK GPUDirect — FFT/GEMM _TBD (PR 2)._ -### RoCE +#### RoCE — FFT/GEMM _TBD (PR 2)._ Note the unit-of-work mismatch when comparing across backends: RoCE applies the post-process kernel once per ~8 MB SEND; DPDK applies it once per packet. The throughput numbers are comparable; "operations per burst" is not. -## Python results +## Python Results The Python benches ([PR 3](https://github.com/NVIDIA/daqiri/issues/16)) mirror the C++ benches' CLI and stdout format, using the existing pybind11 bindings. @@ -364,7 +450,7 @@ _TBD (PR 3)._ _TBD (PR 4)._ -## Reproducibility appendix +## Reproduce these results ### Container @@ -417,7 +503,7 @@ per-backend result directories: The script defaults to `dpdk socket-udp socket-tcp` if invoked with no arguments; on Spark, the socket backends will fail their own pre-flight once the follow-up issues land. RDMA is currently rejected by the -pre-flight (see Known limitations). +pre-flight (see [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)). ### Per-backend wrapper invocations @@ -452,3 +538,29 @@ System tuning is required before the numbers in this report are reproducible. See [`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md) for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity. + +## TODO: Not Yet Implemented / Known Limitations + +- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses + `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory + for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable — + so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no + longer changes the memory path; it only changes the segment partition, + which makes "HDS vs. plain GPUDirect" a non-distinction on this platform. + HDS is characterized when this report extends to IGX and x86-server + platforms where device memory works. See + [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking. +- **RoCE single-host loopback is deferred from this report.** See footnote [^1] + on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in + `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0) + and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the + RC connection through `lo` and the QSFP cable carries no traffic. A + follow-up will land the two-netns + two-process orchestration and re-fill + the RoCE rows. +- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3]. + Both are bench-side bugs uncovered during PR 1 verification on Spark: + UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with + a glibc heap-corruption assertion on init. Follow-up issues track each. +- **p99/p999 latency is not in v1.** The bench output captures throughput, + drops, and resource utilization. Per-burst RX timestamping and percentile + aggregation are deferred to a follow-up issue. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index c3e156a..95af265 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -58,8 +58,8 @@ yellow = no drops AND Gbps ≥ 70% of max red = any drops OR Gbps < 70% of max */ .md-typeset table.perf-matrix { - width: 85%; - table-layout: fixed; + width: 100%; + table-layout: auto; border-collapse: separate; border-spacing: 5px; font-size: 0.64rem; @@ -69,8 +69,9 @@ text-align: center; vertical-align: middle; font-variant-numeric: tabular-nums; - padding: 0.55em 0.6em; + padding: 0.55em 0.5em; border-radius: 4px; + white-space: nowrap; } .md-typeset table.perf-matrix td small { display: block;