From 6bc9a6d2e0840dcd5879fd4703e235ddfd3cfdf1 Mon Sep 17 00:00:00 2001
From: rgurunathan <rgurunathan@nvidia.com>
Date: Mon, 11 May 2026 17:10:33 -0400
Subject: [PATCH 1/4] #15 - Add C++ bench infrastructure for DGX Spark
 performance report

Lays the groundwork for the DGX Spark performance report (issue #15). Code
and doc-skeleton only; per-cell numbers fill in via a follow-on commit on
the same PR.

Bench changes (examples/ only, no src/ touched):
- raw_bench_common: add TokenBucketPacer (--target-gbps software pacer) and
  parse_target_gbps helper; emit seconds= in the shared rx_count_worker
  print; build the print in a stringstream so concurrent RX/TX worker
  output doesn't interleave on stdout.
- raw_gpudirect_bench: wire pacer into the TX worker; track packet/byte
  counts; print TX complete: with seconds=.
- rdma_bench: pacer for SEND path only; split send/recv completion counts
  and bytes per role; seconds= in server/client complete lines.
- socket_bench: pacer; make iteration cap opt-in (iterations <= 0 means
  time-bounded by --seconds); sent_bytes/recv_bytes tracking.
- CMakeLists: link raw_bench_common.cpp + CUDA::cudart into rdma/socket
  bench targets now that they consume the pacer helper.

Drop accounting (without modifying managers, per zero-src/ constraint):
- DPDK: parsed from DAQIRI_LOG_INFO output of PrintDpdkStats by wrapper.
- RDMA: parsed from "CQ error" lines in DAQIRI_LOG_ERROR by wrapper.
  Bench-side CQE counting isn't possible -- the manager filters error
  completions before they reach get_rx_burst.
- Socket UDP: kernel drops via /proc/net/udp diff by wrapper.
- Socket TCP: nstat retrans/inerrs diff by wrapper (TCP has no clean
  "drops" semantic).

New tooling:
- examples/bench_capture_environment.sh: snapshots uname, kernel cmdline,
  CPU/NUMA, hugepages, NIC/PCIe state, OFED, GPU, governor, isolcpus,
  IRQ affinity, git rev once per result set.
- examples/run_spark_bench.sh: sweep wrapper (smoke|sweep|drop-curve
  modes) per backend; runs bench under mpstat + nvidia-smi dmon;
  computes pps/Gbps in one place from packets=/bytes=/seconds= stdout
  fields; writes one CSV row per cell.

Documentation:
- docs/performance-dgx-spark.md: complete skeleton with TBD cells.
  Two headline tables (native-shape peak + matched 8 KB op size) so
  cross-backend comparison is honest. Per-backend sweep dimensions
  table addresses the DPDK packets / RoCE messages / TCP stream
  semantic mismatch. Documents that HDS is deferred on GB10 (host_pinned
  collapses the HDS vs. plain GPUDirect distinction).
- mkdocs.yml, docs/index.html, README.md, AGENTS.md: wire the new doc
  into nav, landing-page News card, README Documentation table,
  AGENTS Documentation section + drift-hotspot list.
- .gitignore: bench-results/ (wrapper output).

Verified on a DGX Spark inside the project container: DPDK loopback
smoke test produces correctly-formatted TX/RX complete: lines at ~84
Gbps cross-chip; pacer holds 1.055 Gbps when --target-gbps 1 is set
(within the +/-5% software pacer accuracy documented in the report);
socket UDP smoke produces the new sent_packets/recv_packets/seconds
output format; wrapper produces a clean CSV row with non-zero gbps and
drops=0. RDMA single-host smoke is deferred -- kernel routes local
local IPs through lo (not the RoCE NICs), which requires a network
namespace to test on one host; that setup lands with the data-fill commit.

Planned follow-up PRs that build on this one:
- This PR (#15), second commit: run the full sweep via run_spark_bench.sh
  and populate the C++ loopback cells in docs/performance-dgx-spark.md.
  Lands before this PR is marked ready-for-review.
- PR for #15 workloads: introduce examples/bench_post_process.{h,cu}
  (cuFFT + cuBLAS post-process layer, examples-only -- no library
  dependency change), refactor rx_count_worker to defer burst free until
  the post-process CUDA event signals, and fill FFT/GEMM cells in the
  performance doc.
- PR for #16 (Python loopback): add daqiri_bench_rdma.py and
  daqiri_bench_socket.py mirroring the C++ benches' --target-gbps and
  stdout format; reuses the existing pybind11 surface, no new bindings.
  Fills Python loopback cells.
- PR for #16 workloads: pybind11 wrappers for the PR 2 post-process layer
  in python/bench_post_process_pybind.cpp; fills Python FFT/GEMM cells
  and completes the Spark v1 report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
---
 .gitignore                            |   1 +
 AGENTS.md                             |   4 +-
 README.md                             |   1 +
 docs/index.html                       |   6 +
 docs/performance-dgx-spark.md         | 312 ++++++++++++++++++++++++++
 examples/CMakeLists.txt               |   6 +-
 examples/bench_capture_environment.sh | 116 ++++++++++
 examples/raw_bench_common.cpp         |  55 ++++-
 examples/raw_bench_common.h           |  30 +++
 examples/raw_gpudirect_bench.cpp      |  33 ++-
 examples/rdma_bench.cpp               |  42 +++-
 examples/run_spark_bench.sh           | 289 ++++++++++++++++++++++++
 examples/socket_bench.cpp             |  49 +++-
 mkdocs.yml                            |   2 +
 14 files changed, 921 insertions(+), 25 deletions(-)
 create mode 100644 docs/performance-dgx-spark.md
 create mode 100755 examples/bench_capture_environment.sh
 create mode 100755 examples/run_spark_bench.sh

diff --git a/.gitignore b/.gitignore
index e221700..dc4cf82 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,5 +1,6 @@
 build*/
 site/
+bench-results/
 
 # macOS
 .DS_Store
diff --git a/AGENTS.md b/AGENTS.md
index c45ef72..79d9568 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -93,14 +93,16 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf
 - `docs/index.html` — custom HTML landing page (not generated by MkDocs, hand-maintained)
 - `docs/daqiri-api.html` — standalone HTML API reference (hand-maintained)
 - `docs/api-guide.md`, `docs/getting-started.md`, `docs/configuration.md` — core markdown docs
+- `docs/performance-dgx-spark.md` — per-platform performance report (DGX Spark; more platforms to follow)
 - `docs/tutorials/` — tutorial walkthroughs (background, system config, benchmarking, config files)
 - `docs/stylesheets/extra.css` — custom theme overrides
 
 **Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots:
 - **Backend list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/configuration.md`
 - **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section
-- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage)
+- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage), and per-platform performance docs (`docs/performance-*.md`)
 - **Public API** (`src/common.h`, `src/types.h`, `src/manager.h`) — `docs/api-guide.md`, `docs/daqiri-api.html`
+- **Bench CLI flags or output format** (`examples/raw_bench_common.{h,cpp}`, `*_bench.cpp`) — per-platform performance docs' Methodology section, `examples/run_spark_bench.sh` parsing logic
 - **Doc reorganization** (any rename in `docs/`) — `docs/index.html` landing page, `mkdocs.yml` nav, README Documentation table
 
 The full mapping with rationale lives in the docs-sync agent rule. Internal-link, anchor, and nav drift is enforced by CI (`.github/workflows/docs.yml`); content drift (stale binary names, defaults) is still a manual check at commit time.
diff --git a/README.md b/README.md
index 50112c6..30c17e1 100644
--- a/README.md
+++ b/README.md
@@ -81,6 +81,7 @@ Reference material for the DAQIRI codebase:
 - [Getting Started](docs/getting-started.md) — System requirements, build/install instructions, and CMake options
 - [Configuration Reference](docs/configuration.md) — Full YAML config reference for all backends
 - [API Guide](docs/api-guide.md) — BurstParams, RX/TX workflows, buffer lifecycle, status codes
+- [Performance: DGX Spark](docs/performance-dgx-spark.md) — Per-platform throughput, drop, and utilization numbers for all backends
 - [Contributing](CONTRIBUTING.md) — Contribution guidelines, coding standards, DCO sign-off
 
 ## Tutorials
diff --git a/docs/index.html b/docs/index.html
index 61a9eb2..53f2ffe 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -595,6 +595,12 @@ <h2 class="section-title">News</h2>
         </div>
       </div>
       <div class="pub-grid">
+        <div class="pub-card">
+          <div class="pub-venue"><span class="pub-badge">Performance</span><span class="pub-year">2026</span></div>
+          <div class="pub-title">DAQIRI Performance on DGX Spark</div>
+          <div class="pub-authors">NVIDIA — Throughput, drops, and resource utilization for DPDK GPUDirect, RoCE, and socket backends measured on a DGX Spark (GB10) workstation. First in a series of per-platform performance reports.</div>
+          <div class="pub-links"><a href="performance-dgx-spark/" class="pub-link">Read report →</a></div>
+        </div>
         <div class="pub-card">
           <div class="pub-venue"><span class="pub-badge">GitHub</span><span class="pub-year">2025</span></div>
           <div class="pub-title">DAQIRI Open-Sourced on GitHub</div>
diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md
new file mode 100644
index 0000000..1e74cd6
--- /dev/null
+++ b/docs/performance-dgx-spark.md
@@ -0,0 +1,312 @@
+# Performance: DGX Spark
+
+This page reports DAQIRI throughput, drop, and resource-utilization numbers
+measured on a DGX Spark (GB10) workstation. It is the first in a series of
+per-platform performance reports; the same section layout will be reused for
+IGX, x86-server, and other targets.
+
+The numbers below are reproducible — every cell is generated by
+[`examples/run_spark_bench.sh`](https://github.com/nvidia/daqiri/blob/main/examples/run_spark_bench.sh)
+against the YAML configs in `examples/`, with the system state captured by
+[`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh)
+alongside each result set.
+
+## Summary
+
+Two headline tables. **Native-shape peak** reports each backend at its
+best-case operation size (the configuration the backend was designed for).
+**Matched 8 KB** drives DPDK / RoCE / Socket UDP at a common ~8 KB unit of
+work so the cross-backend comparison is apples-to-apples; TCP is omitted from
+that table since it has no operation boundary.
+
+### Native-shape peak — max no-drop throughput (Gbps)
+
+| Backend / Stack       | C++ loopback | C++ + FFT      | C++ + GEMM     | Python loopback | Python + FFT   | Python + GEMM  |
+| --------------------- | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- |
+| DPDK GPUDirect (8 KB) | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| RoCE (8 MB SEND)      | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| Socket UDP (MTU)      | _TBD (PR 1)_ | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
+| Socket TCP (stream)   | _TBD (PR 1)_ | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
+
+### Matched 8 KB operation — cross-backend Gbps
+
+| Backend / Stack          | C++ loopback | C++ + FFT      | C++ + GEMM     | Python loopback | Python + FFT   | Python + GEMM  |
+| ------------------------ | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- |
+| DPDK GPUDirect           | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| RoCE                     | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| Socket UDP (1472 B, MTU) | _TBD (PR 1)_ | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
+
+!!! note "Why two tables"
+    A single Gbps number isn't enough to compare backends fairly. A DPDK
+    backend at peak with 8 KB packets is doing very different work than a RoCE
+    backend at peak with 8 MB messages, even when both report the same Gbps.
+    The matched 8 KB table makes the cross-backend comparison honest; the
+    native-shape table shows each backend's design ceiling.
+
+## Known limitations on this platform
+
+- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses
+  `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory
+  for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable —
+  so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no
+  longer changes the memory path; it only changes the segment partition,
+  which makes "HDS vs. plain GPUDirect" a non-distinction on this platform.
+  HDS is characterized when this report extends to IGX and x86-server
+  platforms where device memory works. See
+  [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking.
+- **p99/p999 latency is not in v1.** The bench output captures throughput,
+  drops, and resource utilization. Per-burst RX timestamping and percentile
+  aggregation are deferred to a follow-up issue.
+
+## System under test
+
+The reproducibility appendix has the full capture. Key fields:
+
+| Field             | Value                                |
+| ----------------- | ------------------------------------ |
+| Platform          | NVIDIA DGX Spark (GB10)              |
+| GPU               | Blackwell (compute capability 12.1)  |
+| NIC               | ConnectX-7 (two ports, tied / QSFP loopback) |
+| Topology          | `0000:01:00.0` (mlx5_0) → `0002:01:00.0` (mlx5_2), physical loopback cable |
+| OS                | _captured in `environment.txt`_      |
+| Kernel            | _captured in `environment.txt`_      |
+| CUDA driver       | _captured in `environment.txt`_      |
+| DPDK              | patched per `dpdk_patches/` (container build) |
+| DAQIRI commit     | _captured in `environment.txt`_      |
+
+## Methodology
+
+### Bench commands
+
+Each backend has a dedicated bench executable in `examples/`. PR 1 numbers
+come from these:
+
+```bash
+# DPDK GPUDirect — physical loopback
+./build/examples/daqiri_bench_raw_gpudirect \
+    examples/daqiri_bench_raw_tx_rx_spark.yaml \
+    --seconds 30 [--target-gbps G]
+
+# RoCE — same NIC, two ports
+./build/examples/daqiri_bench_rdma \
+    examples/daqiri_bench_rdma_tx_rx_spark.yaml \
+    --seconds 30 --mode both [--target-gbps G]
+
+# Socket UDP / TCP — localhost
+./build/examples/daqiri_bench_socket \
+    examples/daqiri_bench_socket_udp_tx_rx.yaml \
+    --seconds 30 --mode both [--target-gbps G]
+```
+
+The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC:
+
+```bash
+ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"
+```
+
+### Per-backend sweep dimensions
+
+The "payload × batch" sweep doesn't map uniformly across backends. Each
+backend has its own sweep:
+
+| Backend            | Sweep dim 1                                            | Sweep dim 2                            | Native-shape cell    | Matched-size cell      |
+| ------------------ | ------------------------------------------------------ | -------------------------------------- | -------------------- | ---------------------- |
+| DPDK GPUDirect     | payload_size ∈ {64, 256, 1024, 4096, 8000} B           | batch_size ∈ {256, 1024, 4096, 10240}  | (8000, 10240)        | (8000, 10240) — same   |
+| RoCE               | message_size ∈ {4 K, 64 K, 1 M, 8 M} B                 | batch_size ∈ {1, 4, 16}                | (8 M, 1)             | (8 K, 16)              |
+| Socket UDP         | payload_size ∈ {64, 256, 1024, 1472} B (MTU-bound)     | batch_size ∈ {1, 32, 256}              | (1472, 256)          | (1472, 256) — closest to 8 K under MTU cap |
+| Socket TCP         | message_size ∈ {1 K, 64 K, 1 M} B                      | n/a (single stream)                    | (64 K)               | n/a                    |
+
+### "No-drop" threshold
+
+A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run.
+The headline tables report the highest target rate at which a run was still
+drop-free under this threshold. The methodology does not use a percentile cap
+— either there are drops or there are not.
+
+### Drop-curve sweep
+
+The drop curve sweeps `--target-gbps` while holding the native-shape cell
+constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`)
+adds a software-paced sleep after each burst; accuracy is ~±5 % at high rates
+due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK
+`accurate_send`) is unused but would tighten DPDK-only precision; deferred to
+a follow-up.
+
+### Drop sources per backend
+
+- **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO`
+  output (the bench's stderr).
+- **RoCE** — count of `CQ error` lines in `DAQIRI_LOG_ERROR` output (RDMA
+  manager filters error completions but logs each one).
+- **Socket UDP** — diff of the `drops` column in `/proc/net/udp` over the run.
+- **Socket TCP** — `nstat -a` diff of `TcpExtTCPLostRetransmit` /
+  `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is
+  the closest proxy.
+
+### External captures per run
+
+Each run records, in parallel with the bench:
+
+- `mpstat -P ALL 1 <N>` — per-core CPU busy%.
+- `nvidia-smi dmon -s pucvmet -c <N>` — GPU SM%, mem%, DRAM bandwidth.
+
+Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, hugepages,
+GPU state, DAQIRI commit) is captured once per result set by
+`bench_capture_environment.sh`.
+
+## Results — DPDK GPUDirect
+
+_PR 1: this section is populated after the first sweep run._
+
+### Drop curve at native shape
+
+| target_gbps | achieved Gbps | RX pps  | drops |
+| ----------- | ------------- | ------- | ----- |
+| _TBD (PR 1)_ |               |         |       |
+
+### Payload × batch sweep
+
+| payload | batch | Gbps | pps | drops |
+| ------- | ----- | ---- | --- | ----- |
+| _TBD (PR 1)_ |    |      |     |       |
+
+### CPU and GPU utilization (headline cell)
+
+| Resource        | Value           |
+| --------------- | --------------- |
+| Master core %   | _TBD_           |
+| TX core %       | _TBD_           |
+| RX core %       | _TBD_           |
+| GPU SM %        | _TBD (near 0; GPU is DMA target, not compute)_ |
+| GPU mem BW %    | _TBD_           |
+
+## Results — RoCE
+
+_PR 1: this section is populated after the first sweep run._
+
+### Drop curve at native shape
+
+| target_gbps | achieved Gbps | completions/s | drops (CQ errors) |
+| ----------- | ------------- | ------------- | ----------------- |
+| _TBD (PR 1)_ |               |               |                   |
+
+### Message-size × batch sweep
+
+| message_size | batch | Gbps | completions/s | drops |
+| ------------ | ----- | ---- | ------------- | ----- |
+| _TBD (PR 1)_ |       |      |               |       |
+
+### CPU and GPU utilization (headline cell)
+
+| Resource        | Value |
+| --------------- | ----- |
+| Server core %   | _TBD_ |
+| Client core %   | _TBD_ |
+| GPU SM %        | _TBD_ |
+| GPU mem BW %    | _TBD_ |
+
+## Results — Socket
+
+_PR 1: this section is populated after the first sweep run. GPU rows N/A._
+
+### UDP — payload × batch sweep
+
+| payload | batch | Gbps | pps | drops |
+| ------- | ----- | ---- | --- | ----- |
+| _TBD (PR 1)_ |    |      |     |       |
+
+### TCP — message-size sweep
+
+| message_size | Gbps | retrans/inerrs |
+| ------------ | ---- | -------------- |
+| _TBD (PR 1)_ |      |                |
+
+## Workload variants (FFT, GEMM)
+
+The post-process layer ([PR 2](https://github.com/NVIDIA/daqiri/issues/15))
+adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` /
+`cuBLAS` on the received GPU-resident payload, and reports the resulting
+throughput delta and GPU utilization.
+
+**Sizes (representative):**
+
+- FFT: 1D complex-to-complex, length 1024.
+- GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload).
+
+### DPDK GPUDirect
+
+_TBD (PR 2)._
+
+### RoCE
+
+_TBD (PR 2)._ Note the unit-of-work mismatch when comparing across backends:
+RoCE applies the post-process kernel once per ~8 MB SEND; DPDK applies it
+once per packet. The throughput numbers are comparable; "operations per
+burst" is not.
+
+## Python results
+
+The Python benches ([PR 3](https://github.com/NVIDIA/daqiri/issues/16)) mirror
+the C++ benches' CLI and stdout format, using the existing pybind11 bindings.
+
+### Loopback
+
+_TBD (PR 3)._
+
+### FFT / GEMM via pybind of the C++ post-process layer
+
+_TBD (PR 4)._
+
+## Reproducibility appendix
+
+### Container
+
+All commands below assume execution inside the project container, as required
+by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md). The
+container must be started in privileged mode with all GPUs and hugepage
+mounts passed through.
+
+### Full result regeneration
+
+```bash
+# 1. Build inside the container (root).
+cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \
+    -DDAQIRI_MGR="dpdk socket rdma"
+cmake --build build -j
+
+# 2. Snapshot environment + run the sweeps.
+export DAQIRI_BUILD_DIR="$PWD/build"
+export ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"
+
+# Native-shape headline cell:
+./examples/run_spark_bench.sh dpdk       smoke
+./examples/run_spark_bench.sh rdma       smoke
+./examples/run_spark_bench.sh socket-udp smoke
+./examples/run_spark_bench.sh socket-tcp smoke
+
+# Full payload × batch sweep:
+./examples/run_spark_bench.sh dpdk       sweep
+./examples/run_spark_bench.sh rdma       sweep
+./examples/run_spark_bench.sh socket-udp sweep
+./examples/run_spark_bench.sh socket-tcp sweep
+
+# Drop curve (sweeps --target-gbps):
+./examples/run_spark_bench.sh dpdk       drop-curve
+./examples/run_spark_bench.sh rdma       drop-curve
+./examples/run_spark_bench.sh socket-udp drop-curve
+./examples/run_spark_bench.sh socket-tcp drop-curve
+```
+
+### Environment-only capture
+
+Useful for filing a bug report or comparing two Spark units:
+
+```bash
+./examples/bench_capture_environment.sh /tmp/spark-env
+```
+
+### Tuning prerequisites
+
+System tuning is required before the numbers in this report are reproducible.
+See [`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md)
+for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity.
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index 4bc11aa..4807b85 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -71,14 +71,16 @@ add_daqiri_raw_bench(daqiri_bench_raw_reorder_quantize raw_reorder_quantize_benc
 add_daqiri_raw_bench(daqiri_example_gds_write gds_write_example.cpp)
 add_daqiri_raw_bench(daqiri_example_pcap_writer pcap_writer_example.cpp)
 
-add_executable(daqiri_bench_rdma rdma_bench.cpp)
+add_executable(daqiri_bench_rdma rdma_bench.cpp raw_bench_common.cpp)
 link_daqiri_bench(daqiri_bench_rdma)
+target_link_libraries(daqiri_bench_rdma PRIVATE CUDA::cudart)
 set_target_properties(daqiri_bench_rdma PROPERTIES
   BUILD_RPATH "$ORIGIN/../src;$ORIGIN/../src/third_party/yaml-cpp"
 )
 
-add_executable(daqiri_bench_socket socket_bench.cpp)
+add_executable(daqiri_bench_socket socket_bench.cpp raw_bench_common.cpp)
 link_daqiri_bench(daqiri_bench_socket)
+target_link_libraries(daqiri_bench_socket PRIVATE CUDA::cudart)
 
 foreach(cfg IN LISTS DAQIRI_BENCH_CONFIGS)
   configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${cfg} ${CMAKE_CURRENT_BINARY_DIR}/${cfg} COPYONLY)
diff --git a/examples/bench_capture_environment.sh b/examples/bench_capture_environment.sh
new file mode 100755
index 0000000..3314b08
--- /dev/null
+++ b/examples/bench_capture_environment.sh
@@ -0,0 +1,116 @@
+#!/usr/bin/env bash
+#
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Capture host/NIC/GPU/build state for a benchmark run, so numbers are
+# reproducible across machines and over time. Writes one structured text file
+# with named sections.
+#
+# Usage: ./bench_capture_environment.sh <output_dir>
+#        Default output dir: bench-results/<UTC timestamp>/
+
+set -u
+
+OUT_DIR="${1:-bench-results/$(date -u +%Y%m%dT%H%M%SZ)}"
+mkdir -p "$OUT_DIR"
+OUT="$OUT_DIR/environment.txt"
+
+# Run a command, capturing exit status. Always write a header so the section is
+# present even when the command is missing or fails — silent absence is harder
+# to debug than an explicit "command not found".
+run_section() {
+  local label="$1"; shift
+  {
+    echo "=========================================================="
+    echo "[$label]"
+    echo "  cmd: $*"
+    echo "=========================================================="
+    if command -v "$1" >/dev/null 2>&1 || [[ "$1" == /* || "$1" == ./* ]]; then
+      "$@" 2>&1
+      echo "  (exit: $?)"
+    else
+      echo "  (command not found in PATH: $1)"
+    fi
+    echo
+  } >> "$OUT"
+}
+
+# Cat a file/glob; write a header either way.
+cat_section() {
+  local label="$1"; shift
+  {
+    echo "=========================================================="
+    echo "[$label]"
+    echo "  paths: $*"
+    echo "=========================================================="
+    for p in "$@"; do
+      if compgen -G "$p" >/dev/null; then
+        for f in $p; do
+          echo "----- $f -----"
+          cat "$f" 2>&1
+        done
+      else
+        echo "  (no match: $p)"
+      fi
+    done
+    echo
+  } >> "$OUT"
+}
+
+: > "$OUT"
+
+echo "DAQIRI benchmark environment capture" >> "$OUT"
+echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> "$OUT"
+echo "Host:      $(hostname)" >> "$OUT"
+echo "Output:    $OUT" >> "$OUT"
+echo >> "$OUT"
+
+# --- Kernel / OS ---
+run_section "uname"           uname -a
+cat_section "kernel-cmdline"  /proc/cmdline
+cat_section "os-release"      /etc/os-release
+run_section "lsb-release"     lsb_release -a
+run_section "clocksource"     cat /sys/devices/system/clocksource/clocksource0/current_clocksource
+
+# --- CPU / NUMA / IRQ ---
+run_section "numactl"         numactl --show
+run_section "lscpu"           lscpu
+cat_section "cpu-isolated"    /sys/devices/system/cpu/isolated
+run_section "cpufreq-info"    cpupower frequency-info
+cat_section "cpu-governor"    /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+run_section "irq-mlx5"        bash -c "grep mlx5 /proc/interrupts || true"
+
+# --- Hugepages ---
+cat_section "hugepages"       /sys/kernel/mm/hugepages/*/nr_hugepages
+run_section "free-h"          free -h
+
+# --- PCIe topology ---
+run_section "lspci-mellanox"  bash -c "lspci -vvv -d 15b3: 2>/dev/null"
+run_section "lspci-nvidia"    bash -c "lspci -vvv -d 10de: 2>/dev/null"
+
+# --- NIC: OFED / firmware / DPDK binding ---
+run_section "ofed-info"       ofed_info -s
+run_section "mlxfwmanager"    mlxfwmanager --query
+run_section "dpdk-devbind"    dpdk-devbind.py --status
+# Per-iface ethtool — iterate over the daqiri-tx/rx names if present, else all mlx5.
+for iface in daqiri-tx daqiri-rx $(ls /sys/class/net 2>/dev/null | grep -E '^(enP|enp|eth)' || true); do
+  [[ -d "/sys/class/net/$iface" ]] || continue
+  run_section "ethtool-i:$iface"  ethtool -i "$iface"
+  run_section "ethtool-g:$iface"  ethtool -g "$iface"
+  run_section "ethtool-l:$iface"  ethtool -l "$iface"
+  cat_section "iface-mtu:$iface"  "/sys/class/net/$iface/mtu"
+  cat_section "iface-mac:$iface"  "/sys/class/net/$iface/address"
+done
+
+# --- GPU ---
+run_section "nvidia-smi-q"        nvidia-smi -q
+run_section "nvidia-smi-tempclk"  nvidia-smi --query-gpu=name,driver_version,temperature.gpu,clocks.current.sm,clocks.current.memory --format=csv
+
+# --- Build state ---
+DAQIRI_DIR="$(git -C "$(dirname "$0")/.." rev-parse --show-toplevel 2>/dev/null || pwd)"
+run_section "git-rev-parse"   git -C "$DAQIRI_DIR" rev-parse HEAD
+run_section "git-status"      git -C "$DAQIRI_DIR" status --short
+run_section "git-describe"    git -C "$DAQIRI_DIR" describe --always --dirty
+
+echo "Capture complete: $OUT"
diff --git a/examples/raw_bench_common.cpp b/examples/raw_bench_common.cpp
index 405c500..bb1e42c 100644
--- a/examples/raw_bench_common.cpp
+++ b/examples/raw_bench_common.cpp
@@ -19,10 +19,12 @@
 
 #include <arpa/inet.h>
 
+#include <algorithm>
 #include <chrono>
 #include <csignal>
 #include <cstring>
 #include <iostream>
+#include <sstream>
 #include <stdexcept>
 #include <thread>
 
@@ -97,6 +99,44 @@ int parse_run_seconds(int argc, char **argv) {
   return run_seconds;
 }
 
+double parse_target_gbps(int argc, char **argv) {
+  double target_gbps = 0.0;
+  for (int i = 2; i + 1 < argc; i += 2) {
+    if (std::string(argv[i]) == "--target-gbps") {
+      target_gbps = std::stod(argv[i + 1]);
+    }
+  }
+  return target_gbps;
+}
+
+TokenBucketPacer::TokenBucketPacer(double target_gbps)
+    : target_bps_(target_gbps > 0.0 ? target_gbps * 1e9 : 0.0),
+      t0_(std::chrono::steady_clock::now()) {}
+
+void TokenBucketPacer::wait_for_bytes(size_t bytes, std::atomic<bool> &stop) {
+  if (target_bps_ <= 0.0) {
+    return;
+  }
+  total_bytes_ += bytes;
+  const double scheduled_secs = (total_bytes_ * 8.0) / target_bps_;
+  const auto scheduled = t0_ + std::chrono::duration_cast<
+                                   std::chrono::steady_clock::duration>(
+                                   std::chrono::duration<double>(scheduled_secs));
+  // Slice the wait into 10 ms chunks so a stop flag (--seconds expiry or
+  // Ctrl-C) can break us out promptly. The total slept across the slices
+  // accumulates to the scheduled deadline, so pacing remains accurate.
+  constexpr auto kSlice = std::chrono::milliseconds(10);
+  while (!stop.load()) {
+    const auto now = std::chrono::steady_clock::now();
+    if (scheduled <= now) {
+      return;
+    }
+    const auto remaining = scheduled - now;
+    std::this_thread::sleep_for(
+        std::min<std::chrono::steady_clock::duration>(remaining, kSlice));
+  }
+}
+
 bool has_bench_rx(const YAML::Node &root) {
   return root["bench_rx"] && root["bench_rx"]["interface_name"];
 }
@@ -287,6 +327,7 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic<bool> &stop) {
   uint64_t pkts = 0;
   uint64_t bytes = 0;
   uint64_t bursts = 0;
+  const auto t0 = std::chrono::steady_clock::now();
   while (!stop.load()) {
     const auto num_rx_queues =
         static_cast<int>(daqiri::get_num_rx_queues(port_id));
@@ -307,9 +348,17 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic<bool> &stop) {
       std::this_thread::sleep_for(std::chrono::microseconds(100));
     }
   }
-
-  std::cout << "RX complete: packets=" << pkts << " bytes=" << bytes
-            << " bursts=" << bursts << "\n";
+  const double secs =
+      std::chrono::duration<double>(std::chrono::steady_clock::now() - t0)
+          .count();
+
+  // Build the line in a stringstream so the print to stdout is a single
+  // write(). RX and TX workers race at end-of-run and naive `cout <<` can
+  // interleave their output (corrupting downstream parsers).
+  std::ostringstream oss;
+  oss << "RX complete: packets=" << pkts << " bytes=" << bytes
+      << " bursts=" << bursts << " seconds=" << secs << "\n";
+  std::cout << oss.str() << std::flush;
 }
 
 } // namespace daqiri::bench
diff --git a/examples/raw_bench_common.h b/examples/raw_bench_common.h
index d53a4f9..cf0ee0c 100644
--- a/examples/raw_bench_common.h
+++ b/examples/raw_bench_common.h
@@ -21,6 +21,7 @@
 #include <yaml-cpp/yaml.h>
 
 #include <atomic>
+#include <chrono>
 #include <cstddef>
 #include <cstdint>
 #include <string>
@@ -28,6 +29,34 @@
 
 namespace daqiri::bench {
 
+// Software token-bucket pacer used by the bench TX workers. When
+// target_gbps == 0 the wait_for_bytes() call is a no-op early return, so the
+// pacer adds no overhead when --target-gbps is unset.
+//
+// Accuracy: ~5% at high rates due to Linux nanosleep granularity and scheduler
+// jitter. Acceptable for drop-curve sweeps; tighter pacing would require
+// hardware TX timestamping (DAQIRI's accurate_send YAML flag), deferred.
+class TokenBucketPacer {
+public:
+  TokenBucketPacer() = default;
+  explicit TokenBucketPacer(double target_gbps);
+
+  // Call after each TX burst. Sleeps in short slices until the pacer's notion
+  // of "time the configured target rate would have taken to send the
+  // accumulated bytes" catches up, OR `stop` flips true. Slicing keeps the
+  // bench responsive to --seconds expiry / Ctrl-C without truncating the total
+  // sleep (which would silently break pacing for low target rates).
+  void wait_for_bytes(size_t bytes, std::atomic<bool> &stop);
+
+  bool enabled() const { return target_bps_ > 0.0; }
+  double target_gbps() const { return target_bps_ / 1e9; }
+
+private:
+  double target_bps_ = 0.0;  // 0 means disabled
+  uint64_t total_bytes_ = 0;
+  std::chrono::steady_clock::time_point t0_;
+};
+
 struct RawBenchTxConfig {
   std::string interface_name = "tx_port";
   uint32_t batch_size = 1024;
@@ -68,6 +97,7 @@ class PinnedHostBuffer {
 };
 
 int parse_run_seconds(int argc, char **argv);
+double parse_target_gbps(int argc, char **argv);
 bool has_bench_rx(const YAML::Node &root);
 bool has_bench_tx(const YAML::Node &root);
 RawBenchRxConfig parse_rx(const YAML::Node &root);
diff --git a/examples/raw_gpudirect_bench.cpp b/examples/raw_gpudirect_bench.cpp
index e93d0c8..cbfa9f9 100644
--- a/examples/raw_gpudirect_bench.cpp
+++ b/examples/raw_gpudirect_bench.cpp
@@ -24,6 +24,7 @@
 #include <cstdint>
 #include <cstring>
 #include <iostream>
+#include <sstream>
 #include <string>
 #include <thread>
 #include <unordered_set>
@@ -35,6 +36,7 @@
 namespace {
 
 void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
+               daqiri::bench::TokenBucketPacer &pacer,
                std::atomic<bool> &stop) {
   const int port_id = daqiri::get_port_id(cfg.interface_name);
   if (port_id < 0) {
@@ -67,6 +69,12 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
 
   std::unordered_set<void *> initialized_tx_buffers;
 
+  uint64_t tx_packets = 0;
+  uint64_t tx_bytes = 0;
+  uint64_t tx_bursts = 0;
+  const auto t0 = std::chrono::steady_clock::now();
+  const uint32_t pkt_bytes = cfg.header_size + cfg.payload_size;
+
   while (!stop.load()) {
     auto *msg = daqiri::create_tx_burst_params();
     daqiri::set_header(msg, static_cast<uint16_t>(port_id), 0, cfg.batch_size,
@@ -109,7 +117,7 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
       }
 
       if (daqiri::set_packet_lengths(
-              msg, i, {static_cast<int>(cfg.header_size + cfg.payload_size)}) !=
+              msg, i, {static_cast<int>(pkt_bytes)}) !=
           daqiri::Status::SUCCESS) {
         failed = true;
         break;
@@ -121,18 +129,34 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
       continue;
     }
     daqiri::send_tx_burst(msg);
+    const uint64_t burst_bytes =
+        static_cast<uint64_t>(num_pkts) * pkt_bytes;
+    tx_packets += static_cast<uint64_t>(num_pkts);
+    tx_bytes += burst_bytes;
+    ++tx_bursts;
+    pacer.wait_for_bytes(burst_bytes, stop);
   }
+  const double secs =
+      std::chrono::duration<double>(std::chrono::steady_clock::now() - t0)
+          .count();
+  // Single-write print so concurrent RX worker output doesn't interleave.
+  std::ostringstream oss;
+  oss << "TX complete: packets=" << tx_packets << " bytes=" << tx_bytes
+      << " bursts=" << tx_bursts << " seconds=" << secs << "\n";
+  std::cout << oss.str() << std::flush;
 }
 
 } // namespace
 
 int main(int argc, char **argv) {
   if (argc < 2) {
-    std::cerr << "Usage: " << argv[0] << " <config.yaml> [--seconds N]\n";
+    std::cerr << "Usage: " << argv[0]
+              << " <config.yaml> [--seconds N] [--target-gbps G]\n";
     return 1;
   }
 
   const int run_seconds = daqiri::bench::parse_run_seconds(argc, argv);
+  const double target_gbps = daqiri::bench::parse_target_gbps(argc, argv);
   const auto root = YAML::LoadFile(argv[1]);
   if (daqiri::daqiri_init(argv[1]) != daqiri::Status::SUCCESS) {
     std::cerr << "daqiri_init failed\n";
@@ -150,14 +174,15 @@ int main(int argc, char **argv) {
   std::atomic<bool> stop{false};
   std::thread tx_thread;
   std::thread rx_thread;
+  daqiri::bench::TokenBucketPacer tx_pacer(target_gbps);
 
   if (has_rx) {
     rx_thread = std::thread(daqiri::bench::rx_count_worker,
                             daqiri::bench::parse_rx(root), std::ref(stop));
   }
   if (has_tx) {
-    tx_thread =
-        std::thread(tx_worker, daqiri::bench::parse_tx(root), std::ref(stop));
+    tx_thread = std::thread(tx_worker, daqiri::bench::parse_tx(root),
+                            std::ref(tx_pacer), std::ref(stop));
   }
 
   daqiri::bench::wait_for_stop(run_seconds, stop);
diff --git a/examples/rdma_bench.cpp b/examples/rdma_bench.cpp
index dbe8186..5b9bcfd 100644
--- a/examples/rdma_bench.cpp
+++ b/examples/rdma_bench.cpp
@@ -25,6 +25,7 @@
 #include <string>
 #include <thread>
 
+#include "raw_bench_common.h"
 #include "src/common.h"
 
 namespace {
@@ -48,6 +49,8 @@ struct RdmaBenchConfig {
 struct RdmaWorkerStats {
   uint64_t send_completions = 0;
   uint64_t recv_completions = 0;
+  uint64_t send_bytes = 0;
+  uint64_t recv_bytes = 0;
 };
 
 RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) {
@@ -62,7 +65,8 @@ RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) {
   return cfg;
 }
 
-void rdma_worker(const RdmaBenchConfig& cfg, std::atomic<bool>& stop, RdmaWorkerStats& stats) {
+void rdma_worker(const RdmaBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer,
+                 std::atomic<bool>& stop, RdmaWorkerStats& stats) {
   static constexpr int kMaxOutstanding = 5;
   int outstanding_send = 0;
   int outstanding_recv = 0;
@@ -116,6 +120,10 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic<bool>& stop, RdmaWorker
       }
       outstanding++;
       wr_id++;
+      // Only meter actual byte transmissions (SENDs), not RECEIVE-side posts.
+      if (op == daqiri::RDMAOpCode::SEND) {
+        pacer.wait_for_bytes(static_cast<size_t>(cfg.message_size), stop);
+      }
     };
 
     if (cfg.send) { post_req(outstanding_send, send_wr_id, daqiri::RDMAOpCode::SEND, send_mr); }
@@ -127,10 +135,12 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic<bool>& stop, RdmaWorker
       if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::SEND && outstanding_send > 0) {
         outstanding_send--;
         stats.send_completions++;
+        stats.send_bytes += static_cast<uint64_t>(cfg.message_size);
       } else if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::RECEIVE &&
                  outstanding_recv > 0) {
         outstanding_recv--;
         stats.recv_completions++;
+        stats.recv_bytes += static_cast<uint64_t>(cfg.message_size);
       }
       daqiri::free_tx_burst(completion);
     } else {
@@ -143,17 +153,21 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic<bool>& stop, RdmaWorker
 
 int main(int argc, char** argv) {
   if (argc < 2) {
-    std::cerr << "Usage: " << argv[0] << " <config.yaml> [--seconds N] [--mode server|client|both]\n";
+    std::cerr << "Usage: " << argv[0]
+              << " <config.yaml> [--seconds N] [--mode server|client|both] [--target-gbps G]\n";
     return 1;
   }
 
   int run_seconds = 10;
+  double target_gbps = 0.0;
   std::string mode = "both";
   for (int i = 2; i + 1 < argc; i += 2) {
     if (std::string(argv[i]) == "--seconds") {
       run_seconds = std::stoi(argv[i + 1]);
     } else if (std::string(argv[i]) == "--mode") {
       mode = argv[i + 1];
+    } else if (std::string(argv[i]) == "--target-gbps") {
+      target_gbps = std::stod(argv[i + 1]);
     }
   }
 
@@ -168,18 +182,22 @@ int main(int argc, char** argv) {
   std::thread client_thread;
   RdmaWorkerStats server_stats;
   RdmaWorkerStats client_stats;
+  daqiri::bench::TokenBucketPacer server_pacer(target_gbps);
+  daqiri::bench::TokenBucketPacer client_pacer(target_gbps);
   bool run_server = false;
   bool run_client = false;
 
   if ((mode == "server" || mode == "both") && root["rdma_bench_server"]) {
     run_server = true;
     server_thread = std::thread(
-        rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]), std::ref(stop), std::ref(server_stats));
+        rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]),
+        std::ref(server_pacer), std::ref(stop), std::ref(server_stats));
   }
   if ((mode == "client" || mode == "both") && root["rdma_bench_client"]) {
     run_client = true;
     client_thread = std::thread(
-        rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]), std::ref(stop), std::ref(client_stats));
+        rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]),
+        std::ref(client_pacer), std::ref(stop), std::ref(client_stats));
   }
 
   if (!server_thread.joinable() && !client_thread.joinable()) {
@@ -203,11 +221,23 @@ int main(int argc, char** argv) {
   if (server_thread.joinable()) { server_thread.join(); }
   if (client_thread.joinable()) { client_thread.join(); }
 
+  const double secs =
+      std::chrono::duration<double>(std::chrono::steady_clock::now() - start)
+          .count();
+
   if (run_server) {
-    std::cout << "Server received messages: " << server_stats.recv_completions << '\n';
+    std::cout << "Server complete: send_completions=" << server_stats.send_completions
+              << " recv_completions=" << server_stats.recv_completions
+              << " send_bytes=" << server_stats.send_bytes
+              << " recv_bytes=" << server_stats.recv_bytes
+              << " seconds=" << secs << '\n';
   }
   if (run_client) {
-    std::cout << "Client received messages: " << client_stats.recv_completions << '\n';
+    std::cout << "Client complete: send_completions=" << client_stats.send_completions
+              << " recv_completions=" << client_stats.recv_completions
+              << " send_bytes=" << client_stats.send_bytes
+              << " recv_bytes=" << client_stats.recv_bytes
+              << " seconds=" << secs << '\n';
   }
 
   daqiri::print_stats();
diff --git a/examples/run_spark_bench.sh b/examples/run_spark_bench.sh
new file mode 100755
index 0000000..e85df69
--- /dev/null
+++ b/examples/run_spark_bench.sh
@@ -0,0 +1,289 @@
+#!/usr/bin/env bash
+#
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Sweep wrapper for DAQIRI benchmarks on DGX Spark. Runs the bench across a
+# matrix of (payload/message size, batch size, target-gbps), captures per-run
+# CPU/GPU/NIC counters, and emits one CSV row per cell into bench-results/.
+#
+# Drop sources per backend (per the report methodology):
+#   DPDK    : grep imissed/ierrors/rx_nombuf from bench log (DAQIRI_LOG_INFO).
+#   RDMA    : grep "CQ error" lines from bench log (DAQIRI_LOG_ERROR).
+#   socket  : diff /proc/net/udp drops column (UDP); nstat -a (TCP retransmits).
+#
+# Usage:
+#   ./run_spark_bench.sh <backend> [mode]
+#     backend ∈ {dpdk, rdma, socket-udp, socket-tcp}
+#     mode    ∈ {smoke, sweep, drop-curve}  (default: smoke)
+#
+# Required environment in current shell:
+#   DAQIRI_BUILD_DIR — path to the cmake build dir (defaults to ../build).
+#   ETH_DST_ADDR     — required for dpdk backend (the RX iface MAC).
+#   RX_IFACE         — kernel name of the RX interface for /proc/net/udp diff
+#                       (e.g. enP2p1s0f0np0); required for socket-udp.
+#
+# Run inside the project container as root (per AGENTS.md).
+
+set -u
+set -o pipefail
+
+# --------------------------------------------------------------------------
+# Configuration
+# --------------------------------------------------------------------------
+
+BACKEND="${1:-}"
+MODE="${2:-smoke}"
+if [[ -z "$BACKEND" ]]; then
+  echo "Usage: $0 <dpdk|rdma|socket-udp|socket-tcp> [smoke|sweep|drop-curve]" >&2
+  exit 1
+fi
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+BUILD_DIR="${DAQIRI_BUILD_DIR:-$SCRIPT_DIR/../build}"
+TS="$(date -u +%Y%m%dT%H%M%SZ)"
+OUT_DIR="$SCRIPT_DIR/../bench-results/$TS-$BACKEND-$MODE"
+mkdir -p "$OUT_DIR"
+
+CSV="$OUT_DIR/runs.csv"
+echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_busy_mean,gpu_sm_pct,gpu_mem_bw" > "$CSV"
+
+# Capture slow-moving environment state once per result set.
+"$SCRIPT_DIR/bench_capture_environment.sh" "$OUT_DIR"
+
+RUN_SECONDS=30
+DRIVER_LOG="$OUT_DIR/last_run.stderr"
+
+# Per-backend sweep matrices (see docs/performance-dgx-spark.md methodology).
+# Native-shape sizes are the leftmost entry; "matched 8K" cell is also included.
+case "$BACKEND" in
+  dpdk)
+    PAYLOADS_SWEEP=(8000 4096 1024 256 64)
+    BATCHES_SWEEP=(10240 4096 1024 256)
+    PAYLOADS_HEADLINE=(8000)
+    BATCHES_HEADLINE=(10240)
+    BASE_YAML="$SCRIPT_DIR/daqiri_bench_raw_tx_rx_spark.yaml"
+    BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect"
+    : "${ETH_DST_ADDR:?ETH_DST_ADDR must be set for dpdk backend (cat /sys/class/net/<rx-iface>/address)}"
+    ;;
+  rdma)
+    PAYLOADS_SWEEP=(8000000 1048576 65536 8192 4096)
+    BATCHES_SWEEP=(1)
+    PAYLOADS_HEADLINE=(8000000)
+    BATCHES_HEADLINE=(1)
+    BASE_YAML="$SCRIPT_DIR/daqiri_bench_rdma_tx_rx_spark.yaml"
+    BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_rdma"
+    ;;
+  socket-udp)
+    PAYLOADS_SWEEP=(1472 1024 256 64)
+    BATCHES_SWEEP=(256 32 1)
+    PAYLOADS_HEADLINE=(1472)
+    BATCHES_HEADLINE=(256)
+    BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_udp_tx_rx.yaml"
+    BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket"
+    ;;
+  socket-tcp)
+    PAYLOADS_SWEEP=(1048576 65536 1024)
+    BATCHES_SWEEP=(1)
+    PAYLOADS_HEADLINE=(65536)
+    BATCHES_HEADLINE=(1)
+    BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_tcp_tx_rx.yaml"
+    BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket"
+    ;;
+  *) echo "Unknown backend: $BACKEND" >&2; exit 1 ;;
+esac
+
+DROP_CURVE_TARGETS=(1 5 10 25 50 75 100 0)  # 0 means unpaced (line rate)
+
+# --------------------------------------------------------------------------
+# Helpers
+# --------------------------------------------------------------------------
+
+# Read a scalar field from a `key=value` style stdout line.
+# usage: extract_field <pattern-prefix> <field-name> <file>
+extract_field() {
+  local prefix="$1" field="$2" file="$3"
+  grep -E "^$prefix" "$file" | tail -n1 | grep -oE " $field=[^ ]+" | head -n1 | sed -E "s/.*$field=//"
+}
+
+# Sum DPDK drop counters from the manager log emitted via DAQIRI_LOG_INFO.
+parse_dpdk_drops() {
+  local log="$1"
+  local sum=0 v
+  for key in imissed ierrors rx_nombuf; do
+    v="$(grep -oE "$key=[0-9]+" "$log" 2>/dev/null | tail -n1 | sed -E "s/.*=//" || true)"
+    [[ -n "${v:-}" ]] && sum=$((sum + v))
+  done
+  echo "$sum"
+}
+
+# Count RDMA CQ errors in the manager log.
+parse_rdma_drops() {
+  local log="$1"
+  grep -c 'CQ error' "$log" 2>/dev/null || echo 0
+}
+
+# Snapshot socket drops on the kernel side.
+snapshot_proc_net_udp() {
+  awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0
+}
+snapshot_nstat() {
+  nstat -a 2>/dev/null | awk '/TcpExtTCPLostRetransmit|TcpRetransSegs|TcpInErrs/ { s += $2 } END { print s+0 }' || echo 0
+}
+
+# Substitute payload / batch into the base YAML and write a temp config.
+generate_yaml() {
+  local out="$1" payload="$2" batch="$3"
+  case "$BACKEND" in
+    dpdk)
+      sed -E \
+        -e "s|^( *payload_size: ).*|\1$payload|" \
+        -e "s|^( *batch_size: ).*|\1$batch|" \
+        -e "s|<00:00:00:00:00:00>|$ETH_DST_ADDR|g" \
+        "$BASE_YAML" > "$out"
+      ;;
+    rdma)
+      sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out"
+      ;;
+    socket-udp|socket-tcp)
+      sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out"
+      ;;
+  esac
+}
+
+# Run one cell. Echoes the CSV row to stdout.
+run_cell() {
+  local lang="$1" payload="$2" batch="$3" target_gbps="$4"
+  local cell="$lang-$BACKEND-p$payload-b$batch-g$target_gbps"
+  local cell_dir="$OUT_DIR/$cell"
+  mkdir -p "$cell_dir"
+
+  local yaml="$cell_dir/config.yaml"
+  generate_yaml "$yaml" "$payload" "$batch"
+
+  # Snapshot kernel-side drop counters.
+  local udp_before tcp_before
+  udp_before="$(snapshot_proc_net_udp)"
+  tcp_before="$(snapshot_nstat)"
+
+  # Background captures.
+  ( mpstat -P ALL 1 "$RUN_SECONDS" > "$cell_dir/mpstat.txt" 2>&1 ) &
+  local mpstat_pid=$!
+  ( nvidia-smi dmon -s pucvmet -c "$RUN_SECONDS" > "$cell_dir/nvidia_smi_dmon.txt" 2>&1 ) &
+  local dmon_pid=$!
+
+  # Run the bench. Stderr captures DAQIRI_LOG_* output (DPDK/RDMA drop sources).
+  local stdout="$cell_dir/stdout.txt"
+  local stderr="$cell_dir/stderr.txt"
+  local args=("$yaml" --seconds "$RUN_SECONDS")
+  [[ "$target_gbps" != "0" ]] && args+=(--target-gbps "$target_gbps")
+  [[ "$BACKEND" == "rdma" || "$BACKEND" =~ ^socket- ]] && args+=(--mode both)
+
+  "$BENCH_BIN" "${args[@]}" > "$stdout" 2> "$stderr" || true
+  cp "$stderr" "$DRIVER_LOG"
+
+  # Stop background captures (they self-terminate at -c <N>, but reap if needed).
+  wait "$mpstat_pid" 2>/dev/null || true
+  wait "$dmon_pid"  2>/dev/null || true
+
+  # Parse bench stdout. For RX-bearing benches "RX complete" is authoritative;
+  # for TX-only configs fall back to "TX complete".
+  local pkts bytes secs
+  pkts="$(extract_field 'RX complete' packets "$stdout")"
+  bytes="$(extract_field 'RX complete' bytes   "$stdout")"
+  secs="$(extract_field 'RX complete' seconds  "$stdout")"
+  if [[ -z "$pkts" ]]; then
+    pkts="$(extract_field 'TX complete' packets "$stdout")"
+    bytes="$(extract_field 'TX complete' bytes   "$stdout")"
+    secs="$(extract_field 'TX complete' seconds  "$stdout")"
+  fi
+  if [[ -z "$pkts" ]]; then
+    # RDMA prints "Client/Server complete: ... send_completions=N send_bytes=N seconds=S"
+    pkts="$(extract_field 'Client complete' send_completions "$stdout")"
+    bytes="$(extract_field 'Client complete' send_bytes "$stdout")"
+    secs="$(extract_field 'Client complete' seconds "$stdout")"
+  fi
+  pkts="${pkts:-0}"; bytes="${bytes:-0}"; secs="${secs:-0}"
+
+  local pps gbps
+  pps="$(awk -v p="$pkts" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.0f", p/s; else print 0 }')"
+  gbps="$(awk -v b="$bytes" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.3f", (b*8.0)/s/1e9; else print 0 }')"
+
+  # Drops per backend.
+  local drops drops_kind
+  case "$BACKEND" in
+    dpdk)
+      drops="$(parse_dpdk_drops "$stderr")"
+      drops_kind="dpdk-imissed+ierrors+nombuf"
+      ;;
+    rdma)
+      drops="$(parse_rdma_drops "$stderr")"
+      drops_kind="rdma-cqe-error"
+      ;;
+    socket-udp)
+      local udp_after; udp_after="$(snapshot_proc_net_udp)"
+      drops="$((udp_after - udp_before))"
+      drops_kind="udp-proc-net-udp-drops"
+      ;;
+    socket-tcp)
+      local tcp_after; tcp_after="$(snapshot_nstat)"
+      drops="$((tcp_after - tcp_before))"
+      drops_kind="tcp-nstat-retrans+inerrs"
+      ;;
+  esac
+
+  # Mean CPU busy% across all cores (mpstat row "all"). Simple aggregate; per-core
+  # in mpstat.txt for deeper review.
+  local cpu_busy_mean
+  cpu_busy_mean="$(awk '/Average:.*all/ { print 100 - $NF; exit }' "$cell_dir/mpstat.txt" 2>/dev/null || echo 0)"
+  cpu_busy_mean="${cpu_busy_mean:-0}"
+
+  # GPU SM% and memory BW from nvidia-smi dmon. dmon output columns vary by
+  # driver; sm is typically column 5 in `-s pucvmet`, mem is column 6.
+  local gpu_sm gpu_mem
+  gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
+               "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"
+  gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
+                "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"
+
+  echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_busy_mean,$gpu_sm,$gpu_mem" \
+    | tee -a "$CSV"
+}
+
+# --------------------------------------------------------------------------
+# Driver
+# --------------------------------------------------------------------------
+
+case "$MODE" in
+  smoke)
+    # One cell, native-shape, unpaced.
+    for p in "${PAYLOADS_HEADLINE[@]}"; do
+      for b in "${BATCHES_HEADLINE[@]}"; do
+        run_cell cpp "$p" "$b" 0
+      done
+    done
+    ;;
+  sweep)
+    # Full payload × batch matrix at line rate.
+    for p in "${PAYLOADS_SWEEP[@]}"; do
+      for b in "${BATCHES_SWEEP[@]}"; do
+        run_cell cpp "$p" "$b" 0
+      done
+    done
+    ;;
+  drop-curve)
+    # Hold native-shape constant, sweep target_gbps.
+    for p in "${PAYLOADS_HEADLINE[@]}"; do
+      for b in "${BATCHES_HEADLINE[@]}"; do
+        for g in "${DROP_CURVE_TARGETS[@]}"; do
+          run_cell cpp "$p" "$b" "$g"
+        done
+      done
+    done
+    ;;
+  *) echo "Unknown mode: $MODE" >&2; exit 1 ;;
+esac
+
+echo
+echo "Results in: $OUT_DIR"
+echo "CSV:        $CSV"
diff --git a/examples/socket_bench.cpp b/examples/socket_bench.cpp
index 5fda5f5..8ebb27f 100644
--- a/examples/socket_bench.cpp
+++ b/examples/socket_bench.cpp
@@ -26,6 +26,7 @@
 #include <string>
 #include <thread>
 
+#include "raw_bench_common.h"
 #include "src/common.h"
 
 namespace {
@@ -50,6 +51,8 @@ struct SocketBenchConfig {
 struct SocketWorkerStats {
   uint64_t sent_packets = 0;
   uint64_t received_packets = 0;
+  uint64_t sent_bytes = 0;
+  uint64_t received_bytes = 0;
 };
 
 SocketBenchConfig parse_socket_cfg(const YAML::Node& node) {
@@ -65,7 +68,8 @@ SocketBenchConfig parse_socket_cfg(const YAML::Node& node) {
   return cfg;
 }
 
-void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, SocketWorkerStats& stats) {
+void socket_worker(const SocketBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer,
+                   std::atomic<bool>& stop, SocketWorkerStats& stats) {
   uintptr_t conn_id = 0;
   uint16_t port = 0;
   uint16_t queue = 0;
@@ -92,8 +96,14 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, Socket
       }
     }
 
-    const bool send_done = !cfg.send || stats.sent_packets >= static_cast<uint64_t>(cfg.iterations);
-    const bool recv_done = !cfg.receive || stats.received_packets >= static_cast<uint64_t>(cfg.iterations);
+    // When cfg.iterations <= 0, the loop is time-bounded (driven by stop.load()
+    // set by --seconds). Otherwise the iteration cap applies as before.
+    const bool send_done = !cfg.send ||
+                           (cfg.iterations > 0 &&
+                            stats.sent_packets >= static_cast<uint64_t>(cfg.iterations));
+    const bool recv_done = !cfg.receive ||
+                           (cfg.iterations > 0 &&
+                            stats.received_packets >= static_cast<uint64_t>(cfg.iterations));
     if (send_done && recv_done) { break; }
 
     if (cfg.send && !send_done) {
@@ -110,6 +120,8 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, Socket
 
         if (daqiri::send_tx_burst(msg) == daqiri::Status::SUCCESS) {
           stats.sent_packets++;
+          stats.sent_bytes += static_cast<uint64_t>(cfg.message_size);
+          pacer.wait_for_bytes(static_cast<size_t>(cfg.message_size), stop);
         }
       } else {
         daqiri::free_tx_metadata(msg);
@@ -120,7 +132,9 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, Socket
       daqiri::BurstParams* burst = nullptr;
       if (daqiri::get_rx_burst(&burst, conn_id, cfg.server) == daqiri::Status::SUCCESS &&
           burst != nullptr) {
-        stats.received_packets += static_cast<uint64_t>(daqiri::get_num_packets(burst));
+        const uint64_t rx_pkts = static_cast<uint64_t>(daqiri::get_num_packets(burst));
+        stats.received_packets += rx_pkts;
+        stats.received_bytes += daqiri::get_burst_tot_byte(burst);
         daqiri::free_all_packets_and_burst_rx(burst);
       } else {
         std::this_thread::sleep_for(std::chrono::microseconds(100));
@@ -134,17 +148,20 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic<bool>& stop, Socket
 int main(int argc, char** argv) {
   if (argc < 2) {
     std::cerr << "Usage: " << argv[0]
-              << " <config.yaml> [--seconds N] [--mode server|client|both]\n";
+              << " <config.yaml> [--seconds N] [--mode server|client|both] [--target-gbps G]\n";
     return 1;
   }
 
   int run_seconds = 10;
+  double target_gbps = 0.0;
   std::string mode = "both";
   for (int i = 2; i + 1 < argc; i += 2) {
     if (std::string(argv[i]) == "--seconds") {
       run_seconds = std::stoi(argv[i + 1]);
     } else if (std::string(argv[i]) == "--mode") {
       mode = argv[i + 1];
+    } else if (std::string(argv[i]) == "--target-gbps") {
+      target_gbps = std::stod(argv[i + 1]);
     }
   }
 
@@ -159,6 +176,8 @@ int main(int argc, char** argv) {
   std::thread client_thread;
   SocketWorkerStats server_stats;
   SocketWorkerStats client_stats;
+  daqiri::bench::TokenBucketPacer server_pacer(target_gbps);
+  daqiri::bench::TokenBucketPacer client_pacer(target_gbps);
   bool run_server = false;
   bool run_client = false;
 
@@ -166,6 +185,7 @@ int main(int argc, char** argv) {
     run_server = true;
     server_thread = std::thread(socket_worker,
                                 parse_socket_cfg(root["socket_bench_server"]),
+                                std::ref(server_pacer),
                                 std::ref(stop),
                                 std::ref(server_stats));
   }
@@ -173,6 +193,7 @@ int main(int argc, char** argv) {
     run_client = true;
     client_thread = std::thread(socket_worker,
                                 parse_socket_cfg(root["socket_bench_client"]),
+                                std::ref(client_pacer),
                                 std::ref(stop),
                                 std::ref(client_stats));
   }
@@ -199,13 +220,23 @@ int main(int argc, char** argv) {
   if (server_thread.joinable()) { server_thread.join(); }
   if (client_thread.joinable()) { client_thread.join(); }
 
+  const double secs =
+      std::chrono::duration<double>(std::chrono::steady_clock::now() - start)
+          .count();
+
   if (run_server) {
-    std::cout << "Server sent packets: " << server_stats.sent_packets
-              << ", received packets: " << server_stats.received_packets << '\n';
+    std::cout << "Server complete: sent_packets=" << server_stats.sent_packets
+              << " recv_packets=" << server_stats.received_packets
+              << " sent_bytes=" << server_stats.sent_bytes
+              << " recv_bytes=" << server_stats.received_bytes
+              << " seconds=" << secs << '\n';
   }
   if (run_client) {
-    std::cout << "Client sent packets: " << client_stats.sent_packets
-              << ", received packets: " << client_stats.received_packets << '\n';
+    std::cout << "Client complete: sent_packets=" << client_stats.sent_packets
+              << " recv_packets=" << client_stats.received_packets
+              << " sent_bytes=" << client_stats.sent_bytes
+              << " recv_bytes=" << client_stats.received_bytes
+              << " seconds=" << secs << '\n';
   }
 
   daqiri::print_stats();
diff --git a/mkdocs.yml b/mkdocs.yml
index 4b80201..81d2dcf 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -35,6 +35,8 @@ nav:
   - Getting Started: getting-started.md
   - Configuration: configuration.md
   - API Reference: api-guide.md
+  - Performance:
+    - DGX Spark: performance-dgx-spark.md
   - Tutorials:
     - Background: tutorials/background.md
     - System Configuration: tutorials/system_configuration.md

From 54b77f58f3f91459034c66e7332230d6590ca4dd Mon Sep 17 00:00:00 2001
From: rgurunathan <rgurunathan@nvidia.com>
Date: Wed, 13 May 2026 09:24:40 -0400
Subject: [PATCH 2/4] #15 - Add DPDK data-fill and drop-curve-matrix mode to
 perf doc
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fills the DPDK GPUDirect numbers (sweep, drop curve, headline tables)
into docs/performance-dgx-spark.md from the 2026-05-12 bench runs, and
adds the supporting infra for ongoing data fills:

- New examples/run_spark_bench.sh mode `drop-curve-matrix` sweeps
  payload x target_gbps at the headline batch (feeds an upcoming
  payload x target_gbps heatmap, distinct from the existing 1D drop
  curve).
- New scripts/spark_data_fill.sh one-shot driver runs the DPDK +
  socket sweep and drop-curve modes back-to-back, with pre-flight
  checks and orphan-hugepage cleanup between runs. RDMA is deferred
  from PR 1 (single-host loopback over the cable needs a netns +
  two-process refactor; tracked separately).
- docs/stylesheets/extra.css adds a .perf-matrix heatmap (green /
  yellow / red cells) used by the payload x batch matrix and
  upcoming target_gbps matrix.
- mkdocs.yml enables the footnotes extension so the deferred-row
  footnotes ([^1]/[^2]/[^3]) render.
- Container snippet in the perf doc now auto-injects ETH_DST_ADDR
  (and RX_IFACE) from the host so the DPDK benches just work after
  `docker run`.

RoCE and socket rows in the headline tables are marked deferred with
footnotes; the corresponding follow-up issues (drafts at
claude_plans/daqiri-pr15-spark-bench-followups.md) are:

  [^1] RoCE single-host loopback shortcuts through `lo` because both
       endpoints live in the root netns, so the QSFP cable carries no
       traffic; fix is an examples-only two-netns + two-process
       orchestration in run_spark_bench.sh.
  [^2] Socket UDP --mode both deadlocks on peer learning — both ends
       spin send-then-receive in one process, the server never learns
       a peer, and only ~1000 packets / 30 s trickle through under a
       flood of "no learned peer" ERROR spam.
  [^3] Socket TCP --mode both aborts immediately after the second
       accept with a glibc malloc.c:2599 (sysmalloc) heap-integrity
       assertion — likely a double-free or OOB write in the TCP
       socket-mgr init path.

Issue numbers will be back-filled into the footnotes once filed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
---
 docs/performance-dgx-spark.md | 352 ++++++++++++++++++++++++----------
 docs/stylesheets/extra.css    |  60 ++++++
 examples/run_spark_bench.sh   |  75 ++++++--
 mkdocs.yml                    |   1 +
 scripts/spark_data_fill.sh    | 143 ++++++++++++++
 5 files changed, 511 insertions(+), 120 deletions(-)
 create mode 100755 scripts/spark_data_fill.sh

diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md
index 1e74cd6..28930c7 100644
--- a/docs/performance-dgx-spark.md
+++ b/docs/performance-dgx-spark.md
@@ -21,20 +21,24 @@ that table since it has no operation boundary.
 
 ### Native-shape peak — max no-drop throughput (Gbps)
 
-| Backend / Stack       | C++ loopback | C++ + FFT      | C++ + GEMM     | Python loopback | Python + FFT   | Python + GEMM  |
-| --------------------- | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- |
-| DPDK GPUDirect (8 KB) | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
-| RoCE (8 MB SEND)      | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
-| Socket UDP (MTU)      | _TBD (PR 1)_ | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
-| Socket TCP (stream)   | _TBD (PR 1)_ | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
+| Backend / Stack       | C++ loopback   | C++ + FFT      | C++ + GEMM     | Python loopback | Python + FFT   | Python + GEMM  |
+| --------------------- | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- |
+| DPDK GPUDirect (8 KB) | **96.4**       | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| RoCE (8 MB SEND)      | _deferred_[^1] | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| Socket UDP (MTU)      | _deferred_[^2] | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
+| Socket TCP (stream)   | _deferred_[^3] | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
 
 ### Matched 8 KB operation — cross-backend Gbps
 
-| Backend / Stack          | C++ loopback | C++ + FFT      | C++ + GEMM     | Python loopback | Python + FFT   | Python + GEMM  |
-| ------------------------ | ------------ | -------------- | -------------- | --------------- | -------------- | -------------- |
-| DPDK GPUDirect           | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
-| RoCE                     | _TBD (PR 1)_ | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
-| Socket UDP (1472 B, MTU) | _TBD (PR 1)_ | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
+| Backend / Stack          | C++ loopback   | C++ + FFT      | C++ + GEMM     | Python loopback | Python + FFT   | Python + GEMM  |
+| ------------------------ | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- |
+| DPDK GPUDirect           | **96.4**       | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| RoCE                     | _deferred_[^1] | _TBD (PR 2)_   | _TBD (PR 2)_   | _TBD (PR 3)_    | _TBD (PR 4)_   | _TBD (PR 4)_   |
+| Socket UDP (1472 B, MTU) | _deferred_[^2] | n/a            | n/a            | _TBD (PR 3)_    | n/a            | n/a            |
+
+[^1]: RoCE single-host loopback is deferred from PR 1 — the bench's `--mode both` runs both ends in one process; with both IPs locally bound, the kernel shortcuts RC connection setup through `lo` rather than the cable. Real loopback measurement requires two netns (one per port), `rdma system set netns exclusive`, RDMA-device netns transfer, a YAML split, and two-process orchestration. Tracked in a follow-up issue.
+[^2]: Socket UDP `--mode both` deadlocks on peer learning: both server and client try to transmit before either has received, so neither side learns its peer. Only ~1000 packets / 30 s trickle through. Tracked in a follow-up issue.
+[^3]: Socket TCP `--mode both` aborts with a glibc heap-corruption assertion (`malloc.c:2599`) immediately after the second TCP connection accept. Bench is unrunnable. Tracked in a follow-up issue.
 
 !!! note "Why two tables"
     A single Gbps number isn't enough to compare backends fairly. A DPDK
@@ -54,6 +58,17 @@ that table since it has no operation boundary.
   HDS is characterized when this report extends to IGX and x86-server
   platforms where device memory works. See
   [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking.
+- **RoCE single-host loopback is deferred from this report.** See footnote [^1]
+  on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in
+  `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0)
+  and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the
+  RC connection through `lo` and the QSFP cable carries no traffic. A
+  follow-up will land the two-netns + two-process orchestration and re-fill
+  the RoCE rows.
+- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3].
+  Both are bench-side bugs uncovered during PR 1 verification on Spark:
+  UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with
+  a glibc heap-corruption assertion on init. Follow-up issues track each.
 - **p99/p999 latency is not in v1.** The bench output captures throughput,
   drops, and resource utilization. Per-burst RX timestamping and percentile
   aggregation are deferred to a follow-up issue.
@@ -78,21 +93,23 @@ The reproducibility appendix has the full capture. Key fields:
 
 ### Bench commands
 
-Each backend has a dedicated bench executable in `examples/`. PR 1 numbers
-come from these:
+Each backend has a dedicated bench executable in `examples/`. The DPDK
+numbers in this report come from the first command; the RoCE and Socket
+commands are listed here for documentation and will be the basis for the
+future fill of those rows.
 
 ```bash
-# DPDK GPUDirect — physical loopback
+# DPDK GPUDirect — physical loopback (used in this report)
 ./build/examples/daqiri_bench_raw_gpudirect \
     examples/daqiri_bench_raw_tx_rx_spark.yaml \
     --seconds 30 [--target-gbps G]
 
-# RoCE — same NIC, two ports
+# RoCE — same NIC, two ports (deferred; see Known limitations)
 ./build/examples/daqiri_bench_rdma \
     examples/daqiri_bench_rdma_tx_rx_spark.yaml \
     --seconds 30 --mode both [--target-gbps G]
 
-# Socket UDP / TCP — localhost
+# Socket UDP / TCP — localhost (deferred; see Known limitations)
 ./build/examples/daqiri_bench_socket \
     examples/daqiri_bench_socket_udp_tx_rx.yaml \
     --seconds 30 --mode both [--target-gbps G]
@@ -106,6 +123,15 @@ ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"
 
 ### Per-backend sweep dimensions
 
+**Payload** is the user-data bytes in one packet (DPDK / Socket UDP),
+one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI
+hands to (or pulls from) the NIC in one `rte_eth_tx_burst` /
+`rte_eth_rx_burst` call — the burst size knob, not a packet-size knob.
+Larger batches amortize doorbell and API overhead per packet; smaller
+batches keep per-call latency lower. Batch only matters when the bench
+is not yet at the link ceiling — at saturation, the NIC is the
+bottleneck and batch size has near-zero effect.
+
 The "payload × batch" sweep doesn't map uniformly across backends. Each
 backend has its own sweep:
 
@@ -145,81 +171,162 @@ a follow-up.
 
 ### External captures per run
 
-Each run records, in parallel with the bench:
-
-- `mpstat -P ALL 1 <N>` — per-core CPU busy%.
-- `nvidia-smi dmon -s pucvmet -c <N>` — GPU SM%, mem%, DRAM bandwidth.
-
-Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, hugepages,
-GPU state, DAQIRI commit) is captured once per result set by
+Each run records, alongside the bench:
+
+- `/proc/stat` snapshots before and after the bench process. The wrapper
+  computes per-core busy% for the master / TX / RX cores (cores 8 / 17 / 18
+  on Spark) by delta over the run window. `mpstat` is not used — it is
+  often missing from minimal containers, and the per-core CPUs we care
+  about are pinned by the YAML so a targeted delta is more meaningful
+  than a system-wide average.
+- `nvidia-smi dmon -s pucvmet -c <N>` — GPU SM%, mem-controller%, and
+  PCIe rxpci / txpci columns (the latter are reported as `-` on the
+  current Spark driver; SM% and mem% are near zero for plain GPUDirect
+  because the GPU is a DMA target, not a compute engine).
+
+Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA,
+hugepages, GPU state, DAQIRI commit) is captured once per result set by
 `bench_capture_environment.sh`.
 
 ## Results — DPDK GPUDirect
 
-_PR 1: this section is populated after the first sweep run._
+Native shape on Spark is 8 KB payload, batch 10240 — the configuration the
+DPDK backend was designed around. All cells below ran for 30 s with
+`drops == 0`.
 
 ### Drop curve at native shape
 
-| target_gbps | achieved Gbps | RX pps  | drops |
-| ----------- | ------------- | ------- | ----- |
-| _TBD (PR 1)_ |               |         |       |
-
-### Payload × batch sweep
-
-| payload | batch | Gbps | pps | drops |
-| ------- | ----- | ---- | --- | ----- |
-| _TBD (PR 1)_ |    |      |     |       |
-
-### CPU and GPU utilization (headline cell)
-
-| Resource        | Value           |
-| --------------- | --------------- |
-| Master core %   | _TBD_           |
-| TX core %       | _TBD_           |
-| RX core %       | _TBD_           |
-| GPU SM %        | _TBD (near 0; GPU is DMA target, not compute)_ |
-| GPU mem BW %    | _TBD_           |
+Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The
+token-bucket pacer tracks target within ±0.02 Gbps until the link saturates
+near 96 Gbps. Beyond that, target=100 and unpaced both report the
+achievable ceiling. TX and RX cores spin in poll-mode regardless of target
+rate (visible in the CPU table below).
+
+| target Gbps | achieved Gbps | RX pps    | drops | TX core % | RX core % |
+| ----------- | ------------- | --------- | ----- | --------- | --------- |
+| 1           | 1.011         |    15,678 | 0     | 92.0      | 92.0      |
+| 5           | 5.012         |    77,697 | 0     | 91.8      | 91.8      |
+| 10          | 10.001        |   155,032 | 0     | 92.5      | 92.5      |
+| 25          | 25.008        |   387,650 | 0     | 91.9      | 91.9      |
+| 50          | 50.001        |   775,062 | 0     | 92.8      | 92.8      |
+| 75          | 74.999        | 1,162,551 | 0     | 91.7      | 91.7      |
+| 100         | 96.370        | 1,493,823 | 0     | 91.6      | 91.6      |
+| unpaced     | 95.897        | 1,486,498 | 0     | 91.6      | 90.5      |
+
+### Payload × batch matrix
+
+Each cell shows the achieved Gbps and drops over a 30 s unpaced run.
+Coloring is relative to the global max across the matrix (here
+**104.989 Gbps** at payload 4096 B, batch 4096):
+
+<div class="perf-legend" markdown="0">
+  <span class="cell-green">green — no drops, Gbps ≥ 90% of max</span>
+  <span class="cell-yellow">yellow — no drops, Gbps ≥ 70% of max</span>
+  <span class="cell-red">red — drops, or Gbps &lt; 70% of max</span>
+</div>
+
+<table class="perf-matrix" markdown="0">
+  <thead>
+    <tr>
+      <th rowspan="2">payload</th>
+      <th colspan="4">batch (packets per burst)</th>
+    </tr>
+    <tr>
+      <th>256</th><th>1024</th><th>4096</th><th>10240</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>8000 B</th>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+      <td class="cell-green">95.9 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>4096 B</th>
+      <td class="cell-green">104.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">104.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">105.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">103.0 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>1024 B</th>
+      <td class="cell-red">64.9 Gbps<small>0 drops</small></td>
+      <td class="cell-yellow">86.2 Gbps<small>0 drops</small></td>
+      <td class="cell-red">71.0 Gbps<small>0 drops</small></td>
+      <td class="cell-red">72.7 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>256 B</th>
+      <td class="cell-red">24.5 Gbps<small>0 drops</small></td>
+      <td class="cell-red">24.8 Gbps<small>0 drops</small></td>
+      <td class="cell-red">29.3 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.1 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>64 B</th>
+      <td class="cell-red">10.1 Gbps<small>0 drops</small></td>
+      <td class="cell-red">10.3 Gbps<small>0 drops</small></td>
+      <td class="cell-red">9.8 Gbps<small>0 drops</small></td>
+      <td class="cell-red">10.3 Gbps<small>0 drops</small></td>
+    </tr>
+  </tbody>
+</table>
+
+**Reading the matrix.** At payload ≥ 4 KB the link saturates (~96–105
+Gbps) and batch size barely moves the number, so every cell is green.
+The 1 KB row is the transition: pps and Gbps both matter, batch size
+starts to influence which side dominates, and most cells fall under 70%
+of the global max. At ≤ 256 B the bottleneck is packets-per-second
+(~10 M pps ceiling at 64 B), so effective Gbps stays well below the
+link ceiling regardless of batch. Run-to-run variance on the unpaced
+cells is ~0.5 Gbps; row-internal Gbps differences smaller than that
+should be treated as noise.
+
+### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced)
+
+| Resource             | Value | Note                                       |
+| -------------------- | ----- | ------------------------------------------ |
+| Master core (CPU 8)  |  3.3% | Mostly idle; orchestration only            |
+| TX core (CPU 17)     | 91.4% | Poll-mode spin; rate-independent           |
+| RX core (CPU 18)     | 91.4% | Poll-mode spin; rate-independent           |
+| GPU SM %             |  0.0% | GPU is a DMA target, no compute            |
+| GPU mem-ctrl %       |  0.0% | Payload writes traverse PCIe, not the GPU memory controller |
+
+The TX/RX cores stay at ~92% across every drop-curve step from 1 Gbps to
+line rate — characteristic of DPDK's poll-mode driver, which spins waiting
+for descriptors regardless of offered load. The master core handles
+configuration only and idles below 5 % at the headline shape; at smaller
+payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts
+flow through the orchestration path. That asymmetry is data, not a bug,
+and is captured in the per-cell artifacts under `bench-results/`.
 
 ## Results — RoCE
 
-_PR 1: this section is populated after the first sweep run._
-
-### Drop curve at native shape
-
-| target_gbps | achieved Gbps | completions/s | drops (CQ errors) |
-| ----------- | ------------- | ------------- | ----------------- |
-| _TBD (PR 1)_ |               |               |                   |
-
-### Message-size × batch sweep
-
-| message_size | batch | Gbps | completions/s | drops |
-| ------------ | ----- | ---- | ------------- | ----- |
-| _TBD (PR 1)_ |       |      |               |       |
-
-### CPU and GPU utilization (headline cell)
-
-| Resource        | Value |
-| --------------- | ----- |
-| Server core %   | _TBD_ |
-| Client core %   | _TBD_ |
-| GPU SM %        | _TBD_ |
-| GPU mem BW %    | _TBD_ |
+**Deferred from this report.** See [headline-table footnote 1](#fn:1) and
+the Known limitations section. Single-host RoCE loopback on Spark requires a
+two-netns + two-process orchestration that the wrapper does not yet
+implement. The RoCE rows will be filled when the follow-up issue lands.
 
 ## Results — Socket
 
-_PR 1: this section is populated after the first sweep run. GPU rows N/A._
-
-### UDP — payload × batch sweep
-
-| payload | batch | Gbps | pps | drops |
-| ------- | ----- | ---- | --- | ----- |
-| _TBD (PR 1)_ |    |      |     |       |
+**Deferred from this report.** See [headline-table footnotes 2 and 3](#fn:2)
+and the Known limitations section. Both backends produced unusable data on
+Spark during PR 1 verification:
 
-### TCP — message-size sweep
+- **Socket UDP** in `--mode both` deadlocks on peer learning — both ends try
+  to transmit before either has received, only ~1000 packets per 30 s get
+  through (≈ 390 kbps). Visible as repeated
+  `[ERROR] UDP server has no learned peer yet; cannot transmit` lines in
+  the bench's stderr.
+- **Socket TCP** in `--mode both` aborts with a glibc heap-corruption
+  assertion (`Fatal glibc error: malloc.c:2599 (sysmalloc)`) immediately
+  after the second TCP connection accept. The bench crashes before any
+  send / recv completes, so no completion line is printed.
 
-| message_size | Gbps | retrans/inerrs |
-| ------------ | ---- | -------------- |
-| _TBD (PR 1)_ |      |                |
+Both are tracked as separate follow-up issues; the Socket rows here will be
+filled once the underlying bench bugs are fixed.
 
 ## Workload variants (FFT, GEMM)
 
@@ -261,45 +368,79 @@ _TBD (PR 4)._
 
 ### Container
 
-All commands below assume execution inside the project container, as required
-by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md). The
-container must be started in privileged mode with all GPUs and hugepage
-mounts passed through.
+All commands below assume execution inside the project container, as
+required by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md).
+On the host, launch the container in privileged mode with all GPUs,
+hugepage mounts, and `/sys` passed through, and bind-mount the repo at
+`/workspace`:
 
-### Full result regeneration
+```bash
+# RX-side NIC; auto-injects ETH_DST_ADDR for the DPDK bench wrappers.
+RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}"
+sudo docker run --rm -it \
+  --net host --ipc=host \
+  --runtime=nvidia --gpus all \
+  --privileged \
+  --ulimit memlock=-1 --ulimit stack=67108864 \
+  -v "$(pwd):/workspace" \
+  -v /dev/hugepages:/dev/hugepages \
+  -v /mnt/huge:/mnt/huge \
+  -v /sys:/sys \
+  -w /workspace \
+  -e ETH_DST_ADDR="$(cat /sys/class/net/$RX_IFACE/address)" \
+  -e RX_IFACE="$RX_IFACE" \
+  daqiri:local \
+  bash
+```
+
+### Build
+
+Inside the container:
 
 ```bash
-# 1. Build inside the container (root).
 cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \
     -DDAQIRI_MGR="dpdk socket rdma"
 cmake --build build -j
+```
+
+### One-shot driver
+
+The whole DPDK matrix (sweep + drop-curve) is driven by a single script
+which handles pre-flight checks (hugepage availability, RX iface MAC,
+link state), orphan-hugepage cleanup between cells, and a final summary of
+per-backend result directories:
+
+```bash
+./scripts/spark_data_fill.sh dpdk
+```
+
+The script defaults to `dpdk socket-udp socket-tcp` if invoked with no
+arguments; on Spark, the socket backends will fail their own pre-flight
+once the follow-up issues land. RDMA is currently rejected by the
+pre-flight (see Known limitations).
+
+### Per-backend wrapper invocations
 
-# 2. Snapshot environment + run the sweeps.
+For ad-hoc runs of a single cell or a single mode:
+
+```bash
 export DAQIRI_BUILD_DIR="$PWD/build"
-export ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"
-
-# Native-shape headline cell:
-./examples/run_spark_bench.sh dpdk       smoke
-./examples/run_spark_bench.sh rdma       smoke
-./examples/run_spark_bench.sh socket-udp smoke
-./examples/run_spark_bench.sh socket-tcp smoke
-
-# Full payload × batch sweep:
-./examples/run_spark_bench.sh dpdk       sweep
-./examples/run_spark_bench.sh rdma       sweep
-./examples/run_spark_bench.sh socket-udp sweep
-./examples/run_spark_bench.sh socket-tcp sweep
-
-# Drop curve (sweeps --target-gbps):
-./examples/run_spark_bench.sh dpdk       drop-curve
-./examples/run_spark_bench.sh rdma       drop-curve
-./examples/run_spark_bench.sh socket-udp drop-curve
-./examples/run_spark_bench.sh socket-tcp drop-curve
+export ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"   # DPDK only
+
+./examples/run_spark_bench.sh dpdk smoke         # native-shape headline cell
+./examples/run_spark_bench.sh dpdk sweep         # full payload × batch matrix
+./examples/run_spark_bench.sh dpdk drop-curve    # sweep --target-gbps
 ```
 
+Each invocation emits `bench-results/<timestamp>-dpdk-<mode>/` containing
+one subdirectory per cell (stdout / stderr / config / dmon / cpu_stat
+captures), an `environment.txt` snapshot, and a `runs.csv` aggregating
+the cell-level metrics.
+
 ### Environment-only capture
 
-Useful for filing a bug report or comparing two Spark units:
+Useful for filing a bug report or comparing two Spark units without
+running the bench:
 
 ```bash
 ./examples/bench_capture_environment.sh /tmp/spark-env
@@ -307,6 +448,7 @@ Useful for filing a bug report or comparing two Spark units:
 
 ### Tuning prerequisites
 
-System tuning is required before the numbers in this report are reproducible.
-See [`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md)
+System tuning is required before the numbers in this report are
+reproducible. See
+[`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md)
 for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity.
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
index b5ce45f..c3e156a 100644
--- a/docs/stylesheets/extra.css
+++ b/docs/stylesheets/extra.css
@@ -50,3 +50,63 @@
 [data-md-color-scheme="slate"] .md-footer {
   background: #111;
 }
+
+/* ── Performance-report heatmap cells ───────────────────────────────── */
+/* Used by the payload×batch and payload×target_gbps matrices in
+   docs/performance-*.md. Threshold logic (vs. matrix-global max):
+     green  = no drops AND Gbps ≥ 90% of max
+     yellow = no drops AND Gbps ≥ 70% of max
+     red    = any drops OR Gbps < 70% of max                            */
+.md-typeset table.perf-matrix {
+  width: 85%;
+  table-layout: fixed;
+  border-collapse: separate;
+  border-spacing: 5px;
+  font-size: 0.64rem;
+}
+.md-typeset table.perf-matrix th,
+.md-typeset table.perf-matrix td {
+  text-align: center;
+  vertical-align: middle;
+  font-variant-numeric: tabular-nums;
+  padding: 0.55em 0.6em;
+  border-radius: 4px;
+}
+.md-typeset table.perf-matrix td small {
+  display: block;
+  opacity: 0.75;
+  font-size: 0.85em;
+  margin-top: 0.25em;
+}
+.md-typeset table.perf-matrix td.cell-green {
+  background-color: rgba(118, 185, 0, 0.28);
+  color: inherit;
+}
+.md-typeset table.perf-matrix td.cell-yellow {
+  background-color: rgba(255, 196, 0, 0.32);
+  color: inherit;
+}
+.md-typeset table.perf-matrix td.cell-red {
+  background-color: rgba(220, 60, 60, 0.32);
+  color: inherit;
+}
+.md-typeset table.perf-matrix th {
+  background-color: rgba(255, 255, 255, 0.05);
+  font-weight: 600;
+}
+/* Compact legend chips that pair with the matrix. */
+.md-typeset .perf-legend {
+  display: flex;
+  gap: 0.75em;
+  margin: 0.5em 0 1em 0;
+  font-size: 0.85em;
+  flex-wrap: wrap;
+}
+.md-typeset .perf-legend span {
+  padding: 0.1em 0.55em;
+  border-radius: 0.25em;
+  white-space: nowrap;
+}
+.md-typeset .perf-legend .cell-green  { background-color: rgba(118, 185, 0, 0.28); }
+.md-typeset .perf-legend .cell-yellow { background-color: rgba(255, 196, 0, 0.32); }
+.md-typeset .perf-legend .cell-red    { background-color: rgba(220, 60, 60, 0.32); }
diff --git a/examples/run_spark_bench.sh b/examples/run_spark_bench.sh
index e85df69..c2aa75a 100755
--- a/examples/run_spark_bench.sh
+++ b/examples/run_spark_bench.sh
@@ -15,7 +15,7 @@
 # Usage:
 #   ./run_spark_bench.sh <backend> [mode]
 #     backend ∈ {dpdk, rdma, socket-udp, socket-tcp}
-#     mode    ∈ {smoke, sweep, drop-curve}  (default: smoke)
+#     mode    ∈ {smoke, sweep, drop-curve, drop-curve-matrix}  (default: smoke)
 #
 # Required environment in current shell:
 #   DAQIRI_BUILD_DIR — path to the cmake build dir (defaults to ../build).
@@ -35,7 +35,7 @@ set -o pipefail
 BACKEND="${1:-}"
 MODE="${2:-smoke}"
 if [[ -z "$BACKEND" ]]; then
-  echo "Usage: $0 <dpdk|rdma|socket-udp|socket-tcp> [smoke|sweep|drop-curve]" >&2
+  echo "Usage: $0 <dpdk|rdma|socket-udp|socket-tcp> [smoke|sweep|drop-curve|drop-curve-matrix]" >&2
   exit 1
 fi
 
@@ -46,7 +46,7 @@ OUT_DIR="$SCRIPT_DIR/../bench-results/$TS-$BACKEND-$MODE"
 mkdir -p "$OUT_DIR"
 
 CSV="$OUT_DIR/runs.csv"
-echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_busy_mean,gpu_sm_pct,gpu_mem_bw" > "$CSV"
+echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_master_pct,cpu_tx_pct,cpu_rx_pct,gpu_sm_pct,gpu_mem_pct" > "$CSV"
 
 # Capture slow-moving environment state once per result set.
 "$SCRIPT_DIR/bench_capture_environment.sh" "$OUT_DIR"
@@ -64,6 +64,7 @@ case "$BACKEND" in
     BATCHES_HEADLINE=(10240)
     BASE_YAML="$SCRIPT_DIR/daqiri_bench_raw_tx_rx_spark.yaml"
     BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect"
+    CPU_MASTER=8; CPU_TX=17; CPU_RX=18
     : "${ETH_DST_ADDR:?ETH_DST_ADDR must be set for dpdk backend (cat /sys/class/net/<rx-iface>/address)}"
     ;;
   rdma)
@@ -73,6 +74,7 @@ case "$BACKEND" in
     BATCHES_HEADLINE=(1)
     BASE_YAML="$SCRIPT_DIR/daqiri_bench_rdma_tx_rx_spark.yaml"
     BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_rdma"
+    CPU_MASTER=8; CPU_TX=17; CPU_RX=18
     ;;
   socket-udp)
     PAYLOADS_SWEEP=(1472 1024 256 64)
@@ -81,6 +83,7 @@ case "$BACKEND" in
     BATCHES_HEADLINE=(256)
     BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_udp_tx_rx.yaml"
     BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket"
+    CPU_MASTER=8; CPU_TX=17; CPU_RX=18
     ;;
   socket-tcp)
     PAYLOADS_SWEEP=(1048576 65536 1024)
@@ -89,6 +92,7 @@ case "$BACKEND" in
     BATCHES_HEADLINE=(1)
     BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_tcp_tx_rx.yaml"
     BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket"
+    CPU_MASTER=8; CPU_TX=17; CPU_RX=18
     ;;
   *) echo "Unknown backend: $BACKEND" >&2; exit 1 ;;
 esac
@@ -131,6 +135,31 @@ snapshot_nstat() {
   nstat -a 2>/dev/null | awk '/TcpExtTCPLostRetransmit|TcpRetransSegs|TcpInErrs/ { s += $2 } END { print s+0 }' || echo 0
 }
 
+# Snapshot /proc/stat per-cpu counters to a file. Mpstat is often not installed
+# in the bench container; /proc/stat is always available.
+snapshot_cpu_stat() {
+  awk '/^cpu[0-9]+/ {
+    total = $2+$3+$4+$5+$6+$7+$8
+    busy  = total - $5 - $6
+    print $1, total, busy
+  }' /proc/stat > "$1"
+}
+
+# Compute busy% for a single cpu index between two /proc/stat snapshots.
+cpu_busy_pct() {
+  local before="$1" after="$2" cpu_idx="$3"
+  awk -v cpu="cpu$cpu_idx" '
+    NR == FNR { b_total[$1] = $2; b_busy[$1] = $3; next }
+              { a_total[$1] = $2; a_busy[$1] = $3 }
+    END {
+      dt = a_total[cpu] - b_total[cpu]
+      db = a_busy[cpu]  - b_busy[cpu]
+      if (dt > 0) printf "%.1f", (db * 100.0) / dt
+      else        printf "0.0"
+    }
+  ' "$before" "$after"
+}
+
 # Substitute payload / batch into the base YAML and write a temp config.
 generate_yaml() {
   local out="$1" payload="$2" batch="$3"
@@ -166,9 +195,10 @@ run_cell() {
   udp_before="$(snapshot_proc_net_udp)"
   tcp_before="$(snapshot_nstat)"
 
-  # Background captures.
-  ( mpstat -P ALL 1 "$RUN_SECONDS" > "$cell_dir/mpstat.txt" 2>&1 ) &
-  local mpstat_pid=$!
+  # Snapshot per-cpu stats just before the bench starts.
+  snapshot_cpu_stat "$cell_dir/cpu_stat.before"
+
+  # Background GPU dmon (1-sec sample, RUN_SECONDS samples).
   ( nvidia-smi dmon -s pucvmet -c "$RUN_SECONDS" > "$cell_dir/nvidia_smi_dmon.txt" 2>&1 ) &
   local dmon_pid=$!
 
@@ -182,8 +212,11 @@ run_cell() {
   "$BENCH_BIN" "${args[@]}" > "$stdout" 2> "$stderr" || true
   cp "$stderr" "$DRIVER_LOG"
 
+  # Snapshot per-cpu stats right after the bench exits (before background
+  # captures finish reaping, to bound the window).
+  snapshot_cpu_stat "$cell_dir/cpu_stat.after"
+
   # Stop background captures (they self-terminate at -c <N>, but reap if needed).
-  wait "$mpstat_pid" 2>/dev/null || true
   wait "$dmon_pid"  2>/dev/null || true
 
   # Parse bench stdout. For RX-bearing benches "RX complete" is authoritative;
@@ -232,21 +265,23 @@ run_cell() {
       ;;
   esac
 
-  # Mean CPU busy% across all cores (mpstat row "all"). Simple aggregate; per-core
-  # in mpstat.txt for deeper review.
-  local cpu_busy_mean
-  cpu_busy_mean="$(awk '/Average:.*all/ { print 100 - $NF; exit }' "$cell_dir/mpstat.txt" 2>/dev/null || echo 0)"
-  cpu_busy_mean="${cpu_busy_mean:-0}"
+  # Per-core CPU busy% over the bench window. Cores defined per-backend
+  # (master/TX/RX) match the YAML so we measure the threads we actually pin.
+  local cpu_master_pct cpu_tx_pct cpu_rx_pct
+  cpu_master_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_MASTER")"
+  cpu_tx_pct="$(cpu_busy_pct     "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_TX")"
+  cpu_rx_pct="$(cpu_busy_pct     "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_RX")"
 
-  # GPU SM% and memory BW from nvidia-smi dmon. dmon output columns vary by
-  # driver; sm is typically column 5 in `-s pucvmet`, mem is column 6.
+  # GPU SM% (column 5) and memory-controller % (column 6) from nvidia-smi
+  # dmon -s pucvmet. These are near zero for GPUDirect workloads (GPU is a
+  # DMA target, not a compute engine).
   local gpu_sm gpu_mem
   gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
                "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"
   gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
                 "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"
 
-  echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_busy_mean,$gpu_sm,$gpu_mem" \
+  echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_master_pct,$cpu_tx_pct,$cpu_rx_pct,$gpu_sm,$gpu_mem" \
     | tee -a "$CSV"
 }
 
@@ -281,6 +316,16 @@ case "$MODE" in
       done
     done
     ;;
+  drop-curve-matrix)
+    # 2D drop curve: sweep payload × target_gbps at the headline batch.
+    for p in "${PAYLOADS_SWEEP[@]}"; do
+      for b in "${BATCHES_HEADLINE[@]}"; do
+        for g in "${DROP_CURVE_TARGETS[@]}"; do
+          run_cell cpp "$p" "$b" "$g"
+        done
+      done
+    done
+    ;;
   *) echo "Unknown mode: $MODE" >&2; exit 1 ;;
 esac
 
diff --git a/mkdocs.yml b/mkdocs.yml
index 81d2dcf..0fd6b1a 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -47,6 +47,7 @@ markdown_extensions:
   - admonition
   - attr_list
   - def_list
+  - footnotes
   - md_in_html
   - tables
   - toc:
diff --git a/scripts/spark_data_fill.sh b/scripts/spark_data_fill.sh
new file mode 100755
index 0000000..0e4f81b
--- /dev/null
+++ b/scripts/spark_data_fill.sh
@@ -0,0 +1,143 @@
+#!/usr/bin/env bash
+# Drives the PR 1 data-fill bench runs for the DGX Spark performance report.
+#
+# Runs DPDK GPUDirect, socket-UDP, and socket-TCP through their sweep and
+# drop-curve modes via examples/run_spark_bench.sh, with pre-flight checks
+# and orphan-hugepage cleanup. RDMA is deferred from PR 1 (single-host
+# loopback over the cable needs a netns+two-process refactor; tracked
+# separately).
+#
+# Run inside the project container (privileged, --gpus all, /dev/hugepages
+# and /mnt/huge mounted, repo at /workspace).
+#
+# Usage:
+#   ./scripts/spark_data_fill.sh                 # all three backends
+#   ./scripts/spark_data_fill.sh dpdk            # just DPDK
+#   ./scripts/spark_data_fill.sh socket-udp socket-tcp
+#
+# Env overrides:
+#   ETH_DST_ADDR  — RX-side MAC. Auto-detected from
+#                   /sys/class/net/enP2p1s0f0np0/address if unset.
+#   RX_IFACE      — RX netdev name (default enP2p1s0f0np0).
+#   DAQIRI_BUILD_DIR — defaults to ./build.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+WRAPPER="$REPO_ROOT/examples/run_spark_bench.sh"
+BUILD_DIR="${DAQIRI_BUILD_DIR:-$REPO_ROOT/build}"
+RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}"
+
+BACKENDS=("$@")
+[[ ${#BACKENDS[@]} -eq 0 ]] && BACKENDS=(dpdk socket-udp socket-tcp)
+
+# --- pre-flight ------------------------------------------------------------
+
+preflight_fail() { echo "PREFLIGHT FAIL: $*" >&2; exit 1; }
+note() { echo "[$(date -u +%H:%M:%SZ)] $*"; }
+
+[[ -x "$WRAPPER" ]] || preflight_fail "wrapper missing or not executable: $WRAPPER"
+
+for be in "${BACKENDS[@]}"; do
+  case "$be" in
+    dpdk)       bin="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect" ;;
+    socket-udp|socket-tcp) bin="$BUILD_DIR/examples/daqiri_bench_socket" ;;
+    rdma)       preflight_fail "RDMA is deferred from PR 1; see follow-up issue" ;;
+    *)          preflight_fail "unknown backend: $be" ;;
+  esac
+  [[ -x "$bin" ]] || preflight_fail "missing bench binary: $bin (run cmake --build first)"
+done
+
+# DPDK-only checks.
+if [[ " ${BACKENDS[*]} " == *" dpdk "* ]]; then
+  free_hp="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)"
+  [[ "${free_hp:-0}" -ge 4 ]] || preflight_fail "HugePages_Free=$free_hp (need >=4); clean /mnt/huge and /dev/hugepages from prior runs"
+
+  if [[ -z "${ETH_DST_ADDR:-}" ]]; then
+    mac_path="/sys/class/net/$RX_IFACE/address"
+    [[ -r "$mac_path" ]] || preflight_fail "cannot read $mac_path; set ETH_DST_ADDR explicitly"
+    ETH_DST_ADDR="$(cat "$mac_path")"
+    export ETH_DST_ADDR
+    note "ETH_DST_ADDR auto-detected from $RX_IFACE: $ETH_DST_ADDR"
+  fi
+
+  carrier="$(cat "/sys/class/net/$RX_IFACE/carrier" 2>/dev/null || echo 0)"
+  [[ "$carrier" == "1" ]] || preflight_fail "RX iface $RX_IFACE has no carrier (cable unplugged or link down)"
+fi
+
+note "Pre-flight OK. Backends: ${BACKENDS[*]}"
+note "Build dir: $BUILD_DIR"
+note "Repo root: $REPO_ROOT"
+
+# --- hugepage cleanup helper ----------------------------------------------
+
+# DPDK leaves orphan rtemap_* files when a bench aborts. Clean between runs so
+# we don't run out of hugepages mid-sweep.
+clean_orphan_hugepages() {
+  local pre post freed
+  pre="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)"
+  : "${pre:=0}"
+  shopt -s nullglob
+  # DPDK uses a random per-process file prefix (override with --file-prefix);
+  # match anything ending in `map_<digit>` to catch the common shape without
+  # nuking unrelated files. Skip any that are still held by a live process.
+  for f in /dev/hugepages/*map_[0-9]* /mnt/huge/*map_[0-9]*; do
+    if ! fuser -- "$f" >/dev/null 2>&1; then
+      rm -f -- "$f" 2>/dev/null || true
+    fi
+  done
+  shopt -u nullglob
+  post="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)"
+  : "${post:=0}"
+  freed=$((post - pre))
+  if [[ "$freed" -gt 0 ]]; then
+    note "Freed $freed orphan hugepages (now ${post} free)"
+  fi
+  return 0
+}
+
+# --- driver loop -----------------------------------------------------------
+
+declare -a RESULT_DIRS
+
+run_backend_mode() {
+  local backend="$1" mode="$2"
+  note "=== Running: $backend $mode ==="
+  clean_orphan_hugepages
+
+  # Stream wrapper output live (per-cell CSV rows appear as they finish) while
+  # also keeping a log for post-run parsing of the "Results in:" line.
+  local log="/tmp/spark_data_fill.$backend.$mode.log"
+  local rc=0
+  "$WRAPPER" "$backend" "$mode" 2>&1 | tee "$log" || rc=$?
+  rc="${PIPESTATUS[0]:-$rc}"
+
+  if [[ "$rc" -eq 0 ]]; then
+    local result_dir
+    result_dir="$(awk '/^Results in:/ { print $3 }' "$log" | tail -n1)"
+    [[ -n "$result_dir" ]] && RESULT_DIRS+=("$backend/$mode -> $result_dir")
+    note "$backend $mode complete"
+  else
+    note "$backend $mode FAILED (exit $rc); continuing"
+    tail -n 40 "$log" >&2
+  fi
+  clean_orphan_hugepages
+}
+
+for be in "${BACKENDS[@]}"; do
+  run_backend_mode "$be" sweep
+  run_backend_mode "$be" drop-curve
+done
+
+# --- summary ---------------------------------------------------------------
+
+echo
+echo "=========================================="
+echo "Data-fill complete. Result directories:"
+echo "=========================================="
+for r in "${RESULT_DIRS[@]}"; do
+  echo "  $r"
+done
+echo
+echo "Next: aggregate CSVs and fill docs/performance-dgx-spark.md."

From c5fffd5cfdf3ebf3382db962dd32bc0c76526099 Mon Sep 17 00:00:00 2001
From: rgurunathan <rgurunathan@nvidia.com>
Date: Wed, 13 May 2026 09:43:53 -0400
Subject: [PATCH 3/4] #15 - Ignore pcie_schematic.png generated by
 tune_system.py

`python/tune_system.py` writes a PCIe topology schematic to
`pcie_schematic.png` in the working directory by default (introduced
in #61). Anyone who runs the tuner from the repo root then has it
sitting untracked in `git status` forever. Ignore the default path
so it stops showing up in working-tree status. Custom output paths
passed via `--output` are unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
---
 .gitignore | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/.gitignore b/.gitignore
index dc4cf82..472185d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -2,5 +2,8 @@ build*/
 site/
 bench-results/
 
+# tune_system.py default output
+pcie_schematic.png
+
 # macOS
 .DS_Store

From 14b6f0dd003bcd8f176cfd43b9a418f3ba21ab60 Mon Sep 17 00:00:00 2001
From: rgurunathan <rgurunathan@nvidia.com>
Date: Wed, 13 May 2026 10:33:41 -0400
Subject: [PATCH 4/4] #15 - Add payload x target_gbps matrix and restructure
 perf doc

Adds the payload x target_gbps heatmap to the DPDK GPUDirect results
(5 payloads x 8 target rates, batch=10240) from the 2026-05-13
drop-curve-matrix run. Coloring is relative to the effective target
min(target_gbps, 96 Gbps link cap) so the heatmap reads as "does the
configuration sustain the requested rate?" rather than "what's the
absolute peak?". The 8000 B / 4096 B rows are all green; 1024 B is
green-except-unpaced (where the master core peaks at 93%); 256 B and
64 B turn red once target crosses the PPS ceiling.

Restructures the report around four top-level sections to keep
backend results scannable as more platforms and workloads land:

  - Summary (headline tables; unchanged content, just stays at top)
  - Introduction (System under test + Methodology, demoted to H3)
  - C++ Results (DPDK / RoCE / Socket / Workload variants, demoted
    to H3; their per-cell subsections demoted to H4; workload
    variants renamed to "DPDK GPUDirect - FFT/GEMM" and
    "RoCE - FFT/GEMM" so anchors don't collide with the C++ Results
    backend headings)
  - Python Results (renamed from "Python results"; unchanged otherwise)
  - Reproduce these results (renamed from "Reproducibility
    appendix"; same content)
  - TODO: Not Yet Implemented / Known Limitations (moved from below
    Summary to the bottom; relabeled to make clear these are pending
    items, not platform constraints)

Adds a compact in-page Contents list under the intro paragraphs
linking to each top-level section. The Material sidebar TOC still
works for fine-grained navigation; the inline list is for orienting
the reader on first arrival.

Plain-text references to "Known limitations" updated to
"TODO / Known Limitations" with anchor links where appropriate
(RoCE / Socket deferred-results subsections, one-shot driver note).

Also tunes the .perf-matrix CSS so the new 9-column matrix fits
within the content area without text overflow: width 100%,
table-layout: auto, white-space: nowrap on cells, slightly tighter
padding. The narrower 5-column payload x batch matrix continues to
render fine under the same rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
---
 docs/performance-dgx-spark.md | 216 ++++++++++++++++++++++++++--------
 docs/stylesheets/extra.css    |   7 +-
 2 files changed, 168 insertions(+), 55 deletions(-)

diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md
index 28930c7..9854243 100644
--- a/docs/performance-dgx-spark.md
+++ b/docs/performance-dgx-spark.md
@@ -11,6 +11,15 @@ against the YAML configs in `examples/`, with the system state captured by
 [`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh)
 alongside each result set.
 
+**Contents**
+
+- [Summary](#summary)
+- [Introduction](#introduction) — system under test, methodology
+- [C++ Results](#c-results) — DPDK, RoCE, Socket, workload variants
+- [Python Results](#python-results)
+- [Reproduce these results](#reproduce-these-results)
+- [TODO: Not Yet Implemented / Known Limitations](#todo-not-yet-implemented-known-limitations)
+
 ## Summary
 
 Two headline tables. **Native-shape peak** reports each backend at its
@@ -47,33 +56,9 @@ that table since it has no operation boundary.
     The matched 8 KB table makes the cross-backend comparison honest; the
     native-shape table shows each backend's design ceiling.
 
-## Known limitations on this platform
-
-- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses
-  `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory
-  for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable —
-  so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no
-  longer changes the memory path; it only changes the segment partition,
-  which makes "HDS vs. plain GPUDirect" a non-distinction on this platform.
-  HDS is characterized when this report extends to IGX and x86-server
-  platforms where device memory works. See
-  [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking.
-- **RoCE single-host loopback is deferred from this report.** See footnote [^1]
-  on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in
-  `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0)
-  and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the
-  RC connection through `lo` and the QSFP cable carries no traffic. A
-  follow-up will land the two-netns + two-process orchestration and re-fill
-  the RoCE rows.
-- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3].
-  Both are bench-side bugs uncovered during PR 1 verification on Spark:
-  UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with
-  a glibc heap-corruption assertion on init. Follow-up issues track each.
-- **p99/p999 latency is not in v1.** The bench output captures throughput,
-  drops, and resource utilization. Per-burst RX timestamping and percentile
-  aggregation are deferred to a follow-up issue.
+## Introduction
 
-## System under test
+### System under test
 
 The reproducibility appendix has the full capture. Key fields:
 
@@ -89,9 +74,9 @@ The reproducibility appendix has the full capture. Key fields:
 | DPDK              | patched per `dpdk_patches/` (container build) |
 | DAQIRI commit     | _captured in `environment.txt`_      |
 
-## Methodology
+### Methodology
 
-### Bench commands
+#### Bench commands
 
 Each backend has a dedicated bench executable in `examples/`. The DPDK
 numbers in this report come from the first command; the RoCE and Socket
@@ -104,12 +89,12 @@ future fill of those rows.
     examples/daqiri_bench_raw_tx_rx_spark.yaml \
     --seconds 30 [--target-gbps G]
 
-# RoCE — same NIC, two ports (deferred; see Known limitations)
+# RoCE — same NIC, two ports (deferred; see TODO / Known Limitations)
 ./build/examples/daqiri_bench_rdma \
     examples/daqiri_bench_rdma_tx_rx_spark.yaml \
     --seconds 30 --mode both [--target-gbps G]
 
-# Socket UDP / TCP — localhost (deferred; see Known limitations)
+# Socket UDP / TCP — localhost (deferred; see TODO / Known Limitations)
 ./build/examples/daqiri_bench_socket \
     examples/daqiri_bench_socket_udp_tx_rx.yaml \
     --seconds 30 --mode both [--target-gbps G]
@@ -121,7 +106,7 @@ The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC:
 ETH_DST_ADDR="$(cat /sys/class/net/<rx-iface>/address)"
 ```
 
-### Per-backend sweep dimensions
+#### Per-backend sweep dimensions
 
 **Payload** is the user-data bytes in one packet (DPDK / Socket UDP),
 one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI
@@ -142,14 +127,14 @@ backend has its own sweep:
 | Socket UDP         | payload_size ∈ {64, 256, 1024, 1472} B (MTU-bound)     | batch_size ∈ {1, 32, 256}              | (1472, 256)          | (1472, 256) — closest to 8 K under MTU cap |
 | Socket TCP         | message_size ∈ {1 K, 64 K, 1 M} B                      | n/a (single stream)                    | (64 K)               | n/a                    |
 
-### "No-drop" threshold
+#### "No-drop" threshold
 
 A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run.
 The headline tables report the highest target rate at which a run was still
 drop-free under this threshold. The methodology does not use a percentile cap
 — either there are drops or there are not.
 
-### Drop-curve sweep
+#### Drop-curve sweep
 
 The drop curve sweeps `--target-gbps` while holding the native-shape cell
 constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`)
@@ -158,7 +143,7 @@ due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK
 `accurate_send`) is unused but would tighten DPDK-only precision; deferred to
 a follow-up.
 
-### Drop sources per backend
+#### Drop sources per backend
 
 - **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO`
   output (the bench's stderr).
@@ -169,7 +154,7 @@ a follow-up.
   `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is
   the closest proxy.
 
-### External captures per run
+#### External captures per run
 
 Each run records, alongside the bench:
 
@@ -188,13 +173,15 @@ Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA,
 hugepages, GPU state, DAQIRI commit) is captured once per result set by
 `bench_capture_environment.sh`.
 
-## Results — DPDK GPUDirect
+## C++ Results
+
+### DPDK GPUDirect
 
 Native shape on Spark is 8 KB payload, batch 10240 — the configuration the
 DPDK backend was designed around. All cells below ran for 30 s with
 `drops == 0`.
 
-### Drop curve at native shape
+#### Drop curve at native shape
 
 Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The
 token-bucket pacer tracks target within ±0.02 Gbps until the link saturates
@@ -213,7 +200,104 @@ rate (visible in the CPU table below).
 | 100         | 96.370        | 1,493,823 | 0     | 91.6      | 91.6      |
 | unpaced     | 95.897        | 1,486,498 | 0     | 91.6      | 90.5      |
 
-### Payload × batch matrix
+#### Payload × target_gbps matrix
+
+Holds batch=10240 constant; sweeps payload and `--target-gbps`
+together. Each cell shows the achieved Gbps and drops over a 30 s
+run. Coloring is relative to the **effective target**
+`min(target_gbps, 96 Gbps link cap)`; the "unpaced" column uses the
+link cap as its effective target:
+
+<div class="perf-legend" markdown="0">
+  <span class="cell-green">green — no drops, achieved ≥ 95% of effective target</span>
+  <span class="cell-yellow">yellow — no drops, achieved ≥ 70% of effective target</span>
+  <span class="cell-red">red — drops, or achieved &lt; 70% of effective target</span>
+</div>
+
+<table class="perf-matrix" markdown="0">
+  <thead>
+    <tr>
+      <th rowspan="2">payload</th>
+      <th colspan="8">target Gbps</th>
+    </tr>
+    <tr>
+      <th>1</th><th>5</th><th>10</th><th>25</th><th>50</th><th>75</th><th>100</th><th>unpaced</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>8000 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">25.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">50.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">75.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">95.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">96.4 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>4096 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">25.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">50.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">75.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">100.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">102.7 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>1024 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">24.9 Gbps<small>0 drops</small></td>
+      <td class="cell-green">49.8 Gbps<small>0 drops</small></td>
+      <td class="cell-green">74.2 Gbps<small>0 drops</small></td>
+      <td class="cell-green">94.0 Gbps<small>0 drops</small></td>
+      <td class="cell-yellow">68.2 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>256 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">10.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">24.7 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.1 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.2 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.2 Gbps<small>0 drops</small></td>
+      <td class="cell-red">21.6 Gbps<small>0 drops</small></td>
+    </tr>
+    <tr>
+      <th>64 B</th>
+      <td class="cell-green">1.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">5.0 Gbps<small>0 drops</small></td>
+      <td class="cell-green">9.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">13.7 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.9 Gbps<small>0 drops</small></td>
+      <td class="cell-red">8.7 Gbps<small>0 drops</small></td>
+    </tr>
+  </tbody>
+</table>
+
+**Reading the matrix.** The token-bucket pacer tracks target within
+sub-Gbps at payloads ≥ 1 KB right up to the link cap. The PPS
+ceiling for each payload (~21 Gbps at 256 B, ~9 Gbps at 64 B) shows
+up as a horizontal saturation band: once `target_gbps` crosses that
+ceiling, increasing it further produces no additional throughput.
+The 64 B / target=75 cell (13.7 Gbps) is a pacer transient when the
+requested rate is well above achievable — the next cell at
+target=100 falls back to the PPS ceiling. The 1024 B / unpaced cell
+(68.2 Gbps, yellow) sits below its own paced cells; 1024 B is the
+size where the master loop transitions from idle to fully busy
+(`cpu_master_pct` jumps from ~3% to ~93% in the corresponding
+unpaced run), and per-cell numbers in this regime carry roughly
+±5 Gbps of run-to-run variance.
+
+#### Payload × batch matrix
 
 Each cell shows the achieved Gbps and drops over a 30 s unpaced run.
 Coloring is relative to the global max across the matrix (here
@@ -284,7 +368,7 @@ link ceiling regardless of batch. Run-to-run variance on the unpaced
 cells is ~0.5 Gbps; row-internal Gbps differences smaller than that
 should be treated as noise.
 
-### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced)
+#### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced)
 
 | Resource             | Value | Note                                       |
 | -------------------- | ----- | ------------------------------------------ |
@@ -302,18 +386,20 @@ payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts
 flow through the orchestration path. That asymmetry is data, not a bug,
 and is captured in the per-cell artifacts under `bench-results/`.
 
-## Results — RoCE
+### RoCE
 
 **Deferred from this report.** See [headline-table footnote 1](#fn:1) and
-the Known limitations section. Single-host RoCE loopback on Spark requires a
-two-netns + two-process orchestration that the wrapper does not yet
-implement. The RoCE rows will be filled when the follow-up issue lands.
+the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)
+section. Single-host RoCE loopback on Spark requires a two-netns +
+two-process orchestration that the wrapper does not yet implement. The RoCE
+rows will be filled when the follow-up issue lands.
 
-## Results — Socket
+### Socket
 
 **Deferred from this report.** See [headline-table footnotes 2 and 3](#fn:2)
-and the Known limitations section. Both backends produced unusable data on
-Spark during PR 1 verification:
+and the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)
+section. Both backends produced unusable data on Spark during PR 1
+verification:
 
 - **Socket UDP** in `--mode both` deadlocks on peer learning — both ends try
   to transmit before either has received, only ~1000 packets per 30 s get
@@ -328,7 +414,7 @@ Spark during PR 1 verification:
 Both are tracked as separate follow-up issues; the Socket rows here will be
 filled once the underlying bench bugs are fixed.
 
-## Workload variants (FFT, GEMM)
+### Workload variants (FFT, GEMM)
 
 The post-process layer ([PR 2](https://github.com/NVIDIA/daqiri/issues/15))
 adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` /
@@ -340,18 +426,18 @@ throughput delta and GPU utilization.
 - FFT: 1D complex-to-complex, length 1024.
 - GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload).
 
-### DPDK GPUDirect
+#### DPDK GPUDirect — FFT/GEMM
 
 _TBD (PR 2)._
 
-### RoCE
+#### RoCE — FFT/GEMM
 
 _TBD (PR 2)._ Note the unit-of-work mismatch when comparing across backends:
 RoCE applies the post-process kernel once per ~8 MB SEND; DPDK applies it
 once per packet. The throughput numbers are comparable; "operations per
 burst" is not.
 
-## Python results
+## Python Results
 
 The Python benches ([PR 3](https://github.com/NVIDIA/daqiri/issues/16)) mirror
 the C++ benches' CLI and stdout format, using the existing pybind11 bindings.
@@ -364,7 +450,7 @@ _TBD (PR 3)._
 
 _TBD (PR 4)._
 
-## Reproducibility appendix
+## Reproduce these results
 
 ### Container
 
@@ -417,7 +503,7 @@ per-backend result directories:
 The script defaults to `dpdk socket-udp socket-tcp` if invoked with no
 arguments; on Spark, the socket backends will fail their own pre-flight
 once the follow-up issues land. RDMA is currently rejected by the
-pre-flight (see Known limitations).
+pre-flight (see [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)).
 
 ### Per-backend wrapper invocations
 
@@ -452,3 +538,29 @@ System tuning is required before the numbers in this report are
 reproducible. See
 [`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md)
 for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity.
+
+## TODO: Not Yet Implemented / Known Limitations
+
+- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses
+  `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory
+  for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable —
+  so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no
+  longer changes the memory path; it only changes the segment partition,
+  which makes "HDS vs. plain GPUDirect" a non-distinction on this platform.
+  HDS is characterized when this report extends to IGX and x86-server
+  platforms where device memory works. See
+  [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking.
+- **RoCE single-host loopback is deferred from this report.** See footnote [^1]
+  on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in
+  `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0)
+  and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the
+  RC connection through `lo` and the QSFP cable carries no traffic. A
+  follow-up will land the two-netns + two-process orchestration and re-fill
+  the RoCE rows.
+- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3].
+  Both are bench-side bugs uncovered during PR 1 verification on Spark:
+  UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with
+  a glibc heap-corruption assertion on init. Follow-up issues track each.
+- **p99/p999 latency is not in v1.** The bench output captures throughput,
+  drops, and resource utilization. Per-burst RX timestamping and percentile
+  aggregation are deferred to a follow-up issue.
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
index c3e156a..95af265 100644
--- a/docs/stylesheets/extra.css
+++ b/docs/stylesheets/extra.css
@@ -58,8 +58,8 @@
      yellow = no drops AND Gbps ≥ 70% of max
      red    = any drops OR Gbps < 70% of max                            */
 .md-typeset table.perf-matrix {
-  width: 85%;
-  table-layout: fixed;
+  width: 100%;
+  table-layout: auto;
   border-collapse: separate;
   border-spacing: 5px;
   font-size: 0.64rem;
@@ -69,8 +69,9 @@
   text-align: center;
   vertical-align: middle;
   font-variant-numeric: tabular-nums;
-  padding: 0.55em 0.6em;
+  padding: 0.55em 0.5em;
   border-radius: 4px;
+  white-space: nowrap;
 }
 .md-typeset table.perf-matrix td small {
   display: block;