compilAR is a compiler and runtime for straggler-aware AllReduce over GPU clusters, inspired by the paper Efficient AllReduce with Stragglers (Devraj et al.). It takes a schedule produced by the StragglAR algorithm as input and emits a complete, standalone CUDA + MPI + NCCL implementation of that schedule for any number of GPUs.
The core problem it solves is that standard AllReduce algorithms (ring, tree, recursive-halving-doubling) stall every healthy rank until the slowest GPU catches up. StragglAR lets the N-1 healthy ranks make progress among themselves while the straggler is still computing, then merges the straggler's contribution with a minimal number of additional communication rounds once it is ready.
Given N GPUs with one designated straggler (rank N-1):
-
Reduce-scatter phase: While the straggler is delayed, the N-1 healthy ranks run
ncclReduceScatteramong themselves over N-1 equal chunks of the buffer. After this, rank r holds the partial sum of chunk r across all healthy ranks. -
Straggler merge phase: The schedule synthesizer produces a sequence of rounds, each containing a batch of pairwise exchanges. Three exchange types are used:
- StragglerMatching: a healthy rank and the straggler both hold a partial sum of the same chunk. They swap into a scratch buffer, then both call
reduce_addto finalize. - OneWayMatching: a rank holding a fully-reduced chunk pushes it to a rank that does not. Plain copy.
- TwoWayMatching: two ranks each hold a fully-reduced chunk the other lacks. They swap simultaneously.
- StragglerMatching: a healthy rank and the straggler both hold a partial sum of the same chunk. They swap into a scratch buffer, then both call
-
After the last round, every rank holds every chunk fully reduced.
compilAR.py takes a schedule file (from synthesizer_pow2.py or synthesizer_nonpow2.py) and generates a complete .cu source file. It uses allreduce_multinode.cu.template as its skeleton and substitutes:
NUM_RANKS: number of GPUs, inferred from the schedulekStragglerRank: straggler rank, inferred from the schedule- Body of
stragglar_allreduce_helper: per-round NCCL group blocks, generated from the schedule matchings
Everything else in the template (MPI bootstrap, NCCL communicator init, reduce-scatter sub-communicator, benchmark loop, correctness check, cleanup) is agnostic of the number of GPUs.
All builds and runs happen inside an Apptainer container defined by stragglar/compilar.def. The container ships a coherent CUDA + NCCL + OpenMPI + PyTorch toolchain (NGC pytorch:23.10-py3), so the host only needs:
- An NVIDIA driver (≥ 535 for our base image)
- Apptainer ≥ 1.3
- An MPI launcher (host
mpirunon a TCP cluster, or Slurmsrunon an HPC cluster)
No host-side CUDA, NCCL, MPI, Python, or PyTorch installation is required.
cd stragglar
sudo apptainer build compilAR.sif compilar.def
# or, if sudo is unavailable:
apptainer build --fakeroot compilAR.sif compilar.defPlace compilAR.sif somewhere visible from every node (e.g. NFS-shared $HOME).
cd stragglar/schedules
python synthesizer_pow2.py 8 > 8gpusched.txtPre-generated schedules for N=2, 4, 8 are already in schedules/.
cd stragglar
apptainer exec --nv compilAR.sif python3 compilAR.py schedules/8gpusched.txt generated_8gpu.cuThe container provides stragglar-build, which produces a fat binary that runs on every shipping NVIDIA GPU from Pascal (sm_61) through Hopper (sm_90), with PTX fallback for newer cards:
apptainer exec --nv compilAR.sif stragglar-build generated_8gpu.cu stragglar_8gpuFor a faster, GPU-specific build (host's GPU only):
apptainer exec --nv compilAR.sif stragglar-build --native generated_8gpu.cu stragglar_8gpumpirun -np 8 -H host1:4,host2:4 \
--mca btl_tcp_if_include <iface> \
-x NCCL_P2P_DISABLE=1 \
-x PMIX_MCA_psec=native \
-x PMIX_MCA_gds=hash \
apptainer exec --nv compilAR.sif ./stragglar_8gpu 117440512 stragglar 10 100.0The PMIX_MCA_* flags reconcile the host PMIx with the container's PMIx; drop them if your host and container OpenMPI versions match. NCCL_P2P_DISABLE=1 is needed only on consumer (GeForce) GPUs where PCIe P2P is disabled in the driver.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time=00:30:00
srun --mpi=pmix \
apptainer exec --nv \
--bind /dev/infiniband --bind /sys/class/infiniband \
compilAR.sif ./stragglar_8gpu 117440512 stragglar 10 100.0Add -x NCCL_DEBUG=INFO (TCP) or --export=ALL,NCCL_DEBUG=INFO (Slurm) on first run to confirm Using network IB (or Using network Socket on TCP) and reasonable bandwidth.
launch.sh runs a smoketester to identify the physically slow GPU and binds it to rank N-1, useful when you don't want to simulate a delay. It currently assumes a single-node, host-toolchain layout — if you use it, run it inside the container:
apptainer exec --nv compilAR.sif ./stragglar/launch.sh 8 ./stragglar_8gpu 117440512 stragglar 10 -1<buffer_bytes> <algorithm> <num_iters> <sleep_ms>
| Argument | Description |
|---|---|
buffer_bytes |
Total AllReduce buffer size in bytes. Must satisfy (buffer_bytes / sizeof(float)) % (N - 1) == 0; the binary errors out otherwise with suggested valid sizes. The examples above use 117440512 (112 MiB, valid for N=8) and 50331648 (48 MiB, valid for N=4) — both pick a 4 MiB chunk per rank |
algorithm |
Must be stragglar |
num_iters |
Timed iterations; first is discarded as warmup |
sleep_ms |
Milliseconds to delay rank N-1. -1 skips reduce-scatter and runs the merge schedule only |
allreduce_multinode.cu (and all generated files) use one MPI rank per GPU. Each process initializes its NCCL communicator via ncclCommInitRank with a token distributed by MPI_Bcast. This scales to multi-node configurations.
The single-process variants in reference_code/ and allreduce_4GPU_rewrite.cu use ncclCommInitAll and only work on one host. They are kept for reference.
Two NCCL communicators are maintained per process:
comm: all N ranks; used during the straggler merge schedulesubComm: ranks 0 through N-2; used for the reduce-scatter phase while the straggler is delayed (built viaMPI_Comm_split)
Each MPI process binds to its GPU via LOCAL_RANK (set by mpirun, torchrun, or Slurm). Without it, the process falls back to myRank % cudaDeviceCount. For accurate straggler behavior, rank N-1 must be bound to the physically slow GPU. launch.sh does this automatically.
launch.shassumes single-node. The smoketester enumerates GPUs viatorch.cuda.device_count()on one host. Multi-node runs need either a per-node aggregation step or manualLOCAL_RANKassignment.- Data type is hardcoded to
float32. Supporting fp16 / bf16 requires changes to both the template and the generator. - Buffer size must be divisible by
(N-1) * sizeof(float). The binary checks at startup and exits with an error and suggested valid sizes if not. Note that round-power-of-2 sizes (1 MiB, 1 GiB) are not valid for non-power-of-2 values of N-1 — for N=4 use multiples of 12 bytes (e.g. 48 MiB, 192 MiB), for N=8 use multiples of 28 bytes (e.g. 112 MiB, 896 MiB). The helper atstragglar/smoketest/pad_buffers.py Nprints a list of valid sizes for a given N. - Straggler is always rank N-1. Baked into the generated code;
launch.shmaps the physical straggler GPU to that rank at launch. - Correctness check assumes the built-in fill pattern.
kExpectedSum = 6.0fonly holds when the straggler fills its buffer with the arbitrary3.0fand each non-straggler fills its own chunk. Replace this check when wiring real input data. - Clock-based straggler delay is calibrated to device 0. On heterogeneous-clock GPUs,
sleep_mswon't match wall-clock ms on other devices. Doesn't affect correctness.
The StragglAR algorithm and the schedule synthesizer in stragglar/schedules/ are the work of Devraj et al., Efficient AllReduce with Stragglers. This project is not the original algorithm, but a compiler and launch harness built around it.