Skip to content

Latest commit

 

History

History
228 lines (169 loc) · 8.85 KB

File metadata and controls

228 lines (169 loc) · 8.85 KB

High-Performance Computing (HPC) Patterns

HPC is the discipline of getting close to peak hardware throughput on workloads that justify the engineering. This page covers the recurring structural patterns; the per-instruction tricks live in the other Memory & Performance docs.


1. The Three Walls

Every HPC project hits at least one of:

  • Compute wall — the algorithm is FLOPS-bound. Solution: vectorize, parallelize across cores/sockets/nodes/GPUs.
  • Memory-bandwidth wall — moving data between DRAM and cores is the bottleneck. Solution: better layout, blocking/tiling, in-place algorithms, fewer passes.
  • Communication wall — MPI / network / PCIe traffic dominates. Solution: overlap communication with computation, fewer/larger messages, better partitioning.

Most numerical kernels (BLAS-2 like matvec) are bandwidth-bound. Most BLAS-3 kernels (matmul) can be made compute-bound with blocking. Always characterize where you are before optimizing.

2. Embarrassingly Parallel / SPMD

When tasks are fully independent — Monte Carlo trials, image filters, particle updates — split the input and run on N threads:

#include <algorithm>
#include <execution>
#include <vector>

int main() {
  std::vector<int> items(1000, 1);
  std::for_each(std::execution::par_unseq, items.begin(), items.end(),
                [](int& x){ x = x * 2; });
}

Or with OpenMP:

#include <vector>

int compute(int x) { return x * x; }

int main() {
  size_t n = 1000;
  std::vector<int> input(n, 3), result(n);
  #pragma omp parallel for
  for (size_t i = 0; i < n; ++i) result[i] = compute(input[i]);
}

This is the easiest speedup in HPC and often the largest. Scales linearly until you hit memory bandwidth or coherence costs.

SPMD ("Single Program, Multiple Data") is the same idea expressed differently — every thread runs the same code, partitioned by an ID. CUDA, OpenMP, MPI all use this model.

3. Fork–Join

Recursive divide-and-conquer. Split a problem, solve each half in parallel, combine.

#include <future>
#include <numeric>
#include <vector>
#include <iostream>

int parallel_sum(const int* data, size_t n) {
  if (n < 4096) return std::accumulate(data, data + n, 0);
  size_t mid = n / 2;
  auto fut = std::async(std::launch::async, parallel_sum, data, mid);
  int rhs = parallel_sum(data + mid, n - mid);
  return fut.get() + rhs;
}

int main() {
  std::vector<int> v(100000, 1);
  std::cout << parallel_sum(v.data(), v.size()) << "\n"; // 100000
}

Cilk, Intel TBB, and OpenMP task provide first-class fork-join with much lower overhead than std::async. The grain size (cutoff for serial execution) matters a lot — too small means overhead dominates; too large means underutilization.

4. Work Stealing

Each thread has its own deque of tasks. When a thread runs out, it steals from a random victim's deque. This balances irregular workloads automatically.

  • TBB, Cilk, Go's runtime, Tokio, .NET ThreadPool — all use work stealing.
  • Steals come from the back of the victim's deque to minimize cache contention with the owner who pushes/pops the front.

When work-stealing matters: irregular workloads where static partitioning would leave some cores idle. When it doesn't: regular workloads where SPMD with even chunks is optimal and has lower overhead.

See also Concurrency Patterns: Fork-Join and Work Stealing.

5. Map–Reduce

Two-phase: per-element transform (map), then combine (reduce):

#include <execution>
#include <numeric>
#include <vector>
#include <iostream>

int main() {
  std::vector<double> v{1.0, 2.0, 3.0, 4.0};
  double sum_of_squares = std::transform_reduce(
      std::execution::par_unseq,
      v.begin(), v.end(),
      0.0,
      std::plus{},
      [](double x){ return x * x; });
  std::cout << sum_of_squares << "\n"; // 30
}

Distributed map-reduce (Hadoop, Spark) extends this across machines with shuffle/sort between phases.

6. Stencil Computations

Update each cell as a function of itself and neighbors. Used in finite-difference solvers, image filters, cellular automata.

#include <vector>

int main() {
  size_t n = 64, m = 64;
  std::vector<std::vector<double>> cur(n, std::vector<double>(m, 1.0));
  std::vector<std::vector<double>> next(n, std::vector<double>(m, 0.0));

  for (size_t i = 1; i < n - 1; ++i)
    for (size_t j = 1; j < m - 1; ++j)
      next[i][j] = 0.25 * (cur[i-1][j] + cur[i+1][j] + cur[i][j-1] + cur[i][j+1]);
}

Optimization patterns:

  • Tiling/blocking. Process the grid in cache-sized tiles to maximize reuse.
  • Time-skewing. Run multiple time steps within a tile before moving on.
  • Double buffering. Read from cur, write to next, swap. Avoids in-place hazards.
  • Vectorization. Inner loops SIMD-vectorize naturally if data is contiguous.
  • Halo exchange. When parallelizing across nodes, each rank needs ghost rows/columns from neighbors.

7. Reductions

Combining many values into one (sum, max, dot product). Three concerns:

Parallelization. Reduce per chunk; combine chunk results:

#include <execution>
#include <numeric>
#include <vector>
#include <iostream>

int main() {
  std::vector<double> v(1000, 0.5);
  double total = std::reduce(std::execution::par, v.begin(), v.end(), 0.0);
  std::cout << total << "\n"; // 500
}

Numerical stability. Floating-point sum is non-associative. Different reduction orders give different results. For high precision, use Kahan summation or pairwise reduction.

False sharing. Per-thread accumulators must be on separate cache lines (see False Sharing & NUMA).

8. Roofline Model — Knowing the Ceiling

The roofline model plots achievable performance against arithmetic intensity (FLOPS per byte loaded). Two limits:

  • Memory bandwidth limit: peak_BW * intensity — slope going up from origin.
  • Peak compute limit: flat horizontal line.

Your kernel's performance is bounded by min(peak_compute, peak_BW * intensity). Plot both and you immediately see whether you're compute- or memory-bound, and how far below the ceiling you are.

GFLOPS
  |               peak compute ─────
  |             /
  |           /  ← matmul lives here (compute-bound)
  |         /
  |       /  ← matvec / stencils (BW-bound)
  |_____/
       arithmetic intensity (FLOPS/byte)

Tools: Intel Advisor's roofline view, NVIDIA Nsight Compute, the open-source Empirical Roofline Toolkit.

9. Hybrid Models: MPI + OpenMP + GPU

Real HPC apps are layered:

  • MPI between nodes. Explicit message passing, ranks own data partitions. Halo exchanges, allreduces, broadcasts.
  • OpenMP within a node. Threads share memory, work in parallel on the rank's chunk.
  • CUDA / HIP / SYCL on the GPU. Offload bandwidth-and-compute-heavy kernels.

Communication patterns to know:

  • Point-to-point. Direct send/receive between two ranks.
  • Collective. All-reduce, broadcast, gather, scatter — implemented in MPI as MPI_Allreduce etc.
  • Halo exchange. Each rank sends boundary cells to neighbors, receives ghost cells.
  • Asynchronous overlap. MPI_Isend/MPI_Irecv allows compute and comm to run concurrently.

10. Tooling

Tool Use
perf / VTune / uProf Per-instruction profiling, cache/branch counters
likwid CLI for x86 hardware counters; clean roofline numbers
Tracy In-app frame/event profiling (game-style); see Tracy Profiler
NVIDIA Nsight GPU kernel profiling, roofline
Score-P / Vampir / TAU MPI + OpenMP trace analysis at HPC scale
HPCToolkit Sampling-based whole-app profiling
MAQAO Static + dynamic loop analysis

Rule: instrument before optimizing. Most "obvious" hot spots are wrong, and most easy wins are invisible until you measure.

References