Skip to content

Latest commit

 

History

History
239 lines (184 loc) · 8.64 KB

File metadata and controls

239 lines (184 loc) · 8.64 KB

Cache-Friendly Data Structures and Data-Oriented Design

Modern CPUs are bottlenecked on memory, not compute. A cache miss to DRAM costs ~200–300 cycles — enough time to do hundreds of arithmetic ops. The single most impactful performance lever for most code is stop missing the cache.


1. The Memory Hierarchy

Approximate latencies on a modern x86 server (Skylake-X class):

Level Latency Size
L1 cache ~4 cycles (~1 ns) 32 KB / core
L2 cache ~12 cycles (~4 ns) 256 KB–1 MB / core
L3 cache ~40 cycles (~12 ns) 8–64 MB shared
DRAM ~200–300 cycles (~80 ns) GBs
NVMe SSD ~10 μs TBs
Network (same DC) ~100 μs

Two implications:

  1. The price of an L1 hit vs an L3 miss is ~80×. Layout dwarfs algorithmic constants below ~1000 elements.
  2. Branch mispredicts are ~15 cycles, function calls add overhead — but a single L3 miss costs more than dozens of branches.

2. Data-Oriented Design (DOD)

DOD is a counter-philosophy to OOP. Instead of "what does this object do," ask: "what is this transformation, and what's the layout that lets the CPU run through it fast?" Popularized in game engines (Mike Acton's Data-Oriented Design and C++ talk is the foundational text).

The mental model:

The CPU is a loop on contiguous arrays of bytes. Code is a way to describe what to do at each iteration. Make the data shape match the access shape.

3. AoS vs SoA

Two ways to store a million particles, each with position, velocity, color, mass:

Array of Structures (AoS):

struct Vec3  { float x, y, z; };
struct Vec4  { float x, y, z, w; };

struct Particle {
  Vec3  pos, vel;
  Vec4  color;
  float mass;
};

std::vector<Particle> particles;   // every read pulls all fields into cache

Structure of Arrays (SoA):

struct Particles {
  std::vector<Vec3>  positions;
  std::vector<Vec3>  velocities;
  std::vector<Vec4>  colors;
  std::vector<float> masses;
};

// Physics step touches only positions and velocities — color/mass stay out of cache.
void step(Particles& p, float dt) {
  for (size_t i = 0; i < p.positions.size(); ++i) {
    p.positions[i].x += p.velocities[i].x * dt;
    p.positions[i].y += p.velocities[i].y * dt;
    p.positions[i].z += p.velocities[i].z * dt;
  }
}

If you only need positions and velocities (e.g., physics step), SoA loads only those fields — typically 2–4× the throughput. AoS is friendlier when you always touch the whole record.

Modern compilers (with ISPC, or pragma directives) can vectorize SoA loops trivially; AoS often falls out of vectorization because the strides don't fit SIMD lanes.

4. Hot/Cold Splitting

Most types have a few "hot" fields (touched in tight loops) and many "cold" fields (touched rarely). Split them:

// Bad — every cache line that holds a player's HP also drags in inventory + name + ...
struct Player {
  float                    hp;
  std::vector<int>         inventory;   // ~24 bytes
  std::string              name;        // ~32 bytes
  std::vector<std::string> quests;      // ~24 bytes
};

// Better — hot data is dense.
struct PlayerHot  { float hp; uint16_t cold_index; };       // 8 bytes
struct PlayerCold {
  std::vector<int>         inv;
  std::string              name;
  std::vector<std::string> quests;
};

std::vector<PlayerHot>  hot;
std::vector<PlayerCold> cold;

// Tight loop: 8 cache lines covers 64 players' HP.
void apply_regen(std::vector<PlayerHot>& hot, float amount) {
  for (auto& p : hot) p.hp += amount;
}

A loop that updates HP touches only hot[] — 2 bytes per player vs hundreds. The cold table is paid for only on the rare access.

5. Contiguous Containers

In ascending order of cache-friendliness:

Container Layout When to use
std::vector One block Default; almost always
std::array One block on stack Known fixed size
std::deque Chunked blocks Push/pop both ends
std::flat_map (C++23) Sorted vector Small map, mostly reads
std::list Pointer-chained nodes Almost never (O(1) splice excluded)
std::map / std::set Tree nodes, scattered Avoid for hot paths — use sorted vector
std::unordered_map Buckets + chains OK for big tables; chains hurt locality

The lesson: default to std::vector. Other containers should justify themselves with a measured win.

6. Indices Beat Pointers

In a Node*-chained structure, the CPU pointer-chases through unpredictable memory. Indices into a contiguous backing array are predictable and pre-fetchable:

// Pointer-chasing — slow. Each `next` may live anywhere in memory.
struct Node {
  Node* next;
  int   v;
};

// Indexed — fast. The links live in a packed array; the CPU can prefetch the stride.
// Bonus: 32-bit indices vs 64-bit pointers.
std::vector<int>     values;
std::vector<int32_t> next_index;   // -1 for end

int sum_list(const std::vector<int>& values,
             const std::vector<int32_t>& next_index,
             int32_t head) {
  int s = 0;
  for (int32_t i = head; i != -1; i = next_index[i]) s += values[i];
  return s;
}

Bonus: indices are stable across reallocation (relative offsets) and can be 32-bit even on 64-bit machines.

7. Sentinels and Branch-Free Iteration

Branch mispredicts cost cycles. Sentinels remove "did we run off the end?" checks:

// Branchy — two conditions per iteration.
size_t find_branchy(const int* a, size_t n, int target) {
  for (size_t i = 0; i < n; ++i) {
    if (a[i] == target) return i;
  }
  return n;
}

// Sentinel-based — one condition per iteration.
// Caller must reserve n+1 slots so a[n] is writable.
size_t find_sentinel(int* a, size_t n, int target) {
  a[n] = target;            // sentinel guarantees a hit
  size_t i = 0;
  while (a[i] != target) ++i;
  return i;                 // == n if not found
}

Modern compilers can sometimes do this for you; sometimes they can't. Profile.

8. Prefetching

When the next access is computable but far ahead, hint the CPU to start loading it:

void process(int x);

void run(const int* data, size_t n) {
  for (size_t i = 0; i < n; ++i) {
    if (i + 16 < n) {
      __builtin_prefetch(&data[i + 16], 0, 1);   // GCC/Clang
    }
    process(data[i]);
  }
}

_mm_prefetch is the portable intrinsic. Useful for index-driven structures (hash table lookups, BVH traversal, B-tree walks). Useless if you're already memory-bound on bandwidth, not latency.

9. Measuring

Don't guess. Tools:

  • perf stat (Linux): cycles, instructions, cache-references, cache-misses, branch-misses.
    perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./bench
    
  • perf record + perf report: which lines are missing the cache.
  • likwid-perfctr: cleaner CLI for x86 perf counters.
  • VTune (Intel), uProf (AMD): GUI views of memory hierarchy events.
  • Cachegrind (Valgrind): simulated cache miss rates, no hardware counters needed.

Headline numbers to watch:

  • IPC (instructions per cycle). >2 is good, <1 usually means memory-bound.
  • L1/L2/L3 miss rates. Each level miss adds latency proportional to next-level cost.
  • Branch miss rate. >5% is a flag.

10. Quick Reference

  • Default to std::vector of dense structs. Reach elsewhere with cause.
  • SoA when hot loops touch only a subset of fields.
  • Hot/cold split when types have many fields touched at very different frequencies.
  • Contiguous over linked, indices over pointers.
  • Pack hot data into cache-line-sized structs (64 bytes typical).
  • Avoid sharing cache lines between threads — see False Sharing & NUMA.
  • Prefetch only when measured to help; the compiler is often already doing it.
  • Measure, don't intuit. The first cache-line you save almost always points to ten more.

References