Modern CPUs are bottlenecked on memory, not compute. A cache miss to DRAM costs ~200–300 cycles — enough time to do hundreds of arithmetic ops. The single most impactful performance lever for most code is stop missing the cache.
- 1. The Memory Hierarchy
- 2. Data-Oriented Design (DOD)
- 3. AoS vs SoA
- 4. Hot/Cold Splitting
- 5. Contiguous Containers
- 6. Indices Beat Pointers
- 7. Sentinels and Branch-Free Iteration
- 8. Prefetching
- 9. Measuring
- 10. Quick Reference
Approximate latencies on a modern x86 server (Skylake-X class):
| Level | Latency | Size |
|---|---|---|
| L1 cache | ~4 cycles (~1 ns) | 32 KB / core |
| L2 cache | ~12 cycles (~4 ns) | 256 KB–1 MB / core |
| L3 cache | ~40 cycles (~12 ns) | 8–64 MB shared |
| DRAM | ~200–300 cycles (~80 ns) | GBs |
| NVMe SSD | ~10 μs | TBs |
| Network (same DC) | ~100 μs | — |
Two implications:
- The price of an L1 hit vs an L3 miss is ~80×. Layout dwarfs algorithmic constants below ~1000 elements.
- Branch mispredicts are ~15 cycles, function calls add overhead — but a single L3 miss costs more than dozens of branches.
DOD is a counter-philosophy to OOP. Instead of "what does this object do," ask: "what is this transformation, and what's the layout that lets the CPU run through it fast?" Popularized in game engines (Mike Acton's Data-Oriented Design and C++ talk is the foundational text).
The mental model:
The CPU is a loop on contiguous arrays of bytes. Code is a way to describe what to do at each iteration. Make the data shape match the access shape.
Two ways to store a million particles, each with position, velocity, color, mass:
Array of Structures (AoS):
struct Vec3 { float x, y, z; };
struct Vec4 { float x, y, z, w; };
struct Particle {
Vec3 pos, vel;
Vec4 color;
float mass;
};
std::vector<Particle> particles; // every read pulls all fields into cacheStructure of Arrays (SoA):
struct Particles {
std::vector<Vec3> positions;
std::vector<Vec3> velocities;
std::vector<Vec4> colors;
std::vector<float> masses;
};
// Physics step touches only positions and velocities — color/mass stay out of cache.
void step(Particles& p, float dt) {
for (size_t i = 0; i < p.positions.size(); ++i) {
p.positions[i].x += p.velocities[i].x * dt;
p.positions[i].y += p.velocities[i].y * dt;
p.positions[i].z += p.velocities[i].z * dt;
}
}If you only need positions and velocities (e.g., physics step), SoA loads only those fields — typically 2–4× the throughput. AoS is friendlier when you always touch the whole record.
Modern compilers (with ISPC, or pragma directives) can vectorize SoA loops trivially; AoS often falls out of vectorization because the strides don't fit SIMD lanes.
Most types have a few "hot" fields (touched in tight loops) and many "cold" fields (touched rarely). Split them:
// Bad — every cache line that holds a player's HP also drags in inventory + name + ...
struct Player {
float hp;
std::vector<int> inventory; // ~24 bytes
std::string name; // ~32 bytes
std::vector<std::string> quests; // ~24 bytes
};
// Better — hot data is dense.
struct PlayerHot { float hp; uint16_t cold_index; }; // 8 bytes
struct PlayerCold {
std::vector<int> inv;
std::string name;
std::vector<std::string> quests;
};
std::vector<PlayerHot> hot;
std::vector<PlayerCold> cold;
// Tight loop: 8 cache lines covers 64 players' HP.
void apply_regen(std::vector<PlayerHot>& hot, float amount) {
for (auto& p : hot) p.hp += amount;
}A loop that updates HP touches only hot[] — 2 bytes per player vs hundreds. The cold table is paid for only on the rare access.
In ascending order of cache-friendliness:
| Container | Layout | When to use |
|---|---|---|
std::vector |
One block | Default; almost always |
std::array |
One block on stack | Known fixed size |
std::deque |
Chunked blocks | Push/pop both ends |
std::flat_map (C++23) |
Sorted vector | Small map, mostly reads |
std::list |
Pointer-chained nodes | Almost never (O(1) splice excluded) |
std::map / std::set |
Tree nodes, scattered | Avoid for hot paths — use sorted vector |
std::unordered_map |
Buckets + chains | OK for big tables; chains hurt locality |
The lesson: default to std::vector. Other containers should justify themselves with a measured win.
In a Node*-chained structure, the CPU pointer-chases through unpredictable memory. Indices into a contiguous backing array are predictable and pre-fetchable:
// Pointer-chasing — slow. Each `next` may live anywhere in memory.
struct Node {
Node* next;
int v;
};
// Indexed — fast. The links live in a packed array; the CPU can prefetch the stride.
// Bonus: 32-bit indices vs 64-bit pointers.
std::vector<int> values;
std::vector<int32_t> next_index; // -1 for end
int sum_list(const std::vector<int>& values,
const std::vector<int32_t>& next_index,
int32_t head) {
int s = 0;
for (int32_t i = head; i != -1; i = next_index[i]) s += values[i];
return s;
}Bonus: indices are stable across reallocation (relative offsets) and can be 32-bit even on 64-bit machines.
Branch mispredicts cost cycles. Sentinels remove "did we run off the end?" checks:
// Branchy — two conditions per iteration.
size_t find_branchy(const int* a, size_t n, int target) {
for (size_t i = 0; i < n; ++i) {
if (a[i] == target) return i;
}
return n;
}
// Sentinel-based — one condition per iteration.
// Caller must reserve n+1 slots so a[n] is writable.
size_t find_sentinel(int* a, size_t n, int target) {
a[n] = target; // sentinel guarantees a hit
size_t i = 0;
while (a[i] != target) ++i;
return i; // == n if not found
}Modern compilers can sometimes do this for you; sometimes they can't. Profile.
When the next access is computable but far ahead, hint the CPU to start loading it:
void process(int x);
void run(const int* data, size_t n) {
for (size_t i = 0; i < n; ++i) {
if (i + 16 < n) {
__builtin_prefetch(&data[i + 16], 0, 1); // GCC/Clang
}
process(data[i]);
}
}_mm_prefetch is the portable intrinsic. Useful for index-driven structures (hash table lookups, BVH traversal, B-tree walks). Useless if you're already memory-bound on bandwidth, not latency.
Don't guess. Tools:
perf stat(Linux): cycles, instructions, cache-references, cache-misses, branch-misses.perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./benchperf record+perf report: which lines are missing the cache.likwid-perfctr: cleaner CLI for x86 perf counters.- VTune (Intel), uProf (AMD): GUI views of memory hierarchy events.
- Cachegrind (Valgrind): simulated cache miss rates, no hardware counters needed.
Headline numbers to watch:
- IPC (instructions per cycle). >2 is good, <1 usually means memory-bound.
- L1/L2/L3 miss rates. Each level miss adds latency proportional to next-level cost.
- Branch miss rate. >5% is a flag.
- Default to
std::vectorof dense structs. Reach elsewhere with cause. - SoA when hot loops touch only a subset of fields.
- Hot/cold split when types have many fields touched at very different frequencies.
- Contiguous over linked, indices over pointers.
- Pack hot data into cache-line-sized structs (64 bytes typical).
- Avoid sharing cache lines between threads — see False Sharing & NUMA.
- Prefetch only when measured to help; the compiler is often already doing it.
- Measure, don't intuit. The first cache-line you save almost always points to ten more.
- Mike Acton's CppCon 2014 talk: Data-Oriented Design and C++ — required watching.
- What Every Programmer Should Know About Memory, Ulrich Drepper.
- False Sharing and NUMA
- Branch Prediction and SIMD
- Memory Management Strategies
- Computer Architecture: A Quantitative Approach, Hennessy & Patterson — the textbook.