C++

C++ - [Memory Access - 1] Layout

Cache

Posted by Rico's Nerd Cluster on May 27, 2023

Memory Access: The Hidden Bottleneck

Reference

Why Memory Access Matters

Even as CPU speeds continue to improve, the cost of fetching data from memory has become a limiting factor. Operating systems mitigate this by caching frequently used data in RAM — which is much faster than disk. However, we can’replace SSDs or HDDs with RAM: RAM is expensive, and it loses content when power is lost. So RAM is usually called “memory”.

To have fast memory access while still retaining disks, we rely on hardware-accelerated memory access, such as:

  • Fast, parallel RAM architectures
  • Advanced memory controller designs
  • CPU caches (L1, L2, L3)
  • Direct Memory Access (DMA) for device communication

The memory hierarchy is:

1
2
3
CPU Registers   →   L1 Cache   →   L2 Cache   →   L3 Cache   →   RAM   →   Disk
   (tiny, fast)      ↑              ↑             ↑             ↑         ↑
   (nanoseconds)  Cache for...  Cache for...  Cache for...  Cache for...  (slow)

CPU Caches: Your Program’s Real Performance Friend

CPU caches are inside CPU. They are even faster than RAM. Their size typically is 64KB - 64 MB. Its latency is ~10ns, compared to 100ns to RAM. Caches work by keeping frequently accessed data close to the processor:

  • L1: Smallest, fastest, per-core.
  • L2: Bigger, a bit slower.
  • L3: Shared across cores, larger still.

A cache line is typically 64 bytesthe smallest unit of data fetched from cache. So reading arr[0] (with 4-byte integers) will likely bring in arr[0] through arr[15].

Caches are organized into sets and associativity levels:

  • For example, in a 4-way set-associative cache, each set holds 4 cache lines.
  • Memory addresses are mapped to these sets — which leads to a few interesting behaviors.

Cache Miss Types

A cache miss happens when the data the CPU needs is not in the cache — so it has to go fetch it from a slower memory (like RAM or even disk).

  • Cold Miss: First-time access.

  • Conflict Miss: Two memory addresses map to the same cache set. If that set is full, one must be evicted — even if the rest of the cache is empty.

  • Capacity Miss: The working set is too large to fit in the cache.

False Sharing: An Invisible Type Of Multithreading Slowdowns

False sharing occurs when multiple threads access different variables that share a cache line. Even though each thread is logically isolated, the CPU sees contention due to hardware cache coherence protocols.

Example:

1
2
3
4
struct Counters {
    int a; // Thread 1
};
std::vector<SafeCounter> counters(num_threads);

If counters[0], counters[1] are on the same cache line, updates from different threads will invalidate the cache line across cores, causing unnecessary slowdowns. (The result of the final result will be correct.)

1
2
3
4
5
6
7
8
9
10
struct alignas(64) SafeCounter {
    int a; // Thread 1
};

// In C
struct __attribute__((aligned(64))) SafeCounterC {
    int counter;
};

std::vector<SafeCounter> counters(num_threads);

The alignas(64) will make sure the start of the struct is aligned to a 64-byte boundary (the size of a typical cache line). In a small test with 2 thredads, assigning counter incrementing with alignas(64)took 3ms, while without it the program took 12ms.

TL;DR: Align and pad data structures to avoid false sharing!