Perf

Perf

Posted by Rico's Nerd Cluster on May 25, 2023

perf

Reference: cnblogs post

Prerequisites: Permissions

By default, many distros set perf_event_paranoid=4, which blocks hardware counter access for unprivileged users. Check and fix:

1
2
3
4
5
6
7
cat /proc/sys/kernel/perf_event_paranoid   # 4 = very restricted, -1 = fully open

# Temporarily lower it (until next reboot):
sudo sysctl -w kernel.perf_event_paranoid=0

# Make it permanent:
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf
Value Effect
-1 All events allowed for all users
0 CPU events allowed (recommended for dev machines)
1 No raw/ftrace access
≥ 2 No kernel profiling
4 Heavily restricted (default on some distros)

Basic Workflow

  1. Compile with -g to embed debug symbols. This increases binary size but does not affect runtime performance.

  2. Record and convert the trace:

    1
    
     perf record -F 99 -a -g -- MY_EXE && perf script -i perf.data &> perf.unfold
    
  3. Visualize with one of:

    • FlameGraph

      1
      2
      3
      
        git clone https://github.com/brendangregg/FlameGraph.git
        ./FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded
        ./FlameGraph/flamegraph.pl perf.folded > perf.svg
      
    • Speedscope (recommended) — drag perf.unfold onto the page.

Perf in a Docker Container

Running perf inside Docker has a few quirks, especially with custom kernels (e.g., System76 / Pop!_OS).

1. Missing kernel-matched perf binary

If you see:

1
2
3
WARNING: perf not found for kernel 6.9.3-76060903
  You may need to install the following packages for this specific kernel:
    linux-tools-6.9.3-76060903-generic

Custom kernel builds (System76, etc.) often have no matching linux-tools-<version> package in the default Ubuntu repos, so apt install linux-tools-$(uname -r) will fail inside the container.

Fix: Perf’s ABI is very stable (perf_event_open(2) is backward/forward compatible), so copy the host’s perf binary directly into the container:

1
cp /usr/lib/linux-tools-$(uname -r)/perf ./perf-copy

Do not copy /usr/bin/perf — it is a shell wrapper, not the real binary.

2. Required docker-compose settings

1
2
privileged: true      # or at least cap_add: [SYS_ADMIN, PERFMON]
pid: "host"
  • privileged / SYS_ADMIN / PERFMON are required for perf_event_open.
  • pid: "host" lets perf trace real host PIDs and resolve symbols correctly.

3. Lowering the paranoid level

If perf complains about permissions, run inside the container:

1
2
sudo sysctl -w kernel.perf_event_paranoid=0
sudo sysctl --system

Profiling

Timing and Resource Metrics

Metric Description How to measure
Elapsed Wall-clock time (includes sleep/wait) std::chrono, /usr/bin/time, benchmark timers
CPU Time User+system CPU time; can exceed elapsed when multithreaded /usr/bin/time -p/-v, getrusage(RUSAGE_SELF), clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
Peak RSS Peak physical memory resident (not virtual) /usr/bin/time -v, /proc/<pid>/status (VmHWM/VmRSS), ps -o rss, smem

CPU Performance Counter Metrics

These require hardware perf counters to be enabled (may show “n/a” otherwise).

Metric Description
Cycles CPU clock cycles (hardware counter)
Instructions Retired instructions (hardware counter)
IPC Instructions per cycle = instructions / cycles (higher is better)
CPI Cycles per instruction = cycles / instructions (lower is better)
Branch miss rate Wrong branch predictions / total branches
Cache miss rate Last-level cache misses / references

Collect all at once with:

1
2
3
perf stat -r 5 \
  -e task-clock,cycles,instructions,branches,branch-misses,cache-references,cache-misses \
  -- ./your_benchmark_binary --your-flags

Hands-on Example

The program below exercises all the metrics above in one binary:

  • Sequential sum — cache-friendly, high IPC, low cache-miss rate baseline.
  • Random scatter — cache-unfriendly, drives up LLC miss rate.
  • Unpredictable branches — values are random so the branch predictor is ~50 % wrong.
  • Large allocation — 256 MB array → visible Peak RSS.
  • std::chrono timer — prints elapsed time from inside the program.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// perf_demo.cpp
#include <algorithm>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>

static const size_t N = 64 * 1024 * 1024;   // 64 M elements = 256 MB (int)

// Cache-friendly sequential sum
long long sequential_sum(const std::vector<int>& v) {
    long long s = 0;
    for (size_t i = 0; i < v.size(); ++i) s += v[i];
    return s;
}

// Cache-unfriendly random scatter read
long long random_sum(const std::vector<int>& v,
                     const std::vector<size_t>& idx) {
    long long s = 0;
    for (size_t i : idx) s += v[i];
    return s;
}

// Unpredictable branches (random input → ~50 % misprediction)
long long branch_heavy(const std::vector<int>& v) {
    long long s = 0;
    for (int x : v) {
        if (x & 1) s += x;     // odd/even depends on random data
        else        s -= x;
    }
    return s;
}

int main() {
    std::mt19937 rng(42);
    std::uniform_int_distribution<int> dist(0, 1 << 20);

    std::cout << "Allocating " << N * sizeof(int) / (1024 * 1024) << " MB...\n";
    std::vector<int> data(N);
    for (auto& x : data) x = dist(rng);

    // Random index permutation for scatter access
    std::vector<size_t> idx(N);
    std::iota(idx.begin(), idx.end(), 0);
    std::shuffle(idx.begin(), idx.end(), rng);

    auto t0 = std::chrono::steady_clock::now();

    volatile long long r1 = sequential_sum(data);
    volatile long long r2 = random_sum(data, idx);
    volatile long long r3 = branch_heavy(data);

    auto t1 = std::chrono::steady_clock::now();
    double ms = std::chrono::duration<double, std::milli>(t1 - t0).count();

    std::cout << "Results (prevent DCE): " << r1 << " " << r2 << " " << r3 << "\n";
    std::cout << "Elapsed: " << ms << " ms\n";
}
  • sequential_sum gives high IPC (~2–4) and low cache-miss rate (<1 %).
  • random_sum drives LLC miss rate to 20–40 % and drops IPC below 1.
  • branch_heavy pushes branch-miss rate toward 50 %.

Compile and run:

1
2
g++ -O2 -g -o perf_demo perf_demo.cpp
./perf_demo

Timing + RSS (/usr/bin/time -v):

1
/usr/bin/time -v ./perf_demo 2>&1 | grep -E "wall clock|Maximum resident"

CPU counters (perf stat):

1
2
perf stat -e task-clock,cycles,instructions,branches,branch-misses,cache-references,cache-misses \
  -- ./perf_demo

Expected Output (actual run, hardware varies)

1
2
3
4
5
6
7
8
9
10
11
12
13
 Performance counter stats for './perf_demo':

    33,189,508,178      task-clock         #    0.998 CPUs utilized
    43,898,685,981      cycles             #    1.323 GHz
    67,699,556,400      instructions       #    1.54  insn per cycle
     9,385,632,421      branches           #  282.789 M/sec
       103,063,756      branch-misses      #    1.10% of all branches
       223,694,419      cache-references   #    6.740 M/sec
       165,054,219      cache-misses       #   73.79% of all cache refs

      33.271095641 seconds time elapsed
      32.689158000 seconds user
       0.504431000 seconds sys

Time breakdown

Field Value Meaning
Elapsed 33.27 s Wall-clock time
User 32.69 s Time in user code
Sys 0.50 s Time in kernel (allocations, syscalls)
task-clock 33.19 s CPU time charged to the process
CPUs utilized 0.998 ~1 core active — program is single-threaded

Cycles and effective frequency

  • 43.9 B cycles / 33.2 s ≈ 1.32 GHz effective frequency.
  • This is below typical turbo speeds (2–4 GHz). Possible causes: power/thermal throttling, container CPU quota, or frequent stalls causing the CPU to clock-gate.

IPC — Instructions Per Cycle

1.54 IPC — the CPU retired 1.54 instructions on average every cycle.

IPC Interpretation
< 0.5 Severely stalled (memory-bound or branch-heavy)
~1.0 Moderate utilisation
1.5–2.5 Good (this run falls here)
3–4 Excellent (compute-bound, vectorised)

Cross-check: 1.32 GHz × 1.54 IPC ≈ 2.03 billion instructions/sec, which matches 67.7 B instructions / 33.3 s.

Branch misses

A branch misprediction occurs when the CPU’s branch predictor guesses the wrong direction for a conditional jump. Modern out-of-order CPUs speculatively execute instructions along the predicted path. When the prediction is wrong the pipeline must be flushed and re-executed from the correct target — wasting ~15–20 cycles per miss.

Common causes:

  • Data-dependent conditions on random input (e.g., if (x & 1) where x is random → ~50 % miss rate).
  • Indirect calls / virtual dispatch — the target address is computed at runtime.
  • Loop bounds that vary — the predictor can’t learn a fixed pattern.
  • Correlated branches deep in call chains.

This run: 103 M misses / 9.39 B branches = 1.10 % — normal range for a mix of predictable and unpredictable code.

Miss rate Interpretation
< 1 % Excellent
1–3 % Normal
> 10 % Problematic

Cache misses

  • 223 M cache-references, 165 M cache-misses → 73.79 % looks alarming, but perf cache-misses measures Last Level Cache (LLC / L3) misses, not L1/L2.
  • The denominator (cache-references) counts only the subset of accesses that reach L3, not all memory accesses, so the ratio can appear inflated.
  • In absolute terms: 165 M LLC misses / 33 s ≈ 5 M misses/sec — not unusually high.
  • The random_sum workload (random scatter over 256 MB) is the main driver; it defeats hardware prefetchers and forces most accesses all the way to DRAM.

Flame Graph (perf record)

1
2
3
perf record -F 99 -a -g -- ./perf_demo
perf script -i perf.data &> perf.unfold
# drag perf.unfold to https://www.speedscope.app