perf
Reference: cnblogs post
Prerequisites: Permissions
By default, many distros set perf_event_paranoid=4, which blocks hardware counter access for unprivileged users. Check and fix:
1
2
3
4
5
6
7
cat /proc/sys/kernel/perf_event_paranoid # 4 = very restricted, -1 = fully open
# Temporarily lower it (until next reboot):
sudo sysctl -w kernel.perf_event_paranoid=0
# Make it permanent:
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf
| Value | Effect |
|---|---|
| -1 | All events allowed for all users |
| 0 | CPU events allowed (recommended for dev machines) |
| 1 | No raw/ftrace access |
| ≥ 2 | No kernel profiling |
| 4 | Heavily restricted (default on some distros) |
Basic Workflow
-
Compile with
-gto embed debug symbols. This increases binary size but does not affect runtime performance. -
Record and convert the trace:
1
perf record -F 99 -a -g -- MY_EXE && perf script -i perf.data &> perf.unfold
-
Visualize with one of:
-
FlameGraph
1 2 3
git clone https://github.com/brendangregg/FlameGraph.git ./FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded ./FlameGraph/flamegraph.pl perf.folded > perf.svg -
Speedscope (recommended) — drag
perf.unfoldonto the page.
-
Perf in a Docker Container
Running perf inside Docker has a few quirks, especially with custom kernels (e.g., System76 / Pop!_OS).
1. Missing kernel-matched perf binary
If you see:
1
2
3
WARNING: perf not found for kernel 6.9.3-76060903
You may need to install the following packages for this specific kernel:
linux-tools-6.9.3-76060903-generic
Custom kernel builds (System76, etc.) often have no matching linux-tools-<version> package in the default Ubuntu repos, so apt install linux-tools-$(uname -r) will fail inside the container.
Fix: Perf’s ABI is very stable (perf_event_open(2) is backward/forward compatible), so copy the host’s perf binary directly into the container:
1
cp /usr/lib/linux-tools-$(uname -r)/perf ./perf-copy
Do not copy
/usr/bin/perf— it is a shell wrapper, not the real binary.
2. Required docker-compose settings
1
2
privileged: true # or at least cap_add: [SYS_ADMIN, PERFMON]
pid: "host"
privileged/SYS_ADMIN/PERFMONare required forperf_event_open.pid: "host"letsperftrace real host PIDs and resolve symbols correctly.
3. Lowering the paranoid level
If perf complains about permissions, run inside the container:
1
2
sudo sysctl -w kernel.perf_event_paranoid=0
sudo sysctl --system
Profiling
Timing and Resource Metrics
| Metric | Description | How to measure |
|---|---|---|
| Elapsed | Wall-clock time (includes sleep/wait) | std::chrono, /usr/bin/time, benchmark timers |
| CPU Time | User+system CPU time; can exceed elapsed when multithreaded | /usr/bin/time -p/-v, getrusage(RUSAGE_SELF), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) |
| Peak RSS | Peak physical memory resident (not virtual) | /usr/bin/time -v, /proc/<pid>/status (VmHWM/VmRSS), ps -o rss, smem |
CPU Performance Counter Metrics
These require hardware perf counters to be enabled (may show “n/a” otherwise).
| Metric | Description |
|---|---|
| Cycles | CPU clock cycles (hardware counter) |
| Instructions | Retired instructions (hardware counter) |
| IPC | Instructions per cycle = instructions / cycles (higher is better) |
| CPI | Cycles per instruction = cycles / instructions (lower is better) |
| Branch miss rate | Wrong branch predictions / total branches |
| Cache miss rate | Last-level cache misses / references |
Collect all at once with:
1
2
3
perf stat -r 5 \
-e task-clock,cycles,instructions,branches,branch-misses,cache-references,cache-misses \
-- ./your_benchmark_binary --your-flags
Hands-on Example
The program below exercises all the metrics above in one binary:
- Sequential sum — cache-friendly, high IPC, low cache-miss rate baseline.
- Random scatter — cache-unfriendly, drives up LLC miss rate.
- Unpredictable branches — values are random so the branch predictor is ~50 % wrong.
- Large allocation — 256 MB array → visible Peak RSS.
std::chronotimer — prints elapsed time from inside the program.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// perf_demo.cpp
#include <algorithm>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>
static const size_t N = 64 * 1024 * 1024; // 64 M elements = 256 MB (int)
// Cache-friendly sequential sum
long long sequential_sum(const std::vector<int>& v) {
long long s = 0;
for (size_t i = 0; i < v.size(); ++i) s += v[i];
return s;
}
// Cache-unfriendly random scatter read
long long random_sum(const std::vector<int>& v,
const std::vector<size_t>& idx) {
long long s = 0;
for (size_t i : idx) s += v[i];
return s;
}
// Unpredictable branches (random input → ~50 % misprediction)
long long branch_heavy(const std::vector<int>& v) {
long long s = 0;
for (int x : v) {
if (x & 1) s += x; // odd/even depends on random data
else s -= x;
}
return s;
}
int main() {
std::mt19937 rng(42);
std::uniform_int_distribution<int> dist(0, 1 << 20);
std::cout << "Allocating " << N * sizeof(int) / (1024 * 1024) << " MB...\n";
std::vector<int> data(N);
for (auto& x : data) x = dist(rng);
// Random index permutation for scatter access
std::vector<size_t> idx(N);
std::iota(idx.begin(), idx.end(), 0);
std::shuffle(idx.begin(), idx.end(), rng);
auto t0 = std::chrono::steady_clock::now();
volatile long long r1 = sequential_sum(data);
volatile long long r2 = random_sum(data, idx);
volatile long long r3 = branch_heavy(data);
auto t1 = std::chrono::steady_clock::now();
double ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
std::cout << "Results (prevent DCE): " << r1 << " " << r2 << " " << r3 << "\n";
std::cout << "Elapsed: " << ms << " ms\n";
}
sequential_sumgives high IPC (~2–4) and low cache-miss rate (<1 %).random_sumdrives LLC miss rate to 20–40 % and drops IPC below 1.branch_heavypushes branch-miss rate toward 50 %.
Compile and run:
1
2
g++ -O2 -g -o perf_demo perf_demo.cpp
./perf_demo
Timing + RSS (/usr/bin/time -v):
1
/usr/bin/time -v ./perf_demo 2>&1 | grep -E "wall clock|Maximum resident"
CPU counters (perf stat):
1
2
perf stat -e task-clock,cycles,instructions,branches,branch-misses,cache-references,cache-misses \
-- ./perf_demo
Expected Output (actual run, hardware varies)
1
2
3
4
5
6
7
8
9
10
11
12
13
Performance counter stats for './perf_demo':
33,189,508,178 task-clock # 0.998 CPUs utilized
43,898,685,981 cycles # 1.323 GHz
67,699,556,400 instructions # 1.54 insn per cycle
9,385,632,421 branches # 282.789 M/sec
103,063,756 branch-misses # 1.10% of all branches
223,694,419 cache-references # 6.740 M/sec
165,054,219 cache-misses # 73.79% of all cache refs
33.271095641 seconds time elapsed
32.689158000 seconds user
0.504431000 seconds sys
Time breakdown
| Field | Value | Meaning |
|---|---|---|
| Elapsed | 33.27 s | Wall-clock time |
| User | 32.69 s | Time in user code |
| Sys | 0.50 s | Time in kernel (allocations, syscalls) |
| task-clock | 33.19 s | CPU time charged to the process |
| CPUs utilized | 0.998 | ~1 core active — program is single-threaded |
Cycles and effective frequency
- 43.9 B cycles / 33.2 s ≈ 1.32 GHz effective frequency.
- This is below typical turbo speeds (2–4 GHz). Possible causes: power/thermal throttling, container CPU quota, or frequent stalls causing the CPU to clock-gate.
IPC — Instructions Per Cycle
1.54 IPC — the CPU retired 1.54 instructions on average every cycle.
| IPC | Interpretation |
|---|---|
| < 0.5 | Severely stalled (memory-bound or branch-heavy) |
| ~1.0 | Moderate utilisation |
| 1.5–2.5 | Good (this run falls here) |
| 3–4 | Excellent (compute-bound, vectorised) |
Cross-check: 1.32 GHz × 1.54 IPC ≈ 2.03 billion instructions/sec, which matches 67.7 B instructions / 33.3 s.
Branch misses
A branch misprediction occurs when the CPU’s branch predictor guesses the wrong direction for a conditional jump. Modern out-of-order CPUs speculatively execute instructions along the predicted path. When the prediction is wrong the pipeline must be flushed and re-executed from the correct target — wasting ~15–20 cycles per miss.
Common causes:
- Data-dependent conditions on random input (e.g.,
if (x & 1)wherexis random → ~50 % miss rate). - Indirect calls / virtual dispatch — the target address is computed at runtime.
- Loop bounds that vary — the predictor can’t learn a fixed pattern.
- Correlated branches deep in call chains.
This run: 103 M misses / 9.39 B branches = 1.10 % — normal range for a mix of predictable and unpredictable code.
| Miss rate | Interpretation |
|---|---|
| < 1 % | Excellent |
| 1–3 % | Normal |
| > 10 % | Problematic |
Cache misses
- 223 M cache-references, 165 M cache-misses → 73.79 % looks alarming, but
perf cache-missesmeasures Last Level Cache (LLC / L3) misses, not L1/L2. - The denominator (
cache-references) counts only the subset of accesses that reach L3, not all memory accesses, so the ratio can appear inflated. - In absolute terms: 165 M LLC misses / 33 s ≈ 5 M misses/sec — not unusually high.
- The
random_sumworkload (random scatter over 256 MB) is the main driver; it defeats hardware prefetchers and forces most accesses all the way to DRAM.
Flame Graph (perf record)
1
2
3
perf record -F 99 -a -g -- ./perf_demo
perf script -i perf.data &> perf.unfold
# drag perf.unfold to https://www.speedscope.app