perf
Reference: cnblogs post
Basic Workflow
-
Compile with
-gto embed debug symbols. This increases binary size but does not affect runtime performance. -
Record and convert the trace:
1
perf record -F 99 -a -g -- MY_EXE && perf script -i perf.data &> perf.unfold
-
Visualize with one of:
-
FlameGraph
1 2 3
git clone https://github.com/brendangregg/FlameGraph.git ./FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded ./FlameGraph/flamegraph.pl perf.folded > perf.svg -
Speedscope (recommended) — drag
perf.unfoldonto the page.
-
Perf in a Docker Container
Running perf inside Docker has a few quirks, especially with custom kernels (e.g., System76 / Pop!_OS).
1. Missing kernel-matched perf binary
If you see:
1
2
3
WARNING: perf not found for kernel 6.9.3-76060903
You may need to install the following packages for this specific kernel:
linux-tools-6.9.3-76060903-generic
Custom kernel builds (System76, etc.) often have no matching linux-tools-<version> package in the default Ubuntu repos, so apt install linux-tools-$(uname -r) will fail inside the container.
Fix: Perf’s ABI is very stable (perf_event_open(2) is backward/forward compatible), so copy the host’s perf binary directly into the container:
1
cp /usr/lib/linux-tools-$(uname -r)/perf ./perf-copy
Do not copy
/usr/bin/perf— it is a shell wrapper, not the real binary.
2. Required docker-compose settings
1
2
privileged: true # or at least cap_add: [SYS_ADMIN, PERFMON]
pid: "host"
privileged/SYS_ADMIN/PERFMONare required forperf_event_open.pid: "host"letsperftrace real host PIDs and resolve symbols correctly.
3. Lowering the paranoid level
If perf complains about permissions, run inside the container:
1
2
sudo sysctl -w kernel.perf_event_paranoid=0
sudo sysctl --system
Profiling
Timing and Resource Metrics
| Metric | Description | How to measure |
|---|---|---|
| Elapsed | Wall-clock time (includes sleep/wait) | std::chrono, /usr/bin/time, benchmark timers |
| CPU Time | User+system CPU time; can exceed elapsed when multithreaded | /usr/bin/time -p/-v, getrusage(RUSAGE_SELF), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) |
| Peak RSS | Peak physical memory resident (not virtual) | /usr/bin/time -v, /proc/<pid>/status (VmHWM/VmRSS), ps -o rss, smem |
CPU Performance Counter Metrics
These require hardware perf counters to be enabled (may show “n/a” otherwise).
| Metric | Description |
|---|---|
| Cycles | CPU clock cycles (hardware counter) |
| Instructions | Retired instructions (hardware counter) |
| IPC | Instructions per cycle = instructions / cycles (higher is better) |
| CPI | Cycles per instruction = cycles / instructions (lower is better) |
| Branch miss rate | Wrong branch predictions / total branches |
| Cache miss rate | Last-level cache misses / references |
Collect all at once with:
1
2
3
perf stat -r 5 \
-e task-clock,cycles,instructions,branches,branch-misses,cache-references,cache-misses \
-- ./your_benchmark_binary --your-flags