C++ Profiling and Speed Ups

gprof, perf, CMake Release and Debug BoilerPlate, CMake Settings

Posted by Rico's Nerd Cluster on May 24, 2023

Gprof

GNU gprof provides CPU execution times (not wall time, so sleep is not accounted for) of functions and their subfunctions in percentage. Gprof outputs data in a text-based format, which can be difficult to interpret. This is where gprof2dot comes in—it converts the profiling data into a visual call graph that makes it easier to understand function relationships and execution costs.

Gprof limitations: - It measures user time, but not kernel time

Source

Youtube Reference

  1. Compile your program with -pg: gcc -pg -o my_program my_program.c or g++ -pg -o my_program my_program.cpp. In CMake:
1
2
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -pg")
set(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS_DEBUG} -pg")
  • -pg is specifcally for GNU gprof. It
    • Inserts instrumentation code into the executable to monitor function calls and execution times.
    • Generates a profiling report (gmon.out) when the program finishes running.
  • Profiling does not work well with compiler optimizations (-O2 or -O3). The compiler may inline or reorder functions, making gprof reports unreliable.
  • Therefore, Debug mode is recommended for gprof:
1
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} -pg -O2")
  1. Install gprof2dot: pip install gprof2dot. If this doesn’t work, try:
1
git clone https://github.com/jrfonseca/gprof2dot.git
1
- One can use `python3 -m gprof2dot <ARGS>` if the module has been installed successfully
  1. Install dot (part of Graphviz):
1
2
sudo apt update
sudo apt install graphviz
  1. Run the program:
1
gprof -p -q my_program gmon.out > analysis.txt
  • -p: Only prints flat profile (function time usage)
  • -q: Only prints call graph
  • -b to suppress verbose explanations:
  1. Create a profile.png view of the graph:
1
python3 gprof2dot.py -s -w analysis.txt | dot -Tpng -o profile.png
  • -w means “wrap text in bounding box”
  • -s strips away args
  1. To check the times of a subtree:
    1. Check the full function signature

      1
      
       python3 gprof2dot.py  --list-functions='*add_scan*'   -w ~/file_exchange_port/Mumble-Robot/mumble_onboard/profile.txt
      
    2. Generate the sub-tree:

      1
      
       python3 gprof2dot.py -s --root='halo::IncrementalNDTLO::add_scan(std::shared_ptr<pcl::PointCloud<pcl::PointXYZI> >, bool)' -w ~/file_exchange_port/Mumble-Robot/mumble_onboard/profile.txt | dot -Tpng -o profile.png
      

perf

  1. Compile your binary with -g to get debug symbols. Debug symbols could increase the size of your binary, but wouldn’t cause slowdowns
  2. Second, run
1
perf record -F 99 -a -g -- MY_EXE && perf script -i perf.data &> perf.unfold                        
  1. Visualization:
    1. FlameGraph: git clone https://github.com/brendangregg/FlameGraph.git
      1
      2
      
       ./FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded              
       ./FlameGraph/flamegraph.pl perf.folded > perf.svg
      
    2. Speedscope (my favorite)
      • Just drag perf.unfold onto the page!

Perf in a Docker Container

Running perf inside Docker has a few quirks, especially with custom kernels like those from System76 or Pop!_OS. Here’s a refined setup guide:

  1. If you see this error:
1
2
3
4
WARNING: perf not found for kernel 6.9.3-76060903

  You may need to install the following packages for this specific kernel:
    linux-tools-6.9.3-76060903-generic

💡 Why:

  • Some systems (e.g., System76) use custom kernel builds that don’t have matching linux-tools-<version> packages in the default Ubuntu repos.
  • Installing linux-tools-$(uname -r) inside the container won’t work if the package doesn’t exist in apt.

✅ Solution:

Perf’s ABI has been very stable. It is backward and forward ABI compatible via perf_event_open(2). You can use a perf binary from a different kernel version as long as the syscalls remain stable (which they do). Just make sure the binary matches your host architecture and brings any needed .so dependencies.

Use the perf binary from the host, where it was successfully installed:

1
cp /usr/lib/linux-tools-$(uname -r)/perf ./perf-copy
  • ⚠️ Do not copy /usr/bin/perf — it’s just a symlink that may not point to the real binary.
  1. Required docker-compose settings in docker-compose.yaml:
1
2
privileged: true      # or at least cap_add: [SYS_ADMIN, PERFMON]
pid: "host"
  • privileged or SYS_ADMIN/PERFMON are needed for perf_event_open
  • pid: "host" is critical so perf can trace real PIDs and resolve symbols correctly.
  1. If you see this warning:
1
Perf tool from other version of kernel still can be used (the syscalls in perf_event subsystem have good design and are compatible with older/newer tools). So, you can just find any perf binary (not the /usr/bin/perf script) anywhere, check its library depends with (ldd ..path_to_perf/perf) and copy perf inside Docker (and install libs).

Do this in the container:

1
2
sudo sysctl -w kernel.perf_event_paranoid=0
sudo sysctl --system

Optick

Optick failed to run on my linux machine. I could not get JSON file out, only the opt file out. Also, it needs a GUI to run it. The gui only be built on MSVC.

CMake Boiler Plate For Release and Debug

For a ROS 2 workspace, we generally want a structured CMakeLists.txt design that allows easy toggling between Debug and Release modes. Why? I worked on a KD tree implementation. The same implementation takes 0.18s for KD tree building with 20k 3D LiDAR Points under the Debug mode. Under the release mode? 3ms (60x speed up)!

In a two level structure my_ros2_workspace -> halo, We control optimization flags, profiling tools (gprof, gdb), and CPU-specific instructions from the top-level workspace CMakeLists.txt, while allowing package-specific settings in halo/CMakeLists.txt.

  • Workspace-Level (my_ros2_workspace/CMakeLists.txt) (top-level) sets the default build type and allows toggling between Debug and Release modes dynamically.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
cmake_minimum_required(VERSION 3.10)
project(my_ros2_workspace)

# Ensure that a build type is set
if(NOT CMAKE_BUILD_TYPE)
    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type (Debug, Release, RelWithDebInfo)" FORCE)
endif()

# Define common flags for different build types
if(CMAKE_BUILD_TYPE MATCHES "Debug")
    message(STATUS "Building in Debug mode with gprof and gdb support")
    set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -pg -ggdb -O0")
    set(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS_DEBUG} -pg")
elseif(CMAKE_BUILD_TYPE MATCHES "Release")
    message(STATUS "Building in Release mode with optimizations")
    set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -march=native -O3")
    if(WITH_SSE)
        set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -msse4.2")
    endif()
    if(WITH_AVX)
        set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -mavx2")
    endif()
endif()

# Export build type setting to subdirectories
set(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS_${CMAKE_BUILD_TYPE}})

# Add your packages
add_subdirectory(halo)
  • Sub-Package Level halo/CMakeLists.txt: Each ROS 2 package (like halo) should inherit the build settings from the top-level CMakeLists.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cmake_minimum_required(VERSION 3.10)
project(halo)
# Enable C++17 (or higher)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# Ensure build type is consistent
if(NOT CMAKE_BUILD_TYPE)
    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type (Debug, Release, RelWithDebInfo)" FORCE)
endif()
# Use the common flags from the workspace-level CMakeLists.txt
add_definitions(${CMAKE_CXX_FLAGS})

# Include directories
include_directories(
    ${CMAKE_SOURCE_DIR}/include
)
...

Some explanations:

  • CACHE: Stores the variable persistently across CMake runs.
  • STRING: Specifies that this is a string-type variable.
  • "Build type (Debug, Release, RelWithDebInfo)": A user-friendly message to describe the variable.
  • FORCE: Overwrites any previously set value of CMAKE_BUILD_TYPE in the cache. This ensures that CMAKE_BUILD_TYPE is set globally and is not overridden by user settings unless explicitly changed.

Understanding the WITH_SSE and WITH_AVX Flags:

  • These flags enable optional CPU-specific optimizations using SSE4.2 and AVX2 instruction sets. They are useful for performance-critical applications like computer vision, deep learning, or scientific computing.

CMake automatically sets certain compiler flags depending on the build type. The default values are:

  • Debug usually includes -O0 and -g.
  • Release usually includes -O3 (for GCC/Clang) and often -DNDEBUG.
  • RelWithDebInfo usually includes -O2 plus debug symbols.
  • MinSizeRel usually includes -Os (optimize for size).

If you only commented out your custom optimization lines, you did not override CMake’s built-in defaults for Release. By default, Release mode is still optimized (most often -O3).

To build with colcon:

  • colcon build --cmake-args -DCMAKE_BUILD_TYPE=Debug
  • colcon build --cmake-args -DCMAKE_BUILD_TYPE=Release
  • Enable WITH_SSE or WITH_AVX for Release: colcon build --cmake-args -DCMAKE_BUILD_TYPE=Release -DWITH_SSE=ON -DWITH_AVX=ON

  • colcon_build_source --executor-args -j4: --executor-args is used for things like number of threads -j4

To build with CMake:

  • cmake .. -DCMAKE_BUILD_TYPE=Debug
  • cmake .. -DCMAKE_BUILD_TYPE=Release (This applies -O3 for maximum optimization.)
  • cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo (This enables -O2 optimizations while keeping debugging symbols.)

Quirk: Seeing Segfault Or Inconsistent Values In Release Runs From The Debug Mode

This is almost certainly an undefined behavior. In Debug mode, compilers often zero-initialize more aggressively or add padding/checks, so you get a “lucky” consistent result (like pointers). In Release mode, the uninitialized pointers can contain garbage values. So:

  • Always initialize pointers (and all fields) in your structs/classes. Debug mode can mask uninitialized usage. Release mode typically reveals these bugs.

  • If a function has a return type but is not returning anything, in release mode, we see seg faults

Optional CMake Settings

  • set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${PROJECT_SOURCE_DIR}/lib): useful if we are building static lib: .a files. set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)

UNRECOMMENDED CMake Settings

  • set(CMAKE_CXX_FLAGS "-w") - suppress all warnings
  • set(CMAKE_CXX_FLAGS_RELEASE "-O2 -g -ggdb ${CMAKE_CXX_FLAGS}") This preserves debugging symbols while applying O2 optimizations. But instead, one can just use colcon build --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo