Note: This version targets Xe-family consumer and data center GPUs. If you were using
iaprofon the Intel® Tiber™ AI Cloud with a PVC device, see thetibertag.
This tool collects profiles of Intel GPU performance based on hardware sampling and generates visualizations from the results: AI flame graphs and subsecond-offset heatmaps.
It combines EU stalls, CPU stacks, and GPU kernel information to link CPU code to GPU performance metrics. The resulting profile output can be consumed by external tools to generate visualizations such as:
- Flame Graphs
- FlameScope-style subsecond-offset heatmaps
The following Intel Xe-family hardware is supported on Linux:
- Intel® Arc™ B-series graphics cards (Battlemage)
- Intel® Core™ Ultra processors with Intel® Arc™ graphics (Lunar Lake)
- Other Intel® Xe2-based devices (untested)
You will need Linux 6.15 or later, which includes EU stall sampling support for the xe driver.
BTF type information is required for both
vmlinux and the xe driver. These are typically found at /sys/kernel/btf/vmlinux
and /sys/kernel/btf/xe once the driver is loaded. If /sys/kernel/btf/xe is
absent, your kernel may have been built without CONFIG_DEBUG_INFO_BTF_MODULES=y.
iaprof uses USDT probes in libze_intel_gpu
(the Intel GPU Level Zero runtime) to observe GPU kernel launches and collect kernel
debug information. Standard NEO releases do not yet include these probes; until they
are upstreamed, a patched build is required.
Patches against a supported NEO release are provided on the Releases page.
Note: Documentation for the specific NEO version and patch instructions is coming soon.
The profiled application and its dependencies — including the graphics stack — must be compiled with frame pointers enabled in order to collect reliable CPU stacks. Add these flags to C/C++ compile commands:
-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
Work is in progress to have frame pointer support integrated into official graphics stack packages. In the meantime, you will need to rebuild relevant libraries from source with the flags above.
Install build dependencies:
sudo apt install libelf-dev clang llvm python3-mako cmake libzstd-dev
A Rust toolchain is also required. The recommended way to install one is via
rustup rather than the cargo package from apt, which is
often out of date:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Clone the repo and build:
git clone --recursive https://github.com/intel/iaprof
cd iaprof
make deps
./build.sh
NOTE: if the make deps step fails, ensure that user.name and user.email are
set in your git config.
The built binary is placed at build/iaprof.
Start the profiler (requires root):
sudo build/iaprof > profile.txt
Run your GPU workload. When done, interrupt iaprof with ctrl-C.
You can tune collection with the following options:
--interval=MS— output interval in milliseconds (default: 10)--eu-stall-subsample=N— process one out of every N EU stall samples (default: 100)
Note: This section is incomplete. More detailed per-framework guidance is coming once frame pointer support is available in official graphics stack packages.
Python added perf support (and frame pointer support through trampolines) in version 3.12. This is the minimum required version for profiling with this tool. The CPython interpreter itself must also be compiled with frame pointers enabled.
Set this environment variable before running a Python workload to enable perf support:
export PYTHONPERFSUPPORT=1
PyTorch workloads typically use a mix of SYCL, oneDNN, and oneMKL kernels. Frame pointers must be enabled for the following components:
- PyTorch itself
- IPEX (Intel Extension for PyTorch)
- oneCCL and its PyTorch bindings (a dependency of IPEX)
- The SYCL runtime
- oneMKL
- oneDNN
Note: Guidance on which of these components may already ship with frame pointers on supported distributions is coming soon.
The iaprof output format is a tab-separated text stream that can be consumed by
external tools. ProVis is one such tool
that reads this format and generates flame graphs and subsecond-offset heatmaps. A
conversion script for producing standard
stackcollapse output (compatible with
flamegraph.pl and other tools) is also planned.
Note: The stackcollapse conversion script is not yet available.
The overhead of iaprof is low, but the current version is not designed for
continuous profiling. It profiles a single active workload and does not handle
multiple transient workloads as would be seen in a multi-tenant environment.
Ensure all code is compiled with frame pointers as described above. If your CPU
stack ends in one frame of libfoo.so, that library is likely missing frame pointers.
Stacks can be truncated if they exceed the kernel's collection limit. Raise it with:
sudo sysctl kernel.perf_event_max_stack=512
sudo sysctl kernel.perf_event_max_contexts_per_stack=64
If stacks are still truncated below that depth, it may be due to the kernel stopping early on encountering a non-resident stack page. This can occur under memory pressure or with NUMA balancing enabled.
iaprof is designed to have low overhead, but some slowdown may still be
noticeable depending on the workload. If iaprof itself is consuming significant
CPU, the --eu-stall-subsample option can reduce its processing load.