Optimizing Rust programs with PGO and BOLT using cargo-pgo
Last year I was working on improving
the Profile-guided optimization (PGO) workflow used to build the Rust compiler. While doing that, I
realized that while PGO works fine for Rust, it is not as straightforward to use and as discoverable
as I would have liked. That led me to the creation of cargo-pgo
,
a Cargo subcommand that makes it easier to optimize Rust binaries with PGO (and BOLT, see below).
I have actually implemented this tool over a year ago, and already posted about it on
Reddit,
but I figured that it might be useful to also write a short blog post about it here to explain a
little bit how PGO/BOLT works for Rust and how cargo-pgo
automates it.
What is PGO, anyway?
Profile-guided optimization (PGO) is a program optimization technique that allows a compiler to better optimize your code thanks to having a better idea of how will your program behave on real-world workloads. This is done using a recorded representation of program behavior, which is usually called a “profile”, hence the term profile-guided optimization. In a way, it is sort of like a JIT for compiled programs - instead of optimizing a program based on its runtime behavior while it is running, you run the program, record its behavior, and then re-compile it using this additional information.
The PGO workflow usually looks something like this:
- You compile an “instrumented” version of your program. The compiler will insert additional instrumentation instructions into it, which will record useful information when the program is executed.
- You execute the instrumented binary on some representative workload(s). This will generate a set of profiles on disk, which will contain information about your program behavior - things like how many times was each function called or how many times was a conditional branch taken.
- You compile your binary again, this time providing the gathered profiles to the compiler. It should then be able to optimize the code better, because it will have a better idea of your program runtime behavior.
PGO is a common technique in the C/C++ world, and it is also well-supported by Rust1. There is a
PGO guide in the official Rust
compiler documentation, which describes the steps that you need to perform to get it working. In short,
you need to pass a special compiler flag to rustc
when building your crate, gather the profiles by
running your program, use a separate LLVM tool to merge the gathered profiles and then pass a different
flag to rustc
, which needs to point to the merged profile. It’s not super complicated, but it’s also
quite far from the typical frictionless experience of running a single cargo <foo>
command that
does everything you need.
Automating PGO
That is why I decided to create cargo-pgo
, a Cargo subcommand
that is designed to make it as easy as possible to apply PGO to Rust crates. So, how does it work?
First, you need to install it with the following command:
$ cargo install cargo-pgo
After that, you can start using the various cargo pgo <...>
commands.
How does this Cargo integration work?
Cargo has this ingenious feature where it basically allows you to add custom subcommands to it
transparently. If you execute e.g. cargo foo bar
, and Cargo doesn’t know the command foo
, it will
try to search for a cargo-foo
binary in PATH
. If it finds it, it will delegate the executed command
to the binary, and basically invoke cargo-foo bar
.
In this way, you can add custom third-party subcommands to Cargo quite easily.
You may recall that the first step of the PGO workflow is to generate an instrumented binary. You can
do that using cargo pgo build
, which does several things for you:
- It passes the
--release
flag to Cargo. Just to make sure that you don’t forget . There’s not much point in PGO optimizing debug builds. - It passes an explicit
--target
flag to Cargo, which avoids PGO instrumenting build scripts. - It creates a directory for storing the PGO profiles under the
target
artifact directory. It will also automatically clear this directory to remove any stale profiles, unless you pass the--keep-profiles
flag. - And finally, it compiles your target with the
-Cprofile-generate=<profile-dir>
flag, which will causerustc
to enable PGO instrumentation.
Gathering profiles
After you have an instrumented binary, you should execute it on some realistic workloads to gather the profiles. You should gather enough profiles to provide proper context for the compiler, but it’s hard to say in general what is the correct amount. Usually I just let the program run at least for a minute or so.
Sometimes you might not have an easy way of running your code on a representative workload, and you
would like to gather profiles e.g. from tests or benchmarks. cargo pgo
has your back! With
cargo pgo test
or cargo pgo bench
, you can generate profiles by running instrumented tests or
benchmarks, and then use these profiles to optimize a separate binary executable (it doesn’t make
much sense to optimize tests/benchmarks themselves with PGO).
More precise profiles
If you run cargo pgo build
, you might notice that it will tell you that you might want to execute
the instrumented binary with a LLVM_PROFILE_FILE
environment variable. What is this about?
By default, the instrumented binary will store all profiles into a single .profraw
file. This is fine
for most use-cases. However, if your program creates multiple processes or if you execute the instrumented
program multiple times in parallel, some data in the profile might get lost or overwritten. This happens
because the file will be read and written in parallel by multiple processes, effectively resulting in
a race condition. This race condition is mostly harmless, however it can result in less precise profiles.
To resolve this potential problem, you can run the instrumented binary with the environment variable
LLVM_PROFILE_FILE
set to a path containing a special placeholder value %p
. This will essentially
cause the instrumented program to generate one .profraw
file per process2. For example:
$ LLVM_PROFILE_FILE=./target/pgo-profiles/%m_%p.profraw
./target/release/x86_64-unknown-linux-gnu/foo
Creating one file per process should result in more precise profiles and thus a better optimized program.
When I enabled this “trick” for the Rust compiler itself, it resulted in
pretty
nice ~1% instruction count
improvements across the board, although it’s hard to say whether this will generalize to other programs.
It should also be noted that if you create a lot of processes, the disk usage of all these profile
files can get large pretty quickly! For rustc
, a single profile takes tens of megabytes, while
creating a separate profile for each process consumes almost 60 GiB
!
Final optimization step
Once you have gathered the PGO profiles, you can run cargo pgo optimize
. It will merge all
gathered profiles using the llvm-profdata
tool and then compile your target with the -Cprofile-use
flag, pointing it to the single merged profile file. It will also print helpful stats about the
gathered profiles (like their count and total size before and after merging).
Running PGO on CI
If you want to apply PGO to binary artifacts that you then distribute to end users, you might want
to run PGO in a CI (continuous integration) workflow. If you install cargo-pgo
in your CI script,
and you are able to run your instrumented binary on some (probably small) workload directly on the CI
machine, then this becomes quite straightforward. I created a simple example of a GitHub Actions
workflow
that shows how this could be done.
Going beyond PGO
The (LLVM-based) PGO implementation offered by the Rust compiler is just one of many existing so-called Feedback-directed optimization (FDO) tools, which leverage some sort of runtime profiles to better optimize programs. Another such tool is a post-link optimizer called BOLT. “Post-link” means that it takes a fully compiled and linked program binary as an input, and then uses profiles to optimize the binary, even without access to its source code. This differs from “classic” PGO, which optimizes the program during compilation, and thus has access to its source code. Its main goal is to better reorganize instructions within the binary, in particular to improve instruction cache utilization.
BOLT is a part of LLVM, and can provide additional performance improvements even on top of an already PGO-optimized binary. Last year, I have enabled BOLT for LLVM3 used by the Rust compiler, which resulted in ~2-5% cycle improvements across the board.
Sadly, it might not be that easy to even get ahold of a precompiled version of BOLT. While it is distributed
through Ubuntu/Debian packages, they seem to be broken currently.
Since LLVM 16, LLVM GitHub releases
contain a precompiled llvm-bolt
binary4, which allows you to get a working version of BOLT relatively
easily, however it has to be available for your architecture and platform. If it is not, then you
basically have to go and compile LLVM + BOLT yourself5,
which is quite annoying and can also difficult to do on CI.
Same as with PGO, BOLT uses a workflow where you first need to gather profiles of your program running on some workload, and then use these profiles to re-optimize your binary. BOLT can gather these profiles in two modes:
- Sampling: In this mode, you simply execute your binary under a profiler (
perf
), which gathers hardware counter data from its execution and uses this information to generate the required profiles. - Instrumentation: This mode is similar to PGO instrumentation. BOLT modifies your binary to add additional instructions that will generate the profiles during runtime. The advantage of this mode is that it doesn’t require access to CPU/HW counters, which makes it usable in CIs which do not allow this (such as GitHub Actions). I also think that instrumentation should be able to generate more precise profiles. The disadvantage is that you need to have an additional instrumentation step. The instrumented binary is also slower, and it will thus take it more time to gather the same amount of profile data as with the sampling approach. But this might not be a big deal.
Here is an example of how you could use BOLT manually using the instrumentation mode:
# Build your binary with linker relocations, so that BOLT can instrument it
$ RUSTFLAGS="-C link-args=-Wl,-q" cargo build --release
# Instrument the binary with BOLT
$ llvm-bolt ./target/release/<binary> -o instrumented -instrument
# Run the instrumented binary on some workload
$ ./instrumented <...>
# Merge the generate profiles, which are by default stored into /tmp/*.fdata
$ merge-fdata /tmp/*.fdata > merged.profdata
# Finally, optimize the binary with BOLT
$ llvm-bolt -o optimized -data merged.profdata <BOLT flags...>
As you can see, the process is quite involved. Using BOLT is actually more tricky than using PGO, because
it is not integrated into the Cargo workflow, but instead it operates on the finished Rust
artifacts. You should thus make sure not to modify the original artifacts built by cargo
so that
you do not mess with its cache and that you don’t instrument the same file with BOLT twice (it will
result in an error).
To make this easier, I also added support for BOLT to cargo-pgo
. It uses a workflow that is quite
similar to the PGO one, and does the instrumentation, profile merging and optimization for you:
# Build a BOLT instrumented binary
$ cargo pgo bolt build
# Run the binary to gather profiles
$ ./target/.../<binary>-bolt-instrumented
# Optimize the binary with BOLT using the gathered profiles
$ cargo pgo bolt optimize
# Now you can use ./target/release/<binary>-bolt-optimized
The instrumented and optimized files are named with a suffix (-bolt-instrumented
and -bolt-optimized
)
to avoid messing with artifacts built by Cargo.
cargo-pgo
is even able to combine both PGO and BOLT using the --with-pgo
flag6:
# Build PGO instrumented binary
$ cargo pgo build
# Run binary to gather PGO profiles
$ ./target/.../<binary>
# Build BOLT instrumented binary using PGO profiles
$ cargo pgo bolt build --with-pgo
# Run binary to gather BOLT profiles
$ ./target/.../<binary>-bolt-instrumented
# Optimize a PGO-optimized binary with BOLT
$ cargo pgo bolt optimize --with-pgo
This combined PGO + BOLT workflow should provide the largest performance improvements7, at the cost of increased build time - you need to recompile and run your program several times.
Conclusion
There’s probably more to say about both PGO and BOLT, but this post was mainly supposed to serve as a short
intro into how to use these techniques with Rust, and how to leverage cargo-pgo
to make this simpler,
and I think that it has achieved that goal.
Let me know on Reddit or on the cargo-pgo
issue tracker
if you have any questions regarding the usage of PGO/BOLT for Rust crates.
-
Although I’m not sure how known it is and how many people actually use it. ↩
-
The
%m
placeholder is “module” (I think?) and basically describes the signature of the instrumented binary. ↩ -
And also the
merge-fdata
tool, which is needed for merging BOLT profiles. It is basically a BOLT analogue to thellvm-profdata
PGO tool. ↩ -
We do exactly this in Rust CI workflows. ↩
-
It would be nicer to do this in a more composable way, like
cargo pgo bolt build -- cargo pgo pgo optimize
or something like that, but alas. A possible future improvement :) ↩ -
Although it is not guaranteed that it will actually improve performance, of course. ↩