andweorc
- a causal profiler
2021/12/17
Causal profiling, at least to my knowledge, was introduced by Curtsinger, Berger in their 2015 paper COZ: Finding Code that Counts with Causal Profiling. Traditionally profilers measure the CPU time taken by code and report on that. This works well when CPUs are relatively deterministic and programs aren't multi-threaded in themselves. On a modern machines all programs are parallel programs, in the sense of Amdahl's law applying. Even a totally serial program -- no threads -- is still running on a superscalar, out-of-order device, hence tricks around loop fission and what not. The "serialization" point of our programs have an outsized impact on total program performance in a way that is related to but not totally explained by CPU time. A very sleepy part of a program might drive its overall runtime while a busy part of the program lights up a traditional profiler. Think of a busy-loop in one thread waiting on an atomic bool to flip before causing the program to exit and a sleep followed by a bool flip in another thread. Causal profiling aims to not indicate where a program spent its CPU time but, instead, explain to what degree performance of the program would change if such and such line's performance were changed. Thataway we, the software engineers, learn where to focus our effort from the computer, rather than guessing if such and such change would materially improve total-program performance.
Does causal profiling work?
It does. You can experiment with the
coz
program yourself. This software
depends on libelfin which doesn't
understand more recent DWARF versions, but if you use an old-ish Ubuntu version
you should be alright. The paper linked above describes the experience of using
coz, as well, if you can't get it to function.
How does a coz-like causal profiler work?
How a coz
-like causal profiler works is really interesting, but, more
generally, how does a causal profiler work? The key insight to causal profiling
is that, while you can't speed up program sub-components you can slow them
down, and if you refrain from slowing down some sub-component while slowing all
the rest down you've "sped" that sub-component up. A causal profiler uses this
insight plus some mechanism to slow down the sub-components -- line, function,
module etc -- you're investigating, a table of "speedup" factors to apply, some
"progress points" to track during execution and some coordinator to keep track
of which bits have been "spedup" and their impact on the progress points, with
enough repeats of the same speedup to get statistically interesting results.
The way coz
achieves this is by managing experiments with a "global
profiler". This thing runs in its own OS thread, polls the progress points and
keeps a record of the speedup experiments already run and their effect on the
program. Through an LD_PRELOAD
override coz
injects overwritten POSIX and
pthread functions to
- make select functions participate in ongoing experiments (by sleeping for an experimentally determined time, ie, slowing down) and
- start a per-thread profiler complete with timer interrupt.
This per-thread profiler lives at the "top" of every thread, collecting Linux
perf samples -- instruction pointer, callchain -- and sleeping in coordination
with the global profiler, reporting back results to the global profiler. The
interrupted thread goes back to normal operation once the interrupt is
handled. Because of the use of Linux perf coz
can only understand whatever an
instruction pointer points to that can also be resolved into a sybmol, thus a
line. You can imagine an alternative implementation that fiddles with functions
at the compiler level to insert a delay period, or more invasive
application-level changes to get the same result, so long as you can preserve
the interrupt timer notion.
What is andweorc
?
The andweorc
project is my attempt to adapt the coz
idea into a workflow
that looks something more like what we have with
criterion. I focus on single-machine
performance concerns at Datadog and have been
working pretty single-mindedly on
Vector since I joined on. The Vector
team is excellent and we've made some serious strides in improving Vector's
throughput performance. For instance, we have statistically stable, reasonably
fast integrated performance monitoring and regression detection for each PR,
called soak
tests. Super
useful. They chuck out comments like
this
that inform us whether our PR changes have adjusted performance and with what
statistical certainty. Very handy, but it doesn't tell us why or what to do
about why. That's the job of a causal profiler.
For causal profiling to work and work well for something like the Vector project
it has to hook up well with our CI and to hook up with our CI it needs to be
cargo friendly. I have a hunch that by building a causal profiler that relaxes
some of the coz
constraints we can get a really useful tool for Vector and
Rust generally. Specifically I'm thinking:
- support of non-Rust languages is a non-goal
- CI / CLI mungable results are key
- automatic diffing in profile-points between versions is key
- requiring the user to modify their program is o-kay.
So far I've got to the point in the repo where I have all the hard bits and bobs proved -- save one, see below -- and roughly linked together. The code is still very, very rough and the dream of it is still mostly in my mind, I think, but the outline's clear in a way it wasn't, say, a year and change (and two abandoned repos) ago when I first started seriously thinking about this attempt.
Why can't I use andweorc
now?
Well, there's two important things missing in the implementation, three if you cound the cargo runner. The first is I don't have the "experiment" notion built, but the global-profiler exists and the tracking for that should be relatively easy to piece together. I've already proved out resolving callchains to symbols to my satisfaction, so what remains is setup. The really hard, missing piece is interruption. I need to set up a timer on an interval per thread, have that timer send a signal to the thread and then delay (potentially), collect samples and ship them up to the global-profiler. That's missing.
And! It turns out it's kinda hard to do that today in Rust. Relevant bugs against nix and libc:
- https://github.com/nix-rust/nix/issues/1424
- https://github.com/rust-lang/libc/issues/2576
Anyway, progress is halted on andweorc
while I poke at those. The eagle-eyed
reader of the codebase will note that I'm also using a
branch of perf-event to pull
samples from Linux perf, but that seems less rough to manage than the signal
thing.