andweorc - a causal profiler

2021/12/17

Causal profiling, at least to my knowledge, was introduced by Curtsinger, Berger in their 2015 paper COZ: Finding Code that Counts with Causal Profiling. Traditionally profilers measure the CPU time taken by code and report on that. This works well when CPUs are relatively deterministic and programs aren't multi-threaded in themselves. On a modern machines all programs are parallel programs, in the sense of Amdahl's law applying. Even a totally serial program -- no threads -- is still running on a superscalar, out-of-order device, hence tricks around loop fission and what not. The "serialization" point of our programs have an outsized impact on total program performance in a way that is related to but not totally explained by CPU time. A very sleepy part of a program might drive its overall runtime while a busy part of the program lights up a traditional profiler. Think of a busy-loop in one thread waiting on an atomic bool to flip before causing the program to exit and a sleep followed by a bool flip in another thread. Causal profiling aims to not indicate where a program spent its CPU time but, instead, explain to what degree performance of the program would change if such and such line's performance were changed. Thataway we, the software engineers, learn where to focus our effort from the computer, rather than guessing if such and such change would materially improve total-program performance.

Does causal profiling work?

It does. You can experiment with the coz program yourself. This software depends on libelfin which doesn't understand more recent DWARF versions, but if you use an old-ish Ubuntu version you should be alright. The paper linked above describes the experience of using coz, as well, if you can't get it to function.

How does a coz-like causal profiler work?

How a coz-like causal profiler works is really interesting, but, more generally, how does a causal profiler work? The key insight to causal profiling is that, while you can't speed up program sub-components you can slow them down, and if you refrain from slowing down some sub-component while slowing all the rest down you've "sped" that sub-component up. A causal profiler uses this insight plus some mechanism to slow down the sub-components -- line, function, module etc -- you're investigating, a table of "speedup" factors to apply, some "progress points" to track during execution and some coordinator to keep track of which bits have been "spedup" and their impact on the progress points, with enough repeats of the same speedup to get statistically interesting results.

The way coz achieves this is by managing experiments with a "global profiler". This thing runs in its own OS thread, polls the progress points and keeps a record of the speedup experiments already run and their effect on the program. Through an LD_PRELOAD override coz injects overwritten POSIX and pthread functions to

  1. make select functions participate in ongoing experiments (by sleeping for an experimentally determined time, ie, slowing down) and
  2. start a per-thread profiler complete with timer interrupt.

This per-thread profiler lives at the "top" of every thread, collecting Linux perf samples -- instruction pointer, callchain -- and sleeping in coordination with the global profiler, reporting back results to the global profiler. The interrupted thread goes back to normal operation once the interrupt is handled. Because of the use of Linux perf coz can only understand whatever an instruction pointer points to that can also be resolved into a sybmol, thus a line. You can imagine an alternative implementation that fiddles with functions at the compiler level to insert a delay period, or more invasive application-level changes to get the same result, so long as you can preserve the interrupt timer notion.

What is andweorc?

The andweorc project is my attempt to adapt the coz idea into a workflow that looks something more like what we have with criterion. I focus on single-machine performance concerns at Datadog and have been working pretty single-mindedly on Vector since I joined on. The Vector team is excellent and we've made some serious strides in improving Vector's throughput performance. For instance, we have statistically stable, reasonably fast integrated performance monitoring and regression detection for each PR, called soak tests. Super useful. They chuck out comments like this that inform us whether our PR changes have adjusted performance and with what statistical certainty. Very handy, but it doesn't tell us why or what to do about why. That's the job of a causal profiler.

For causal profiling to work and work well for something like the Vector project it has to hook up well with our CI and to hook up with our CI it needs to be cargo friendly. I have a hunch that by building a causal profiler that relaxes some of the coz constraints we can get a really useful tool for Vector and Rust generally. Specifically I'm thinking:

  • support of non-Rust languages is a non-goal
  • CI / CLI mungable results are key
  • automatic diffing in profile-points between versions is key
  • requiring the user to modify their program is o-kay.

So far I've got to the point in the repo where I have all the hard bits and bobs proved -- save one, see below -- and roughly linked together. The code is still very, very rough and the dream of it is still mostly in my mind, I think, but the outline's clear in a way it wasn't, say, a year and change (and two abandoned repos) ago when I first started seriously thinking about this attempt.

Why can't I use andweorc now?

Well, there's two important things missing in the implementation, three if you cound the cargo runner. The first is I don't have the "experiment" notion built, but the global-profiler exists and the tracking for that should be relatively easy to piece together. I've already proved out resolving callchains to symbols to my satisfaction, so what remains is setup. The really hard, missing piece is interruption. I need to set up a timer on an interval per thread, have that timer send a signal to the thread and then delay (potentially), collect samples and ship them up to the global-profiler. That's missing.

And! It turns out it's kinda hard to do that today in Rust. Relevant bugs against nix and libc:

  • https://github.com/nix-rust/nix/issues/1424
  • https://github.com/rust-lang/libc/issues/2576

Anyway, progress is halted on andweorc while I poke at those. The eagle-eyed reader of the codebase will note that I'm also using a branch of perf-event to pull samples from Linux perf, but that seems less rough to manage than the signal thing.