Vector Leaks Memory
The vector project is written in Rust, a
memory-safe systems language. Unless you're writing Rust code with unsafe blocks
or liberal std::mem::forget
statements the compiler keeps track of allocations
and where they've got to, ensuring that every allocation gets a free if that
allocation goes out of scope. Memory leaks in Rust are, then, generally
expressed either when there's unsafe code involved, there's atomic code involved
or someone keeps allocating into a collection without freeing that
collection. That last possibility is something that Rust shares with garbage
collected languages.
Anyhow, on March 9 we got an error report from @karlmartink that vector was leaking memory with the Kubernetes logs source enabled. A little background on vector. Vector's a data movement tool, something that is intended to take data from a small army of "sources", normalize into an internal schema, transform the data as configured by the end user and egress it into another small army of "sinks". Vector's got a handful of niches it fits in. A tool like vector allows the infra wing of an engineering group to vary the details of the observability stack without impacting application software, lets you pre-process observability data prior to it crossing the network boundary (either for PCI management or because network IO is costly in your environment) or lets you move your observability data around without being particularly greedy about system resources. We aim to make vector best in the field for throughput and do so in a minimal amount of space. But sometimes that goes sideways.
The bug report was interesting. The boon to end-users of vector's configurability ends up being a challenge for testing and triage. We offer a handful of build variants, mostly to do with what libc vector's linked to, and so the main thing we need to understand what vector's up to is the user's configuration, or something representative. Very helpfully Karl-Martin gave us multiple example configurations tested across the variants we publish. The major variation between the configs were which sink the logs terminated into, seen here. When terminating into the http sink -- meaning vector ships the data out to some http speaking thing -- vector's memory use gradually climbs and does so until vector is killed for consuming too much memory. Classic leak, right? Now, when logs were sank into the blackhole -- vector does nothing with the data, merely deallocates it -- vector's memory consumption was pretty well flat, so at least that was working.
Unfortunately it took me a few days to give the bug report some proper attention since I had other work ongoing when the report came in. My thoughts immediately went to the http sink being buggy and I asked Karl-Martin for telemetry relating to that sink and then spent the rest of the day reading the source code for the http sink, here. If you've done any HTTP work in Rust / Tokio with hyper the sink code here should look fairly boring, which is a good thing I'd argue. I read the code with an eye out toward any state with a lifetime that lasted longer than a single request but didn't notice any. Hopeful that the requested experiments would turn up interesting info I set this ticket back down and waited.
Three days later we got more data but the results ruined any kind of hypothesis I had. There was clearly some bug and it got worse if we flipped on vector's internal telemetry but there was nothing to suggest that the http sink was at fault, either in the new data or my reading of the source code. I had ran vector through valgrind dhat by this point with a config that aped the user's but the results were murky. At this time vector still used pre-1.0 tokio -- another engineer on the team was busy making the upgrade to 1.x when this ticket came in -- and our version of tokio made many small, short-term allocations that made it difficult to understand what I was seeing. There were three main areas of allocation according to dhat:
- tokio timers (we use a lot of timers and they used to imply allocation)
- smallish
BTreeMap
instances (the vector internal data model stores metadata in these) - metrics-rs histograms
The last was a surprise to me and I discounted it as noise, especially because the paths into these allocations came through tokio. Vector relies on metrics-rs to supply our internal telemetry, anything that comes through the "internal metrics" source. Whether or not this source is enabled those metrics are collected, combined as well with tracing information. We generate a non-trivial amount of telemetry for every event that comes into vector, a situation I had looked at previously to reduce costs there. I'd fixated here on the tokio signal and resolved to pick the bug back up once the upgrade to tokio 1.x was completed, which finally happened on April 1. Once this upgrade was in vector master I re-ran my tests and found that the timer allocations were gone from dhat output, as expected, but the overall problem remained.
A day later I had managed to get a minimal example put together but:
I can observe that vector does gradually increase its memory by a few bytes here and there which does seem roughly correlated to the number of requests out that vector makes, which is why I have the batch size set so low in my config. Unfortunately there's not a direct correlation and the accumulation process does take a while as your graphs show, so we do apologize for the slow progress here. We've just finished an upgrade to vector's tokio which, as @lukesteensen pointed out to me, should resolve some known sources of fragmentation. I'll be running vector with the above config and setup under valgrind; this profile should be clearer now but it will take some time to get results.
When I wrote this my test rig was already running and I full expected to need 24 hours to get results. Not so! The work the tokio folks put in to reduce allocations in their library really made the issue pop. After a few hours I had a lead:
The upgrade to tokio 1.0 has really paid off. @jszwedko relevant to our recent work, if you pull this massif dump open you'll see its recording vector at 1.5Gb of heap and the majority of that is in the tracing subsystem.
Massif output had suddenly become explicable after the upgrade and wow was it clear where memory was going:
1.5GiB spent in metric_tracing_context
histogram implementation. My
assumption at this point was that metric-rs had a bucketing histogram, an
array of counters for points that fall within some range and we'd goofed our
integration with metric-rs. I had used their prometheus
integration
before without observing similar leak behavior and took a peak at how their
exporter drained histograms. We called a function read_histogram
that leaves
the underlying storage of the histogram untouched where their prometheus
exporter called read_histogram_clear
that's drains the underlying storage. At
this point I was still working with the understanding that metrics-rs used an
array of counters internally, though there's a hint in these two functions that
this understanding was not at all accurate. But, when something's a black-box
it's hard to shake incorrect beliefs even when in retrospect it's clear that
belief is wrong. I adjusted the way vector calls metrics-rs and didn't get
hardly anywhere.
It's here I realized I needed to understand how metrics-rs actually
functioned. Now, metrics-rs has this notion of a
Recorder
, a kind
of global thing that sits in your program and catches changes to
telemetry. Vector at this point had its own but as a wrapper over the
Handle
concept from metrics_util
. All we really did was keep track of a "cardinality
counter", the total number of unique keys being tracked, and then deferred
entirely into metric-rs' code. That said, I wasn't at all familiar with the
metrics-rs code base -- even if I had made a PR to it -- and it was clearly time
to fix that. Turns out, I was totally wrong about how the histogram works:
We're calling
read_histogram
here which leaves the underlying samples in place -- the metrics-rs histogram is a linked list of fixed sized blocks of samples that grows as you add more samples -- where metrics-rs own exporters callread_histogram_with_clear
, a function that clears up the internal space of the histogram. Experimentation shows this doesn't quite do the trick but it does help some. We're behind current metrics-rs and may be hitting a but that upstream has fixed.
I had figured that the histogram in use was this
one
or something like it but in fact if you trace the code paths through Handle
a
histogram is an
Arc<AtomicBucket<f64>>
!
And that AtomicBucket
? It's a pretty bog-standard atomic linked-list of
Block<T>
instances:
pub struct AtomicBucket<T> {
tail: Atomic<Block<T>>,
}
where Atomic
is from
crossbeam-epoch. What's
a Block<T>
? It's
this:
/// Discrete chunk of values with atomic read/write access.
struct Block<T> {
// Write index.
write: AtomicUsize,
// Read bitmap.
read: AtomicUsize,
// The individual slots.
slots: [UnsafeCell<T>; BLOCK_SIZE],
// The "next" block to iterate, aka the block that came before this one.
next: Atomic<Block<T>>,
}
Every node in the linked list has a fixed sized slots
that stores T
instances, in this case always f64
. As vector pushes data into a histogram
metrics-rs will silently allocate more underlying storage as needed. Reclamation
happens in
clear_with
. Now,
freeing memory in atomic code is a Hard Problem. You can only free the once but
part of the point of writing atomic code is to reduce coordination between
threads. Embedded in the statement "you can only free the once" is an implicit
coordination. Something has to be the thing that frees and it has to be sure
that when it does so there is nothing that will interact with the storage. The
metrics-rs approach is to use epoch based reclamation, explained by
crossbeam-epoch like so:
An interesting problem concurrent collections deal with comes from the remove operation. Suppose that a thread removes an element from a lock-free map, while another thread is reading that same element at the same time. The first thread must wait until the second thread stops reading the element. Only then it is safe to destruct it.
Programming languages that come with garbage collectors solve this problem trivially. The garbage collector will destruct the removed element when no thread can hold a reference to it anymore.
This crate implements a basic memory reclamation mechanism, which is based on epochs. When an element gets removed from a concurrent collection, it is inserted into a pile of garbage and marked with the current epoch. Every time a thread accesses a collection, it checks the current epoch, attempts to increment it, and destructs some garbage that became so old that no thread can be referencing it anymore.
Epoch based reclamation is cool. It's used in the Linux kernel and is high performance as these things go. This reclamation technique is similar to "Quiescent State Based Reclamation", the primary differentiator being that it's the responsibility of the application to declare a "quiescent period" in QSBR -- when the thread(s) are done processing and cleaning up their garbage is cool -- whereas EBR has to detect quiescent periods without application input. There's an obvious benefit to EBR if you're implementing a collection that just happens to be full of atomic goodies. The downside of these quiescent approaches -- however that quiescence is detected -- is that they'll keep accumulating garbage until such a quiet time happens. If a quiet period never happens then you'll keep allocating. A leak!
At this point I figured we'd found it. I opened up a
PR to adjust our code to call
read_histogram_with_clear
and get a playground to run more experiments in. We
pinged @tobz with our findings so far and they
confirmed we were using the correct read mechanism as well as saying:
In practice, under benchmark scenarios, there are still "quiescent" periods. As soon as you read/clear the histogram, and those blocks are no longer being read, they become eligible for reclamation. If you only have one "task" doing the read/clear, then you have no concurrent readers, and thus no contention, and thus they're almost immediately reclaimed.
That's sort of what I expected. Quiescent approaches can be memory hungry but they're only rarely leakers. We were behind current metrics-rs and I figured we'd upgrade to confirm that the issue was present on that project's tip. It was.. I did manage to put together a smaller project that demonstrated the issue and, more importantly, didn't demonstrate it. When you use metrics-rs without incorporating tracing metadata everything works fine but as soon as you do the histograms accumulate without end. Bummer.
I'm no Aaron Turon but I'm a reasonably competent at atomic data structures and figured I'd need to spend the next few days understanding what the tracing integration does and then diagnosing the odd-ball behavior I was seeing. Maybe I'd add some loom tests to the code base to help figure it out. I was not looking forward to it. Atomic programming is hard and debugging these kinds of odd conditions is even more hard. All the while I'm debugging this Karl-Martin's still seeing vector eat up RAM, so it's not an academic exercise I can play with. Moreover, there was an idea growing at the edge of my conscious thought. So, I took my dog for a walk, had dinner with my wife and went to bed early.
When I woke up the idea had been birthed: vector doesn't need perfect sampling. My teammate Bruce had put the notion in my mind, pointing out that the sole consumer of this data were our sinks. I had seen that the sinks took the sample stream coming out of metrics-rs and processed it into buckets. It makes sense for metrics-rs to collect every sample, it can't know how the data is going to be exported, but vector absolutely knows how it will export the data. So, why not just collect the histogram data in the way we're going to export it? I felt reasonably confident that the problem was in the metrics-rs histogram and that if we stopped using that histogram the problem would go away.
I spent the rest of the day doing just that. While metrics-rs' loose coupling
can make its use a little confusing the first time you encounter the library
it's astonishingly easy to swap bits and pieces out as needed. I replaced
metrics-rs' Handle
with a vector specific
Handle
. The
gauge and counter implementations are quite similar to upstream but the vector
histogram is a boxed fixed-size array of atomic
integers. Even
if the type ends up being a little wide in the end it's still a guaranteed fixed
size per histogram, no matter how un-quiescent vector is.
To my relief, it worked out. Vector went from allocating without bound to consuming just over 6MiB pretty constant, the bulk of that actually being in pre-compiled regexes.
Our own custom Handle
didn't make vector any
faster but
it now sips at RAM. When I'd disabled the code paths that enabled tracing
integration I'd see vector commonly use just shy of 200MiB in tests, so down to
6MiB is a substantial win for our end users.
As I write this I'm still waiting for results back from Karl-Martin but I have high hopes. At the very least one of the bugs contributing to that issue is solved.
Fun stuff.