Why does garbage collection not work as intended?

aochagavia · September 28, 2022, 2:58pm

Context

I am trying to understand how the BEAM’s GC works in a particular scenario I came across recently. I have read this article on garbage collection and also the official docs, but even then I am not sure I can explain the BEAM’s behaviour. I even found an article by someone who had a similar problem, and a solution, though without an explanation of why it works. So now I am here

Here is a somewhat simplified description of the situation:

We spin up a GenServer, which in its init function retrieves and processes about 4MB of mostly binary data and stores it in an ETS table.
Once in a while, the GenServer retrieves and processes the data again (e.g. when there have been updates) and replaces the old data in the ETS table.
Other processes get the data directly from the ETS table, without interacting with the GenServer.

I observe the following (the numbers come from Phoenix LiveDashboard):

After spinning up the GenServer, its memory usage is 4.5MB.
After triggering reloads one after another, its memory usage grows to 10.7MB, 16.8MB, 19MB and 34.8MB respectively. It seems to stabilize at 34.8MB and might even shrink again to 15MB after some more reloads.
If we repeat the experiment above after starting the GenServer with the hibernate_after option, memory usage drops to 1.6KB every time the process is hibernated.

This seems to be the classic example of a memory leak due to binaries. Quoting a post in the forum:

The binary leak is most prominent with processes that have huge heaps - this can happen if for a normally “quiet” process you have one, infrequent operation that is extremely memory expensive. This operation will cause the heap to balloon, and later will keep the GCs rare in regular operation, since there’s still a lot of free memory left - causing the process to hold on to the binary references for longer than it should.

Question

Though I am able to accurately describe what is happening, and I can solve it using hibernation or separate tasks, I am not sure I understand why garbage collection is not working as intended. I have been unable to explain to myself what I see: is garbage collection being triggered at all? If so when? We are not storing references to anything in the GenServer state, so why is the memory not being collected right away? Could it be somehow possible that stuff ends up in the old heap? If so, why?

Also related: are there any tools I could use to answer these questions? For instance: a way to be notified of garbage collection runs, a way to observe the amount of objects in the old heap, etc.

Any help is appreciated!

cevado · September 28, 2022, 3:14pm

If I understood correctly what is happening is that large binaries are not stored in the heap. so the process doesn’t get their heap full and don’t trigger the GC… what ends up keeping the large binaries allocated for a longer perioed of time, by the beam book:

This means that binaries, which has a tendency to be large or even huge, can hang around for a long time after all references to the binary are dead. Note that since binaries are allocated globally, all references from all processes need to be dead, that is all processes that has seen a binary need to do a GC.
The Erlang Runtime System

aochagavia · September 28, 2022, 3:38pm

Thanks! That sounds like a logical suggestion, but I don’t know how to explain the fact that memory usage dropped at some point. I thought that could only happen after a garbage collection run. And then, if there was a garbage collection run, I don’t get why so much memory (15MB) was not reclaimed.

derek-zhou · September 28, 2022, 3:47pm

I don’t think this is the binary “leak” per se; because you are looking at the process memory, not the global binary heap. It is certainly related. The core issue is still GC not triggered frequent enough. Infrequent memory intensive operation in a long live process may not be the best use of available memory. You can:

hibernate like you did, or
spawn or use a short lived task for the memory intensive operation

cevado · September 28, 2022, 3:50pm

it’s difficult to answer without knowning exactly what happens in that genserver, but I’m assuming that the genserver is doing whatever he is supposed to do. at the moment it triggers the GC it gonna keep only “recent data”, so it keeps the reference to a few versions of this large binary. since all the stuff that you’re doing are out of heap(large binary, ets table) it very rarely triggers the GC.

as per solution, you can do as @derek-zhou suggested.

aochagavia · September 28, 2022, 4:11pm

Thanks again. I was assuming that the LiveDashboard view of the process would somehow include the binaries, but now I think of it that probably does not make much sense.

I might have time later to prepare a minimal reproducible example in code, but in the meantime, I would be very interested to know how to investigate the cause of this in the first place (the solution is clear, and in fact I am using hibernation right now). Since Erlang has been out there for so long, I assume there must be tooling to diagnose garbage collection issues. Or is looking at the code the only way to do it? I can imagine I will come across this kind of problems in the future.

sezaru · September 28, 2022, 5:17pm

I guess you are probably already aware of it, but if not, you can use :observer.start() to spawn Erlang’s observer and there you can see how much binary is being used globally, per process, per ets table, etc.

cevado · September 28, 2022, 5:23pm

is this biting you in anyway? or you’re just striving for very low memory use?
unless you’re running on embedded devices, i woudn’t worry about that… if that’s the case, you can always change the max_heap_size for the entire erlang node or for individual processes. and you can trigger garbage collection manually on a process basis too.

but again, i would only bother with that if that’s hitting you in some way, otherwise it doesn’t pays the effort of tunning things just for the sake of low memory consumption.

garazdawi · September 28, 2022, 7:55pm

If you do a trace on the process for garbage collection events you will get this information whenever a GC is run: Erlang -- erlang

To start such a trace you can use the runtime_tools module :dbg:

:dbg.tracer(), :dbg.p(pid, [:garbage_collection])

You can also get some information from a running process by calling :erlang.process_info(pid, :garbage_collection_info).

ityonemo · September 28, 2022, 9:18pm

Are you parsing parts of the binary data and lifting binaries derived from it into the ets table? This will result in the ets table “holding on” to the binary data because each of those smaller binaries is kept as a reference to the parent one, which cannot be garbage collected. Then finally when you flush the ets table and replace it, the parent process gc’s the ets reference and that in turn allows the gc’s to finally let go of the initial"huge binary".

In general when stashing content into an Ets table it’s probably a good idea to copy the binary. If you’re parsing huge jsons, use the copy binary option instead of the reference binary (default) option.

You should probably also consider directly deleting the ets table when you refersh instead of relying on the genserver to GC the reference which may or may not happen when you expect.

aochagavia · October 3, 2022, 9:13am

Thank you all for the great answers! My takeaways (also after reading other posts) are:

The process memory displayed by LiveDashboard does not account for reference counted binaries (except for the size of the ProcBins).
Garbage collection can take place before the end of a function, which increases the chance of stuff not being cleaned up properly, because objects that will later be thrown away are still being used.
As the heap size of a process grows, garbage collection will be triggered less frequently.
Using :hibernate is not as exceptional as it sounds (e.g. LiveView uses hibernate_after with a default of 15 seconds).