Group of GenServers uses tremendous amount of process memory

varsill · February 8, 2023, 12:49pm

Dear All,

recently I have come across a weird phenomenon - I have spawned a bunch of GenServers, each of which was periodically performing some computation. During the computation a large enumerable had been created, however, it wasn’t saved anywhere - I assume, that such an enumerable could easily be garbage collected.

A bag of processes was using a tremendous amount of memory, increasing with every passing second. What is worse, the garbage collector didn’t seem to be invoked, even when memory usage was almost exceeding the amount available on my device. A few times have I finished with SIGBUS interrupt being thrown or the node’s process has been killed by the host OS.

I have started digging into the problem and created a minimalistic program that triggers the phenomenon occurrence - below I attach its code:

defmodule Server do
 use GenServer
 @length_of_enumerable 100_000 # length of the enumerable created in the computation
 @slacktime 1 # defines the break between the computations [ms]
 @between_gc_time 30_000 # defines the break between garbage collections invocation [ms]
 
 @impl true
 def init(_opts) do
  computation = fn -> Enum.map(1..@length_of_enumerable, fn _ -> 1 end) end
  send(self(), :work)
  Process.send_after(self(), :clear, @between_gc_time)
  {:ok, %{computation: computation}}
 end
 
 @impl true
 def handle_info(:work, state) do
  state.computation.()
  Process.send_after(self(), :work, @slacktime)
  {:noreply, state}
 end
 
 @impl true
 def handle_info(:clear, state) do
  :erlang.garbage_collect()
  Process.send_after(self(), :clear, @between_gc_time)
  {:noreply, state}
 end
end
 
:observer.start()
how_many_processes = 1000
processes =
 Enum.map(
  1..how_many_processes,
  fn _n ->
   {:ok, pid} = GenServer.start_link(Server, nil)
   pid
  end
 )
 
Process.sleep(200_000)
Enum.each(processes, &Process.exit(&1, :kill))

That program allowed me to reproduce the behavior, furthermore, I have found out some other things:

when the number of GenServers is small enough or the enumerable is not that big, the memory usage seems to stabilize at some (more or less tremendous) level
the pace at which the memory usage was increasing was greater when the time between computation invocations was shorter

After invoking a garbage collector with :erlang.garbage_collect() on which of the GenServers, the memory was indeed freed.

Below I attach a plot showcasing the memory usage of the program (note that the total memory usage consists almost only of the memory used by processes).

The test has been performed with 1000 GenServers, for a enumerable being a list of length 100_000, with an element being a small integer. For such a case I assume, that the enumerable’s size is:

length_of_array*(size_of_element) = length_of_array*(1+size_of_small_integer) = length_of_array*(1+1)=200_000 [words]

On my system, the word size is 8 bytes, which would mean that the memory needed by each process should be around 1_600_000 bytes = 1.6[MB]

For 1000 processes, the amount of memory needed at once shouldn’t exceed 1.6GB.
However, the amount of used memory seems to stabilize at around 12.5GB.
At around 40s and 20s on the timeline I forcefully invoke garbage collection, which shows that the amount of memory used can, in fact, be reduced. It’s not reduced to the expected level though.

I am aware that the garbage collection won’t occur everywhere time the enumerable gets out of scope, but shouldn’t it occur when the amount of memory used is so high that the node is using almost all available memory on the machine?
After brief research I have found out that this problem has already been discussed a few times, i.e. here - however, as far as I am concerned, in these cases, the problem was with the total memory used by a node skyrocketing (as a result of a raise in memory occupied by binaries, that weren’t freed). In my case, the problem is that there are processes that are using an enormously large amount of memory.

I have got a few questions:

Has any of You ever experienced such behavior?
What might be the possible reason for it? All the explanations but in my case, I am not operating on binaries so the amount of memory allocated for binaries is constant (as shown on the screenshot)
What could be the solution to deal with that problem? Personally, I feel slightly guilty every time I invoke garbage collector on my own

I am looking forward to a reply from You and I wish you all the best,
Łukasz

kwando · February 8, 2023, 1:03pm

Have you tried to hibernate your GenServers when they are done with their work?

Also how are you running the app? Inside docker?

D4no0 · February 8, 2023, 1:12pm

@kwando got a good point, however he did not mentioned why. When you hibernate a genserver, it triggers the garbage collection each time.

I’ve had this problem before too and it is related to how garbage collection works at runtime, don’t remember many details, but it is related to the fact that big structures are written in a shared memory sector and used as pointers instead of the process memory. I have used the hibernate solution on a production server and it works even to this day like a charm.

LostKobrakai · February 8, 2023, 1:15pm

Afaik that’s only true for large binaries to be stored off heap. That shouldn’t apply to a list of integers.

D4no0 · February 8, 2023, 1:18pm

I know about binaries too, but talking into consideration the fact that the structure is so big, it should apply the same rules, as the memory allocated to the process is very small.

LostKobrakai · February 8, 2023, 1:21pm

Looking at the screenshot the green “Processes” line is just below the top line, which looks to be the “Binary” one. So it might indeed be relavant here.

D4no0 · February 8, 2023, 1:23pm

Do you know any material where details of such behavior is mentioned, because I remember reading about the solution either on the forum or stackoverflow, it would be interesting to understand why this happens.

LostKobrakai · February 8, 2023, 1:29pm

The behaviour is referred to in a few places, but I’m not sure there’s any real reasoning documented as well. I’d expect it’s due to some performance reasons.
https://www.erlang.org/doc/efficiency_guide/binaryhandling.html#how-binaries-are-implemented
https://www.erlang.org/doc/efficiency_guide/advanced.html#memory
https://www.erlang.org/doc/apps/erts/garbagecollection

kwando · February 8, 2023, 1:32pm

I assumed he would find that in the documentation, just hinted about the keyword to look for

y86 · February 8, 2023, 1:44pm

I am not well versed on the specifics of erlang’s garbage collection, but you’re problem reminded me of this talk:

She mentions a similar memory problem using Broadway. Even using the hibernate option wouldn’t fix it completely.

The workaround, if I recall correctly, was to have the memory intensive task be performed on a separate process, so that when it exits all memory is reclaimed.

sabiwara · February 8, 2023, 1:55pm

The workaround, if I recall correctly, was to have the memory intensive task be performed on a separate process, so that when it exist, all memory is reclaimed.

I’ve adopted this strategy after getting similar issues in the past.
Instead of relying on garbage collection within the genserver, I’m spawning tasks using Task.Supervisor — Elixir v1.16.0 to perform short-lived computations whose garbage gets reclaimed once done, and keep the Genserver doing the high-level orchestration.

Also, it seems to be a known issue/limitation that might be related: Optimize garbage collection for processes with large heaps · Issue #5396 · erlang/otp · GitHub.

karlosmid · February 8, 2023, 3:44pm

Hi, here is another thread with this topic:

It also contains some solutions to this problem.

varsill · February 9, 2023, 3:32pm

Hello,
thank you very much for your attention and all the resources you have provided!

@kwando - I am running on BEAM installed natively on my machine. After using :hibernate in fact there is an improvement (the plot below reflects a simulation performed under the same circumstances as the one described in my first post, the only difference is that with that one I am using :hibernate):

The amount of memory used at once was reduce to around 5.5GB

@LostKobrakai - the top line is the “Total” memory line. It’s red, and it’s blending with the green line indicating the “Processes” memory, therefore looking as if it was orange - but the “Binary” memory is just a few kilobytes

I have also heard about the solution with spawning a separate process to perform computation - as far as I know, once such a Task process dies, it’s always garbage collected, and that is why the total amount of memory used by the system is not increasing.

wanton7 · February 10, 2023, 7:49am

What OS are you using? Have you tried different OTP version?