Trying to understand Genserver memory leak

benbonnet · August 1, 2022, 3:32am

Getting started with otp; having a sample app which main purpose is to ingest data.

A genserver is subscribed to a given channel and receives load of data passed aroudn through pubsub. Its single purpose is tostore this data in the db (using ecto). It does not store any state.

  def init(_args) do
    Phoenix.PubSub.subscribe(PubSub, "CHANNEL_NAME")
    {:ok, []}
  end

  def handle_info({:matcher, data}, state) do
    upsert(data)
    {:noreply, state} # , :hibernate} added, solved memory leak, but ¯\_(⊙︿⊙)_/¯ 
  end

  defp upsert(klines) do
    Repo.insert_all(
      ModuleName,
      data, # array of struct
      on_conflict: :replace_all,
      conflict_target: [:key1, :key2, :key3]
     )
  end

Its memory invariably increase to some 100Mb, and never lowers. Googling around, found out about {:noreply, state, :hibernate}, and it solved my issue.

But no clue why this memory increase ended up to be permanent. Is it because of Ecto ? About the way the data is passed around (pubsub) ?
Hope the question is somehow relevant; would like to get a better understanding

regards

ityonemo · August 1, 2022, 4:38am

I believe hibernate triggers a GC event.

You can also trigger GC manually using :erlang.garbage_collect/0

benbonnet · August 1, 2022, 4:52am

It leads me be more accurate in my question; to try to understand when/how a genserver gets garbage collected

Or if is mandatory to manually manage a genserver’s memory

ityonemo · August 1, 2022, 4:57am

It’s not required. but sorry, I don’t have answers for you on when the GenServer gets gc’d. Are maybe your messages coming in more rapidly than you can dispatch them to the database? (Aka do you always have a message in the message queue?). This is just a guess but I think if all of the messages are dispatched, the process goes into a tail call that calls GC, but if there are messages in the queue, the process is only put on ice by the scheduler.

Could also be some weird things with binary references, which are handled differently and have different lifetimes.

Here is specific data how the Erlang GC works, but that is not just for GenServers. Erlang -- Erlang Garbage Collector. If you’re curious you can always check out the source code for the gen module: otp/gen.erl at master · erlang/otp · GitHub + gen_server module otp/gen_server.erl at master · erlang/otp · GitHub

benbonnet · August 1, 2022, 6:57am

thanks a lot for those links! and sorry did not meant to demand an answer, rather general advices

I should have mentionned, the genserver receives data every 15 minutes. It takes it something like 1 minutes to ingest the data; then it just hangs around not receiving anything, waiting for the next batch. Which leads me to think the GC was never triggered

trisolaran · August 1, 2022, 7:38am

Sounds like this could be a plausible explanation: Extremely high memory usage in GenServers - #23 by sasajuric

karlosmid · August 1, 2022, 10:35am

Hi! What helped in my team was to run function as a async Task:

Task.async(__MODULE__, : upsert, [klines])
    |> Task.await(:infinity)

In that case, when Task is done, gen server memory related to it is released immediately.

Heads up: this was in combination with when the gen server did repetitive tasks triggered by the timer.

hst337 · August 1, 2022, 10:49am

Well. it’s actually not a leakage, it’s just how the garbage collection works. I think you’re receiving big amount of structures (in the {:matcher, data}) structure, which is not collected right away

hst337 · August 1, 2022, 10:49am

That’s a bad solution, because it copies klines to the new process. The right solution is to trigger gc in the GenServer manually

karlosmid · August 1, 2022, 11:14am

@hst337 what happens with klines in gen server after call to Task.async? Those are not garbage collected? Based on our testing, gen server memory usage immediately dropped.

hst337 · August 1, 2022, 11:21am

First of all, unused data will always be garbage collected. The question here is at which point it’ll be garbage collected
Second, here klines get copied into Task process you’ve created. So it might trigger gc in GenServer. But this klines will be in the Task until it dies.

But this is a bad solution, because you don’t need a process here, you don’t need to copy this data. You just need to call the Repo.all and then trigger minor gc

benbonnet · August 2, 2022, 6:37am

I think you’re receiving big amount of structures (in the {:matcher, data} ) structure

indeed; trying to see how the whole thing behaves with large amounts

Is hibernate an “ok” solution ? it kinda solved it all; but feels like a shortcut for things i dont understand (yet)

Running such genserver to receive loads of data from different inputs could just be a plain bad idea ?

hst337 · August 2, 2022, 10:08am

hibernate is okay, but I don’t know why want to collect garbage straight away. I mean, gc will be called when there’s no space left, so you don’t have to worry about running out of memory (in this situation). And when garbage collection is initiated by VM, it will be more efficient in time, than calling GC manually every time

lud · August 2, 2022, 11:40am

The gen_server process can go into hibernation (see erlang:hibernate/3) if a callback function specifies ‘hibernate’ instead of a time-out value. This can be useful if the server is expected to be idle for a long time. However, use this feature with care, as hibernation implies at least two garbage collections (when hibernating and shortly after waking up) and is not something you want to do between each call to a busy server.

source.

What I would do is to return a short timeout value, but not zero, and manually call the garbage collector in handle_info/2.

benbonnet · August 2, 2022, 7:52pm

playing around on heroku, on purpose as am learning this whole otp thing and would like to see where thing gets “heavy”

the thing is i dont see the no space left doing anything, rather system errors (memory quota exceeded). i might see whats up on other kind of instances

lud · August 2, 2022, 10:18pm

You can list all processes in your system, then for each get the memory usage, and then print that sorted in your console. Maybe other processes are consuming too much memory.

derek-zhou · August 2, 2022, 11:01pm

Your genserver is receiving large amount of data, yet does very little “work” on them, so gc was triggered late; I think gc is triggered by the amount of “work” since last gc. Just curious, why do you need a genserver here? You can just upsert the data in the original process; ecto will take care of the atomicity.

benbonnet · August 3, 2022, 12:14am

@lud running within phoenix so i opened the dashboard route to see whats up; might explain why below, it is not the memory of a given process, but the amount of genservers running

@derek-zhou what would you mean by ‘work’ ? you bring another issue with the upsert.
this genserver receives data from numerous other running ones. Doing so in the original processes (the numerous other ones) just timed out the db super fast. Centralizing the upsert within this single genserver solved the db timeouts. This surprised me a lot, as the amount of data and queries is the very same; it just flows another way.
But you’re right, i could totally do the subscribe/upsert in the main process, not in a genserver

derek-zhou · August 3, 2022, 2:16am

My understanding of the GC is that it is triggered by every so many “reductions”, which is a BEAM defined metric of “work”. If running the upsert in all the originating processes jams the system, then serializing them on one process will cause serious backlog on its incoming message queue. You may have other problems.

dimitarvp · August 8, 2022, 1:07pm

Always super excited to start reading threads like these but almost all of them inevitably end with silence…