Memory leaks from GenServer processes

I have posted this in StackOverflow as well

I have been into a little trouble lately: The memory used by GenServer processes is super high, probably because of large binary leaks.

The problem comes from here: we receive large binaries through the GenServer and we pass them to the consumer, which then interacts with that data. Now, these large binaries are never assigned to a variable and the GC doesn’t go over them.

I have tried hibernating the processes after managing the data, which partially worked because the memory used by processes lowered a lot, but since binaries were not getting GC’d, the amount of memory used by them increased slowly but steadily, from 30 MBs without hibernating to 200MBs with process hibernation in about 25 minutes.

I have also tried to set :erlang.system_flag(:fullsweep_after, 0), which has also worked and lowered the memory used by processes by around 20%.

Before and after.
I must say it goes down to 60-70MB used by processes from time to time.

Edit: Using :recon.bin_leak(15) frees a lot of memoryresult of :recon.bin_leak(15)

Anyhow the memory used is still high and I’m completely sure it can be fixed.

Here you have a screenshot taken from the observer in the Processes tab. As you can see, GenServer is the one eating the memory like the cookie monster.

I have researched a lot about this topic, tried all the suggestions and possible solutions that were given out there, and nevertheless, I am still in this position.

Any help is welcome.

The code is in this Github Repository

Code of interest that is probably causing this + Applications tree. 3 out of 4 processes there (<0.294.0>, <0.295.0>, <0.297.0> are using 27MB of memory.

Thank you beforehand for reading.

2 Likes

I work on a server that processes a lot of video and audio in real time as well, one of the first things I did is to write my own data pipeline workflow logic, for that I used only OTP. I see you are using gen_stage, after a quick glance I noticed this library buffers data and I will quote the docs:

Defaults to 10_000 for :producer , :infinity for :producer_consumer

From what you described, it feels like you are not consuming/processing data fast enough and the buffer for producer_consumer keeps growing.

I would stay away from gen_stage, or at least make a very simple version of your server that does not use it, just for the sake of comparison.

2 Likes

Have you tried invoking GC from producer as well as consumers?

Invoking manually the GC in an interval would not be a solution to the problem but a temporary fix for the memory usage. The solution would be fixing the root cause of the problem.

I will take that into account, would it be a huge change removing gen_server?

So as you say, OTP would propably process data faster or?

What @xlphs said is that you should do a test with a simple GenServer instead of trying to use GenStage. GenStage try to buffer data to ensure that all stage have always something to do, and in this case, it means the data stays in the buffer. If you stop buffering then it will not stays and the GC will be able to clean it properly.

You may be able to simulate this by simply changing the demand values to 1 to avoid buffering entirely. If this causes the desired change then the memory usage isn’t due to a leak, it’s just due to a buffer configured for more than you want.

2 Likes

Do you mean setting the option of demand_value to 1 for gen stage?

1 Like

https://hexdocs.pm/gen_stage/GenStage.html#c:init/1-options

This API has changed a little bit since I first used GenStage, but yeah you’d lower your max_demand to 1 on the consumer and producer consumer. The reason is that if you stick with the default it’ll buffer up to MIN_DEMAND which is 5_000 items before actually pushing it to the consumers.

This option will look like:

 {:producer_consumer, number, subscribe_to: [{A, max_demand: 1}]}
2 Likes

Very interesting, will really have a look at it. I will later post here the results :slight_smile:

2 Likes

Did you end up solving your problem? Quite an interesting thread.

3 Likes

Honestly, not yet. I managed to lower the memory used by the entire module by reducing the amount of middles, but the main problem was, and still is, the memory / binary leak from GenServer.

The issue is that GenServer processes are consuming way too much memory, sadly.

It looks like there are a number of improvements that could be made so your stages run smoother.

For instance, in Coxir.Struct you are using ets as an in-memory backing store and have these encode/1 and decode/1 functions. Those are getting used a lot and just change a Map into a list to a tuple and back… It looks like you are doing this so you can store it in an ets table which needs tuples … but … you could just do:

:ets.insert @table, {id, data_as_a_map}

Then you will have equivalent lookups:

[{_id, data}] = :ets.lookup @table, id

No moving data between different data structures constantly. Whether or not that is causing your memory usage (doubtful), it will certainly relieve some pressure on the GC and make your code faster.

In Coxir.Stage.Middle.handle/2 there are many places where you are using for list comprehensions, but discarding the results. for creates a list out of every single result … if you do not use those results, the for is storing them in a list for no good reason. What you probably want instead is Enum.each/2 which just calls the function for each entry in the enumerable, IOW: for the side-effects. List comprehensions are there to build lists, not create side effects.

More plainly: for in Elixir is not equivalent to a for-loop in your typical imperative language. Enum.each/2 is closer to loops that do not change external variables, and Enum.reduce/3 is closer to those that do.

There is also interesting code in Consumer like:

handler
|> apply(:handle_event, [event, state])

where you could just do:

handler.handle_event(event, state)

It’s equivalent and probably easier to read, esp as you can then just do:

case handler.handle_event(event, state) do

which is, at least IMHO, a lot easier to read than the pipeline syntax used there :slight_smile: It contains lal the information needed in one line (e.g. what is the subject of the case) rather than having to track back…

I also noticed you have call of String.to_atom/1 on externally supplied data … this is a great way to exhaust the atom table (you only get so many!), run out of entries, and DoS your app as an unhappy side-effect :confused: You should really look to replace that approach.

Finally … as to where your binary “leaks” are coming from … I assume it is something like this:

  • you get a block of json as a binary from the external service
  • Jason parses that into smaller strings for your, but each of those smaller strings are just pointers into the bigger one … and that prevents the GC from removing those binaries until the references to the smaller strings are dropped
  • those smaller strings are being put into an ets table, so they stay forever until the entry is deleted from the ets table, which means the binaries stay forever until the entries are deleted in the ets table
  • since coxir is updating those entries over time from many different json input binaries, each entry in the ets table(s) is keeping multiple json binaries in memory over their lifespan.

You can get around this by calling :binary.copy/1 on the specific sub-strings you are storing in the ets table to see if that is indeed the issue.

9 Likes

I am at work right now so I can’t test all this, but I will definitely try every single suggestion.

Thank you very much for such answer :smiley:

1 Like

no problems; it’s really nice to see people taking on projects like yours with Elixir! best of luck and happy hacking …

Would it be something like this + this?

EDIT: Ignore the Enum.each comments, I mistook it for map. My bad.

First commit:

  • Not sure why you use Poison. Jason is the newer library and is faster.
  • You still have not made your code use string map keys instead of atoms. You only get 1_048_476 atoms (here is the source) and they are never garbage-collected. It’s easy to execute a denial-of-service attack against Elixir code that blindly accepts all atoms thrown at it via external data. I mean this is not actually a major concern in the grand scheme of things (you will probably protect your API behind an authentication wall) but since you are trying to troubleshoot a memory consumption problem I believe it’s worth the effort to modify your code to never use atom JSON keys. I personally also believe your producer/consumer queues hold on to strings for a long time and that’s your problem but hey, why not try everything since you decided to try and tackle the problem?
  • Outside of those, I believe the commit mostly covers what @aseigo meant earlier.

Second commit: if you are not interested in return value of Enum.each I suggest the following instead.

Change all these:

Enum.each(data.guilds, fn(guild) -> handle(:GUILD_UPDATE, guild) end)

to these:

Stream.each(data.guilds, fn(guild) -> handle(:GUILD_UPDATE, guild) end)
|> Stream.run

Sure it’s little longer but (1) it does not create a return value and does not pass and modify it between each iteration only for it to eventually be discarded, and (2) communicates the intent of the code better: you are not interested in the results of the loop, you only invoke it for the side effects.

Do this everywhere you do NOT assign anything to the Enum.each calls.

Enum.each is the right thing to use when you only care about side-effects, it always returns :ok and doesn’t accumulate a list.

1 Like

OMG, I have been reading it as map all the time! Editing post.