Processes memory spike until crash - any ideas what's causing it?

Background

We have an application where processes memory keeps rising, continuously, without end until the machine crashes. We have no clue why this is happening and we need ideas for possible causes so we can research in more detail.

Research

Our first approach was to check the processes State. These are workers, so we naturally assumed that the processe’s State was growing without bound. After checking several processes, we concluded that their state was not big enough to occupy 30 MB nor did it grow without boundary.

Then we shifted our attention to process number. Perhaps we were creating processes without stopping. No such thing either.

Then we moved our attention to garbage collection. Turns out that if we issued an major GC on all our worker processes, RAM usage would go down immediately. So we started issues periodic major GC on our GenServers via :erlang.garbage_collect() but the problem somehow persists.

Our tool of election (observer_cli) shows that the issue is clearly in process memory, but with these options of of the way I can’t think of anything else.

Brainstorming

Does anyone have any idea on why or what could be causing process memory to go up without and end? Any guides on memory leaks would be welcome, we focused our attention in “Erlang in Anger” but to no avail thus far.

1 Like

What have you actually tried to find the problematic processes? Have you tried listing all processes, checking their memory?

2 Likes

Could be related to binaries not garbage collected. Sounds like the same problem as here: https://stackoverflow.com/questions/43613027/solving-large-binaries-leak/43685158

1 Like

It’s not binary if observer shows it as process memory.

Been playing with this from wobserver: https://github.com/Logflare/logflare/blob/master/lib/logflare/system_metrics/wobserver/processes.ex

Check per process memory usage. Also mailbox length.

If it’s not obvious that’s is one or a small group of processes then track process count over time. Maybe it’s a lot of smaller procs building up.

Is there a crash dump being produced when the system crashes? If so, you might find some clues if you load the dump with the crash dump viewer http://erlang.org/doc/apps/observer/crashdump_ug.html and if not you should try enabling them.

I was able to pin point a bottleneck on my application just last week using this technique, when I noticed I had a task supervisor with many messages in it’s mailbox at the time of the crash.

We know which processes are problematic. They belong to a group we call Workers. We identified this by checking their memory usage going up using observer_cli. It is evident that their memory grows continuously up to 30MB each, and because we have thousands of Worker processes, the machine eventually runs out of memory and crashes.

Per process memory usage keeps rising without end, but the mailboxes are empty.

It’s not process count. We have eliminated that possibility as the number of processes in the system is stable.

How do I enable a crashdump if the machine crashes while out of memory?
This technique could prove useful in deed !

The application can still produce the crash dump if it OOMs, in fact, the first couple of crash dump “slogans” documented in the link I shared earlier are exactly for that situation.
Unless you have disabled crash dumps or your server does not have a writable writable file system mounted, they should get written to the current working directory, for example I found the erl_crash.dump file sitting at the top of my release directory, right next to the bin dir. Maybe you already have one on one of your servers too. There are some env vars you can use to configure your crash dumps, documented here:
http://erlang.org/doc/man/erl.html#environment-variables

1 Like

And what do they do?

1 Like

These workers hold a State with a connection pid (which is in reality the PID from a gun process) and a map with streams (when making requests with gun, each request is a Stream )

Following is the state of each worker:

defmodule State do
    @moduledoc """
    State of the Worker. Most importantly, it saves a map of stream refs to
    URLs, so we know which URLs were affected when a stream breaks.
    """
    defstruct conn_pid: nil,
      active_streams: %{}
end

Now, what could possibly go wrong?

The only thing that goes through my mind is that the active_streams map may grow forever if a stream never receives the :fin_packet necessary to tell our worker the request is over (HTTP Protocol details).

But we have no reason to believe such is happening…

Is there a way to know how much memory (in bytes) a Map (or any variable) is occupying ?

You have no reason to believe the network can fail?

Touché !
That is a very good point we are overlooking. Although, IIRC, gun requests also time out. But am I not sure how the workers behave upon timeouts.

Nice catch !

1 Like

:erts_debug.size/1

Maybe there are even old copies of values lingering around that don’t get GC’d because never a GC is triggered?

Have you tried to occasionally hibernate those processes to enforce GC? This might have a negative impact on its latency though.

Each worker calls :erlang.garbage_collect() every 15 seconds, though this doesn’t seem to be doing much. How is that hibernating a process would help me? Could you elaborate?

If a manual GC helps they’re likely not large, they’ve just been updated a bunch.

So, your best options are hibernate, or move that state to ETS.

Sorry, I’ve missed the part that you are already doing a manual GC periodically.

Not sure but that seems excessive and in general bad practice unless you really know what you’re doing (which I don’t so I avoid it :slight_smile: ).

But if you inspect the state of one of those 30MB processes you can at least see how big that map is and if it’s really 30MB of streams.

If it’s not really 30MB of connections then I think just moving that state to ETS will fix it. It should be a pretty quick change so a low risk test vs futsing around with a bunch of GC settings/calls.

1 Like

We really don’t know what we are doing. We’re just desperately trying to fix this memory issue. Anything goes at this point :smiley:

How so? If my connections map is small, (say 700K while the processes memory occupies 30MB), how is is that moving the map to an ETS table will help? Wouldn’t that just make the ETS memory usage rise with no end instead of the process memory?

Can you elaborate?

First, I’ll say that garbage collection with the VM is a bit of a black box for me. If I suspect a GC issue I just try and eliminate that possibility altogether so that, if it’s not a GC issue, I can move on. You do this by not keeping data in a long running process’ state. You move it to ETS.

Process state is garbage collected. ETS is not. When you update data in ETS there are no extra copies in memory. When you update state in the process, copies of that state build up on the heap. Garbage collections runs whenever it runs. If you have a lot of long running processes this can lead to memory issues.

For example, lets say you’re updating state once a minute. That won’t build up that fast, but then something happens to cause you to add to that map 10 times a second. This happens for 10 minutes. So the heap builds up (lots of copies of a map getting larger), but then the activity there goes back to updating state once a minute. So now we have a large heap that won’t get garbage collected soon because that state isn’t getting updating frequently enough anymore.

This specific example I’ve dealt with because I keep queues in state in a lot of long running processes. They stay fairly empty as long as my workers are working, but I’ve tested this scenario if they were to get backed up and this is exactly what happens. Memory bloats exponentially faster because not only do I have data in my queues but I have every iteration of that queue on the heap.

4 Likes

Have you used observer to look at the actual state being kept by one of these processes?

Observer, or :sys.get_state(pid)