Memory started increasing linearly non-stop. Where to look?

sergio · February 6, 2018, 5:01am

Grafana graph here:

App memory goes from blue line, to yellow line, to orange line.

It’s strange, when I did a deploy the memory went down and start climbing non-stop ultimately spiking super high to almost 2.5GB then back down. (yellow line)

After we deployed a new version of the app the line went down again, but this time behaved as it normally does (orange line).

How would you guys recommend I sniff out the reason for this? This is an Umbrella app that includes an Elixir app and a Phoenix app.

Appreciate the assist!

kokolegorille · February 6, 2018, 6:09am

I would advise You to look at…

bentanweihao · February 6, 2018, 10:02am

Does this happen on your local machine? Are you able to start :observer.start on it?

idi527 · February 6, 2018, 10:03am

If it happens again, could you maybe run

iex(backend@127.0.0.1)1> :erlang.memory
[total: 94042312, processes: 41818232, processes_used: 41818232,
 system: 52224080, atom: 1090753, atom_used: 1075093, binary: 1951008,
 code: 35960034, ets: 2757600]

on the node? It would at least show what’s growing.

I had a similar problem a few times.

Once it were binaries, a basic phoenix app around 1.0 (maybe pre-1.0), don’t know what was going on. Maybe file upload.

And the other time it were processes created by hackney to a server that was dropping connections.

sergio · February 6, 2018, 6:05pm

Great insight guys, that second video Koko linked was very interesting. It seems the next time this happens, I will dive into the process as it’s running to see what’s up. When I find the concrete culprit I’ll update this thread with what was causing it.

karmajunkie · February 7, 2018, 12:18am

I had a similar problem awhile back. The cause ended up being a combination of factors. What was supposed to be a transient GenServer was effectively permanent thanks to a bad return tuple from a handler that should have shut it down—instead it forced an error, which resulted in a restart. That, in combination with a very aggressive caching of data in its state, would end up creating hundreds of copies of a process that was meant to live for about 5 minutes, and the memory footprint would end up north of 30GB.

After fixing the termination tuple, as well as drastically trimming back the data that ended up in state, the server hasn’t gone over 400MB more than once or twice, even under heavy load.

I had kind of an idea of where to look and how to trigger the bug, so that helped, but also some selective logging of memory stats, using :observer and wobserver (web version) along with recon helped enormously in confirming the problem and solution.

OvermindDL1 · February 7, 2018, 5:13pm

People love blogs or posts here of people debugging things and finding the causes and fixing them.