Phoenix (with Ecto) application crashing mysteriously

phongmh307 · April 11, 2018, 12:25pm

Our app is keep crashing with Slogan:
eheap_alloc: Cannot allocate <Number> bytes of memory (of type "heap_frag")

Normally, erl_crash.dump gives us enough information to trace back the issue but this time it only provides us little information like:
Current Process Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | RUNNING | ON_HEAP_MSGQ Current Process Program counter: 0x00007fea1b178a70 ('Elixir.Ecto.Repo.Preloader':split_while/4 + 24) Current Process CP: 0x0000000000000000 (invalid)

or

Current Process Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | RUNNING | ON_HEAP_MSGQ Current Process Program counter: 0x00007f934b8155f8 ('Elixir.Enum':'-reduce/3-lists^foldl/2-0-'/3 + 24) Current Process CP: 0x0000000000000000 (invalid)

Did anybody experience in this kind of issue? and how to trace back to initial process for this case?

Thank you guys so much

outlog · April 11, 2018, 12:37pm

what is the number? you seem to be running out of memory - how much memory does the server have, and how much swap?

when does the crash occur? can you identify the function/query that blows up?

try and run observer on the system and see if some processes grows/leaks memory…

Qqwy · April 11, 2018, 2:29pm

Yes, this is definitely your server running out of memory. What kind of Virtual private Server (or otherwise) are you using?

phongmh307 · April 11, 2018, 4:04pm

most recent numbers are:
Slogan: eheap_alloc: Cannot allocate 24804345000 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24540759320 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24777923072 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24751515224 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24593363816 bytes of memory (of type “heap_frag”).

I don’t think this issue happen because of our servers are running our of memory. We have 5 nodes, each node has 16GB of RAM. Normally, each node only consumes about 4GB of RAM. When the app server is about to crash, the node will consume all free memory within 1-2 minutes and goes down.

I will try to connect to the nodes through observer and post the result here ASAP

outlog · April 11, 2018, 4:19pm

24804345000 bytes would be 23GB http://whatsabyte.com/P1/byteconverter.htm which is quite a lot…

so figure out what is (trying to) use that memory… eg is a client uploading 30GB files that are stored in memory or are you processing “big” data attempting ecto inserts of GBs of data etc.

you might want to read/look at chapter 7 in http://www.erlang-in-anger.com

benwilson512 · April 11, 2018, 4:25pm

4 gigs of memory seems like a lot to me, what’s it doing with all that memory normally? For reference our production nodes use ~150mb.

phongmh307 · April 11, 2018, 4:42pm

I read the book and made some tests but unable to figure out anything. The app does not crash frequently. It is usually running smoothly for days, weeks without any problem and just suddenly crash

phongmh307 · April 11, 2018, 4:44pm

We are converting a service from a larger app to Phoenix. This service is handling 70-100k socket connections at the moment

benwilson512 · April 11, 2018, 4:54pm

Ah, that makes more sense. The crash is definitely related to run-away memory usage, so I would consider putting monitoring in place to report out the VM stats about memory usage, that’d be the first step to figure out what kind of memory is being used. Message queue length would be another important thing to track, because if a process gets overloaded or deadlocked and can’t handle its messages the queue will grow unbounded.

AstonJ · April 11, 2018, 4:57pm

Have you looked at memory usage on the server itself?

If you log into your server and go to top and then hit M it should show by memory usage. Also, look at cat /proc/meminfo.

Is mysql installed/used on the server btw? It is a huge memory hog.

OvermindDL1 · April 11, 2018, 5:10pm

Still though, the BEAM was trying to allocate 24 gigs, and it kind of seems like it is trying to allocate that 24 gigs all at once, which is… amazing… I don’t even know how that would be done. Something is massively running away allocating memory in the code… o.O

outlog · April 11, 2018, 6:01pm

@phongmh307 this is running ejabberd right? seems like there are multiple leak scenarios https://github.com/processone/ejabberd/issues?utf8=✓&q=is%3Aissue+allocate - so maybe go through and verify that you are not hitting any of those…

increase logging, use recon, and keep a close eye on message queue sizes… there is a leak/bug somewhere