Our app is keep crashing with Slogan: eheap_alloc: Cannot allocate <Number> bytes of memory (of type "heap_frag")
Normally, erl_crash.dump gives us enough information to trace back the issue but this time it only provides us little information like: Current Process Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | RUNNING | ON_HEAP_MSGQ Current Process Program counter: 0x00007fea1b178a70 ('Elixir.Ecto.Repo.Preloader':split_while/4 + 24) Current Process CP: 0x0000000000000000 (invalid)
or
Current Process Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | RUNNING | ON_HEAP_MSGQ Current Process Program counter: 0x00007f934b8155f8 ('Elixir.Enum':'-reduce/3-lists^foldl/2-0-'/3 + 24) Current Process CP: 0x0000000000000000 (invalid)
Did anybody experience in this kind of issue? and how to trace back to initial process for this case?
most recent numbers are:
Slogan: eheap_alloc: Cannot allocate 24804345000 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24540759320 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24777923072 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24751515224 bytes of memory (of type “heap_frag”).
Slogan: eheap_alloc: Cannot allocate 24593363816 bytes of memory (of type “heap_frag”).
I don’t think this issue happen because of our servers are running our of memory. We have 5 nodes, each node has 16GB of RAM. Normally, each node only consumes about 4GB of RAM. When the app server is about to crash, the node will consume all free memory within 1-2 minutes and goes down.
I will try to connect to the nodes through observer and post the result here ASAP
so figure out what is (trying to) use that memory… eg is a client uploading 30GB files that are stored in memory or are you processing “big” data attempting ecto inserts of GBs of data etc.
I read the book and made some tests but unable to figure out anything. The app does not crash frequently. It is usually running smoothly for days, weeks without any problem and just suddenly crash
Ah, that makes more sense. The crash is definitely related to run-away memory usage, so I would consider putting monitoring in place to report out the VM stats about memory usage, that’d be the first step to figure out what kind of memory is being used. Message queue length would be another important thing to track, because if a process gets overloaded or deadlocked and can’t handle its messages the queue will grow unbounded.
Still though, the BEAM was trying to allocate 24 gigs, and it kind of seems like it is trying to allocate that 24 gigs all at once, which is… amazing… I don’t even know how that would be done. Something is massively running away allocating memory in the code… o.O