Understanding the reason for VM crash (OOM) from a crash dump

Hi, we’re trying to understand why the VM crashed with OOM error by analysing the crash dump. Here’s the output of erl_crashdump_analyzer.sh:

$ ./erl_crashdump_analyzer.sh prod-2crash-2.dump
analyzing prod-2crash-2.dump, generated on:  Wed Mar 16 07:10:32 2022

Slogan: eheap_alloc: Cannot allocate 762886488 bytes of memory (of type "old_heap").

Memory:
===
  processes: 694 Mb
  processes_used: 694 Mb
  system: 2230 Mb
  atom: 1 Mb
  atom_used: 1 Mb
  binary: 2164 Mb
  code: 0 Mb
  ets: 6 Mb
  ---
  total: 2924 Mb

Different message queue lengths (5 largest different):
===
 990 0

Error logger queue length:
===

File descriptors open:
===
  UDP:  0
  TCP:  109
  Files:  1
  ---
  Total:  110

Number of processes:
===
     990

Processes Heap+Stack memory sizes (words) used in the VM (5 largest different):
===
   1 66222786
   1 999631
   1 514838
   1 318187
   2 121536

Processes OldHeap memory sizes (words) used in the VM (5 largest different):
===
   1 999631
   3 833026
  13 514838
   1 318187
   4 196650

Process States when crashing (sum):
===
   1 CONNECTED
   1 CONNECTED|BINARY_IO
 104 CONNECTED|BINARY_IO|PORT_LOCK
   1 CONNECTED|BINARY_IO|SOFT_EOF|PORT_LOCK
   8 CONNECTED|PORT_LOCK
   1 CONNECTED|SOFT_EOF
   1 Current Process Garbing
   1 Current Process Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS
   1 Garbing
   1 Internal ACT_PRIO_HIGH | USR_PRIO_HIGH | PRQ_PRIO_HIGH
   3 Internal ACT_PRIO_HIGH | USR_PRIO_HIGH | PRQ_PRIO_HIGH | OFF_HEAP_MSGQ
   3 Internal ACT_PRIO_LOW | USR_PRIO_LOW | PRQ_PRIO_LOW
   4 Internal ACT_PRIO_MAX | USR_PRIO_MAX | PRQ_PRIO_MAX
   1 Internal ACT_PRIO_MAX | USR_PRIO_MAX | PRQ_PRIO_MAX | OFF_HEAP_MSGQ
 973 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL
   1 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS
   4 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | OFF_HEAP_MSGQ
 989 Waiting

The crash dump is +1.6GB in size and the tool spews out this error:

 ./erl_crashdump_analyzer.sh: line 30: let: m=Current Process Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24)/(1024*1024): syntax error in expression (error token is "Process Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24)/(1024*1024)")

Here are some relevant parts of the crash dump:

31 =dirty_cpu_scheduler:3
32 Scheduler Sleep Info Flags:
33 Scheduler Sleep Info Aux Work:
34 Current Process: <0.24269.0>
35 Current Process State: Garbing
36 Current Process Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS
37 Current Process Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24)
38 Current Process Limited Stack Trace:
39 0x00007f6a3e9bc490:SReturn addr 0x23B60130 ('Elixir.Jason.Decoder':parse/2 + 288)
40 0x00007f6a3e9bc4b8:SReturn addr 0x232685D0 ('Elixir.Tesla.Middleware.JSON':process/3 + 72)
41 0x00007f6a3e9bc4d0:SReturn addr 0x23267DF0 ('Elixir.Tesla.Middleware.JSON':decode/2 + 280)
42 0x00007f6a3e9bc4e8:SReturn addr 0x23C900D0 ('Elixir.Tesla.Middleware.Retry':retry/3 + 216)
...
47 0x00007f6a3e9bc578:SReturn addr 0x23E5AE98 ('Elixir.Oban.Queue.Executor':perform/1 + 296)
48 0x00007f6a3e9bc5a0:SReturn addr 0x23E597B0 ('Elixir.Oban.Queue.Executor':call/1 + 136)
49 0x00007f6a3e9bc5b0:SReturn addr 0x60B011E0 ('Elixir.Task.Supervised':invoke_mfa/2 + 128)
50 0x00007f6a3e9bc5f8:SReturn addr 0x60B01920 ('Elixir.Task.Supervised':reply/4 + 320)
51 0x00007f6a3e9bc610:SReturn addr 0x6C5CB950 (proc_lib:init_p_do_apply/3 + 64)
52 0x00007f6a3e9bc630:SReturn addr 0x9A0F88 (<terminate process normally>)
...
104 =dirty_io_run_queue
105 Run Queue Max Length: 0
106 Run Queue High Length: 0
107 Run Queue Normal Length: 0
108 Run Queue Low Length: 0
109 Run Queue Port Length: 0
110 Run Queue Flags: OUT_OF_WORK | HALFTIME_OUT_OF_WORK
111 =memory
112 total: 3066553160
113 processes: 728034440
114 processes_used: 728028416
115 system: 2338518720
116 atom: 1917297
117 atom_used: 1915015
118 binary: 2269481192
119 code: 42046517
120 ets: 6942480
...
24606 =proc:<0.24269.0>
24607 State: Garbing
24608 Spawned as: proc_lib:init_p/5
24609 Last scheduled in for: 'Elixir.Jason.Decoder':string/7
24610 Spawned by: <0.3530.0>
24611 Message queue length: 0
24612 Number of heap fragments: 6552
24613 Heap fragment data: 3373654
24614 Link list: [<0.3530.0>, {from,<0.3531.0>,#Ref<0.2180618153.564920321.79435>}, {from,<0.24270.0>,#Ref<0.2180618153.564920321.79437>}]
24615 Reductions: 1513280838
24616 Stack+heap: 66222786
24617 OldHeap: 0
24618 Heap unused: 3372710
24619 OldHeap unused: 0
24620 BinVHeap: 275173304
24621 OldBinVHeap: 0
24622 BinVHeap unused: 87907486
24623 OldBinVHeap unused: 28711198
24624 Memory: 556772760
24625 New heap start: 7F6A1F07F028
24626 New heap top: 7F6A3D001108
24627 Stack top: 7F6A3E9BC490
24628 Stack end: 7F6A3E9BC638
24629 Old heap start: 0
24630 Old heap top: 0
24631 Old heap end: 0
24632 Program counter: 0x00007f6b23b608b8 ('Elixir.Jason.Decoder':string/7 + 24)
24633 Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS

Things that we managed to figure out so far:

  • The VM went down trying to allocate extra ~760MB of memory while already consuming ~3GB,
  • The culprit is the process with pid 0.24269.0: an Oban job that was paging through the results of an external API and accumulating JSON responses in memory (this was probably a mistake); it looks like that particular endpoint had an unusual amount of data; running the job would bring the whole VM down again and again after the app was brought up; removing the job solved the problem,
  • The VM crashed when the offending process was doing garbage collection,
  • The system was not showing any signs of overload in terms of CPU or memory prior to going down,
  • It’s unlikely that the data returned by the endpoint would be huge, definitely not in the ballpark of hundreds of megabytes,
  • We’re on Erlang 23.3.4.11 and Elixir 1.13.3,
  • We’re using the default Logger config (handle_otp_reports: true, handle_sasl_reports: false).

I’m unable to inspect the offending process in crashdump_viewer - the tool cannot handle that, reading the heap of the process times out and the UI disappears.

What really happened here?

My theory is that the offending process went over some per-process memory limit and crashed; while logging the crash, the VM went down. Can this be somehow confirmed? I was not able to reproduce it.

Is there a way to prevent this from happening again?

Which http client are you using? You might want to look at Finch as it reduces the amount of data copying. This can really help when dealing with large responses.

Thanks, I’ll look into that! But the priority now is to pinpoint the actual reason for the crash and find measures to prevent this from happening again.

The memory of 556Mb on that single process indeed looks like it’s the culprit, and somewhere it must have gotten all that data from. If the crashdump is 1.6gb and this process alone is already at 500Mb then you could try some manual inspection of the crashdump as 1/3 of the whole file belongs to this one process. Assuming it’s JSON you might be quick in figuring out what kind of payload can have become this large.

In terms of preventing, you could set up a maximum memory limit either only for these worker processes using process_flag_max_heap_size or for all processes in the system with the +hmax parameter. E.g. setup a 300Mb per-process memory limit, if that is more than you expect. That will ensure that individual “poison” data sets won’t kill your VM.