How do I debug this?

tansan · August 15, 2023, 10:05am

I have a staging application (thank goodness its not in production yet), and it started crashing my server because it was using too much memory. However, I’m not really sure why since I haven’t made changes for over a long time.

In my logs it says:

Aug 15 06:58:17 PM  [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Aug 15 06:58:17 PM  [os_mon] memory supervisor port (memsup): Erlang has closed

Where should I be looking to figure out the reason causing this?

dimitarvp · August 15, 2023, 10:06am

Do :observer.start() in iex (on the server) and go to the memory tab. Do stuff with the app and check for increased memory usage.

tansan · August 15, 2023, 10:51am

Erlang/OTP 24 [erts-12.3.2.2] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit]

Interactive Elixir (1.12.3) - press Ctrl+C to exit (type h() ENTER for help)

iex(name@server)1> :observer.start()
** (UndefinedFunctionError) function :observer.start/0 is undefined (module :observer is not available)
    :observer.start()

How do I enable this?

josevalim · August 15, 2023, 11:16am

To be clear, the above does not mean your server is running out of memory. It just means Erlang tooling for measuring memory/cpu usage has terminated, which will always be logged when Erlang shuts down.

So, without further evidence, all we know is that Erlang is shutting down. Do your logs say something else? Do you have metrics that say something else?

If it is a phoenix app, you can enable Phoenix.LiveDashboard, which may be easier to setup than observer: Phoenix.LiveDashboard — LiveDashboard v0.8.1

Open up the dashboard and you will be able to see if memory is growing, processes used, etc.

derek-zhou · August 15, 2023, 1:30pm

OOM errors are hard to debug. :observer and LiveDashboard may help, However, when shit happens, it usually happen quick enough that you don’t get the chance to observe clearly.

I can only offer a few high memory pitfalls that I have seen:

Do you have process that do lot of work then idle for a long time? It may cause global binary not GC’ed soon enough. You can try to make those processes short-lived, or hibernate them.
Do you read and parse largish files? You may try to use :raw mode to open files and tune the read_ahead size.
Do you make a lot of sub-strings and keep them around for a long time? A sub binary will keep the original large binary from GC’ed. You can try to :binary.copy/1 them.

tansan · August 15, 2023, 2:01pm

So, without further evidence, all we know is that Erlang is shutting down. Do your logs say something else? Do you have metrics that say something else?

I am using render, and I noticed the server going unhealthy and then dies then restarts. When I looked at the logs, those are the error messages I noticed before it restarts.

Open up the dashboard and you will be able to see if memory is growing, processes used, etc.

I have that installed, but the server had already been restarted by then so my up time was pretty short. I ended up actually upping the server ram and it seemed to be okay after that. Although, I don’t believe that is the right fix.

I’ll revert to the lesser ram tomorrow when its not being used and then try checking for the things you mentioned again.

tansan · August 15, 2023, 2:04pm

Yeah, it’s a bit tough. In my case, it happens when I hit an API endpoint and I haven’t been able to find a culprit. It could be a long idle process, but if my server is restarting then that idle process would have died. Thanks for the helpful hints. I’ll try to look more closely.

D4no0 · August 15, 2023, 2:07pm

Optimizing ram usage is usually not worth the effort, this is a compromise GC languages have.

There is one thing when ram usage spikes happen and another when there is memory leaking, and judging by your description, you most probably have a spike.

For the record, what are the specs of your machine?

smathy · August 15, 2023, 2:08pm

Not for nothing, but this is what I find APMs (like AppSignal, Scout, DataDog, NewRelic, etc) great for. Doesn’t always give you what you need, but more often than not you can see what was happening when things went off the rails.

tansan · August 15, 2023, 2:09pm

For the record, what are the specs of your machine?

512mb then I upgraded to 2gb.

There is one thing when ram usage spikes happen and another when there is memory leaking, and judging by your description, you most probably have a spike.

I just can’t imagine my small application spiking up to that point, so I figured I must have a bug. Although 512mb might be too small, what do you think?

D4no0 · August 15, 2023, 2:10pm

It depends how much of that space is left for the application, since the OS and other applications might use a part too.

Since the application runs after increasing ram, just look at the profiler and check what is happening with the ram.

tansan · August 15, 2023, 2:11pm

I actually have New Relic on this staging machine. I tried looking around, but didn’t notice anything strange. I was rushing to fix it so might have need to look closer. I’m also not too familiar with New Relic, which page would you recommend that I look at?

tansan · August 15, 2023, 2:14pm

Since the application runs after increasing ram, just look at the profiler and check what is happening with the ram.

Where is the profiler? are you referring to the Processes tab in LiveDashboard?

derek-zhou · August 15, 2023, 2:14pm

For low RAM VMs, you may want to try to run in 32 bit to conserve memory. I have a blog post on this, albeit with fly.io:

D4no0 · August 15, 2023, 2:16pm

Whatever you can use, observer, LiveDashboard, maybe even a system tool, as long as you can understand if it is a spike or there is a permanent leak.

tansan · August 15, 2023, 2:18pm

Got it! I think New Relic might help me out with this. I’ll go through some testing tomorrow to see what I find.

smathy · August 15, 2023, 3:00pm

Honestly, I find the NR UI a baffling mess of confusion I’m an AppSignal fanboy.

dimitarvp · August 15, 2023, 8:41pm

Rollbar works really well too, and it’s quite cheap.

I like AppSignal as well btw.

smathy · August 15, 2023, 9:01pm

Yeah, I like Rollbar too, although when I last used it it was just error reporting, not APM. I really like the “RSQL” or whatever they call it, their query language/tool.

dimitarvp · August 15, 2023, 9:25pm

Ah sorry, I meant it only for errors, yeah. I’m investing in writing code for ingesting anything and everything in OpenObserve lately, though it doesn’t have the ready-and-baked dashboards that f.ex. NewRelic has.