How do I debug this?

I have a staging application (thank goodness its not in production yet), and it started crashing my server because it was using too much memory. However, I’m not really sure why since I haven’t made changes for over a long time.

In my logs it says:

Aug 15 06:58:17 PM  [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Aug 15 06:58:17 PM  [os_mon] memory supervisor port (memsup): Erlang has closed

Where should I be looking to figure out the reason causing this?

Do :observer.start() in iex (on the server) and go to the memory tab. Do stuff with the app and check for increased memory usage.

1 Like
Erlang/OTP 24 [erts-12.3.2.2] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [jit]

Interactive Elixir (1.12.3) - press Ctrl+C to exit (type h() ENTER for help)

iex(name@server)1> :observer.start()
** (UndefinedFunctionError) function :observer.start/0 is undefined (module :observer is not available)
    :observer.start()

How do I enable this?

To be clear, the above does not mean your server is running out of memory. It just means Erlang tooling for measuring memory/cpu usage has terminated, which will always be logged when Erlang shuts down.

So, without further evidence, all we know is that Erlang is shutting down. Do your logs say something else? Do you have metrics that say something else?

If it is a phoenix app, you can enable Phoenix.LiveDashboard, which may be easier to setup than observer: Phoenix.LiveDashboard — LiveDashboard v0.8.1

Open up the dashboard and you will be able to see if memory is growing, processes used, etc.

6 Likes

OOM errors are hard to debug. :observer and LiveDashboard may help, However, when shit happens, it usually happen quick enough that you don’t get the chance to observe clearly.

I can only offer a few high memory pitfalls that I have seen:

  • Do you have process that do lot of work then idle for a long time? It may cause global binary not GC’ed soon enough. You can try to make those processes short-lived, or hibernate them.
  • Do you read and parse largish files? You may try to use :raw mode to open files and tune the read_ahead size.
  • Do you make a lot of sub-strings and keep them around for a long time? A sub binary will keep the original large binary from GC’ed. You can try to :binary.copy/1 them.
5 Likes

So, without further evidence, all we know is that Erlang is shutting down. Do your logs say something else? Do you have metrics that say something else?

I am using render, and I noticed the server going unhealthy and then dies then restarts. When I looked at the logs, those are the error messages I noticed before it restarts.

Open up the dashboard and you will be able to see if memory is growing, processes used, etc.

I have that installed, but the server had already been restarted by then so my up time was pretty short. I ended up actually upping the server ram and it seemed to be okay after that. Although, I don’t believe that is the right fix.

I’ll revert to the lesser ram tomorrow when its not being used and then try checking for the things you mentioned again.

Yeah, it’s a bit tough. In my case, it happens when I hit an API endpoint and I haven’t been able to find a culprit. It could be a long idle process, but if my server is restarting then that idle process would have died. Thanks for the helpful hints. I’ll try to look more closely.

Optimizing ram usage is usually not worth the effort, this is a compromise GC languages have.

There is one thing when ram usage spikes happen and another when there is memory leaking, and judging by your description, you most probably have a spike.

For the record, what are the specs of your machine?

3 Likes

Not for nothing, but this is what I find APMs (like AppSignal, Scout, DataDog, NewRelic, etc) great for. Doesn’t always give you what you need, but more often than not you can see what was happening when things went off the rails.

3 Likes

For the record, what are the specs of your machine?

512mb then I upgraded to 2gb.

There is one thing when ram usage spikes happen and another when there is memory leaking, and judging by your description, you most probably have a spike.

I just can’t imagine my small application spiking up to that point, so I figured I must have a bug. Although 512mb might be too small, what do you think?

It depends how much of that space is left for the application, since the OS and other applications might use a part too.

Since the application runs after increasing ram, just look at the profiler and check what is happening with the ram.

1 Like

I actually have New Relic on this staging machine. I tried looking around, but didn’t notice anything strange. I was rushing to fix it so might have need to look closer. I’m also not too familiar with New Relic, which page would you recommend that I look at?

Since the application runs after increasing ram, just look at the profiler and check what is happening with the ram.

Where is the profiler? are you referring to the Processes tab in LiveDashboard?

For low RAM VMs, you may want to try to run in 32 bit to conserve memory. I have a blog post on this, albeit with fly.io:

3 Likes

Whatever you can use, observer, LiveDashboard, maybe even a system tool, as long as you can understand if it is a spike or there is a permanent leak.

1 Like

Got it! I think New Relic might help me out with this. I’ll go through some testing tomorrow to see what I find.

Honestly, I find the NR UI a baffling mess of confusion :slight_smile: I’m an AppSignal fanboy.

1 Like

Rollbar works really well too, and it’s quite cheap.

I like AppSignal as well btw.

1 Like

Yeah, I like Rollbar too, although when I last used it it was just error reporting, not APM. I really like the “RSQL” or whatever they call it, their query language/tool.

Ah sorry, I meant it only for errors, yeah. I’m investing in writing code for ingesting anything and everything in OpenObserve lately, though it doesn’t have the ready-and-baked dashboards that f.ex. NewRelic has.

1 Like