Troubleshooting Phoenix App in Production

I’ve had a pretty simple Phoenix app running in production on a DigitalOcean droplet for a few months and I just started to experience some periodic crashes. I’ve not been able to determine the cause of these crashes after looking at the production logs or the erl_crash.dump files. Everything seems fine (200s and 302s) until I see error: run_erl[14863]: Erlang closed the connection. in the systemd journal and then ngnix starts returning 502s. No errors showing in erlang.log.x and the Do’s resource graph isn’t showing abnormal memory usage.

I’ve got a hunch that it’s one of our users who is causing the crash and I’d like to see if I can get her to recreate it but want to have some better tools in place to figure out why it’s crashing. Any suggestions?

Elixir 1.7.4
Erlang 21.1.1
Ubuntu 18.04.2 x64
deployed with edeliver 1.6.0

1 Like

I’d say look around for telemetry libraries in Elixir. Here’s one recent blog article on the topic: https://samuelmullen.com/articles/the-hows-whats-and-whys-of-elixir-telemetry/

Link to a YouTube talk through this forum: 21) ElixirConf EU 2019 - Telemetry ...and metrics for all - Arkadiusz Gil

Additionally, there has been a discussion and plans on centralising the telemetry solutions in Elixir space here: OpenCensus gains new integrations with Elixir libraries

Sorry I can’t be more helpful, I am rather new to the topic myself and only recently started gathering some education material. But I think telemetry / logging is going to be your best bet – unless somebody else had that exact problems and chimes in with a solution.

1 Like

First suspect would be running out of memory. I’ve rarely seen an Erlang system crash in production for another reason :slight_smile:

Did it leave an erl_crash.dump? You can open those with observer, or look at the first few lines (it’s just a text file).

Telemetry is definitely a great idea.

Edit: woops, how did I end up in a 2-year old thread? Oh well!