Detecting abnormal application termination

zaid · January 26, 2017, 6:46am

Hi,

I’m an Elixir beginner and was wondering what the best way is to detect an abnormal termination of an OTP application (basically, when the top level supervisor crashes).

I already have a supervision tree setup and I can verify that its working as expected at the various levels by manually trigger errors (or disconnecting network cables).

The application runs on a headless Raspberry PI so I was hoping that I can set it up to send an email notification when the whole application is going down due to an error.

At the moment, I am sending notifications in the application’s start/stop callbacks which is not ideal.

Thanks in advance.

NobbZ · January 26, 2017, 8:55am

This depends a lot of how you actually start your application.

If it runs as a service/deamon on your OS, the underlying service/deamon manager might be able to notify on crashes or even restart your application automagically. You need to take a look in your systems documentation to learn if and how it is possible.

Another alternative might be the use of tools like M/Monit. Perhaps you might need to search for an alternative that runs on your system[1] by yourself.

[1] Raspberry PI running headless elixir can mean anything. Just to name a few: Raspbian, Windows IOT, Nerves, …

zaid · January 26, 2017, 8:48pm

Thank you for the suggestions NobbZ. I’ve used monit before to monitor Ruby/Rails apps so I can definitely use that to start the application and restart it when it fails.

I’ll take a look at adding a monitor for the top supervisor and catch its DOWN message and see if I can get some useful error messages that way as well (a nice stacktrace would be ideal).

I am using Raspbian on that Raspberry PI2 and running the latest versions of Erlang and Elixir on there.

NobbZ · January 27, 2017, 8:46am

Well, thats just a supervisor supervising your supervisor. What will you do when that one crashes? Supervise it? I don’t think that supervising your “root” supervisor is a good idea. You need to monitor/supervise from an external viewpoint at some point. And even a M/monit might crash, which you need to handle as well. I had a cronjob back then, checking once an hour if there is still a process. But who monitors cron? Well, in my case its been M/monit…

I wished I were able to break that circular dependency, but didn’t know how back then, and since I dropped my server 4 years ago, I haven’t further looked into it.

zaid · January 27, 2017, 5:13pm

Yeah you’re right. I tried monitoring the top supervisor in the hope of getting the same level of details when it fails as the VM outputs to the console but there isn’t anything useful.

Basically, what I was hoping to get is something similar to the following output:

17:21:21.421 [error] GenServer EvlDaemon.Connection terminating
** (stop) :connection_closed
Last message: {:tcp_closed, #Port<0.2447>}
State: %{pending_commands: %{“505” => {#PID<0.949.0>, #Reference<0.0.3.441>}}, socket: #Port<0.2447>}

=CRASH REPORT==== 26-Jan-2017::17:21:21 ===
crasher:
initial call: Elixir.EvlDaemon.Connection:init/1
pid: <0.948.0>
registered_name: ‘Elixir.EvlDaemon.Connection’
exception exit: connection_closed
in function gen_server:terminate/7 (gen_server.erl, line 812)
ancestors: [<0.941.0>,‘Elixir.EvlDaemon.Supervisor’,<0.883.0>]
messages:
links: [<0.941.0>]
dictionary:
trap_exit: false
status: running
heap_size: 376
stack_size: 27
reductions: 611
neighbours:

But I’m probably going about this all wrong and maybe all I need is Monit/something similar that’ll email me the output of my application (then restarts it) if it exits with an abnormal exit code.