Coming from a non-BEAM language, I’m used to configure production services to send an alert message every time an exception occur on the server. I found that with beam’s fault-tolerance, this approach no longer works because Elixir errors aren’t necessary as bad: a process dies, its supervisor restarts it and the app keeps running fine. Getting an alert for each error raises a lot of false positives. Most of the time when we get an error alert, we check the app and it’s still behaving as it should. Yet, it’s still interesting to get information about these errors as it gives valuable feedback about how our app behaves. I’m thinking of turning off these alert messages while still logging errors and instead setup custom health checks based on expected observable behavior of our app. I’m wondering if some of you might have gone through a similar path when migrating to Elixir and what is your approach to monitoring the health of your app.
In AppSignal you can adjust how/when you’re getting notified about reoccurring errors. I guess similar methods might exist in other tools as well. So you don’t loose information, but you can adjust how they’re treated.
Second to @dimitarvp answer, but if you want to know how it is implemented on the “low level” then it is simple:
Just attach to
:logger and handle messages for level
:error and higher. As there are structured logs, then you can easily detect which one are caused by exceptions.
Hi guys, thank you for taking the time to reply! Now I realize that my question wasn’t very clear, probably because the problem isn’t very clear in my head. My question was more about policies than about tooling.
So we currently use use FlexLogger to send messages with level info or higher to LogDNA (a log aggregation service). Then we configure LogDNA to send us alerts for messages with level error or higher. LogDNA already does some kind of error batching out of the box, similar to what @LostKobrakai suggested.
Still there are cases when we get alerts because there’s a transient error (say network connectivity issue) from which the system recovers thanks to our supervision tree. As I’ve read in @sasajuric’s book, among other sources, with Elixir one shouldn’t try to obsessively catch all errors but instead just “let it crash” and design a supervision tree that enables error recovery. This contrasts with what I’m used to for instance in Python where exceptions are never acceptable. In other words, from my experience it seems that exceptions aren’t that important in Elixir and at the moment I’m probably giving them too much visibility.
That’s why I’m wondering if instead of getting notified of all errors I should instead set up health checks that validate user-facing behavior and send alerts if any of these checks fail. Exceptions would then just get logged without triggering alerts. But then there’s the risk that being caught in daily tasks we’re never going to look at these exceptions and miss opportunities to anticipate issues that we would have been able to anticipate if we had seen them.
I’m sure there isn’t a general answer as where to draw the line is highly dependent on the context, but I was wondering if you’d be willing to share a bit how you manage alerts on your production services.
Since Elixir 1.10 it is no longer needed. You can use
:logger functions directly, and in Elixir 1.11 there will be
Logger.delete_module_level/1 that will work (almost) exactly like the
:level_config option from