How you handle error logging with "let it crash" philosophy?

sezaru · August 8, 2021, 5:03pm

Hello everyone,

How do you handle error logging with parts of your system that will trigger errors simply because you are using the erlang “let it crash” philosophy?

For example, for my system, I use HoneyBadger to log errors so I can identify them more easily and be notified when a newer error occurs.

The issue for me is that my system connects to multiple third-party WebSockets, and the connection can be lost because the other side closes the connection after some hours.

When that happens I receive a new error entry at HoneyBadger, but at the same time is not an error that I really need to “fix”, since the WebSocket process will simply crash and recreate the connection.

So, how would you handle this? Will you try to handle this in the code so that if the connection is lost, it will not generate an error log (maybe generate a warn log instead) so it doesn’t pollute my log with errors that don’t need to be fixable?

If that’s the case, wouldn’t that be the opposite of the “let it crash” philosophy since I would, in the end, handle that crash?

Brainiac · August 9, 2021, 5:16am

Switch non-important logs to warn levels instead? I do this and filter them out on the log explorer service

stefanchrobot · August 9, 2021, 7:18am

I treat such events as non-errors. They shouldn’t pollute logs with errors or push stuff to the error reporting tool, otherwise there’s a tendency to start ignoring issues that are real errors.

I usually try to handle them gracefully which gets easier the more the “error” occurs. Alternatively, you can ignore the error. In Ruby, most error reporting SDKs had an option to ignore certain exceptions since error handling is exception-based. The benefit of such approach is that the app does not even attempt to push the errors to the error service. Seems like in Elixir you mostly need to resort to silencing them in the error service which needlessly uses the quota.

Whichever way you pick to “silence” the issue, it’s useful to consider putting a metric on top of the non-error event in question because a big increase in the amount of silenced errors is probably a symptom of a real issue. Funnily enough, pushing the errors to the error service and ignoring them there might be the easiest way to have an actual metric on how often the issue occurs

The way I think about the “let it crash” philosophy is that it frees you from convoluted error recovery scenarios. But that works under the assumption that the operation will be retried either automatically (a retryable background job) or by the user (a web request) or that you don’t care. So I would consider these scenarios to follow the general idea of “let it crash”:

Calling Repo.get! in a retryable Oban job to get an entry from the DB which is expected to exist,
Pattern matching on a specific return value of a function in a one-off script, e.g. {:ok, item} = foo(params),
Calling Repo.rollback in a Phoenix action midway transaction when something is off and some changes were already applied.

al2o3cr · August 9, 2021, 10:49am

There’s a difference between “handling” a crash by trying to keep the crashing process alive and fixing the cause of the crash, and “handling” it by noting that it happened and possibly restarting the process.