Errors vs Bugs and Let it crash philosophy


#1

Let it crash philosophy is something a little hard to grasp as a newcomer since usually we are taught to write code defensively. But even crossing that line and getting the idea of how ẁe should work with the guarantees that runtime give us, it has been a terrible experience to work with external tools that don’t/can’t incorporate those concepts.I’ve already worked with Honeybadger and currently been working with Bugsnag and both had the same problem, that are some sort of errors that i don’t care they happened, it’s something my supervision tree will probably take care of. I’ve abstracted this difference as errors and bugs, bugs being the kind of error that i really care about(my code raised on a request and 500ed it) and erros the ones that i not(a worker for a background job library died due to an expected external resource error). I’ve discussed this a little with some of my co-workers but i’d like to know how other people in other contexts deal with it.

So, how do you deal with error reporting services and how you deal with handling expected and unexpected errors?


#2

I normally do the same you did: classify the errors on the reporting services. And I think this is a good thing, because even expected errors can become a pain in the butt sometimes.

For example: when you call external services which has some query limit policies, at the start you might not care about the exceeded query limit exceptions you get, and maybe this will stay like this forever. But, once you start getting too many of these exceptions, maybe its time to hire a paid plan with higher limits, or even start to think about a cache strategy.


#3

Giving you an example:

So, to verk retry a job it need it to raise. I have all setted, how many times a job should retry and all that. So my concerns about it happening several times are very little, becaus the background job structure covers that for me. What happens is, everytime that happens I receivve a report on bugsnag of a very generic erlang error that a gen_server terminated. The problem is, this is so generic that if i ignore this error on bugsnag it can hide every other error that I would like to monitor on my supervison tree. Either I get a flood of reports or i’m taking the risk to not know when another gen_server dies. If my supervision tree is smal, it is ok, i can try to report manually before that happen but if the supervision tree gets bigger, i’m prone to a big failure.


#4

All right, I see now. How are this exceptions sent to bugsnag? The only option I see is manually sending the exception and re-raising them on your verk worker.


#5

My concern is not the raise about the background job thing or not. I was thinking just as an example where i could have a flood of reports on a tool like Bugsnag. If i let the flood happens i can miss other errors, if i ignore the specific error, it can hide other real possible errors from the tool.
My point is, since those tools classify the errors by the kind of raise they receive, it’s troublesome to deal with them while working with a BEAM language. Seems to me that you need to choose to deal with the flood or risk losing some errors. I wanted to know if someone see other path other then those two.


#6

This is an issue with bugsnag or the library that reports to bugsnag. These errors can be improved and we are currently working on it in the master branch for rollbax the library we use for rollback reporting. https://github.com/elixir-addicts/rollbax/


#7

Never heard of rollbar, gonna take a look at rollbar and rollbax master, thanks :smiley:

this specific situation I would point towards the library, once the tool just expose a way to report those errors, what is being reported is responsability of the library. But the lack of a more robust way to handle the errors reported on the tool itself is kinda annoying.