The art of letting it crash, while managing state, and communicating failures?

Are there any really good resources that describe how to idiomatically “let it crash” while declaratively managing the state of the process inbox and communicating errors back to callers?

Here is an example resource for Scala, using Akka’s Actor library.

http://danielwestheide.com/blog/2013/03/20/the-neophytes-guide-to-scala-part-15-dealing-with-failure-in-actor-systems.html

I subscribe to the philosophy that happy path programming, along with a supervision model cleaning up after unhappy events, is a good thing.

But my question isn’t only about keeping the system running.

  • What is the preferred way to providing useful feedback to my callers in the event of an unhappy event?
  • Hopefully this way includes a method for centralizing error handling code away from the rest of the happy path code.
  • What is the preferred way to drop a message that causes a crash, while preserving the rest of my mailbox? I get that this may not be the preferred way to handle every crash. Ideally, I should be able to declare to the supervisor what errors trigger resume vs restart logic.

I see a lot of material on the merits of supervision and building distributed systems that heal. But not much information on the best ways to recover from, or communicate recoverable errors in stateful processes.

It seems like something is missing if “let it crash” means wiping my process state on every error and leaving my clients to timeout in every situation that isn’t happy path.

It also doesn’t seem very idiomatic to:

case result do
  {:ok, happy} -> #do happy
  {:error, "error1"} -> #handle error1
  {:error, "error2"} -> #handle error2
  ... # for each of my possible 2-6 errors, per function (many more if you deal with IO)
end

What is the elegant middle ground?

4 Likes

First of all there is a difference between:

  • Handling success
  • Handling an error that you can easily recover from
  • Handling an unexpected error

The ‘let it crash’ bit is about not handling unexpected error’s. Like you have to handle errors both expected and unexpected in, say, C++ and Java, but in OPT you should not, only handle expected errors where you think you can recover from them in a way that makes sense (like telling the user they forgot to enter in their password or so).

3 Likes

But the user was expected to enter their password. It would seem like a dark art to determine between easily recoverable and unexpected. Nothing will recover this particular request and the given input wasn’t expected.

I think what you’re saying is that over time you start to bubble up a crop of common errors and these are the things you handle?

But then if your system lives long enough you end up with large amounts of error handling code explaining back to the caller the dozens of ways other callers have broken the system in the past. This error handling code then has to propagate throughout multiple processes that communicate with each other.

In this case, maybe it was the DB process that didn’t like the missing value. It crashed on a match exception. Do we fix the DB process to handle this crash and communicate an {:error, message} back to the calling process that the value is required? Should the DB process raise an exception instead? This calling process now crashes because it never expected this response, and so we add logic there, and on up through our call flow until our controller gets the response and translates it for user consumption.

And therein lies the root of my question. My current system has 50 different error codes denoting classifications of problems that are returned to the user. Sometimes it’s obvious how they should act to remediate the issue, sometimes they simply copy and paste to tech support. The point is that the user sees 50 errors, and there are probably 500 different ways the system behaves given what actually went wrong. In these cases, the system either attempts recovery internally or wraps and collapses the error into one of the 50 user-visible classifications.

I’d like this situation to get better in the new system. But I need to provide valuable information back to my user. I want to a rich system of processes (with supervision), each of which could fail in a number of ways. I’d like to not have to sprinkle error handling code in each function that makes a call to a service that can fail since that is pretty much all of them.

If you assume that every interesting call to a remote process can fail in a few interesting ways, is there a way to make this code look different from it’s Java/Scala counterparts? Happy Path + 4 error handling blocks per handle_call, is what I’m trying to avoid.

2 Likes

How to deal with failure on the code level is objective/context sensitive - so “elegance” typically deals with how you structure things around the possibility of failure.

What is the preferred way to providing useful feedback to my callers in the event of an unhappy event?

There are lots of ways a call can fail - for example look at GenServer.call/3 and :gen.do_call that it relies on - all of them ultimately lead to an exit/1. If the process doesn’t catch the exit the process will terminate. Typically the exit isn’t caught in order to recover but instead to “clean up” whatever has been “accomplished” up to that point. So for a GenServer a failed call typically results in a terminated process (in one form or another).

What is the preferred way to drop a message that causes a crash, while preserving the rest of my mailbox?

The mailbox dies with the process and so does every message in it. So if every message is “precious” you minimize the risk of losing them by “outsourcing” the processing that could fail - i.e. take a message from the mailbox and spawn a new process to deal with that message - if it fails you only lose that message.

Sometimes it’s obvious how they should act to remediate the issue, sometimes they simply copy and paste to tech support.

This sounds like a common “legacy” solution to the lack of centralized logging. Errors that are a result of user error should be expected errors and reported in such a way that the user can remedy them. Failures due to circumstances outside of the control of the user need to be logged in detail for future investigation close to the site of failure but should simply be reported as a system error (and a failed operation): “please try again later or contact tech support if the problem persists”.

5 Likes

Judging by this description, it seems to me that you first need to differentiate between business errors which can be fixed by users, and internal faults (bugs) which can only be fixed by developers.

The former case is usually handle by returning {:ok, result} | {:error, reason}, and combining this with with:

with \ 
  {:ok, y} <- do_something_with(x),
  {:ok, z} <- do_something_with(y),
  # ...

And then in the top-level function, you’ll have something like:

case do_something(...) do
  {:ok, success} -> report_success(success)
  {:error, reason} -> report_error(reason)
end

On the other hand, an unexpected situation is something where you should let the process crash. For example, if you’re expecting some map to have the field "foo", and that key is missing, it’s a bug which user’s can’t fix themselves, so you can’t report anything meaningful to your users other than that the operation failed do to an internal error. The exception/crash will be logged, and some developer will need to analyze it and figure out what’s the problem.

This begs the question: how does a process reports an error if it crashes? The answer is: it doesn’t :slight_smile: The idiomatic solution is to run a potentially failing operation in a separate process. The “master” process is responsible only for reporting the result back. This will also keep the mailbox alive. The incoming messages are kept in the master process, so crashes of job processes will not lead to the loss of those messages.

Depending on the use case, the actual implementation can vary to some extent, but the core idea is usually the same. If you don’t want the failure of A to affect the success of B, you should run A and B in separate processes.

6 Likes

I think this is highly dependent on what the system is designed for - I also think that “let it crash” is more in regards to distributed systems than to regular web app flows and requires a bit of explanation of a use case to be accessed.

For instance in a small project I’m working on, I have genservers holding state, they save it to the db as initial state when created, then they keep it updated after each action and through casts save it to the db (so it doesn’t block the genserver). The state is always served from the GenServer directly, because it’s set on its own initialisation. If it crashes for some reason, when restarted it always tries to fetch the state from the db record, and if none it simply creates a new one.
I don’t deal with anything of ecto possible errors, because I make some assumptions up to that point, based on the fact that no data is changed or actually written to the db that I don’t validate, since I control all the entry points (a single one in this case), I can rely on the fact that ecto will not crash arbitrarily unless due to a bug. The entry point performs some generic validation and then, and dispatches according to the request, where it’s further validated. Since the state that is used is never available to the user for direct change I can rely on it being stable.

Another example is another app I’m currently building, that relies on several os processes to control chrome instances, opened as ports, along with websockets to connect to the remote dev tools of each one of these chrome instances, and two genservers up till now to control the spawning of new chrome instances and keep track of sockets. I’m cleaning up the failing paths and tidying it up, and in the end I’ll just write a trap exit for the main genserver that holds all the accountable state and in case it breaks it will just dump it into a dets to be reused. But the thing is, this process should never crash unless there’s a power failure or something really “exceptional”, because it is important I have to make sure that all interactions with it are sanitised, so I’m moving the logic that can break towards the edges, where processes can fail and be restarted (if necessary fetching from the state of the genserver that is important).

They will have completely different strategies, since in one I have a DB and in the other one I don’t, in one I deal with the state that only pertains to a singular game, so it can be restarted without a problem, whereas in the other the genserver holds a whole bunch of linked information (OS PIDs, Process PIDs, Socket PIDs, urls, etc etc) for many different processes, and I can’t just “let it crash” so I’m trying to move stuff to the edges and making sure that all interactions are sane whenever another process tries to talk with this one.

Having said this, I would also like to see some more guides on handling error and exceptions with useful patterns.

Regarding your particular question and comments, I think you need to bubble the error up, but in the form of {:error, your_own_error_struct}, so that then you have only two possible outcomes for the case, either {:ok, something}, or {:error, something} . I also think that you should separate through layers, for instance a missing password on a submission shouldn’t even reach your function logic, it should be taken care, perhaps through function pattern matching on the args, before it has any opportunity to cascade into the inner workings of your processes. If you really have 50 errors that can come out of any given flow, then perhaps writing a module that works as an error interface and from which you can call a function like error_explanation(:some_error) to return specific error messages or data pertaining to that error.

But again, I think that “let it crash” is more towards system engineering, and moving “may-break” parts towards the edges, were a failure of a process doesn’t take down whatever is important to stay up, along with setting up relevant links, supervisors, and restart/traps/inits. When it comes to error translation for the user you really have no option than to write something that translates that error, because I don’t think we’re anywhere near a language that can infere what it should output for a user. And mostly because it’s highly subjective, if you’re dealing with API calls, then the errors will be of a certain type and structured in a certain way (and then again different if you’re outputting json or xml, or plaintext), if you’re dealing with form submission, then the error will need a different structure to signal fields that are wrong, etc and so on.

2 Likes

I do this type of thing within my Akka system. But I don’t read many Elixir/Erlang blogs talking about how to set up proper Exception Hierarchies. I also don’t see many Elixir/Erlang examples promoting {:error, exception|exceptionstruct} as a return value.

Even with {:error, exception|exceptionstruct} I’m not sure how to make it idiomatic. The language supports try/recover clauses. However, outside of a recover block, I’m not sure how best to match on the exception type. I guess you’d need to match on the __struct__ field of the map. And even worse, you’d need to supply a separate hierarchy list somewhere or encode information within your exception type name. Instead of using a real defexception you could define multiple Module.Error1, Module.Error2 structs and use those. You wouldn’t be able to raise those structs, although raising exceptions seems pretty dangerous anyway since it blows up any async conversation you were having.

Given the friction of what I’m describing, I assume that there is a better way. That better way is what I’m trying to discover in this thread.

So far my take on Supervision is that it’s a good safety net for when your application wasn’t defensive enough.

I’m not trying to be intentionally blasphemist, but thus far I’m hearing:

  • The “let it crash” ethos only applies to headless systems. You don’t communicate errors back to a telephone handset. Restart and recover what you can and do better with the next call. At the very least the whole system hasn’t crashed.
  • It’s not idiomatic to trap exists frequently in application logic. Therefor:
  • Don’t use the exception mechanism within a process to communicate failure back to callers.
  • Wrap all logic that could raise an exception and convert that error to a message. Communicate this message back to those participating in the process “conversation”.
  • You wouldn’t want to raise exceptions within conversational processes anyway, since it kills the entire mailbox. It’s better to prevent crashing and return an error response message. Clean up or repair state as necessary and continue to process in flight messages.
  • You should consider encapsulating all communication to processes in spawned tasks that perform exit trapping, so that you can isolate and guard from crashes in other processes from being invisible to your current workflow. Instead have the task handle this detail for you, and provide an error message if it causes a downstream process to crash. You’re in effect trapping exits on every message pass, however this detail is hidden within your custom Task behavior.
  • Error messages are now equivalent to checked exceptions. You need to explictly handle them, or explicitly ignore them.
  • Unfortunately, while you have multiple heads in order to keep your logic focused, you still need to include the code that handles or generates recurring classes of error response messages, in each head. If you’re lucky you can get away with a single catchall clause that wraps or passes through any message that isn’t the happy path message. If you’re unlucky you have error handling logic for common errors replicated in every head.
  • In order to aid in determining the appropriate error handling logic, or at least the proper user-level error response, you create many application specific Error structs, explicitly encoding additional type information with a custom field, if necessary.

This sounds like the complete opposite of “let it crash”. Yet this is the only way I know to build responsive systems that focus on message passing. Part of the reason I’m here is to learn a new way.

If your processes are modeling people and you are having a conversation with those people, it isn’t ok for an errant combination of spoken words to give someone a stroke, killing them. If they can’t do what I’m asking, or don’t understand what I’m asking, or try to do something and fail, that information should be communicated back with reasonable detail as to what went wrong.

If I’m at the post office and ask Joe to get my package, he goes off but a piano falls on him. It’s ok if someone discovers the mess and tells me that Joe died then asks that I please repeat my request. But piano deaths are exceedingly rare. How is there an entire ethos built around recovering from piano deaths, when Joe coming back without my package and saying the following are much more common:

  • Sorry I couldn’t find your package
  • Sorry I see that your package is out for delivery and is no longer at this location
  • Sorry your package is too heavy for me to lift. You will need to come back later when Robert is here
  • Sorry your package is too heavy for anyone at this office to lift. Someone should have told you that before accepting the package. Please don’t come back.
  • Sorry I have your package here, but my scanner is broken. Since I can’t check your package out of my inventory, I can’t let you have it.
  • Sorry I have your package here, but there is a regulatory hold on it. Come back when the hold is lifted. Here is the hold number, put it into our regulatory hold system for more information.

Those are 6 pretty common things that could happen as a result of a single {:getpackage, trackingnumber} message. I stopped at 6 because I’d made my point. In large systems the ways in which things fail are numerous. It’s not ok for Joe to simply say, {:error, "nopackageforyou"} that doesn’t communicate enough information to me as a user, or for post office management attempting to identify areas for improvement.

Many of these same problems could also happen within the {:movepackage, trackingnumber, newstoragelocation} call. In that case someone else is the requester, possibly coming in through a completely different system or controller.

This problem only compounds itself if Joe is the storeroom person, but I talk to a teller. The teller is able to sell stamps, and provide other services in addition to dispatching to Joe my package request. But now the teller is faced with the need to handling these error messages from Joe, and can either pass them to me directly or sugar coat them. But the complexy of dealing with errors is still there. In fact it’s compounded by each party that needs to be aware of them.

Proper error handling is hard, or at least potentially complex and repetitive. In the post office scenario it isn’t acceptable to simply close the teller window, kick everyone out of line, and then reopen the window a short time later, if anything goes wrong. This is acceptable if Joe gets crushed and killed, but that ought to be exceedingly rare. And in that case I’m thankful that through the miracle of automation that Joe gets scraped up and a new Joe constructed in short order.

What I’m looking for now are common idioms and tips for managing this error complexity (presumabily without crashing).

5 Likes

With does appear to be a convenient method for focusing only on happy path and bubbling up errors. Thank you for this tip.

https://hexdocs.pm/elixir/Kernel.SpecialForms.html#with/1

There’s a link for others following the thread. It would be nice if this was included within one of the Getting Started sections, not burred within a section on doctests.

I think the problem here is defining what an error is :slight_smile: Lots of error cases are in fact not errors at all. It is normal process flow. Therefore they should be handled as normal code. Take your list for “getpackage”. They are not errors (perhaps except the scanner being broken) but just different values that can be returned.

You can define an API for it.

getpackage(id :: integer()) ->

| {:ok, package}
| :not_found
| {:resource_problem, :temporary}
| {:resource_problem, :permanent}
| {:package_held, hold_number}
| {:error, reason}

The let it crash is more on how to recover from unexpected errors. Your errors are expected (and that is why you deal with them) and I don’t see them as errors but rather business logic you need to deal with. You could almost not send back the {:error, reason} tuple unless the caller can actually do something about it but perhaps you want to do it for informational purposes. In this case I normally return a well defined atom and the API have an API to format the error to a user string. This makes it much easier to handle in pattern matching as I might care about some but not all errors.

For example:

case MyAPI.call_function(args) do
   {:ok, value} -> print_it(value);
   {:error, :handle_me} -> recover_and_print(args);
   {:error, reason} -> print_error(MyAPI.format_error(reason))
end

defmodule MyAPI do 

  def format_error(:foo), do:  "A foo error happened"
  def format_error(:bar), do: "A bar error happened"

end

If you start program defensively you have to handle everything everywhere which you don’t want to do. You just need to fulfill the API contract. Let it crash makes sure that anything unexpected is dealt with.

I also don’t agree that you need to be super-clear on why something fails to the end user as long as the error is logged somewhere and let the developer of the system deal with it later. It is OK to tell the user that an error has occured and that they need to try back later without specifying exactly what went wrong. If it is an error that is part of the design.

3 Likes

First let me say that I really appreciate this feedback. Reading my long posts and then coming back with refined examples is time consuming. I appreciate everyone’s efforts.

I come from a background where there’s a single success value and anything that isn’t that value is an error (exceptional case). So to me everything not {:ok, package} is an error. Checked exceptions are part of the API definition too. Moving from stack based to message passing doesn’t change this, IMO. That’s why I mentally classify all those other things as errors. Even if they’re well understand or common non-desired outcomes.

Checked exceptions have fallen out of favor. Instead the new hotness is to make everything an unchecked (runtime) exception. It’s then up to the developer to detect common error patterns and handle special error cases, or not. Sometimes the API documentation describes common runtime errors that it raises but this is no longer part of the compiled API contract.

Data processing pipelines also tend to be success biased. So a clear definition for success is important. I also want to reserve the ability to introduce new errors in the future without breaking older/existing clients. I think it’s common for code to:

case get_package(args) do
  {:ok, package} -> #expected path
  {:error, {interesting_type, _message}} -> #well handled error
  {:error, _} -> #unhandled error
end

The number of well handled errors may expand over time as the system matures. The number of interesting error types may also evolve over time.

I try not to write

case get_package(args) do
  {:ok, package} -> #expected path
  {:error, {:interesting_type, _message}} -> #well handled error
  {:error, _} -> #unhandled error
  _ -> #not understood message
end

I do agree that there are many things (that I’m calling errors) that callers may not be able to do anything about. In that case there seems to be only a few options.

  • Forward the error as is
  • Turn this error into a new error and send the new error
  • Raise an exception and crash.

with seems to be pretty nice construct for error forwarding. Simple forwarding may be totally appropriate in some parts of the app. As the message travels through the system something will be responsible for refining or handling these messages.

Oddly, one of the benefits of moving from checked to unchecked exception handling, was that now the typically unchecked errors are being handled in a more idiomatic way. For example, dividing by zero is now trapped, turned into a message and handled idomatically. Whereas in the past this would have caused a crash.

So the rule of thumb appears to be:

  • If your logic is part of a process backed behavior, never crash. Try/rescue everything and encapsulate all failures in a message.
  • If you’re writing a library that isn’t process based you can raise an exception, but an {:error, _} tuple is preferred.
  • Raising errors is more common when interfacing with single return value languages that throw exceptions. So NIFs wrapping C code, raising errors might be more common.

If You like the with construct, You might like exceptional.

This library helps You with exceptions handling, the monadic way.

3 Likes

This might help demystify on how to handle state and crashes in Erlang-like languages (such as Elixir). Pay attention to state classification.

1 Like