Handling crashes in Phoenix

I am hoping someone can put me on the right track here.

We have to process a request and put it through 3 different network dependencies. Any one of them could fail at anytime. If any of them fail, we have to “roll back” the network operations…i.e. make another set of network calls to roll back what was originally done. Something like this:

with {:ok, result_1} <- network_call_1,
       {:ok, result_2} <- network_call_2,
       {:ok, result_3} <- network_call_3 do
  #All successful, do the rest
  error -> #figure out which network call to roll back

So far its fine if everything runs and the process does not crash. But now how do we handle crashes? When the process crashes, we will lose all information about what network calls succeeded/failed.

In this situation, what alternatives are available (OTP or vanilla hand written)? The options that I am aware of with my limited erlang/elixir experience:

  1. After each network operation, send a message to a supervised GenServer process that keeps state in ETS. Presumably, the process is so little in functionality that it won’t crash
  2. Handle each network operation in a separately spawned process and monitor the process and handle the crash
  3. Manage a state machine in some sort of an RDBMS. Crashes are handled by having a supervised process go through the state machine which hasn’t terminated and check if the process is alive and roll back
  4. Try/catch around the network calls might be the simplest, but I am not sure if it will handle all cases when the process might crash.
1 Like

If you expect the network to fail, then you should not model it with exceptions. You should use exactly {:ok, _} | {:error, _}. But then you may say “what if I still get an exception?”- then you just let it blow up, because it is an exception, it is not supposed to happen. In such cases, you don’t even know why it failed and it could be due to any special circumstance. If you are raising an exception, it should be because you don’t have any other option than raising an exception.

Once it blows up, Phoenix will render a 500 page and your monitoring systems should send you an e-mail. In the e-mail, you can decide to act on it. Maybe it is a less common scenario you forgot to handle but maybe it is just something you can’t handle at all.

With this in mind, you need to decide how you are going to make your network calls return {:ok, _} | {:error, _} instead of exceptions. Ideally the libraries you use will expose the tuple formats instead of raising and then it becomes a no brainer: just use with like you did above. But maybe they don’t, then you may need to use a try/catch internally. In the worst scenario, you can use Task.async and use Task.yield/shutdown to assert the task exit reason.


@josevalim - Thanks a lot.

Many times we are not able remediate through monitoring because a logged in user token is required for the network calls. So recovery from failure should happen while the tokens are still available (at the time of failure).

If I understand correctly, you recommend I do something like this?

 def network_op do 
        try do 
            result = do_op_and_process_result 
            {:ok, result}
            e -> {:error, e} #callers do cleanup
            :exit, _ -> {:error, "exited"} #callers do cleanup

That is what I’d do with exceptions.

For note though, if you have a with like in your first function and they return ok/error tuples like you are and you want to distinguish which is which, just tag it like:

with {1, {:ok, result_1}} <- {1, network_call_1},
       {2, {:ok, result_2}} <- {2, network_call_2},
       {3, {:ok, result_3}} <- {3, network_call_3} do
  #All successful, do the rest
  {1, error_from_network_call_1} -> # handle error from network call 1
  {2, error_from_network_call_2} -> # handle error from network call 2
  {3, error_from_network_call_3} -> # handle error from network call 3

Libraries like Vic’s excellent happy_path is basically an enhanced with with more features, includes trivial tagging too. :slight_smile:

That snippet is fine but I would only add it proved necessary. Otherwise it will make absurdly hard for even developers to find bugs during development (as everything is silently handled.