Handling crashes in Phoenix

satb · June 27, 2017, 1:45am

I am hoping someone can put me on the right track here.

We have to process a request and put it through 3 different network dependencies. Any one of them could fail at anytime. If any of them fail, we have to “roll back” the network operations…i.e. make another set of network calls to roll back what was originally done. Something like this:

with {:ok, result_1} <- network_call_1,
       {:ok, result_2} <- network_call_2,
       {:ok, result_3} <- network_call_3 do
  #All successful, do the rest
else 
  error -> #figure out which network call to roll back
end

So far its fine if everything runs and the process does not crash. But now how do we handle crashes? When the process crashes, we will lose all information about what network calls succeeded/failed.

In this situation, what alternatives are available (OTP or vanilla hand written)? The options that I am aware of with my limited erlang/elixir experience:

After each network operation, send a message to a supervised GenServer process that keeps state in ETS. Presumably, the process is so little in functionality that it won’t crash
Handle each network operation in a separately spawned process and monitor the process and handle the crash
Manage a state machine in some sort of an RDBMS. Crashes are handled by having a supervised process go through the state machine which hasn’t terminated and check if the process is alive and roll back
Try/catch around the network calls might be the simplest, but I am not sure if it will handle all cases when the process might crash.

josevalim · June 27, 2017, 10:56am

If you expect the network to fail, then you should not model it with exceptions. You should use exactly {:ok, _} | {:error, _}. But then you may say “what if I still get an exception?”- then you just let it blow up, because it is an exception, it is not supposed to happen. In such cases, you don’t even know why it failed and it could be due to any special circumstance. If you are raising an exception, it should be because you don’t have any other option than raising an exception.

Once it blows up, Phoenix will render a 500 page and your monitoring systems should send you an e-mail. In the e-mail, you can decide to act on it. Maybe it is a less common scenario you forgot to handle but maybe it is just something you can’t handle at all.

With this in mind, you need to decide how you are going to make your network calls return {:ok, _} | {:error, _} instead of exceptions. Ideally the libraries you use will expose the tuple formats instead of raising and then it becomes a no brainer: just use with like you did above. But maybe they don’t, then you may need to use a try/catch internally. In the worst scenario, you can use Task.async and use Task.yield/shutdown to assert the task exit reason.

satb · June 27, 2017, 2:20pm

@josevalim - Thanks a lot.

Many times we are not able remediate through monitoring because a logged in user token is required for the network calls. So recovery from failure should happen while the tokens are still available (at the time of failure).

If I understand correctly, you recommend I do something like this?

 def network_op do 
        try do 
            result = do_op_and_process_result 
            {:ok, result}
        rescue 
            e -> {:error, e} #callers do cleanup
        catch 
            :exit, _ -> {:error, "exited"} #callers do cleanup
        end
    end

OvermindDL1 · June 27, 2017, 3:25pm

That is what I’d do with exceptions.

For note though, if you have a with like in your first function and they return ok/error tuples like you are and you want to distinguish which is which, just tag it like:

with {1, {:ok, result_1}} <- {1, network_call_1},
       {2, {:ok, result_2}} <- {2, network_call_2},
       {3, {:ok, result_3}} <- {3, network_call_3} do
  #All successful, do the rest
else 
  {1, error_from_network_call_1} -> # handle error from network call 1
  {2, error_from_network_call_2} -> # handle error from network call 2
  {3, error_from_network_call_3} -> # handle error from network call 3
end

Libraries like Vic’s excellent happy_path is basically an enhanced with with more features, includes trivial tagging too.

josevalim · June 27, 2017, 4:07pm

That snippet is fine but I would only add it proved necessary. Otherwise it will make absurdly hard for even developers to find bugs during development (as everything is silently handled.