How to correctly handle GenServer.call exits because of non-existing process

GenServer.call might fail when a process we try to execute a call on no longer exists. Are there any guidelines on how to handle such situation? Let’s assume the following scenario:

defmodule RoomService do
  def list_rooms() do
      RoomRegistry
      |> Registry.select([{{:"$1", :_, :_}, [], [:"$1"]}])
      |> Enum.map(&Room.get_state(&1))
      |> Enum.reject(&(&1 == nil))
  end
end

defmodule Room do
  def get_state(room_id) do
    registry_room_id = registry_id(room_id)
    GenServer.call(registry_room_id, :get_state)
  end
end

WIth the above API, RoomService will crash when any of the calls to the Room.get_state crashes which is unwanted behavior.

RoomService cannot wrap Room.get_state into try catch as it is not aware of get_state internal implementation.

We could use try catch inside get_state like:

  def get_state(room_id) do
    registry_room_id = registry_id(room_id)

    try do
      GenServer.call(registry_room_id, :get_state)
    catch
      :exit, {:noproc, {GenServer, :call, [^registry_room_id, :get_state, _timeout]}} ->
        Logger.warning(
          "Cannot get state of #{inspect(room_id)}, the room's process doesn't exist anymore"
        )

        nil
    end
  end

but this effectively means that every single GenServer.call should be wrapped into try catch.

I feel like I am missing something :thinking:

2 Likes

That’s the idea. With a GenServer.call the caller is expected to need the response to the call. If a response is impossible to get it’s expected that the caller cannot/should not continue.

1 Like

How can the Room module know the way it is going to be used by other modules? Room module can’t know whether its caller should or shouldn’t continue when a call to the room crashes. This is caller specific behavior, especially when you are writing a library which exposes a GenServer. You can’t know how users of this library are going to use your API. Some of them might want to crash with GenServer.call crash, some other might not.

Catch the exit of GenServer.call and give the caller of the function enough context to make their own decisions. Room is still the module doing the GenServer.call, so it can decide how it deals with the fact that this might cause an exit.

So we end up with the solution where we wrap every single GenServer.call with a try catch, doesn’t we? Sounds like something we should take care of in the standard library? Something like GenServer.call and GenServer.call!

1 Like

Can’t do that without breaking BC and the fact that the elixir API mirrors what :gen_server.call does.

1 Like

You can search github for safe_call and you will find a buch of results. There may be lots of other names for that, for instance in the brod library it’s called safe_gen_call.

As a general rule of thumb, in a library that deals with processes, you may want to do that try/catch directly in library code if the process not being alive is frequent and desirable. And in all other cases do not wrap the call because when you are calling a process in general it should be there.

A common use case that comes to mind is dealing with transient processes, or timeout-then-stop, because calling Process.alive? before doing a call has an inherent race condition. So you just do the call, and if it exits with :noproc you start the process and try again.

So no you should not default to wrap everything, if the use case arises it’s generally pretty obvious.

2 Likes

The example with brod library is great, thanks!

As a general rule of thumb, in a library that deals with processes, you may want to do that try/catch directly in library code if the process not being alive is frequent and desirable. And in all other cases do not wrap the call because when you are calling a process in general it should be there.

I can imagine that you have some kind of a client, which is a process and might die when it disconnects from a remote server. In such a case, wrapping calls to the client into try/catch sounds good.

But shouldn’t we be prepared for every process to crash at any time, e.g. because there is a bug in a code or something unexpected happened? Aren’t they reasons why we always try to spawn a process under supervisor?

1 Like

Yes we should expect processes to crash. How we react on a process not being unavailable is the point of discussion though. The option choosen for GenServer.call and (probably way) earlier for :gen_server.call is to exit if the process is not available. I’d imagine the reasoning being that the calling process requires a response to the call and if that cannot be retrieved all code depending on the response likely shouldn’t run as well. Choosing how to handle (such) errors is a tradeoff for sure, but I don’t think this one is entirely unreasonable.

The idea of supervisors and OTPs fault tolerance is not that things always have fallbacks (I’d even say to the contrary), but rather that failures only affect the users, which run into a fault, but not others. For this case this means the unlucky user trying to call a crashed process will crash as well, but any user not needing to interact with that missing process is fine. Once the process is back up the system is healed and everything works again.

Another thing you can look into is the async alternative to doing a call (send_request, wait_response, …) introduced in OTP 23, which doesn’t fail if a process is not alive, but returns an error tuple on trying to receive the response. That one is not yet in mirrored in elixir though, but the erlang api should work just fine.

4 Likes

It’s a bit like saying that if a process opens a file and the file doesn’t exist, all the code dependent on the file shouldn’t run. But it doesn’t mean that the process has to crash, thus we have File.open and File.open! :wink:

That might be a very reasonable thing to do.

I’m not argueing that having that option wouldn’t be useful - it very much would be. I don’t see a way to get there now though. The decision for :gen_server.call had been made ages ago and GenServer.call mirrored it. I don’t think GenServer.call_but_not_exit is a great option as well. Unless this discussion is aimed at elixir 2.0 I’m not sure what this topic is meant to get to.

1 Like

I think this discussion should be done with OTP team, because I doubt that elixir will introduce changes to genserver that are incompatible with the erlang version.

Why does the process not exist?

Why is registry_id(room_id) still have a dead process id?

Of course a process can die at any point, and you can crash calling it, but perhaps you need the holy trinity of a Supervisor, DynamicSupervisor and Registry, managing and restarting the rooms?

1 Like

Yeah, I appreciate you trying to find the reasoning behind the original decision. Just wanted to make it clear that the way it is designed just doesn’t work in some cases and you have to hack it one way or another. Probably by wrapping with try-catch.

It’s actually to be logically consistent by default across a range of possibilities such as:

  • If the called process does not exist you get an exception on the caller.

  • If the called process exists and crashes as the message is in flight you get an exception on the caller.

  • If the called process exists and the node hosting the process crashes as the message is in flight you get an execption on the caller.

  • If the called process exists and the process crashes after the message has been added to the process message queue but not yet processed, you get an execption on the caller.

  • If the called process exists and the node hosting the process crashes after the message has been added to the process message queue but not yet processed, you get an execption on the caller.

  • If the called process exists and the message is delivered and the processes begins processing the call and then crashes you get an exception on the caller.

  • If the called process exists and is delivered and then the node crashes as the message is being processed you get an exception on the caller.

  • If the called process exists and is delivered and the process does not handle that call, you get an exception on the caller.

Unless this discussion is aimed at elixir 2.0 I’m not sure what this topic is meant to get to.

Yeah, yeah, I know :wink: I just couldn’t find any official guidelines so I am just trying to understand how I should design my API. I am also curious about your points of view :slight_smile: I know that call was designed on purpose to raise when a process no longer exists so just trying to get the reasoning behind it.

Also, moving this to the Erlang forum might be a good option as that’s not Elixir thing

Why does the process not exist?

Why is registry_id(room_id) still have a dead process id?

We had a bug where a room could crash under some circumstances. After fixing the bug, a room is no longer expected to crash so we could keep our code and assume that there won’t be further crashes and GenServer.call always succeeds but this situation made me think on how our API should be designed and whether we should be prepared for a process to crash at any time or assume that it is not expected to crash.

The bigger context is that we execute GET request to get all room states. It happened to us that one of rooms crashed in the meantime and the whole request crashed returning 500 error code. If a user re-request all room states it will get correct response so maybe that’s how the system should work and as I said we should keep our code without wrapping it into try/catch.

Another problem is that you don’t know how Room.get_state is implemented. In case of HTTP requests it’s pretty easy as Phoenix handles every single request in a separate process but imagine you are calling this API on your own. You don’t know whether this function can raise or not. If you knew, you could wrap it in try/catch, spawn a separate Task for it (but with async_no_link) or just call directly :thinking:

1 Like

I have been there, and it’s very bad. You will go against the Erlang philosophy of letting things crash sometimes, and will have to handle an infinite amount of possible errors at every layer of the application, in every module, etc. Really bad.

The hard part is to accept that yeah, stuff goes wrong sometimes, and there is not much you can do about it. I see two cases, 1. the code will fail 50% of the time. In that case isolate in processes and/or use try/rescue/catch. 2. the code is not expected to fail, failures are exceptional. Then do nothing, let something else retry, it can be the frontend, an Oban job, a Task… just let your supervisor handle that and you can focus on implementing features.

4 Likes

Yes any code can crash because of a missing process, a dead node or a code error such as divide by zero. That’s ok, you either deal with it at design time because you expect it and know EXACTLY how to handle it, or you don’t really know about the possibility or how to handle something unexpected so the right thing to do is to “let it crash” and defer to policy decided in the layer(s) above to recover e.g. through the supervision tree.

In Elixir our code can be highly concurrent, perhaps down deep in your library code is using Task.async_stream to process a stream of items concurrently. Any error that occurs in the spawned task (e.g. due to bad data your function doesn’t handle) should crash the calling process, because you really don’t expect it to fail.

If you are coming from a static typed exception heavy programming language, you are unreasonably forced to declare all possible exceptions and code defensively and exhaustively and try and handle things you can’t reasonably do much about because you don’t have a process model like Elixir to recover and cleanup state when the happy path is violated, such as when a process dies that you expected to be alive. In these other defensive paradigms your code has to do all the work to detect and cleanup, so you write even more code to try and handle all eventualities because you MUST keep the global shared state consistent. Of course, incorrect programs with even more code than the necessary happy path can’t reasonably be expected to correctly handle all eventualities, more code means more bugs.

3 Likes

Exactly.

“Drugs are bad m-kay…”

But it really is a shift in thinking, it’s why Joe Armstrongs thesis was so important, what seems the right thing just leads to more harm than good.

How can we program systems which behave in a reasonable manner in the presence of software errors? This is the central question that I hope to answer in this thesis.
Large systems will probably always be delivered containing a number of errors in the software, nevertheless such systems are expected to behave in a reasonable manner.
To make a reliable system from faulty components places certain requirements on the system.

1 Like

Not to mention that handling errors manually is error-prone :joy: , so this is basically an infinite circle of hell.

1 Like