Trap exits to ensure linked process exits?

hkrutzer · June 20, 2022, 11:02am

Hi,

I’m using Phoenix.Socket.Transport to accept websocket connections. For each websocket connection, I need to start a client process that connects to a different service. This process will remain up and connected for the duration of the websocket connection. When either of the two processes crashes, the other process needs to crash as well (there is no way to recover from this). When the websocket closes normally, the client process needs to be stopped.

When I run this:

defmodule Stack do
  use GenServer

  @impl true
  def init(stack) do
    {:ok, stack}
  end
end

spid = spawn(fn ->
  {:ok, pid} = GenServer.start_link(Stack, [:hello])
  IO.puts("gs: #{inspect pid}")
end)

# Process.alive?(pid( gs pid ))

the genserver stay alive, even though the spawned process immediately exits. When I trap exits in the Genserver, the genserver goes down with the spawned process, which is what I want:

defmodule Stack do
  use GenServer

  @impl true
  def init(stack) do
    Process.flag(:trap_exit, true)
    {:ok, stack}
  end
end

spid = spawn(fn ->
  {:ok, pid} = GenServer.start_link(Stack, [:hello])
  IO.puts("gs: #{inspect pid}")
end)

# Process.alive?(pid( gs pid ))

It seems to me this is a clean solution because I can ensure from the client code, that it will always go down, even if I forget to explicitly shut down the genserver from a terminate callback. This is also expected from the docs:

If trap_exit is set to false, the process exits if it receives an exit signal other than normal and the exit signal is propagated to its linked processes.

https://erlang.org/~lukas/predefined-types/erts-12.1.2/doc/html/erlang.html#process_flag-2

However, Erlang Slack says I should use the terminate callback instead, because trapping exits “has some additional risks with it if the code doesn’t handle all cases correctly”. I’m not sure what those risks are.

Crashes

Furthermore, on https://elixir-lang.org/getting-started/mix-otp/genserver.html#monitors-or-links it says:

If you link two processes and one of them crashes, the other side will crash too (unless it is trapping exits).

However, if I make my spawn process crash:

defmodule Stack do
  use GenServer

  @impl true
  def init(stack) do
    Process.flag(:trap_exit, true)
    {:ok, stack}
  end
end

spid = spawn(fn ->
  {:ok, pid} = GenServer.start_link(Stack, [:hello])
  IO.puts("gs: #{inspect pid}")
  raise "bye"
end)

both the Genserver and the spawned process crash, even though the docs say “the other side will crash too, unless it is trapping exits”.

So my questions are:

Is it appropriate to use start_link from the websocket process?
Should I trap exits in the client connection, or should I stop the client from the websocket terminate callback?
If I should not trap exits, why?
Is the documentation page I linked about missing something about crashing and trapping exits?

trisolaran · June 20, 2022, 12:28pm

When you trap exits, exit signals arriving to a process are converted to {'EXIT', From, Reason} messages, which you normally can receive like any other message. However, your process is a GenServer, and GenServers apparently will crash automatically when they receive an EXIT message, no matter what: Trying to understand GenServer terminate/2 behavior when trapping exit - #2 by michalmuskala

So trapping exits doesn’t really work for GenServers, in the sense that you can’t trap exits and expect your process not to crash when the parent terminates. In fact, when you trap exits on your GenServer it always terminates when the parent terminates, as you observed in your experiments. I actually had no clue it worked like this, I just found out and was very surprised.

msimonborg · June 20, 2022, 1:51pm

I’ve had a little different experience with this while developing an application that uses OTP heavily. In this app I have to do a lot of exit trapping and cleanup on exits. Bear in mind I am still very much learning while I work on it, but this is what I think I have learned so far.

GenServers can trap exits, and must handle those exit signals with handle_info({:EXIT, from, reason}, state) when the exit is sent by any process that is not the parent. This will happen when a linked process exits or if you do something like Process.exit(pid, reason). If you do catch the exit and you still want the GenServer to exit eventually, maybe after logging the message for example, you can return {:stop, reason, state}, which will immediately invoke terminate(reason, state). However, if the exit reason is :kill then the GenServer will immediately exit without invoking handle_info/2 or terminate/2.

When a Supervisor attempts a graceful shutdown, for example with a SIGTERM, :init.stop(), or System.stop(), the Supervisor will stop all of its children with reason :shutdown. Unlike when a linked process exits or using Process.exit/2, terminate/2 will be immediately invoked without ever calling handle_info/2, even when trapping exits. If the GenServer takes too long (> timeout) or the shutdown strategy is :brutal_kill, then the Supervisor will send a :kill signal and the GenServer will immediately exit.

Again, this is what I think I’ve learned based on my experience working on this project

To do my best to help with the question:

I think this is happening because your spawned process is exiting with reason :normal which will not take a linked process down with it. Think Task.start_link, if your task completes its job it doesn’t kill your caller. This is why you need to be careful using start_link rather than starting your process with a Supervisor, because a linked process can get orphaned if you don’t handle it correctly.

When you’re trapping exits you are receiving that :normal signal as a message in your inbox handle_info({:EXIT, from, :normal}, state), and it is the fact that you are not handling it that is why it crashes. I think you want something like this:

defmodule Stack do
  use GenServer
  require Logger

  def start_link(stack), do: GenServer.start_link(__MODULE__, stack)
  
  @impl true
  def init(stack) do
    Process.flag(:trap_exit, true)
    {:ok, stack}
  end
  
  @impl true
  def handle_info({:EXIT, _, reason}, stack) do
    {:stop, reason, stack}
  end
  
  @impl true
  def terminate(reason, _) do
    # return value doesn't matter
    Logger.info("exiting with reason #{reason}")
  end
end

# from your websocket process
{:ok, pid} = Stack.start_link([:hello])

This will handle an exit signal of any kind in your Stack, optionally log the reason, and ensure that it still exits.

You could also not trap exits and instead set up two-way process monitoring, handle the :DOWN messages, and start your Stack processes under a Supervisor.

kokolegorille · June 20, 2022, 2:18pm

You could use a middle man GenServer, responsible for starting/stopping additional processes, with a Dynamic Supervisor

It is not reliable to use terminate as a cleaning callback

Trap exits is not the only way to get information about a dying process, there is also monitor. The main difference is, You can have multiple monitors on the same process.

I have something similar, how to stop a process when websocket disconnect. I use monitor and an additional server, doing the cleaning job. I prefer to delegate to another process, instead of supercharging my server with cleanup functions.

hkrutzer · June 20, 2022, 3:01pm

Alright, thanks, that answers my last question

I don’t think this example works, I don’t see the log line.

Then who will ensure that new process is stopped when it needs to?

I don’t think it needs to be reliable, it only needs to handle a normal exit as all other exits are already handled by the fact that it is a link.

kokolegorille · June 20, 2022, 3:11pm

The manager could stop the worker…

It means it might not be called when the system is really busy…

msimonborg · June 20, 2022, 3:17pm

You are right, I made a mistake A :normal reason does not invoke handle_info but goes right to terminate. Try this instead

defmodule Stack do
  use GenServer
  
  require Logger

  def start_link(stack), do: GenServer.start_link(__MODULE__, stack)
  
  @impl true
  def init(stack) do
    Process.flag(:trap_exit, true)
    {:ok, stack}
  end
  
  @impl true
  def handle_info({:EXIT, _, reason}, stack) do
    {:stop, reason, stack}
  end
  
  @impl true
  def terminate(reason, _) do
    Logger.info("exiting with reason #{inspect(reason)}")
    :ok
  end
end

I’ll edit my original post

msimonborg · June 20, 2022, 3:28pm

I have the feeling that the behavior of trapping exits is a bit confusing and not entirely intuitive, which is why monitor is often preferred. Monitor always sends a :DOWN message that can be handled no matter how the process exits. It just might need to be bi-directional in your case.

defmodule Stack do
  use GenServer
  require Logger

  def start_link(pid), do: GenServer.start_link(__MODULE__, pid)
  
  @impl true
  def init(pid) do
    ref = Process.monitor(pid)
    {:ok, ref}
  end
  
  # the :DOWN ref and the state ref must be the same
  @impl true
  def handle_info({:DOWN, ref, _, _, reason}, ref) do
    Logger.info("exiting with reason #{reason}")
    {:stop, reason, ref}
  end
end

Test with spawn fn -> Stack.start_link(self()) end

You would have to monitor your Stack from the websocket process and have similar message handling

I have read this, but feel like I’m not quite sure what it means. Can you expand on why it’s not reliable? I have assumed that what makes it unreliable is that the behavior can be easily misunderstood and handled incorrectly, and also that it won’t work as expected if things don’t shut down gracefully. Is that an incorrect understanding? I am not challenging the assertion, just really want to learn better from someone with experience

My app is a multiplayer game, and there are many server processes holding state about each game. I want games to continue (relatively) uninterrupted during deployments and topology changes, so I save the state in the database in terminate and retrieve it when the server starts. I have tested this simulating “normal” conditions (SIGTERM/System.stop() at shutdown) and it has worked, gracefully stopping as many as 100,000 servers on one node with 0% loss of handoff state. When will this breakdown? What would be an alternative? I suppose I could persist every change to the state in the database, or persist regularly on a timeout, so that I don’t have to rely on trapping exits. It seems like overkill in the number of transactions, but maybe more reliable? Curious to hear your thoughts

hkrutzer · June 20, 2022, 3:43pm

Do you have any reference to documentation that mentions this?

kokolegorille · June 20, 2022, 3:51pm

from GenServer — Elixir v1.13.4

tldr
Therefore it is not guaranteed that terminate/2 is called when a GenServer exits. For such reasons, we usually recommend important clean-up rules to happen in separated processes either by use of monitoring or by links themselves.

kokolegorille · June 20, 2022, 3:56pm

One book I really enjoyed is The little Elixir OTP guidebook. It explains how poolboy is made.

What I learned is You can use additional processes instead of doing everything in the same server.

I also learned the Manager/Supervisor/Workers construct.

msimonborg · June 20, 2022, 4:03pm

If the GenServer receives an exit signal (that is not :normal) from any process when it is not trapping exits it will exit abruptly with the same reason and so not call terminate/2. Note that a process does NOT trap exits by default and an exit signal is sent when a linked process exits or its node is disconnected.

Therefore it is not guaranteed that terminate/2 is called when a GenServer exits.

I read this to mean “it is not guaranteed terminate will be called by default”, not “terminate may or may not be called, even if you are trapping exits”. Is that wrong? I feel like the documentation on this has room to be more clear.

hkrutzer · June 20, 2022, 4:03pm

It seems to me “Therefore” refers to the previous paragraph, which only makes mention of the trap_exit flag, not any other situations.

kokolegorille · June 20, 2022, 4:10pm

It is just a recommendation, You might decide not to follow…

kokolegorille · June 20, 2022, 4:57pm

Me too, I like game server…

In my case, one server per game, but game can hold multiple players.

I monitor player exit, but it might put the game in idle state, when all players leave, I start a timer to kill the game. The timer is helpful to avoid killing the game on the spot. Useful in dev to avoid losing state when live_reload is triggered.

Yes, or anywhere… a flat file could also store the game state with term_to_binary.

I use ETS to persists server state, and use it to reload if needed, but db is better for persistance.

I have done this for board games, but also more interactive 3d worlds with ThreeJS on the frontend.

msimonborg · June 21, 2022, 10:19pm

Yes, same here, sorry my wording was unclear. I should have said “there are many server processes, each one holding state about a game”. Face palm

Same, after 10s with no connected players the game shuts down. But in my case players can disconnect and come back to the same game later, so the game does not “die”. This is one of the reasons why I need state handoff. That case is a simple one, just save on exit and load on init. It’s guaranteeing handoff during topology changes that’s tricky…

I need the db for distribution mostly. Games and their state will expire after only three hours so long term persistence is not required. Stale games are pruned from the db regularly and servers automatically shutdown when they expire. This is more of a board game with a limited number of state changes, so I may just persist on each event and avoid the handoff complexity and the need for terminate cleanup…

Thanks for your insight, it’s given me much to think about!

hkrutzer · August 19, 2022, 3:42pm

As much as I appreciate this chat about game state it doesn’t have much to do with my original question.

I’ll summarize my own conclusion that I got with help from the Erlang Slack. The upside of trapping exits is terminate gets called when the process that started your genserver exits, which otherwise does not get called. However, this also means if you link any additional process from the genserver, you need to handle their exits as well.

The other option is monitoring. This means you can handle the parent process exiting in a more precise manner. However, it requires more lines of code.