Understanding process links

Hello,

I have an Elixir app where a process depends on another process to be alive and send messages back. This second process is mainly a database connection and may occasionally fail. I created an example app below, which mimics the structure. ModuleA my worker process, ModuleB is the database connection:

defmodule Test.ModuleA do
  use GenServer

  def start_link(opts \\ []) do
    name = Keyword.get(opts, :name, __MODULE__)
    GenServer.start_link(__MODULE__, nil, name: name)
  end

  def init(_) do
    IO.puts("#{__MODULE__} starting")

    {:ok, nil}
  end
end

defmodule Test.ModuleB do
  use GenServer

  def start_link(opts \\ []) do
    name = Keyword.get(opts, :name, __MODULE__)
    GenServer.start_link(__MODULE__, nil, name: name)
  end

  def init(_) do
    IO.puts("#{__MODULE__} starting")

    Process.send_after(self(), :crash, 5000)

    {:ok, nil}
  end

  def handle_info(:crash, _state) do
    raise "boom"
  end
end

Using the application’s main supervisor like this everything works as expected, both processes are restarted when the DB connection (ModuleB) crashes:

def start(_type, _args) do
  children = [
    # Starts a worker by calling: Test.Worker.start_link(arg)
    # {Test.Worker, arg}

    Test.ModuleB,
    Test.ModuleA
  ]

  # See https://hexdocs.pm/elixir/Supervisor.html
  # for other strategies and supported options
  opts = [strategy: :rest_for_one, name: Test.Supervisor]
  Supervisor.start_link(children, opts)
end

asciicast

But because the ModuleB is not really part of my application but of an external package we use, I don’t really can supervise it in my supervisor. Therefore my idea was to link ModuleA to ModuleB like so (and also changed my supervisors strategy back to :one_for_one):

defmodule Test.ModuleA do
  use GenServer

  def start_link(opts \\ []) do
    name = Keyword.get(opts, :name, __MODULE__)
    GenServer.start_link(__MODULE__, nil, name: name)
  end

  def init(_) do
    IO.puts("#{__MODULE__} starting")

    Test.ModuleB
    |> Process.whereis()
    |> Process.link()

    {:ok, nil}
  end
end

However this results in a very weird problem where my app completely crashes after two simulated connection problems:

asciicast

What’s going on? Did I misunderstood process links and/or supervisor config?

How is ModuleB started in your second example? Do you have the code up in a repo somewhere?

Sure here you go: https://github.com/ream88/process-link-example/blob/main/lib/test/application.ex

Its basically just started before ModuleA

So the supervisor you start in your application.ex uses the default of max 3 restarts within 5 seconds, before itself crashes. You’re doing 4 restarts per 5 seconds (restart A and B, wait 5 seconds and again restarting A and B). For :rest_for_one only the crashing ModuleB counts against that limit, not all the other processes restarted by the supervisor as a result of that crash.

Also unexpectedly to many people the root process of an application (the one you return the pid from c:Application.start/2) will never be restarted. It’s not a child of a supervisor and the application will stop as soon as the root pid crashes/stops.

2 Likes

Thanks @LostKobrakai, I thought about that, however I did not see it explicitly being mentioned in the docs (Supervisor — Elixir v1.15.2). I always thought these options are applied per child spec!

:rest_for_one would hardly ever survive a single crash if it would count the restarts of the “rest”. For one_for_one both processes count, as A is restarted because it stopped, not because the supervisor stopped it.