Name conflict when using dynamic supervisor

lud · August 31, 2021, 1:19pm

Hello,

I want to have a Dynamic Supervisor where each child is responsible to run a
workflow to completion. The children are simple gen servers that checkout a
state from a database, run a couple of operation and are able to respond to a
couple calls.

I use processes here because I want to bottleneck all operations on a same ID to
the same process, and have them handled with the last state.

Do achieve that I use simple name registration with Registry. Processes are
transient, when there is no more work to do, they exit with a short timeout.
When I receive a new event for those genservers I will boot them so they can
restore their state, and send them the event. At this point they will handle the
event but may also continue doing work for a while, and have to be restarted on
error if there is a problem doing that work, until they finish. State is
persisted after the event is handled, and after each subsequent operation
(additional work) following that event, until the state is considered stable –
work is done and may accept new events – and then the process can exit.

So, transient genservers.

Now, if there is a crash in that genserver, but just at the same moment a new
event arrives and we need to boot the genserver to handle it, that is what
happens:

    OLD child exits because of an error
    OLD child gets removed from Registry
    client want to send event to child
    client starts NEW child on the DynamicSupervisor
    NEW child is started and registered on Registry
    DynamicSupervisor wants to restart the transient OLD child
    OLD child start_link function returns already started
    DynamicSupervisor is angry and wants to retry forever until it reaches max intensity, and crashes

Another solution would be to use a classic Supervisor, as I can provide IDs for
the children. But if I use transient children, the Supervisor will keep the
child specs forever, accumulating useless data forever. I could then switch to
temporary children but they would not be restarted, obviously, and so errors
happending after an external input (events) would cause the work to not be
completed.

The best solution I can think of is to run my own monitor process that would
watch the temporary children and re-boot the child if the exit reason is not an
expected one. But that seems duplicating work that OTP should already handle,
isn’t it?

What would be the best solution in my case? My goal is to provide a simple
ensure_child(id) function that would fetch the current pid if the worker
process for that ID is up, or to start it if not already_started.

If you want to reproduce the problem, you would create a new application:

mix new conflict --sup
cd conflict
mkdir config
touch config/config.exs

And write those files:

# application.ex

defmodule Conflict.Application do
  # See https://hexdocs.pm/elixir/Application.html
  # for more information on OTP Applications
  @moduledoc false

  use Application

  @impl true
  def start(_type, _args) do
    children = [
      Conflict.ChildSup,
      {Registry, keys: :unique, name: Conflict.Reg}
    ]

    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: Conflict.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

defmodule Conflict.ChildSup do
  # Automatically defines child_spec/1
  use DynamicSupervisor

  def start_link(init_arg) do
    DynamicSupervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
  end

  def start_child(child_spec) do
    DynamicSupervisor.start_child(__MODULE__, child_spec)
  end

  @impl true
  def init(_init_arg) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end
end

defmodule Conflict.Child do
  use GenServer, restart: :transient

  def via(id) do
    {:via, Registry, {Conflict.Reg, {__MODULE__, id}}}
  end

  def start_link(id) do
    GenServer.start_link(__MODULE__, id, name: via(id))
  end

  def init(id) do
    IO.puts("child #{id} initialized")
    {:ok, id}
  end

  def handle_cast({:exec, f}, state) do
    {:reply, f.(), state}
  end
end

# config.exs

import Config

config :logger, :console,
  format: "$time $metadata[$level] $message\n",
  metadata: [:request_id]

config :logger,
  level: :debug,
  handle_otp_reports: true,
  handle_sasl_reports: true,
  # level: :warn,
  # compile_time_purge_matching: [[level_lower_than: :warn]],
  console: [format: "[$level] $levelpad$time $message\n", metadata: [:module]]

Then create a tt.exs file at the root of the project with that content:

defmodule Conflict.Test do
  def force_start(id) do
    case Conflict.ChildSup.start_child({Conflict.Child, id}) do
      {:ok, pid} ->
        IO.puts("started #{id}")
        :ok

      {:error, {:already_started, _}} ->
        force_start(id)
    end
  end
end

:ok = Conflict.Test.force_start(:one)
{starter, ref} = spawn_monitor(fn -> Conflict.Test.force_start(:one) end)
GenServer.cast(Conflict.Child.via(:one), {:exec, fn -> exit(:byebye) end})

receive do
  {:DOWN, ^ref, :process, ^starter, reason} ->
    IO.puts("second started down: #{inspect(reason)}")
  after 1000 ->
    IO.puts "it did not work, please retry"
    exit(:normal)
end

And run mix run tt.exs. The race condition does not always happen, you may
have to retry.

lud · August 31, 2021, 1:36pm

As a temporary solution I convert the :already_started returns from GenServer.start_link/3 into a :ignore result. But that means that in the calling code that want to send an event to a process, I do not know if that :ignore is a true one, or just because the process is started already. So I have to retry to fetch the existing pid even though I will have 5 :ignores (if my retry-to-fetch-the-pid is caped at 5). Or I could run a SELECT in database to verify that the ID I work with exists, but it feels unsatisfying.

I would like to have the possibility to customize the DynamicSupervisor to consider already started errors as fine, but still return the error tuple to the caller.