Hello,
I want to have a Dynamic Supervisor where each child is responsible to run a
workflow to completion. The children are simple gen servers that checkout a
state from a database, run a couple of operation and are able to respond to a
couple calls.
I use processes here because I want to bottleneck all operations on a same ID to
the same process, and have them handled with the last state.
Do achieve that I use simple name registration with Registry. Processes are
transient, when there is no more work to do, they exit with a short timeout.
When I receive a new event for those genservers I will boot them so they can
restore their state, and send them the event. At this point they will handle the
event but may also continue doing work for a while, and have to be restarted on
error if there is a problem doing that work, until they finish. State is
persisted after the event is handled, and after each subsequent operation
(additional work) following that event, until the state is considered stable –
work is done and may accept new events – and then the process can exit.
So, transient genservers.
Now, if there is a crash in that genserver, but just at the same moment a new
event arrives and we need to boot the genserver to handle it, that is what
happens:
OLD child exits because of an error
OLD child gets removed from Registry
client want to send event to child
client starts NEW child on the DynamicSupervisor
NEW child is started and registered on Registry
DynamicSupervisor wants to restart the transient OLD child
OLD child start_link function returns already started
DynamicSupervisor is angry and wants to retry forever until it reaches max intensity, and crashes
Another solution would be to use a classic Supervisor, as I can provide IDs for
the children. But if I use transient children, the Supervisor will keep the
child specs forever, accumulating useless data forever. I could then switch to
temporary children but they would not be restarted, obviously, and so errors
happending after an external input (events) would cause the work to not be
completed.
The best solution I can think of is to run my own monitor process that would
watch the temporary children and re-boot the child if the exit reason is not an
expected one. But that seems duplicating work that OTP should already handle,
isn’t it?
What would be the best solution in my case? My goal is to provide a simple
ensure_child(id)
function that would fetch the current pid if the worker
process for that ID is up, or to start it if not already_started.
If you want to reproduce the problem, you would create a new application:
mix new conflict --sup
cd conflict
mkdir config
touch config/config.exs
And write those files:
# application.ex
defmodule Conflict.Application do
# See https://hexdocs.pm/elixir/Application.html
# for more information on OTP Applications
@moduledoc false
use Application
@impl true
def start(_type, _args) do
children = [
Conflict.ChildSup,
{Registry, keys: :unique, name: Conflict.Reg}
]
# See https://hexdocs.pm/elixir/Supervisor.html
# for other strategies and supported options
opts = [strategy: :one_for_one, name: Conflict.Supervisor]
Supervisor.start_link(children, opts)
end
end
defmodule Conflict.ChildSup do
# Automatically defines child_spec/1
use DynamicSupervisor
def start_link(init_arg) do
DynamicSupervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
def start_child(child_spec) do
DynamicSupervisor.start_child(__MODULE__, child_spec)
end
@impl true
def init(_init_arg) do
DynamicSupervisor.init(strategy: :one_for_one)
end
end
defmodule Conflict.Child do
use GenServer, restart: :transient
def via(id) do
{:via, Registry, {Conflict.Reg, {__MODULE__, id}}}
end
def start_link(id) do
GenServer.start_link(__MODULE__, id, name: via(id))
end
def init(id) do
IO.puts("child #{id} initialized")
{:ok, id}
end
def handle_cast({:exec, f}, state) do
{:reply, f.(), state}
end
end
# config.exs
import Config
config :logger, :console,
format: "$time $metadata[$level] $message\n",
metadata: [:request_id]
config :logger,
level: :debug,
handle_otp_reports: true,
handle_sasl_reports: true,
# level: :warn,
# compile_time_purge_matching: [[level_lower_than: :warn]],
console: [format: "[$level] $levelpad$time $message\n", metadata: [:module]]
Then create a tt.exs
file at the root of the project with that content:
defmodule Conflict.Test do
def force_start(id) do
case Conflict.ChildSup.start_child({Conflict.Child, id}) do
{:ok, pid} ->
IO.puts("started #{id}")
:ok
{:error, {:already_started, _}} ->
force_start(id)
end
end
end
:ok = Conflict.Test.force_start(:one)
{starter, ref} = spawn_monitor(fn -> Conflict.Test.force_start(:one) end)
GenServer.cast(Conflict.Child.via(:one), {:exec, fn -> exit(:byebye) end})
receive do
{:DOWN, ^ref, :process, ^starter, reason} ->
IO.puts("second started down: #{inspect(reason)}")
after 1000 ->
IO.puts "it did not work, please retry"
exit(:normal)
end
And run mix run tt.exs
. The race condition does not always happen, you may
have to retry.