Horde.DynamicSupervisor.terminate_child is not working!

gavid · February 9, 2024, 1:15pm

Horde.DynamicSupervisor.terminate_child seems to be unable to find the children of the Horde.DynamicSupervisor, even when they are clearly defined.

Here is a minimal reproducible example:

init_arg = fn name ->
  [
    name: name,
    strategy: :one_for_one,
    distribution_strategy: Horde.UniformDistribution,
    max_restarts: 100_000,
    max_seconds: 1,
    members: :auto
  ]
end

{:ok, root_supervisor} = Horde.DynamicSupervisor.start_link(init_arg.(:root_supervisor))

for num <- 1..10 do
  child_supervisor = :"child_supervisor#{num}"

  {:ok, _child_supervisor_pid} =
    Horde.DynamicSupervisor.start_child(root_supervisor, %{
      id: child_supervisor,
      start: {Horde.DynamicSupervisor, :start_link, [init_arg.(child_supervisor)]}
    })
end

IO.puts("After creation...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")

for {id, pid, :worker, _startup_module} <-
      Horde.DynamicSupervisor.which_children(root_supervisor),
    String.starts_with?(Atom.to_string(id), "child_supervisor") do
  IO.puts("Trying to terminate worker with id: #{id} and process id: #{inspect(pid)}")
  :ok = Horde.DynamicSupervisor.terminate_child(root_supervisor, pid)
end

IO.puts("After termination...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")
:done

gavid · February 9, 2024, 1:17pm

@derekkraan
You have made an awesome library which we believe is capable of meeting our distributed Supervision needs.

Please can you help me understand this issue so that I can finish implementing it within our code base?

derekkraan · February 12, 2024, 7:40am

Hi @gavid ,

It took me a minute to figure out what was going on here. The short story is, you should use the name of your Horde.DynamicSupervisor, not its pid. The pid you get back is only suitable for use in the supervision tree, and it is not meant to be used in any of the other functions in Horde.DynamicSupervisor.

See below the modified version of your test script:

init_arg = fn name ->
  [
    name: name,
    strategy: :one_for_one,
    distribution_strategy: Horde.UniformDistribution,
    max_restarts: 100_000,
    max_seconds: 1,
    members: :auto
  ]
end

{:ok, _root_supervisor} = Horde.DynamicSupervisor.start_link(init_arg.(:root_supervisor))
root_supervisor = :root_supervisor

for num <- 1..10 do
  child_supervisor = :"child_supervisor#{num}"

  {:ok, _child_supervisor_pid} =
    Horde.DynamicSupervisor.start_child(root_supervisor, %{
      id: child_supervisor,
      start: {Horde.DynamicSupervisor, :start_link, [init_arg.(child_supervisor)]}
    })
end

IO.puts("After creation...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")

for {id, pid, :worker, _startup_module} <-
      Horde.DynamicSupervisor.which_children(root_supervisor),
    String.starts_with?(Atom.to_string(id), "child_supervisor") do
  IO.puts("Trying to terminate worker with id: #{id} and process id: #{inspect(pid)}")
  :ok = Horde.DynamicSupervisor.terminate_child(root_supervisor, pid)
end

IO.puts("After termination...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")
:done

gavid · February 12, 2024, 8:29am

Thanks for getting back to me Derek!

Okay. So this is different from the DynamicSupervisor API, which allows you to use pids for supervisors and children. E.g:

...(n0@127.0.0.1)3> {:ok, ordinary_dynamic_pid} = DynamicSupervisor.start_link([name: :ordinary_dynamic_supervisor])
{:ok,
 #PID<0.2900.0>}

iex(n0@127.0.0.1)4> {:ok, child_pid} = DynamicSupervisor.start_child(ordinary_dynamic_pid, %{id: nil, start: {Task, :start_link, [fn -
> Process.sleep(:infinity) end ] }})
{:ok,
 #PID<0.2919.0>}

iex(n0@127.0.0.1)5> DynamicSupervisor.terminate_child(ordinary_dynamic_pid, child_pid)
:ok

For Horde.DynamicSupervisor, what about the case when we don’t have :root_supervisor as a name on the local node?

In the real-life version of this, I register the :root_supervisor under my instantiation of the Horde.Registry, and then use its pid (retrieved from the registry as needed) to refer to it across the cluster.

gavid · February 12, 2024, 8:36am

If I run the first part of the code (Up to “After creation…”) on node 1, here is what I see:

iex(n0@127.0.0.1)17> iex(n0@127.0.0.1)17> IO.inspect(Horde.DynamicSupervisor.which_children(:root_supervisor), pretty: true)
[
  {:undefined, #PID<0.2971.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.2978.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.2985.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.2992.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.2999.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.3006.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.3013.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.3020.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.3027.0>, :worker, [Horde.DynamicSupervisor]},
  {:undefined, #PID<0.3034.0>, :worker, [Horde.DynamicSupervisor]}
]
...

And then I try to terminate a child using the supervisor name, it works:

iex(n0@127.0.0.1)18> Horde.DynamicSupervisor.terminate_child(:root_supervisor, pid(0, 3034, 0))
:ok

But, the other node can’t see the supervisor at all:

iex(n1@127.0.0.1)3> IO.inspect(Horde.DynamicSupervisor.which_children(:root_supervisor), pretty: true)
** (exit) exited in: GenServer.call(:root_supervisor, :which_children, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.14.3) lib/gen_server.ex:1027: GenServer.call/3
    iex:3: (file)
iex(n1@127.0.0.1)3> Horde.DynamicSupervisor.terminate_child(:root_supervisor, pid(0, 3027, 0))
** (exit) exited in: GenServer.call(:root_supervisor, {:terminate_child, #PID<0.3027.0>}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.14.3) lib/gen_server.ex:1027: GenServer.call/3
    iex:3: (file)
iex(n1@127.0.0.1)3>

Hence the desire to use pids and not names

derekkraan · February 12, 2024, 8:43am

What is your use case that you need dynamic dynamic supervisors?

derekkraan · February 12, 2024, 8:52am

Regardless, the reason we can’t make it work this way is because Horde.DynamicSupervisor has its own mini supervision tree. The top of that tree has to be returned from start_link/3, so that it can fit neatly into your application’s supervision tree. The process that is receiving messages when you call Horde.DynamicSupervisor.which_children/1 for example is a child of this supervisor.

I would like to find a way to make these both the same, but I currently don’t know how that would work or even if it is workable.

gavid · February 12, 2024, 9:09am

We have a meta-programming based IDE that allows human and AI agents to create application flows and have them compile to Elixir under the hood. Among the things that these flows can do is start different kinds of Supervisors.

The concept works well, and it has been used in production for a few years.

The challenge now is that we want to be able to support multiple nodes in a cluster with full location transparency.

gavid · February 15, 2024, 2:10pm

@derekkraan
FYI, found a work-around. I used :supervisor.delete_child (Erlang -- supervisor) followed by :supervisor.terminate_child (Erlang -- supervisor). This works, even on Horde.Supervsior. Both functions accept a pid for the first argument (i.e. to refer to the supervisor), and an id to refer to the child.

    supervisor_pid = MyApp.Cluster.HordeRegistry.lookup_pid(supervisor_id)

    Enum.map(Horde.DynamicSupervisor.which_children(supervisor_pid), fn
      # In the case of dead children, pid will be :undefined
      {worker_id, pid, :worker, [MyApp.Worker]} ->
        if worker_id in workers_to_terminate do
          Logger.info("TERMINATING MyApp.Worker process with id #{worker_id}")
          # dead children will show as :undefined
          if is_pid(pid) do
            :ok = :supervisor.terminate_child(supervisor_pid, worker_id)
          end
          :ok = :supervisor.delete_child(supervisor_pid, worker_id)
        end

      _ ->
        nil
    end)

gavid · February 15, 2024, 2:13pm

To edit the original script accordingly:

init_arg = fn name ->
  [
    name: name,
    strategy: :one_for_one,
    distribution_strategy: Horde.UniformDistribution,
    max_restarts: 100_000,
    max_seconds: 1,
    members: :auto
  ]
end

{:ok, root_supervisor} = Horde.DynamicSupervisor.start_link(init_arg.(:root_supervisor))

for num <- 1..10 do
  child_supervisor = :"child_supervisor#{num}"

  {:ok, _child_supervisor_pid} =
    Horde.DynamicSupervisor.start_child(root_supervisor, %{
      id: child_supervisor,
      start: {Horde.DynamicSupervisor, :start_link, [init_arg.(child_supervisor)]}
    })
end

IO.puts("After creation...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")

for {id, pid, :worker, _startup_module} <-
      Horde.DynamicSupervisor.which_children(root_supervisor),
    String.starts_with?(Atom.to_string(id), "child_supervisor") do
  IO.puts("Trying to terminate worker with id: #{id} and process id: #{inspect(pid)}")
  # NOTE: Switched from pid to id
  :ok = :supervisor.terminate_child(root_supervisor, id)
  :ok = :supervisor.delete_child(root_supervisor, id)
end

IO.puts("After termination...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")
:done

derekkraan · February 19, 2024, 10:31am

I do not think this is doing what you think it is doing. I believe you are starting your child processes as children of a regular (non-distributed) supervisor. If your node goes down, for example, then your processes will not be restarted on another node.

If you don’t need this behaviour, then I would suggest that you probably don’t need Horde, and can get away with using Elixir’s built-in supervisor and registry.

Hope this helps.

gavid · June 23, 2024, 6:00pm

@derekkraan

I have moved the discussion back here because I can’t point this to any specific issue in the library.

I thought that I had a solution above, but what I was actually doing was adding children directly to the Horde.DynamicSupervisor, rather than to the correct place in the tree.

To clarify, this is the full minimal reproducible example:

init_arg = fn name ->
  [
    name: name,
    strategy: :one_for_one,
    distribution_strategy: Horde.UniformDistribution,
    max_restarts: 100_000,
    max_seconds: 1,
    members: :auto
  ]
end

{:ok, root_supervisor} = Horde.DynamicSupervisor.start_link(init_arg.(:root_supervisor))

for num <- 1..10 do
  child_supervisor = :"child_supervisor#{num}"

  {:ok, _child_supervisor_pid} =
    Horde.DynamicSupervisor.start_child(root_supervisor, %{
      id: child_supervisor,
      start: {Horde.DynamicSupervisor, :start_link, [init_arg.(child_supervisor)]}
    })
end

IO.puts("After creation...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")

for {id, pid, :worker, _startup_module} <-
      Horde.DynamicSupervisor.which_children(root_supervisor),
    String.starts_with?(Atom.to_string(id), "child_supervisor") do
  IO.puts("Trying to terminate worker with id: #{id} and process id: #{inspect(pid)}")
  # NOTE: Switched from pid to id
  :ok = :supervisor.terminate_child(root_supervisor, id)
  :ok = :supervisor.delete_child(root_supervisor, id)
end

IO.puts("After termination...")
IO.inspect(Horde.DynamicSupervisor.which_children(root_supervisor), pretty: true)
IO.puts("-----------")
:done

Here is what After creation... printout looks like. This is not the way things should be because the workers should not exist at the part of the tree. The only exist because they were added incorrectly by using the supervisor pid and not the name.

After creation...
[
  {:child_supervisor10, #PID<0.329.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor9, #PID<0.322.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor8, #PID<0.315.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor7, #PID<0.308.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor6, #PID<0.301.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor5, #PID<0.294.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor4, #PID<0.287.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor3, #PID<0.280.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor2, #PID<0.273.0>, :worker, [Horde.DynamicSupervisor]},
  {:child_supervisor1, #PID<0.266.0>, :worker, [Horde.DynamicSupervisor]},
  {Horde.NodeListener, #PID<0.265.0>, :worker, [Horde.NodeListener]},
  {:root_supervisor_telemetry_poller, #PID<0.264.0>, :worker,
   [:telemetry_poller]},
  {Horde.SignalShutdown, #PID<0.263.0>, :worker, [GenServer]},
  {:"root_supervisor.ProcessesSupervisor", #PID<0.262.0>, :supervisor,
   [Horde.ProcessesSupervisor]},
  {Horde.DynamicSupervisorImpl, #PID<0.261.0>, :worker,
   [Horde.DynamicSupervisorImpl]},
  {:"root_supervisor.Crdt", #PID<0.260.0>, :worker, [DeltaCrdt]}
]

gavid · June 23, 2024, 6:02pm

Can you see any way in which it would be possible to have Horde.DynamicSupervisor children started at runtime as children of a parent Horde.DynamicSupervisor, and have each parent and child supervisor behave correctly as expected?

gavid · June 24, 2024, 8:51am

Derek Kraan answered here: added explanatory note to Horde.DynamicSupervisor by gmdorf · Pull Request #271 · derekkraan/horde · GitHub