Supervisor of dynamic children not restarting after crash

mbaeuerle · August 16, 2017, 1:09pm

We have a supervisor structure where one supervisor adds two children dynamically. They are added dynamically because the second child needs to know the pid of the first one.

Now we experience problems when the supervisor is crashing, because the supervisor is not restarted thereafter.
(If one of the children crashes, they are restarted totally fine.)

Edit: We are using Elixir 1.5.1

I could reproduce the problem in a small sample project:

Application

defmodule Sup.Application do
  use Application

  def start(_type, _args) do
    children = [
      {Sup.Supervisor, []},
      {Dyn.Sup.Supervisor, []}
    ]
    opts = [strategy: :one_for_one, name: SupShit.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Dyn.Sup.Supervisor

defmodule Dyn.Sup.Supervisor do
  use Supervisor

  def start_link(_) do
    {:ok, sup} = Supervisor.start_link(__MODULE__, [], name: __MODULE__)

    {:ok, worker1} = Supervisor.start_child(sup, worker(Worker, []))
    # In our real app there would be another worker which depens on the first one 
    # but is not neccessary to reproduce the problem
    #Supervisor.start_child(sup, worker(Worker2, [worker1]))
  end

  def init([]) do
    Supervisor.init([], strategy: :one_for_one)
  end
end

Worker

defmodule Worker do

  def start_link(_) do
    start_link()
  end

  def start_link() do
    GenServer.start_link(__MODULE__, [])
  end

  def init(_) do
    {:ok, nil} 
  end
end

Now when using :observer and killing the Dyn.Sup.Supervisor it is not restarted.

I then tried to reproduce this with a supervisor with static children but that was restarted after the crash as expected:

Sup.Supervisor

defmodule Sup.Supervisor do
  use Supervisor

  def start_link(_) do
    Supervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  def init([]) do
    children = [
      worker(Worker, [])
    ]
    supervise(children, strategy: :one_for_one)
  end
end

No my question is, is there a problem how we create the dynamic children? Or musn’t they be added in in start_link?

dom · August 16, 2017, 1:20pm

The return value of Dyn.Sup.Supervisor’s start_link has to be {:ok, sup}, not {:ok, worker1}.

mbaeuerle · August 16, 2017, 1:26pm

Oh man thanks ! Sometimes one just can’t see the easiest answer.

hubertlepicki · August 16, 2017, 1:30pm

on a side note, I do not see where you are adding the children dynamically. Do you have different code in your app or all you do is to start 2 workers from one supervisor, and the dynamic part is that the 2nd one need to know pid of the 1rst one?

If it is the case of having the need to reference first worker from second, maybe a name registration would do - instead of passing the PID around you would have to know registered name of worker1.

mbaeuerle · August 16, 2017, 1:49pm

Thanks, that’s also a good hint!

peerreynders · August 16, 2017, 3:03pm

The other potential issue is the nature of the dependency between worker2 and worker1. If the interaction between both is stateless then there is no problem with :one_for_one because there is no state in worker1 that worker2 depends on - it’s good enough to contact the replacement worker1 through the registered name.

If the dependency is stateful however there’s a problem because a replacement worker1 won’t have the necessary state from past interactions with worker2. Typically that is what :rest_for_one is there for - worker1 goes down so worker2 goes with it - however I suspect that the “dynamic” nature of the supervision conflicts with this approach.

One hack would be to store the (relevant part of) worker1 state in an ets table that is owned by it’s supervisor - that way the replacement worker1 can re-initialize from it when it starts up.

Sometimes it’s just necessary to refine the supervision structure:

S1
|
-----
|   |
W1 S2
    |
   W2

e.g. if dynamic workers under S2 are statefully dependent on W1 they are a lost cause when W1 crashes - so :rest_for_one on S1 will terminate S2 along with anything under it. Meanwhile S2 can focus entirely on managing the dynamic processes.