(Dynamic) Supervisor getting restarted by own parent only after getting killed for second time

For context, I’m implementing a basic poolboy clone using Elixir 1.6’s DynamicSupervisor and Registry. (I’ve written about my trials and tribulations on my blog if you’d like more context.)

Using the code here, let’s show an example of my issue in IEx (started with plain iex -S mix):

iex(1)> PoolToy.start_pool(name: :poolio, worker_spec: Doubler, size: 3)
:ok

This will result in the following supervision tree:

Here, we have:

Now, for my mystery: killing pid 171 in the Observer yields the following in the console:

07:39:49.704 [error] GenServer #PID<0.174.0> terminating
** (stop) killed
Last message: {:EXIT, #PID<0.171.0>, :killed}
State: %DynamicSupervisor{args: [], children: %{#PID<0.175.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.176.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.177.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}}, extra_arguments: [], max_children: :infinity, max_restarts: 3, max_seconds: 5, mod: PoolToy.WorkerSup, name: {#PID<0.174.0>, PoolToy.WorkerSup}, restarts: [], strategy: :one_for_one}
 
07:39:49.705 [error] GenServer :poolio terminating
** (stop) killed
Last message: {:EXIT, #PID<0.171.0>, :killed}
State: %PoolToy.PoolMan.State{monitors: :monitors_poolio, name: :poolio, pool_sup: #PID<0.171.0>, size: 3, worker_spec: Doubler, worker_sup: #PID<0.174.0>, workers: [#PID<0.175.0>, #PID<0.176.0>, #PID<0.177.0>]}

Of importance, is that the PoolsSup dynamic supervisor doesn’t start a new instance of PoolToy.PoolSup to replace the one that was killed.

Now for chapter 2 of the mystery: after restarting a new pool instance with PoolToy.start_pool(name: :poolio, worker_spec: Doubler, size: 3) (in the same IEx session), if I once again kill the PoolToy.PoolSup instance (which isn’t named: it will be displayed with only a pid in Observer) directly descending from the named PoolsSup process I get the following:

07:46:45.987 [error] GenServer :poolio terminating
** (stop) killed
Last message: {:EXIT, #PID<0.175.0>, :killed}
State: %PoolToy.PoolMan.State{monitors: :monitors_poolio, name: :poolio, pool_sup: #PID<0.175.0>, size: 3, worker_spec: Doubler, worker_sup: #PID<0.177.0>, workers: [#PID<0.178.0>, #PID<0.179.0>, #PID<0.181.0>]}
 
07:46:45.990 [error] GenServer #PID<0.177.0> terminating
** (stop) killed
Last message: {:EXIT, #PID<0.175.0>, :killed}
State: %DynamicSupervisor{args: [], children: %{#PID<0.178.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.179.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.181.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}}, extra_arguments: [], max_children: :infinity, max_restarts: 3, max_seconds: 5, mod: PoolToy.WorkerSup, name: {#PID<0.177.0>, PoolToy.WorkerSup}, restarts: [], strategy: :one_for_one}

Where pid 175 is the pool supervisor I killed and and pid 177 is the worker supervisor it had as one of its children.

But this time, a new pool supervisor (usually) gets started by the pools supervisor. (It appears that sometimes a new pool supervisor does NOT get started even the second time around.)

There’s clearly some fundamental understanding about supervision trees that I’m completely missing, as I have no idea what’s going on, besides the fact that it seems to be some sort of race condition. I don’t understand why this is happening, as the worker supervisor is :temporary, therefore only :poolio should get restarted by the parent supervisor (:poolio will then start a new worker supervisor instance). In addition, I’ve tried making the PoolSup one_for_one but that seems to make no difference.

So where’s my mistake? And of course, how can I fix it?

How are you verifying if the children are started or not? Only via observer? Can this be an observer bug that is not refreshing the tree?

My suggestion is to call DynamicSupervisor.which_children/1 in IEx to make sure what you see in the observer is actually what is in the supervisor.

I was indeed only checking via the observer, but it seems that it isn’t an observer bug: calling DynamicSupervisor.which_children(PoolToy.PoolsSup) returns [] when the child isn’t restarted.

1 Like

It is a race condition with named processes. I played with the demo and I noticed that, when there is no child after kill, the PID of PoolsSup changed. This means that the PoolsSup is actually crashing and being restarted. But why would PoolsSup crash? Likely because it reached its max restart strategy.

So I started the shell with iex --loger-sasl-reports true -S mix and killed PID again (this time <0.141.0> and I got this:

00:08:31.465 [error] Child :undefined of Supervisor PoolToy.PoolsSup terminated
** (exit) killed
Pid: #PID<0.141.0>
Start Call: PoolToy.PoolSup.start_link([name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

00:08:31.465 [error] Child PoolToy.PoolMan of Supervisor #PID<0.294.0> (PoolToy.PoolSup) failed to start
** (exit) already started: #PID<0.142.0>
Start Call: PoolToy.PoolMan.start_link([pool_sup: #PID<0.294.0>, name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: 5000
Type: :worker

00:08:31.465 [error] Child :undefined of Supervisor PoolToy.PoolsSup failed to start
** (exit) shutdown: failed to start child: PoolToy.PoolMan
    ** (EXIT) already started: #PID<0.142.0>
Start Call: PoolToy.PoolSup.start_link([name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

00:08:31.465 [error] Child PoolToy.PoolMan of Supervisor #PID<0.295.0> (PoolToy.PoolSup) failed to start
** (exit) already started: #PID<0.142.0>
Start Call: PoolToy.PoolMan.start_link([pool_sup: #PID<0.295.0>, name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: 5000
Type: :worker

00:08:31.465 [error] Child :undefined of Supervisor PoolToy.PoolsSup failed to start
** (exit) shutdown: failed to start child: PoolToy.PoolMan
    ** (EXIT) already started: #PID<0.142.0>
Start Call: PoolToy.PoolSup.start_link([name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

00:08:31.466 [error] Child PoolToy.PoolMan of Supervisor #PID<0.296.0> (PoolToy.PoolSup) failed to start
** (exit) already started: #PID<0.142.0>
Start Call: PoolToy.PoolMan.start_link([pool_sup: #PID<0.296.0>, name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: 5000
Type: :worker

00:08:31.466 [error] Child :undefined of Supervisor PoolToy.PoolsSup failed to start
** (exit) shutdown: failed to start child: PoolToy.PoolMan
    ** (EXIT) already started: #PID<0.142.0>
Start Call: PoolToy.PoolSup.start_link([name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

00:08:31.466 [error] Child :undefined of Supervisor PoolToy.PoolsSup caused shutdown
** (exit) :reached_max_restart_intensity
Start Call: PoolToy.PoolSup.start_link([name: :poolio, worker_spec: Doubler, size: 3])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

00:08:31.466 [error] Child PoolToy.PoolsSup of Supervisor #PID<0.134.0> (Supervisor.Default) terminated
** (exit) shutdown
Pid: #PID<0.137.0>
Start Call: PoolToy.PoolsSup.start_link([])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

00:08:31.466 [info]  Child PoolToy.PoolsSup of Supervisor #PID<0.134.0> (Supervisor.Default) started
Pid: #PID<0.297.0>
Start Call: PoolToy.PoolsSup.start_link([])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

00:08:31.470 [error] GenServer :poolio terminating
** (stop) killed
Last message: {:EXIT, #PID<0.141.0>, :killed}
State: %PoolToy.PoolMan.State{monitors: :monitors_poolio, name: :poolio, pool_sup: #PID<0.141.0>, size: 3, worker_spec: Doubler, worker_sup: #PID<0.144.0>, workers: [#PID<0.145.0>, #PID<0.146.0>, #PID<0.147.0>]}

00:08:31.470 [error] GenServer #PID<0.144.0> terminating
** (stop) killed
Last message: {:EXIT, #PID<0.141.0>, :killed}
State: %DynamicSupervisor{args: [], children: %{#PID<0.145.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.146.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.147.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}}, extra_arguments: [], max_children: :infinity, max_restarts: 3, max_seconds: 5, mod: PoolToy.WorkerSup, name: {#PID<0.144.0>, PoolToy.WorkerSup}, restarts: [], strategy: :one_for_one}

00:08:31.472 [error] Process :poolio (#PID<0.142.0>) terminating
** (exit) killed
    (stdlib) gen_server.erl:432: :gen_server.decode_msg/9
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Initial Call: PoolToy.PoolMan.init/1
Ancestors: [#PID<0.141.0>, PoolToy.PoolsSup, #PID<0.134.0>, #PID<0.133.0>]
Message Queue Length: 3
Messages: [{:EXIT, #PID<0.145.0>, :shutdown}, {:EXIT, #PID<0.146.0>, :shutdown}, {:EXIT, #PID<0.147.0>, :shutdown}]
Links: [#PID<0.144.0>]
Dictionary: []
Trapping Exits: true
Status: :running
Heap Size: 1598
Stack Size: 27
Reductions: 2425

00:08:31.472 [error] Process #PID<0.144.0> terminating
** (exit) killed
    (stdlib) gen_server.erl:432: :gen_server.decode_msg/9
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Initial Call: PoolToy.WorkerSup.init/1
Ancestors: [#PID<0.141.0>, PoolToy.PoolsSup, #PID<0.134.0>, #PID<0.133.0>]
Message Queue Length: 0
Messages: []
Links: [#PID<0.142.0>]
Dictionary: []
Trapping Exits: true
Status: :running
Heap Size: 2586
Stack Size: 27
Reductions: 4995

You can see that after 0.141.0 dies, it tries to start another process named poolio. However, because the currently running poolie process still exists, as it did not receive the linked exit failure from its parent, it fails with already started message. If we attempt to start 3 times before the original poolie process dies, then the supervisor crashes too.

Note this error only happens because the supervisor terminated before it was able to terminate its children. In practice, this error is unlikely to happen, since termination is guaranteed. Except if you kill the supervisor, as we have seen here.

3 Likes

Thank you so much for taking the time to look into this. From your reply, can I conclude that:

  • there is nothing inherently wrong with the app’s supervision tree
  • the race condition takes place because I killed a supervisor directly (which wouldn’t happen “in real life”)

Therefore, the main problem is that by killing the supervisor directly I was creating a situation (that wouldn’t happen in practice) where the supervisor was dead but its children were still (briefly) alive.

In real life, this situation wouldn’t have happened because the supervisor would usually die due to its children (e.g. max restart intensity hit). In this case, the supervisor would have properly terminated all of the children before restarting, preventing the problematic race condition.

Please correct me if my understanding of your reply is incorrect, and thanks again for shedding light on the problem, as well as bringing the logger-sasl-reports option to my attention.

1 Like

Everything is perfect, except for this sentence:

This would not happen in real-life because in real-life shut down of your application happens by sending a :shutdown message to the supervisor. Opposite to :kill, :shutdown is trappable, which means the supervisor can terminate its children properly before exiting.

In other words, the error you are seeing would only happen if you send an :kill signal to the supervisor, but nothing in a system will actually do this. I guess it would also happen if the Supervisor implementation has a bug, but that is even more unlikely.

5 Likes