For context, I’m implementing a basic poolboy
clone using Elixir 1.6’s DynamicSupervisor
and Registry
. (I’ve written about my trials and tribulations on my blog if you’d like more context.)
Using the code here, let’s show an example of my issue in IEx (started with plain iex -S mix
):
iex(1)> PoolToy.start_pool(name: :poolio, worker_spec: Doubler, size: 3)
:ok
This will result in the following supervision tree:
Here, we have:
PoolsSup
, a DynamicSupervisor (pool_toy_debug/lib/pool_toy/pools_sup.ex at master · davidsulc/pool_toy_debug · GitHub)- pid 171: a one_for_all supervisor (pool_toy_debug/lib/pool_toy/pool_sup.ex at master · davidsulc/pool_toy_debug · GitHub)
- poolio a GenServer (pool_toy_debug/lib/pool_toy/pool_man.ex at master · davidsulc/pool_toy_debug · GitHub)
- pid 174 which is a DynamicSupervisor (pool_toy_debug/lib/pool_toy/worker_sup.ex at master · davidsulc/pool_toy_debug · GitHub) overseeing a bunch of GenServer workers (pool_toy_debug/lib/doubler.ex at master · davidsulc/pool_toy_debug · GitHub)
Now, for my mystery: killing pid 171 in the Observer yields the following in the console:
07:39:49.704 [error] GenServer #PID<0.174.0> terminating
** (stop) killed
Last message: {:EXIT, #PID<0.171.0>, :killed}
State: %DynamicSupervisor{args: [], children: %{#PID<0.175.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.176.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.177.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}}, extra_arguments: [], max_children: :infinity, max_restarts: 3, max_seconds: 5, mod: PoolToy.WorkerSup, name: {#PID<0.174.0>, PoolToy.WorkerSup}, restarts: [], strategy: :one_for_one}
07:39:49.705 [error] GenServer :poolio terminating
** (stop) killed
Last message: {:EXIT, #PID<0.171.0>, :killed}
State: %PoolToy.PoolMan.State{monitors: :monitors_poolio, name: :poolio, pool_sup: #PID<0.171.0>, size: 3, worker_spec: Doubler, worker_sup: #PID<0.174.0>, workers: [#PID<0.175.0>, #PID<0.176.0>, #PID<0.177.0>]}
Of importance, is that the PoolsSup
dynamic supervisor doesn’t start a new instance of PoolToy.PoolSup
to replace the one that was killed.
Now for chapter 2 of the mystery: after restarting a new pool instance with PoolToy.start_pool(name: :poolio, worker_spec: Doubler, size: 3)
(in the same IEx session), if I once again kill the PoolToy.PoolSup
instance (which isn’t named: it will be displayed with only a pid in Observer) directly descending from the named PoolsSup
process I get the following:
07:46:45.987 [error] GenServer :poolio terminating
** (stop) killed
Last message: {:EXIT, #PID<0.175.0>, :killed}
State: %PoolToy.PoolMan.State{monitors: :monitors_poolio, name: :poolio, pool_sup: #PID<0.175.0>, size: 3, worker_spec: Doubler, worker_sup: #PID<0.177.0>, workers: [#PID<0.178.0>, #PID<0.179.0>, #PID<0.181.0>]}
07:46:45.990 [error] GenServer #PID<0.177.0> terminating
** (stop) killed
Last message: {:EXIT, #PID<0.175.0>, :killed}
State: %DynamicSupervisor{args: [], children: %{#PID<0.178.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.179.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}, #PID<0.181.0> => {{Doubler, :start_link, :undefined}, :temporary, 5000, :worker, [Doubler]}}, extra_arguments: [], max_children: :infinity, max_restarts: 3, max_seconds: 5, mod: PoolToy.WorkerSup, name: {#PID<0.177.0>, PoolToy.WorkerSup}, restarts: [], strategy: :one_for_one}
Where pid 175 is the pool supervisor I killed and and pid 177 is the worker supervisor it had as one of its children.
But this time, a new pool supervisor (usually) gets started by the pools supervisor. (It appears that sometimes a new pool supervisor does NOT get started even the second time around.)
There’s clearly some fundamental understanding about supervision trees that I’m completely missing, as I have no idea what’s going on, besides the fact that it seems to be some sort of race condition. I don’t understand why this is happening, as the worker supervisor is :temporary
, therefore only :poolio
should get restarted by the parent supervisor (:poolio
will then start a new worker supervisor instance). In addition, I’ve tried making the PoolSup one_for_one but that seems to make no difference.
So where’s my mistake? And of course, how can I fix it?