Killing process in supervision tree triggers `{:error, {:already_started, _}}` (updating "toy" code to 1.6)

david_ex · April 20, 2018, 7:31am

Hi,

In my efforts to level up my OTP skills, I’m currently working through https://www.manning.com/books/the-little-elixir-and-otp-guidebook

At one point in the book, a pool library gets implemented. The book uses an older Elixir version (e.g. using Supervisor.Spec functions) and I wanted to try and follow along and update the code to Elixir 1.6 (e.g. using a DynamicSupervisor).

The book’s code can be found at https://github.com/benjamintanweihao/pooly/tree/version-1/lib/pooly (note that this points to the “version-1” branch), and my conversion attempt is at https://github.com/davidsulc/pooly/tree/master/lib

I’m able to run my code, but if I open Observer and kill either the Pooly.Server or Pooly.WorkerSupervisor process, multiple {:error, {:already_started, _}} errors get triggered seemingly because Pooly.Server's start_link function gets called multiple times. I do not understand why that is the case.

How can I fix my code for the supervision tree to work as expected?

Note that at this stage in the book, the project is in an intermediary state (i.e. not the final implementation) so what I’m mainly looking for is the equivalent code to the book’s code linked above, but with Elixir 1.6 features/idioms (I’m also aware that certain options passed in to the supervisors are the same as the default values).

In the interest of future-proofing this thread, I’ll reproduce the relevant code below (as available in the 2nd github link above):

# lib/pooly.ex


defmodule Pooly do
  use Application

  def start(_type, _args) do
    pool_config = [
      worker_sup_mod: Pooly.WorkerSupervisor,
      worker_spec: Pooly.SampleWorker,
      size: 5
    ]
    start_pool(pool_config)
  end

  defdelegate start_pool(pool_config), to: Pooly.Supervisor, as: :start_link
end

# pooly/supervisor.ex

defmodule Pooly.Supervisor do
  use Supervisor

  def start_link(pool_config) do
    IO.puts("Supervisor start_link")
    Supervisor.start_link(__MODULE__, pool_config)
  end

  def init(pool_config) do
    IO.puts("Supervisor init")
    children = [
      {Pooly.Server, [self(), pool_config]}
    ]

    Supervisor.init(children, strategy: :one_for_all)
  end
end

# pooly/server.ex

defmodule Pooly.Server do
  use GenServer

  @name __MODULE__
  @pool_config_keys [:sup, :size, :worker_sup_mod, :worker_spec]

  defmodule State do
    defstruct [:sup, :size, :worker_sup_mod, :worker_sup, :worker_spec, :monitors, workers: []]
  end

  def start_link([_sup, _pool_config] = opts) do
    IO.puts "in Server start_link"
    GenServer.start_link(__MODULE__, opts, name: @name)
  end

  def init([sup, pool_config]) when is_pid(sup) do
    IO.puts "in Server init"
    monitors = :ets.new(:monitors, [:private])
    state = struct(State, pool_config |> sanitize())
    state = %{state | sup: sup, monitors: monitors}

    send(self(), :start_worker_supervisor)
    {:ok, state}
  end

  def handle_info(:start_worker_supervisor, %{sup: sup, worker_sup_mod: module, worker_spec: worker_spec, size: size} = state) do
    IO.puts "start worker supervisor request"
    {:ok, worker_sup} = Supervisor.start_child(sup, {module, [restart: :temporary]})
    workers = prepopulate(size, {worker_sup, worker_spec})
    {:noreply, %{state | worker_sup: worker_sup, workers: workers}}
  end

  def handle_info(msg, state) do
    IO.puts "Server received: #{msg}"
    {:noreply, state}
  end

  defp sanitize(config) do
    config
    |> Enum.filter(fn {k, _} -> Enum.member?(@pool_config_keys, k) end)
  end

  defp prepopulate(size, sup_args), do: prepopulate(size, sup_args, [])

  defp prepopulate(size, _sup_args, workers) when size < 1, do: workers

  defp prepopulate(size, sup_args, workers) do
    prepopulate(size - 1, sup_args, [new_worker(sup_args) | workers])
  end

  defp new_worker({sup, worker_spec}) do
    {:ok, worker} = DynamicSupervisor.start_child(sup, worker_spec)
    worker
  end
end

# pooly/worker_supervisor.ex

defmodule Pooly.WorkerSupervisor do
  use DynamicSupervisor

  @name __MODULE__
  @opts [
    strategy: :one_for_one,
    max_restarts: 5,
    max_seconds: 5
  ]

  def start_link(opts) do
    DynamicSupervisor.start_link(__MODULE__, opts, name: @name)
  end

  def init(opts) do
    DynamicSupervisor.init(@opts ++ opts)
  end
end

# pooly/sample_worker.ex

defmodule Pooly.SampleWorker do
  use GenServer

  def start_link(_) do
    GenServer.start_link(__MODULE__, :ok, [])
  end

  def stop(pid) do
    GenServer.call(pid, :stop)
  end

  def init(args) do
    {:ok, args}
  end

  def handle_call(:stop, _from, state) do
    {:stop, :normal, :ok, state}
  end
end

kokolegorille · April 20, 2018, 1:05pm

There has been a lot of change since the release of the book.

First… now in Elixir, when You create a project with --sup it will create a different structure, mainly pooly.ex is replaced by application, which is the main entry point.

Second, the example are using simple_one_for_one and now the relevant method is to use dynamic supervisor.

Third, the example does not use Registry, it has been added after the book publication.

For your code… restarting server will break, because it tries to restart the dynamic supervisor…

{:ok, worker_sup} = Supervisor.start_child(sup, {module, [restart: :temporary]})

And this will fail, because it is already present.

If You kill the dynamic supervisor, nothing should happen to server, but it breaks all because the Pooly Supervisor is set like this.

Supervisor.init(children, strategy: :one_for_all)

If You kill dynamic supervisor, the supervisor will kill the server, and we know that it is not good. You should change to.

Supervisor.init(children, strategy: :one_for_one)

Doing so allows You to kill dynamic supervisor, without crashing the system. But, because it is normally started by server, it will start without config, so without workers

To fully adapt the example… there will be the need to link dynamic supervisor and workers to the server, to trap exit in the server, to catch exit in the server and to do something in response from it.

I remember the worker_supervisor was set to temporary in the book, so that the restart responsability is given to server, not supervisor. Remember the server is the brain, it stores sup, worker_sup, consumers, workers and keeps track of DOWN and EXIT signals.

To make it resiliant, You need to be able to restart server without bringing down the system, so You need to check if dynamic supervisor is alive before trying to start it. What I did is

worker_sup =
      case start_worker_sup(sup) do
        {:ok, worker_sup} ->
          worker_sup

        {:error, _reason} ->
          Process.whereis(WorkerSup)
      end

In your example, the code is breaking here.

{:ok, worker_sup} = Supervisor.start_child(sup, {module, [restart: :temporary]})

# with error

** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.161.0>}}

I had trouble to adapt this to 1.6 too, if You need more precision, feel free to ask.

BTW I see we are from the same country

david_ex · April 22, 2018, 9:18am

Based on your username, we’re probably both in “romandie” If you feel like some geek talk over a beer, feel free to ping me.

Thanks for the time you put into this answer: it’s been very helpful (including the edits!). I mainly realize that it would be better for me (from a learning point of view) to reimplement the functionality from scratch rather than trying to just update/convert the existing code.

I see now that using the DynamicSupervisor while keeping the book’s :one_for_all strategy might have been misguided.

I hadn’t given any thought to how Registry could be leveraged here. Could you maybe expand on that a bit?

It looks like Registry could be used to track the pid of the supervisor, worker_supervisor, etc. But what about process checkin/checkout? Should that continue using the implementation with ETS? Or is there a better way using Registry?

kokolegorille · April 22, 2018, 12:52pm

Once You get this example working, You can adapt it to many situation. I replaced all simple one for one/gproc with dynamic supervisor/registry. Registry is not so useful for poolboy because each worker is the same as the other. But when You need to locate any of the workers, then registry becomes handy. In my case I spawn workers holding game state. I need to route requests based on uuid, registry allows me to do that.

I do not use registry for any named processes, just the dynamic workers.

In my use-case, I don’t need to track down from consumers. Those consumers are websocket client, and they can disconnect… and reconnect. This would be a bad idea to stop worker when ws close. So I don’t store consumer pid, as poolboy does.

This poolboy is really nice because it is example of complex coordination between multiple elements. Supervisor, DynamicSupervisor, Master GenServer, Workers, Consumers, processes monitor and link, ets table… You need to keep this private ets table for poolboy as the way to manage checkin/checkout. This cannot be replaced by registry.

Unfortunately, code has been changing a lot and it is not relevant anymore. It would be nice to have an updated version. Also the example just describes the poolboy use-case.

J’habite au bout du lac… That would be nice to geek talk about Elixir. I tried once to introduce Elixir to a Ruby group, but I don’t meet Elixir coders a lot (well, not even one yet)

david_ex · April 23, 2018, 6:37am

Having spent more time with the code, I’ve identified some issues and I think I’ll be able to get an updated version working.

One major issue was that starting the worker supervisor Supervisor.start_child(sup, {module, [restart: :temporary]}) is naive: it doesn’t magically make the supervisor :temporary. Instead, I’d have to either specify child_spec/1 to handle the option and return a child spec or simply have use DynamicSupervisor, restart: :temporary in the WorkerSupervisor module.

Once the process is configured to be temporary, you can keep the :one_for_all restart strategy (although in practice it seems to make no difference in this specific case as the worker sup will never get restarted, and killing a temporary process doesn’t trigger sibling restart…). However, you need to link the Server process to the worker sup, and the Server must trap exits.

Another option is the one you proposed: try to start the worker sup and handle the case where it has already been started. I found http://erlang.org/pipermail/erlang-questions/2012-December/071326.html which discusses an essentially identical case.

I’ll try and get the full pool management working and post it somewhere.

Regarding “selling” Elixir/Erlang to other coders, I just came across https://ferd.ca/the-zen-of-erlang.html which could help explain languages features and philosophy. I’ve sent you an email to meet up.