Mysterious DynamicSupervisor child shutdown

Hi, I have a DynamicSupervisor spawning 20 child processes. They work fine for a few minutes, and then all of them terminate unexpectedly. The supervisor doesn’t restart the processes even though I have the restart option set to :permanent. I’ve trapped exits of the child process, and indeed the terminate callback is receiving a :shutdown message. None of my code could potentially be shutting down the child processes, let alone all 20 of them simultaneously. All other GenServers are untouched, the parent process keeps running.

What could be causing such weird behaviour, and how can I get the DynamicSupervisor to restart its children?

The application is database-heavy, the workers constantly write to the database, hundreds inserts per second each. Perhaps this is somehow related? The child processes interact directly with Ecto with no intermediary processes.

defmodule App.Worker do
  use GenServer, restart: :permanent
  def start_link(args), do: GenServer.start_link(__MODULE__, args)

  def start_new_worker() do
    {:ok, pid} =
      DynamicSupervisor.start_child(
        App.WorkerSupervisor,
        {__MODULE__, :no_args}
      )

    pid
  end

  @impl true
  def terminate(reason, state) do
    Logger.info("worker terminating, reason: #{inspect reason}")
  end
end

application.ex:

defmodule App.Application do
  use Application

  def start(_type, _args) do
    children = [
      {DynamicSupervisor,
       name: App.WorkerSupervisor, strategy: :one_for_one}
    ]

    opts = [strategy: :one_for_one, name: App.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Thank you!

Can You please provide your dynamic supervisor code?
Why do You have start_new_worker inside GenServer? (Instead of DynamicSupervisor)
Where do You trap exit? (It’s not the terminate function)
It looks a bit strange not to use name (via Registry)

There is an example of migrating from simple_one_for_one to dynamic supervisor.

BTW it looks like You are linking your workers together, and if one die, they all die :slight_smile:

https://hexdocs.pm/elixir/DynamicSupervisor.html

Do any of the worker processes crash in normal operation? The behavior you’re describing sounds like what I’d expect if some workers were crashing and tripped the max_restarts circuit breaker in DynamicSupervisor - which terminates the supervisor and all of its children. In that situation, App.Supervisor will restart App.WorkerSupervisor but it won’t have any children.

Thank you, this did the trick! I’ve set max_restarts: 1_000_000_000, and the processes no longer restart. No, none of them crash, although they stop constantly via {:stop, :normal, state}. Seems like the DynamicSupervisor didn’t like the constant normal shutdowns.

Iirc if you’re shutting down normally you should use restart: :transient in your supervisor spec, instead of setting your max_restarts really high. (You currently have restart: :permanent which is usually for long lived service processes and the like)

Also unless you truly need to respond to messages, such temporary tasks that live in their own processes should probably be Tasks supervised by a Task.Supervisor, not a GenServer.

1 Like