Task.Supervisor stops after one child fails on restarts

merongivian · February 3, 2018, 2:36pm

I have a task supervisor configured through a phoenix application, it will restart a failing child twice if it fails:

lib/nanoindie/application.ex

# children configuration on start
    children = [
      supervisor(Nanoindie.Repo, []),
      supervisor(NanoindieWeb.Endpoint, []),
      supervisor(Task.Supervisor, [[name: Nanoindie.TaskSupervisor, max_restarts: 2]]) # task supervisor
    ]

I add the children dynamically through start_link, what i do here is retrieving some info from blogs via web crawling

lib/nanoindie/jobs/blogs.exs

Enum.map Repo.all(Blog), fn(blog) ->
  Task.Supervisor.start_child(Nanoindie.TaskSupervisor, fn ->
      # web crawling
    end
  end, restart: :transient)
end

When i execute this, and for some reason one of the child fails, it will restart it normally, but if all restarts of this child fail it won’t run the rest of the processes

if i use restart: :temporary instead of restart: :transient it works fine, all processes run normally (without restarts)

Is this the right behavior?, i was reading through the DynamicSupervisor’s code and it seems that it will stop the dynamic supervisor if one the children doesn’t have any successful retry, i might be wrong though, im still learning elixir and OTP

svilen · February 13, 2018, 7:42pm

I am struggling with a similar issue.

I have a DynamicSupervisor and I’d like to use restart: :transient for the child specs, so the supervisor can restart them in case they fail.

However, I didn’t expect the supervisor itself to be restarted, if max_restarts for a child is reached. It seems to be expected default behaviour, from the docs:

https://hexdocs.pm/elixir/Supervisor.html#module-exit-reasons-and-restarts

Notice that supervisor that reached maximum restart intensity will exit with :shutdown reason.

My question is: is there a way to prevent shutdown of the supervisor and just kill the child that has reached max_restarts ? Hope someone can shed some light on the issue!

mbuhot · April 2, 2018, 10:58pm

I recently saw Task.Supervisor suggested as a simple alternative to a sidekiq style job queue in mistakes-rails-developers-make-in-phoenix-pt-1-background-jobs

Given the way a single failing job will take out all other in-progress jobs after several failures, I’m wondering if this is actually an anti-pattern?

svilen · April 3, 2018, 10:38am

The solution for me was to put a Supervisor in front of each job; so we ended up with 1 Dynamic Supervisor, that would start Supervisors with restart: temporary, and let the Supervisor manage the GenServer process with restart: :transient. A failing job would exit, bring down its Supervisor, which will be just killed by the Dynamic Supervisor above, leaving all other running jobs and their supervisors intact.