Supervisor dies with its child?

I have an application with the following setup:

  • There is a central supervisor which is initialized with a child spec at startup encompassing three permanent children. It is started with start_link/2 in the application’s start/2.
  • During the run of the application I start a child supervisor by calleing the central supervisor with start_child/2 and Supervisor.Spec.supervisor/3. This one is started as transient.
  • The child supervisor starts a set of children with transient and one_for_all.

The intent behind that is that each group of workers is tied together and is dismissed after its common task is done. The child supervisor is supposed to restart children on errors only but to let them go gracefully when their job is done. (A “normal” exit.)

The central supervisor has a name (alias MyMod.Supervisor) it was started with. When I query it with Supervisor.which_children(MyMod.Supervisor) it shows me the child supervisor is there.

Now:

  • Whenever I try to use Process.exit/2 to end the child supervisor nothing happens.
  • When I use Supervisor.terminate_child/2 with the assigned child ID in the central supervisor, it returns ok but the next call to Supervisor.which_children(MyMod.Supervisor) tells me there is no process (the central supervisor died)
  • When I use Supervisor.stop/2 in the child with its registered name and :normal I also can afterwards not query the central supervisor (again, it died).

It seems like no matter what I do, if the child exits, the central supervisor exits. I’ve played around with the Supervisor.Spec I created it with. Made no difference.

Any ideas?

2 Likes

Can you provide a minimal example project which does show your problem? Only from your wording, I can roughly understand the problem you have, but without seeing what you are really doing, one can’t tell what you are doing wrong…

1 Like

Hmmm… tried to do that, but my sample project seems to work:

[code]defmodule Test do

import Supervisor.Spec

def start do
# static children
children = [
worker(Worker, [0], restart: :permanent, id: :static1),
worker(Worker, [1], restart: :permanent, id: :static2),
worker(Worker, [2], restart: :permanent, id: :static3)
]

Supervisor.start_link(children, strategy: :one_for_one, name: Central.Supervisor)
    Supervisor.start_child(Central.Supervisor,
                           supervisor(Worker.Supervisor, [3], restart: :transient, id: :super))
    test

end

def test do
IO.puts “Central: #{inspect Supervisor.which_children(Central.Supervisor)}”
IO.puts “Worker: #{inspect Supervisor.which_children(Worker.Supervisor)}”

# Supervisor.terminate_child(Central.Supervisor, :super)
    Worker.Supervisor.stop
    IO.puts "Central: #{inspect Supervisor.which_children(Central.Supervisor)}"

end

end

defmodule Worker.Supervisor do

use Supervisor

def start_link(id) do
Supervisor.start_link(MODULE, [id], name: Worker.Supervisor)
end

def init([id]) do
children = [
worker(Worker, [id], restart: :transient, id: 0),
worker(Worker, [id+1], restart: :transient, id: 1),
worker(Worker, [id+2], restart: :transient, id: 2)
]

supervise(children, strategy: :one_for_all)

end

def stop do
Supervisor.stop(Worker.Supervisor)
end

end

defmodule Worker do

use GenServer

def start_link id do
GenServer.start_link(MODULE, [], name: {:global, {:worker, id}})
end

end[/code]

1 Like

Now I have SASL reports for my app and the following transpires:

2016-04-3011:59:51.496[ERR/connector.ex:43] Received error when listening on TCP port, reason was :closed 2016-04-3011:59:51.496[ERR/SASL] GenServer {:connector, 1} terminating | ** (stop) :error 2016-04-3011:59:51.498[ERR/SASL] Process #PID<0.235.0> terminating ** (exit) :error (stdlib) gen_server.erl:826: :gen_server.terminate/7 (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Initial Call: Connector.init/1 Ancestors: [#PID<0.233.0>, MyMod.Supervisor, #PID<0.181.0>] 2016-04-3011:59:51.498[ERR/SASL] Child {:connection, 1} of Supervisor {:supervisor, 1} shutdown abnormally | ** (exit) :error | Pid: #PID<0.235.0> | Start Call: Connector.start_link(1, #Port<0.7095>) 2016-04-3011:59:51.498[INF/SASL] Application myapp exited: normal

In my application startup I create a TCP listener port:

{ :ok, listenSocket } = :gen_tcp.listen(readUeTcpPort(), [:binary, active: true, packet: :raw, reuseaddr: true])

I then give the listenSocket to each worker child I start:

def addWorker(workerId, listenSocket) do Supervisor.start_child(MyMod.Supervisor, supervisor(Worker.Supervisor, [workerId, listenSocket], restart: :transient, id: {:worker, workerId}, shutdown: :infinity)) end

These workers run their init/1 and send themselves a GenServer.call that makes them wait on :gen_tcp.accept. The first worker is dynamically created during the run of my application’s start/2.

Whenever a new client is accepted, the worker immediately calls MyMod.addWorker to start the next process waiting on the listen port.

Now, when I stop the supervisor of the first group of workers, something odd happens. The 2nd worker suddenly drops out of its :gen_tcp.accept call it is sleeping on with an error. Apparently the listenSocket is somehow closed. And somehow this leads to my whole application to quietly shut down with exit :normal.

Any ideas?

1 Like

Quick tip: you should investigate if this is related to the controlling process: http://erlang.org/doc/man/gen_tcp.html#controlling_process-2

2 Likes

Hello, José.

After rearranging everything under another supervisor I played around with :gen_tcp.controlling_process, and after reading its docs another time and also checking its return value I managed to properly transfer control to the children and also not have this weird exit behavior anymore.

Thank you. :smile:

1 Like

I have a similar problem.

I have a top level dynamic supervisor, say, TopDynSup, whose children are in turn supervisors. The sub supervisors each supervises a fixed number of temporary workers (they should never be restarted even if they died abnormally).

My problem is, I want to let the sub supervisor terminates normally when all of its children terminates. As I played around Supervisor in iex, I saw that a supervisor never ends even when there is no child. How can I make it die?

1 Like

Have you ever found the solution to this?
I’d like my “nested” supervisor to behave this way too but cannot find any configuration option to force it.

Supervisors are not meant to die, they are intended to run indefinitely, monitoring children.

I think the solution is to have your own GenServer to do this instead of a supervisor:

  • Give this new process whatever info it needs to start the children or at least to track their PIDs.
  • Let it invoke Process.monitor on each PID.
  • Let it handle the DOWN messages and shrink the list accordingly.
  • Terminate with :stop when the list is empty.

In order to make handling DOWN messages easier, I defined a record for it:
Record.defrecord :downmsg, :'DOWN', [:monitorRef, :type, :pid, :status]

The function heads for handling the DOWN messages in GenServer then look like this:

  • Good case: def handle_info(downmsg(status: :normal, pid: pid), data) do
  • Other cases: def handle_info(downmsg(status: status, pid: pid), data) do

This makes for a nice basic supervisor-like functionality and you can add the cleaning up of itself in the handle_info or any function, and either return a :noreply with the reduced list of PIDs as state or a :stop.

1 Like

As @Oliver said, you want to have GenServer instance being the “brains” of your operation: when the supervisor no longer has any children, the GenServer can simply call https://hexdocs.pm/elixir/Supervisor.html#stop/3 with :normal as the reason to shut it down (note that you’ll probably want to configure the supervisor’s own restart strategy to be :transient).

How does the GenServer know when the supervisor doesn’t have any children left? It can monitor each child and track their state with :DOWN messages as described in the post above (and explained step by step in my blog series here), or you can simply have something like

def handle_info(:check_supervisor, %{supervisor: sup} = state) do
  if Supervisor.count_children(sup).specs > 0 do
    Process.send_after(self(), :check_supervisor, 5_000)
    {:noreply, state}
  else
    Supervisor.stop(sup, :normal)
    {:noreply, Map.delete(state, :supervisor)}
  end
end
1 Like

Interesting approaches.
I actually want to do something when all children exit so those advice and code samples are very useful to me.

I kind of want to have a constant number of workers running, so when some of the workers finish, I want to spawn a new one to replace them.

I suppose there are better ways to achieve this but for now I think I’ll try to go with your advice and create a GenServer that will orchestrate this.

Then I’d suggest reading my blog series: it should require minimal modifications for what you want to do. Of course, if want you really want to have is a constant pool of workers, you’ll probably want to look at something like poolboy instead.