Graceful shutdown of supervised GenServer

I’ve got an app that connects to a number of TCP sockets using Ranch.

The supervision tree looks like this:

|-------|
|  APP  |
|-------|
   |
   ▼
|---------------------|
| Manager(Supervisor) |
|---------------------|
  |        |        |                   
  ▼        ▼        ▼                    
|-------||-------||----------------|              
| Sup 1 || Sup 2 ||     Sup N      |               
|-------||-------||----------------|
                    |            |
                    ▼            ▼
             |------------|  |-------------|
             | TCP Reader |  | Other Child |
             |------------|  |-------------|

For each connection I start a GenServer in a supervisor… The GenServer init\1 calls connect\1.

  def connect(config) do
    case :ranch_tcp.connect(config.ip, config.port, []) do
      {:ok, socket} ->
        {:ok, socket}

      error = {:error, :econnrefused} ->
        Logger.error("Connection refused to #{inspect(config)}. Shutting down Reader")
        error

      error = {:error, _error} ->
        Logger.error(
          "CTC Socket failed to connect to #{inspect(config)}. Shutting down Reader"
        )

        error
    end
  end

What I want to do is let the Supervisor and its children die gracefully if :ranch_tcp.connect\2 returns {:error, :econnrefused}

The GenServer init looks like this

  @impl true
  def init(config) do
    case connect(config) do
      {:ok, socket} ->
#nice, start reading
        {:ok, %{:config => config, socket: socket}} 

      {:error, :econnrefused} ->
#too bad. Close self and Supervisor gracefully
        exit(:normal)

      {:error, error} ->
#What the what! Try again X times
        {:stop, error}
    end
  end

Observed behaviour

If the Genserver init\1 does not return {:ok, state} the error is propagated all the way to the Application, and it quits.

Starting ctc socket app

12:25:25.694 [error] Connection refused to %Server{ctc_name: :down_tcp_server, ip: {127, 0, 0, 1}, port: 5555, recv_buffer: 0}. Shutting down CtcReader

12:25:25.697 [info]  Application ctc_socket exited: CtcSocket.Application.start(:normal, []) returned an error: shutdown: failed to start child: Supervision.Manager
    ** (EXIT) shutdown: failed to start child: :down_tcp_server_supervisor
        ** (EXIT) shutdown: failed to start child: :down_tcp_server_reader
            ** (EXIT) normal
** (Mix) Could not start application ctc_socket: CtcSocket.Application.start(:normal, []) returned an error: shutdown: failed to start child: Supervision.Manager
    ** (EXIT) shutdown: failed to start child: :down_tcp_server_supervisor
        ** (EXIT) shutdown: failed to start child: :down_tcp_server_reader
            ** (EXIT) normal

You’ll want to return either {:stop, :normal} or :ignore from the init so that the Supervisor ignores the child otherwise it considers it to be crashing. I think in this case :ignore describes better its functionality. I think if you do {:stop, :normal} you’ll also have to change the GenServer restart strategy to be :transient (meaning it will only be restarted by the Supervisor in case it exits with something else than :normal or through :shutdown, or shutdown tuple, by using use GenServer, restart: :transient).

1 Like

You can return {:ok, state} from your init/1 function and then spawn a function that would call Supervisor.stop to stop the parent supervisor.

But that is hacky. I would have it done by the process that starts each Sup1, Sup2, SupN for each connection.

Thanks for the answer.

:ignore will keep a referance to the pid so you have the possibility to do Supervisor.restart_child(). That’s not quite what I’m looking for.

What I can’t seem to understand is how I handle the children stopping, In the Supervisor.
Shuold I rather trap exits in the GenServer and then send a message to the Supervisor that it should stop itself and all children? That seems to defeat the Purpose of the strategy that I sat to :rest_for_one.

I’ve got the {:error, :econnrefused} case working so that it does not crash its supervisor by using handle_info instead of trying to connect directly in init()

I read that having a GenServer monitor the shutdown of other GenServers is a common pattern.

But it seemed strange to me to have a GenServer to help my Supervisor supervising…

I ended up passing the Supervisor’s pid as a init arg to the Genservers and if I got {:error, :econnrefused} when trying to connect to the TCP socket, I close the Supervisor with Supervisor.stop\1

 @impl true
  def handle_info(:connect, state = %{:config => config, :supervisor => supervisor}) do
    IO.puts("connecting")

    case connect(config) do
      {:ok, socket} ->
        Process.send_after(self(), :read_data, 1)
        new_state = Map.put(state, :socket, socket)
        {:noreply, new_state}

      {:error, :econnrefused} ->
        :ok = Supervisor.stop(supervisor)
        {:noreply, state}
    end
  end

On all other errors I let it crash :exclamation:

Feel free to suggest better patterns to this :+1:

Isn’t Supervisor.stop is a synchronous call ? You call will await supervisor termination, but the supervisor will not be able to terminate the child gracefully, since the child is blocking on a receive, awaiting termination.

The supervisor will wait like 5 seconds and then kill the child.

That is why is suggested to do it from a process that is not the child.

Another solution, if ranch_tcp:connect does not link anything to the caller, would be to call it from the supervisor and then pass the socket to the child if successful, or cancel the supervisor initialization otherwise.

I checked with :observer.start() and it seems to work as expected. The blocking :ranch_tcp.recv() is not called if I get a connection refused.

But I’m willing to try from a monitoring GenServer to learn :smiley:

How would that be done?
Start the the monitoring GenServer from the supervisor and then call spawn_monitor(GenServerToBeMonitored, :some_function, []) from the Supervisor ?

I’m a bit hesitant to do spawn_monitor since the docs says:

Typically developers do not use the spawn functions, instead they use abstractions such as Task , GenServer and Agent , built on top of spawn , that spawns processes with more conveniences in terms of introspection and debugging.

It is Supervisor.stop that should be blocking. But as your GenServer is not trapping exits and you are calling from handle_info it is fine. I thought you were still on init.

What I was saying after is that you could call ranch_tcp.connect in the supervisor init function, and depending on the result call Supervisor.init or return :ignore : https://hexdocs.pm/elixir/Supervisor.html#c:init/1

Ok, thanks.

Starting the ranch_tcp.connect in the Supervisor is an option. I would gain not having to pass the Supervisor pid, but would have to passe the tcp socket. So I guess I don’t win too much by making that change.

Also, I avoid starting the Genserver and (1) sibling process, just to shut them down. But It’s less than 20 TCP connections. So the performance impact is negligible.

Again, thanks a lot for the feedback.

Well you gain that if you have :econnrefused your supervisor init/1 callback just returns :ignore so:

  • your GenServer child does not have to stop its supervisor. It does not even have to “know” the concept of a supervisor: cleaner code.
  • you can start it without a supervisor from your tests if needed, it will not try to call a supervisor that does not exist, it does not care about supervision.
  • it always receives a successfuly connected socket, so it does not have to handle connection errors in its init/handle_info callbacks, just do the work with an actual socket.

Also, if the connect returns an error that is NOT :econnrefused, you can exit(...) from the supervisor and try again, just like you do now. But again, the child does not have to deal with it, it only receives connected sockets.

is your manager supervisor a dynamic supervisor? It seems like you probably would want your TCP reader supervisor to be dynamic, and your TCP Reader and “other child” to be statically supervised. If that’s the case then the repeated deaths of your one level-3 supervisor won’t bring down your level-2 supervisor, and then your app.

also consider using Connection https://hexdocs.pm/connection/Connection.html

is your manager supervisor a dynamic supervisor?

It is static since I know the TCP hosts and they won’t change dynamically.

probably would want your TCP reader supervisor to be dynamic.

I don’t see the need for that since it is a fixed set of TCP host/port I’m connecting to.

also consider using Connection

Seems like it’s doing exactly what I want. Shame it doesn’t come up when searching for “tcp” on hex.pm!
I’ll look into Connection :+1: