Trying to understand how restarting child processes work

I’m making a test application that consumes server-sent events in order to understand supervisors and workers. The source code is here: https://github.com/andrewyang96/ElixirEventSourceTest

I’m having problems with two specific cases:

  1. If the server is not running and I run my Elixir program, it crashes with this error message:
** (Mix) Could not start application es_client: EsClient.start(:normal, []) returned an error: shutdown: failed to start child: EsClient.SSE
    ** (EXIT) an exception was raised:
        ** (MatchError) no match of right hand side value: {:error, {:shutdown, {:failed_to_start_child, EventsourceEx, {%HTTPoison.Error{id: nil, reason: :econnrefused}, [{HTTPoison, :request!, 5, [file: 'lib/httpoison.ex', line: 66]}, {EventsourceEx, :init, 1, [file: 'lib/eventsource_ex.ex', line: 18]}, {:gen_server, :init_it, 2, [file: 'gen_server.erl', line: 365]}, {:gen_server, :init_it, 6, [file: 'gen_server.erl', line: 333]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}}}}
            (es_client) lib/es_client.ex:31: EsClient.SSE.init/1
            (stdlib) supervisor.erl:294: :supervisor.init/1
            (stdlib) gen_server.erl:365: :gen_server.init_it/2
            (stdlib) gen_server.erl:333: :gen_server.init_it/6
            (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

I want to have my program to keep on retrying the connection instead.

  1. If I stop the server when the program is running, I get this warning:
18:51:25.544 [error] GenServer #PID<0.176.0> terminating
** (stop) :connection_terminated
Last message: %HTTPoison.AsyncEnd{id: #Reference<0.1336705182.1334575110.79508>}
State: %{message: %EventsourceEx.Message{data: nil, dispatch_ts: nil, event: "message", id: nil}, parent: #PID<0.174.0>, prev_chunk: ""}

The program hangs after that even if I restart the server. I expected that the program would continue consuming events after the connection is restored.

Any help is appreciated. I’m pretty new to Elixir’s concurrency model and I’m not sure how to debug this myself.

Hey @andrewyang96

Part of your issue is that the process you’ve got isn’t a valid OTP process. You’re starting the process as a supervisor and doing use Supervisor, but the logic you’re doing is that of a worker. You’re also running your own receive loop instead of letting the genserver loop manage stuff.

I’d start by checking out some . of the intro genserver guides and adapting your code to be a genserver in the supervisor tree. That’ll get you the restart semantics you expect. I’m also not entirely sure what’s up with starting the eventsource_ex project in the EsClient module. If it needs to be started by you that’d be in your top level supervisor too.

2 Likes

Are there any example projects that you can point me to? It’s hard trying to juggle how Elixir handles applications, supervisors, and genservers.

This part of the error

{:error, {:shutdown, {:failed_to_start_child, EventsourceEx, {%HTTPoison.Error{id: nil, reason: :econnrefused}, 
[{HTTPoison, :request!, 5, [file: 'lib/httpoison.ex', line: 66]},
 {EventsourceEx, :init, 1, [file: 'lib/eventsource_ex.ex', line: 18]}, 
 {:gen_server, :init_it, 2, [file: 'gen_server.erl', line: 365]}, 
 {:gen_server, :init_it, 6, [file: 'gen_server.erl', line: 333]}, 
 {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]} ...

Points to this line

eventsource_ex/lib/eventsource_ex.ex at master · cwc/eventsource_ex · GitHub

As far as I’m aware trying to establish a connection during the process initialization is not a recommended practice (see “As bad as anything else” It’s About the Guarantees).

A supervisor will not retry if a child fails during initialization - so that child process should only really try to (re)connect once it has gotten past initialization.

A demonstration script:

defmodule Worker do
  use GenServer

  def init([:fail]) do
    IO.puts("Worker: init fail")
    {:stop, :fail}
  end
  def init([:shutup]) do
    IO.puts("Worker: init shutup")
    Process.send_after(self(), :shutup, 200)
    {:ok, nil}
  end
  def init(args) do
    # trap exit for terminate/2 to get
    # called when supervisor terminates
    Process.flag(:trap_exit, true)
    IO.puts("Worker: init #{inspect args}")
    {:ok, nil}
  end

  def handle_info(:shutup, state) do
    {:stop, :bye, state}
  end

  def terminate(reason, _state) do
    IO.puts("Worker: Bye #{inspect reason} #{inspect self()}")
  end

  # client functions
  def start_link(args) do
    GenServer.start_link(Worker, args)
  end

end

defmodule Super do
  use Supervisor

  def start_link(arg) do
    Supervisor.start_link(Super, arg, [name: Super])
  end

  def init(arg) do
    children = [
      {Worker, arg}
    ]

    Supervisor.init(children, [strategy: :one_for_one, max_restarts: 3, max_seconds: 5])
  end
end

defmodule Demo do

  def run(arg) do
    IO.puts("Demo: self() #{inspect self()}")
    Process.flag(:trap_exit, true)
    route_demo(Super.start_link(arg))
  end

  def  route_demo({:ok, pid}) do
    IO.puts("Demo: Supervisor #{inspect pid}")

    ref = Process.monitor(pid)
    IO.puts("Demo: Monitor #{inspect ref}")

    timer_ref = Process.send_after(self(), :done, 2000)
    IO.puts("Demo: Timer #{inspect timer_ref}")
    receive do
      {:EXIT, _, _ } = msg ->
        IO.puts "Demo: EXIT msg #{inspect msg}"
        Process.demonitor(ref,[:flush])
      :done ->
        IO.puts "Demo: done - terminating supervisor"
        Process.exit(pid, :shutdown)
    end
    finish_demo()
  end
  def route_demo(result) do
    IO.puts("Demo: Supervisor start failed: #{inspect result}")
  end

  def finish_demo() do
    receive do
      :done ->
        IO.puts "Demo: done"
        finish_demo()
      {:EXIT, _, _ } = msg ->
        IO.puts "Demo: EXIT msg #{inspect msg}"
        finish_demo()
      {:DOWN, _, _ , _, _} = msg ->
        IO.puts "Demo: DOWN msg #{inspect msg}"
        finish_demo()
    after
      200 ->
        :ok
    end
  end

end

arg =
  case System.get_env("DEMO") do
    "FAIL" ->
      [:fail]
    "SHUTUP" ->
      [:shutup]
    _ ->
      []
  end

Demo.run(arg)
$ elixir demo.exs
Demo: self() #PID<0.73.0>
Worker: init []
Demo: Supervisor #PID<0.82.0>
Demo: Monitor #Reference<0.3362617656.312475653.145460>
Demo: Timer #Reference<0.3362617656.312475653.145465>
Demo: done - terminating supervisor
Worker: Bye :shutdown #PID<0.83.0>
Demo: EXIT msg {:EXIT, #PID<0.82.0>, :shutdown}
Demo: DOWN msg {:DOWN, #Reference<0.3362617656.312475653.145460>, :process, #PID<0.82.0>, :shutdown}
$ export DEMO=FAIL
$ elixir demo.exs
Demo: self() #PID<0.73.0>
Worker: init fail
Demo: Supervisor start failed: {:error, {:shutdown, {:failed_to_start_child, Worker, :fail}}}
$ export DEMO=SHUTUP
$ elixir demo.exs
Demo: self() #PID<0.73.0>
Worker: init shutup
Demo: Supervisor #PID<0.82.0>
Demo: Monitor #Reference<0.2504950352.3265003528.42022>
Demo: Timer #Reference<0.2504950352.3265003528.42027>
Worker: Bye :bye #PID<0.83.0>
Worker: init shutup

01:21:05.785 [error] GenServer #PID<0.83.0> terminating
** (stop) :bye
Last message: :shutup
State: nil
Worker: Bye :bye #PID<0.84.0>
Worker: init shutup

01:21:05.983 [error] GenServer #PID<0.84.0> terminating
** (stop) :bye
Last message: :shutup
State: nil
Worker: Bye :bye #PID<0.85.0>
Worker: init shutup

01:21:06.184 [error] GenServer #PID<0.85.0> terminating
** (stop) :bye
Last message: :shutup
State: nil
Worker: Bye :bye #PID<0.86.0>
Demo: EXIT msg {:EXIT, #PID<0.82.0>, :shutdown}

01:21:06.386 [error] GenServer #PID<0.86.0> terminating
** (stop) :bye
Last message: :shutup
State: nil
$ 
  • arg = [] Script successfully starts supervisor which starts Worker. After 2000 ms the script receives the :done message and terminates the supervisor normally with Process.exit(pid, :normal). The script lingers to catch the :EXIT and :DOWN messages from the supervisor.
  • arg = [:fail] This argument will cause the Worker to fail during the initialization phase. The supervisor only attempts initialization of the child one time - and it fails. As a result the supervisor fails immediately.
  • arg = [:shutup] This argument lets the child get past the initialization phase but causes it to terminate 200 ms later. The supervisor tries to restart the child three more times (defaults to 3 times in 5 seconds). After the third unsuccessful restart the supervisor itself shuts down.

Introduction to Mix goes through

Once you’ve followed along that you should have a base project that lets you experiment with the variations described in GenServer, Supervisor and Application.

3 Likes

That is a very thorough answer! I’ll second the recommendation to checkout the https://elixir-lang.org/getting-started/mix-otp/introduction-to-mix.html guide, it’ll get you started on some of the basics there.

2 Likes

I will take a look at those. Thank you!