Supervision problem?

Hi,

I just found a very strange and mysterious error here. I initially thought it was me doing something the wrong way, but I managed to strip it down to a bare, out of the box Phoenix app. Here’s how to reproduce it:

mix phx.hew test --no-ecto
cd test/
iex -S mix

Then start Observer and kill the TestWeb.Endpoint process. What you will see is the whole application is exiting. The reason is that the top level supervisor cannot restart the Endpoint. Well, that’s not quite true, because if you trace the top level supervisor, it seems that the restart succeeded, still the supervisor thinks it failed:

 iex> :sys.trace Process.whereis(Test.Supervisor), true

Here’s the full output.

*DBG* 'Elixir.Test.Supervisor' got {'EXIT',<0.235.0>,killed}
*DBG* 'Elixir.Test.Supervisor' new state {state,
                                          {local,'Elixir.Test.Supervisor'},
                                          one_for_one,
                                          {['Elixir.TestWeb.Endpoint'],
                                           #{'Elixir.TestWeb.Endpoint' =>
                                              {child,
                                               {restarting,<0.235.0>},
                                               'Elixir.TestWeb.Endpoint',
                                               {'Elixir.TestWeb.Endpoint',
                                                start_link,[]},
                                               permanent,infinity,supervisor,
                                               ['Elixir.TestWeb.Endpoint']}}},
                                          undefined,3,5, 
                                          [-576460567],
                                          0,'Elixir.Supervisor.Default',
                                          {ok, 
                                           {#{intensity => 3,period => 5,
                                              strategy => one_for_one},
                                            [{'Elixir.TestWeb.Endpoint',
                                              {'Elixir.TestWeb.Endpoint',
                                               start_link,[]},
                                              permanent,infinity,supervisor,
                                              ['Elixir.TestWeb.Endpoint']}]}}}
*DBG* 'Elixir.Test.Supervisor' got {'EXIT',<0.336.0>,
                                    {shutdown,
                                     {failed_to_start_child,
                                      'Elixir.Phoenix.PubSub.PG2',
                                      {already_started,<0.237.0>}}}}
*DBG* 'Elixir.Test.Supervisor' new state {state,
                                          {local,'Elixir.Test.Supervisor'},
                                          one_for_one,
                                          {['Elixir.TestWeb.Endpoint'],
                                           #{'Elixir.TestWeb.Endpoint' =>
                                              {child,
                                               {restarting,<0.235.0>},
                                               'Elixir.TestWeb.Endpoint',
                                               {'Elixir.TestWeb.Endpoint',
                                                start_link,[]},
                                               permanent,infinity,supervisor,
                                               ['Elixir.TestWeb.Endpoint']}}},
                                          undefined,3,5, 
                                          [-576460567],
                                          0,'Elixir.Supervisor.Default',
                                          {ok, 
                                           {#{intensity => 3,period => 5,
                                              strategy => one_for_one},
                                            [{'Elixir.TestWeb.Endpoint',
                                              {'Elixir.TestWeb.Endpoint',
                                               start_link,[]},
                                              permanent,infinity,supervisor,
                                              ['Elixir.TestWeb.Endpoint']}]}}}
*DBG* 'Elixir.Test.Supervisor' got cast {try_again_restart,
                                         'Elixir.TestWeb.Endpoint'}
*DBG* 'Elixir.Test.Supervisor' new state {state,
                                          {local,'Elixir.Test.Supervisor'},
                                          one_for_one,
                                          {['Elixir.TestWeb.Endpoint'],
                                           #{'Elixir.TestWeb.Endpoint' =>
                                              {child,
                                               {restarting,<0.235.0>},
                                               'Elixir.TestWeb.Endpoint',
                                               {'Elixir.TestWeb.Endpoint',
                                                start_link,[]},
                                               permanent,infinity,supervisor,
                                               ['Elixir.TestWeb.Endpoint']}}},
                                          undefined,3,5, 
                                          [-576460567,-576460567],
                                          0,'Elixir.Supervisor.Default',
                                          {ok,
                                           {#{intensity => 3,period => 5,
                                              strategy => one_for_one},
                                            [{'Elixir.TestWeb.Endpoint',
                                              {'Elixir.TestWeb.Endpoint',
                                               start_link,[]},
                                              permanent,infinity,supervisor,
                                              ['Elixir.TestWeb.Endpoint']}]}}}
*DBG* 'Elixir.Test.Supervisor' got {'EXIT',<0.337.0>,
                                    {shutdown,
                                     {failed_to_start_child,
                                      'Elixir.Phoenix.PubSub.PG2',
                                      {already_started,<0.237.0>}}}}
*DBG* 'Elixir.Test.Supervisor' new state {state,
                                          {local,'Elixir.Test.Supervisor'},
                                          one_for_one,
                                          {['Elixir.TestWeb.Endpoint'],
                                           #{'Elixir.TestWeb.Endpoint' =>
                                              {child,
                                               {restarting,<0.235.0>},
                                               'Elixir.TestWeb.Endpoint',
                                               {'Elixir.TestWeb.Endpoint',
                                                start_link,[]},
                                               permanent,infinity,supervisor,
                                               ['Elixir.TestWeb.Endpoint']}}},
                                          undefined,3,5, 
                                          [-576460567,-576460567],
                                          0,'Elixir.Supervisor.Default',
                                          {ok,
                                           {#{intensity => 3,period => 5,
                                              strategy => one_for_one},
                                            [{'Elixir.TestWeb.Endpoint',
                                              {'Elixir.TestWeb.Endpoint',
                                               start_link,[]},
                                              permanent,infinity,supervisor,
                                              ['Elixir.TestWeb.Endpoint']}]}}}
*DBG* 'Elixir.Test.Supervisor' got cast {try_again_restart,
                                         'Elixir.TestWeb.Endpoint'}
*DBG* 'Elixir.Test.Supervisor' new state {state,
                                          {local,'Elixir.Test.Supervisor'},
                                          one_for_one,
                                          {['Elixir.TestWeb.Endpoint'],
                                           #{'Elixir.TestWeb.Endpoint' =>
                                              {child,
                                               {restarting,<0.235.0>},
                                               'Elixir.TestWeb.Endpoint',
                                               {'Elixir.TestWeb.Endpoint',
                                                start_link,[]},
                                               permanent,infinity,supervisor,
                                               ['Elixir.TestWeb.Endpoint']}}},
                                          undefined,3,5, 
                                          [-576460567,-576460567,-576460567],
                                          0,'Elixir.Supervisor.Default',
                                          {ok,
                                           {#{intensity => 3,period => 5,
                                              strategy => one_for_one},
                                            [{'Elixir.TestWeb.Endpoint',
                                              {'Elixir.TestWeb.Endpoint',
                                               start_link,[]},
                                              permanent,infinity,supervisor,
                                              ['Elixir.TestWeb.Endpoint']}]}}}
*DBG* 'Elixir.Test.Supervisor' got {'EXIT',<0.338.0>,
                                    {shutdown,
                                     {failed_to_start_child,
                                      'Elixir.Phoenix.PubSub.PG2',
                                      {already_started,<0.237.0>}}}}
*DBG* 'Elixir.Test.Supervisor' new state {state,
                                          {local,'Elixir.Test.Supervisor'},
                                          one_for_one,
                                          {['Elixir.TestWeb.Endpoint'],
                                           #{'Elixir.TestWeb.Endpoint' =>
                                              {child,
                                               {restarting,<0.235.0>},
                                               'Elixir.TestWeb.Endpoint',
                                               {'Elixir.TestWeb.Endpoint',
                                                start_link,[]},
                                               permanent,infinity,supervisor,
                                               ['Elixir.TestWeb.Endpoint']}}},
                                          undefined,3,5, 
                                          [-576460567,-576460567,-576460567],
                                          0,'Elixir.Supervisor.Default',
                                          {ok,
                                           {#{intensity => 3,period => 5,
                                              strategy => one_for_one},
                                            [{'Elixir.TestWeb.Endpoint',
                                              {'Elixir.TestWeb.Endpoint',
                                               start_link,[]},
                                              permanent,infinity,supervisor,
                                              ['Elixir.TestWeb.Endpoint']}]}}}
*DBG* 'Elixir.Test.Supervisor' got cast {try_again_restart,
                                         'Elixir.TestWeb.Endpoint'}
[info] Application test exited: shutdown

Seems like Phoenix.PubSub.PG2 is causing the problem. It cannot be started, because it is already running. What’s wrong here? I believe the app should have survived a process crash like the killing I do here. Or am I wong?

Phoenix.PubSub.PG2 is supervised by the Endpoint, and for some reason it just does not appear in observer under the Endpoint in the supervision tree.

The Phoenix version I use is 1.7.4.
Tried with OTP 21.2.1 and 21.2.2, Elixir 1.7.4 and also with 1.8.0-rc.1. The same happened.

2 Likes

I’m facing the same exact problem, did you solved?

This has just been asked on Stack Overflow too.

This is more of an issue of using Process.exit(pid, :kill) on a supervisor. :kill is a last resort message, which causes the supervisor to immediately terminate, without notifying its children to terminate properly. So the supervisor exits, the children now have to notice their parent is dead, clean up, and terminate.

Meanwhile, the application is restarting part of the tree at the same time, which ends up conflicting with the old one still shutting down, causing another failure. This eventually triggers max_restarts and the application shuts down.

Overall:

  1. Just use :kill as a last resort, when the process did not respond to any other exit signal. This especially applies to supervisors, as their only job is to ensure processes start and terminate accordingly, and sending a :kill voids that

  2. If you want to simulate stopping a supervisor, Supervisor.stop will at least go through the usual flow

  3. Run iex --logger-sasl-reports true -S mix phx.server to get precise logging from supervisors


One question that may arise from this is: could my supervisor fail in a way that triggers the same behaviour as the :kill signal? Supervisors trap exits, which means no other process (linked or not) could cause them to crash. Therefore, this can only happen if there is a bug in the supervisor. And if there is a bug in the supervisor, then indeed they can’t guarantee their fault tolerant properties anyway. That’s why supervisors rarely change, they have been strongly tested for decades and are an essential piece of your application.

11 Likes