Supervised process is being restarted during node shutdown

lessless · March 9, 2017, 1:30pm

Hello,

Today I noticed strange process behavior in the staging logs and it turned out that one of the supervised processes is being restarted during node shutdown (upon deployment).

The process name is Account and it is supervised by AccountSupervisor. Simplified definition will look like this:

defmodule Postman.Processor.AccountSupervisor do
  def init([account_name, max_demand]) do
    supervise(
      [
        worker(Postman.Processor.Stage.Account, [account_name, max_demand]),
      ],
      strategy: :one_for_one
    )
  end
end

I added a couple of IO.inspect and here is what I got

17:37:35.457 [info] Running App.Endpoint with Cowboy using http://localhost:4000
AccountSupervisor/start_link, #PID<0.1989.0>: "starting list1"
AccountSupervisor/init, #PID<0.1992.0>: "starting list1"
Account/start_link, #PID<0.1992.0>: "starting list1"
Account/init, #PID<0.1993.0>: "starting list1"
^C
Shutting down..
Account/start_link, #PID<0.1992.0>: "starting list1"
Account/init, #PID<0.2005.0>: "starting list1"

=SUPERVISOR REPORT==== 9-Mar-2017::17:38:51 ===
     Supervisor: {via,'Elixir.Registry',
                      {'Elixir.Postman.Registry',<<"list1-supervisor">>}}
     Context:    child_terminated
     Reason:     shutdown
     Offender:   [{pid,<0.1993.0>},
                  {id,'Elixir.Postman.Processor.Stage.Account'},
                  {mfargs,
                      {'Elixir.Postman.Processor.Stage.Account',start_link,
                          [<<"list1">>,3]}},
                  {restart_type,permanent},
                  {shutdown,5000},
                  {child_type,worker}]


=CRASH REPORT==== 9-Mar-2017::17:38:51 ===
  crasher:
    initial call: Elixir.GenStage:init/1
    pid: <0.2005.0>
    registered_name: []
    exception exit: noproc
      in function  gen_server:init_it/6 (gen_server.erl, line 344)
    ancestors: [<0.1992.0>,'Elixir.Postman.Processor.AccountsSupervisor',
                  'Elixir.Postman.Supervisor',<0.1983.0>]
    messages: []
    links: [<0.1992.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 376
    stack_size: 27
    reductions: 296
  neighbours:

=SUPERVISOR REPORT==== 9-Mar-2017::17:38:51 ===
     Supervisor: {via,'Elixir.Registry',
                      {'Elixir.Postman.Registry',<<"list1-supervisor">>}}
     Context:    start_error
     Reason:     noproc
     Offender:   [{pid,<0.1993.0>},
                  {id,'Elixir.Postman.Processor.Stage.Account'},
                  {mfargs,
                      {'Elixir.Postman.Processor.Stage.Account',start_link,
                          [<<"list1">>,3]}},
                  {restart_type,permanent},
                  {shutdown,5000},
                  {child_type,worker}]

ok

Can anyone please help to figure out what’s going on?

rvirding · March 9, 2017, 4:33pm

A quick guess is that the node shutdown terminates the Account process in some way so it is restarted by the supervisor. After that the node shutdown terminates the supervisor, by the system terminating the application perhaps, and it first terminates the Account process before terminating itself.

lessless · March 9, 2017, 7:36pm

I wonder what is the root cause of that behavior?

Here is a full process tree http://i.imgur.com/lSZPoy8.png

Scheduler -> Account -> Dispatchers are GenStage stages and connected as producer -> producer_consumer -> consumer

lessless · March 10, 2017, 9:15am

I added terminate callback to trace Account shutdown and in 2 out of 3 runs there was only one process crash:

Account:

def terminate(reason, state) do
  IO.inspect("#{inspect reason}", label: "Account/terminate")
end

Backtrace:

SUPERVISOR REPORT==== 10-Mar-2017::16:12:37 ===
     Supervisor: {via,'Elixir.Registry',
                      {'Elixir.Postman.Registry',<<"list1-supervisor">>}}
     Context:    shutdown_error
     Reason:     shutdown
     Offender:   [{pid,<0.2330.0>},
                  {id,'Elixir.Postman.Processor.Stage.Account'},
                  {mfargs,
                      {'Elixir.Postman.Processor.Stage.Account',start_link,
                          [<<"list1">>,3]}},
                  {restart_type,permanent},
                  {shutdown,5000},
                  {child_type,worker}]

ok

Update:

  def terminate(reason, state) do
    Process.sleep(100)
  end

prevents issue to appear, but I still would like to know what’s the root cause

dom · March 10, 2017, 11:19am

Can you try putting Postman.Registry as the first child in its supervisor’s children list, rather than the last?

Supervisors shut down children in the reverse order of initialization. If Registry shuts down first, that will kill the linked processes (Account etc.), so their supervisor will try to restart them.

rvirding · March 10, 2017, 11:31am

When you stop the system using init:stop it will explicitly stop all the applications which have been started, IIRC in the reverse order to which it has started them. So what I meant was that when some other application is stopped it causes an Account process to terminate which is then restarted by its supervisor before that application is stopped and the supervisor terminates all its children.

If you look at the original output you will see a process dies with reason :noproc, this can happen if in a behaviour you try to contact a named process which no longer exists. The :shutdown reasons come from when the supervisor terminates its children.

Again I am guessing that the error comes from a process disappearing then when another process is restarted it can’t find the other process and generates the :noproc error.