Killing Registry Process, shutdown the entire application

Ciboulette · July 5, 2018, 1:10pm

Hello,

I am writing a fault tolerant application in Elixir, so I kill every process to check the way they handle restarting etc. I start a Registry from the Default Supervisor, when i try killing it, it shutdown the entire application “Application worker exited: shutdown”. Do you know how to handle a simple fresh restart?

defmodule Toto.Application do
    @moduledoc """
        Entry point of the toto application
    """
    use Application
    require Logger
    use Supervisor

    def start(_type, _args) do
        Logger.info "Starting #{__MODULE__}"
        Supervisor.start_link(__MODULE__, [], name: __MODULE__)
    end

    def init(_) do
        Supervisor.init([
            {Registry, keys: :unique, name: Worker.ProcessRegistry}
        ], strategy: :one_for_one)
    end
end

benwilson512 · July 5, 2018, 2:37pm

Can you show how you’re killing it?

Ciboulette · July 5, 2018, 2:57pm

Process.exit(Process.whereis(Worker.ProcessRegistry), :kill)

kelvinst · July 27, 2018, 3:13pm

That’s very interesting. Did you discover what is it?

isaacsanders · July 27, 2018, 3:26pm

What does Process.exit(Process.whereis(Worker.ProcessRegistry), :normal) do?

Ciboulette · July 27, 2018, 3:49pm

No i didn’t find yet the exact answer…

kelvinst · July 27, 2018, 6:27pm

I’ve tested out, Process.exit(Process.whereis(Worker.ProcessRegistry), :normal) does not stop the process. Looks like Registry is a very special type of process, I was not expecting this behaviour in any way.

isaacsanders · July 27, 2018, 7:53pm

Another question.

What do you expect killing the Registry to do?

Registries are collections of names when used with :via and (I think) a complex process dictionary otherwise.

It is also a supervisor. It creates a named ETS table. The Registry module functions call into the ETS table. I imagine if you have a number of processes that interacting with this Registry, killing it might be a Really Bad™ thing. It also might not be super common since it shouldn’t be affected by your application code.

There might be a bug in Registry, but I am also thinking this might not be a place where faults happen.

kelvinst · July 28, 2018, 12:52am

I guess @Ciboulette’s point is that he was testing the fault tolerance on every part of his system, and it’s very odd that killing a Registry shuts down the whole application. He’s not doing that on his system, only trying out possible failure points.

Faults can happen anywhere, and being fault tolerant is a basic property of BEAM, that’s why I’m very intrigued by this.

So I have managed to repeat the problem and traced it out:

15:48:56:912029 (<0.132.0>) getting_unlinked <0.133.0>
15:48:56:912032 (<0.132.0>) << {'EXIT',<0.133.0>,killed}
15:48:56:912207 (<0.132.0>) spawn <0.552.0> as proc_lib:init_p('Elixir.Toto.Supervisor',[<0.131.0>],gen,init_it,[gen_server,<0.132.0>,<0.132.0>,
 {local,'Elixir.Worker.ProcessRegistry'},
 supervisor,
 {{local,'Elixir.Worker.ProcessRegistry'},
  'Elixir.Registry.Supervisor',
  {unique,'Elixir.Worker.ProcessRegistry',1,[],[{-1,{unique,1,nil,nil,[]}},{-2,{unique,1,nil}}]}},
 []])
15:48:56:912226 (<0.132.0>) link <0.552.0>
15:48:56:912233 (<0.132.0>) out {proc_lib,sync_wait,2}
15:48:56:912724 (<0.132.0>) in {proc_lib,sync_wait,2}
15:48:56:912731 (<0.132.0>) << {ack,<0.552.0>,
                                   {error,
                                       {shutdown,
                                           {failed_to_start_child,
                                               'Elixir.Worker.ProcessRegistry.PIDPartition0',
                                               {already_started,<0.134.0>}}}}}
15:48:56:912737 (<0.132.0>) getting_unlinked <0.552.0>
15:48:56:912738 (<0.132.0>) << {'EXIT',<0.552.0>,
                                   {shutdown,
                                       {failed_to_start_child,
                                           'Elixir.Worker.ProcessRegistry.PIDPartition0',
                                           {already_started,<0.134.0>}}}}
15:48:56:912762 (<0.132.0>) <0.132.0> ! {'$gen_cast',
                                            {try_again_restart,'Elixir.Worker.ProcessRegistry'}}
15:48:56:912769 (<0.132.0>) << {'$gen_cast',{try_again_restart,'Elixir.Worker.ProcessRegistry'}}
15:48:56:912788 (<0.132.0>) spawn <0.553.0> as proc_lib:init_p('Elixir.Toto.Supervisor',[<0.131.0>],gen,init_it,[gen_server,<0.132.0>,<0.132.0>,
 {local,'Elixir.Worker.ProcessRegistry'},
 supervisor,
 {{local,'Elixir.Worker.ProcessRegistry'},
  'Elixir.Registry.Supervisor',
  {unique,'Elixir.Worker.ProcessRegistry',1,[],[{-1,{unique,1,nil,nil,[]}},{-2,{unique,1,nil}}]}},
 []])
15:48:56:912793 (<0.132.0>) link <0.553.0>
15:48:56:912797 (<0.132.0>) out {proc_lib,sync_wait,2}
15:48:56:913611 (<0.132.0>) in {proc_lib,sync_wait,2}
15:48:56:913622 (<0.132.0>) << {ack,<0.553.0>,
                                   {error,
                                       {shutdown,
                                           {failed_to_start_child,
                                               'Elixir.Worker.ProcessRegistry.PIDPartition0',
                                               {already_started,<0.134.0>}}}}}
15:48:56:913632 (<0.132.0>) getting_unlinked <0.553.0>
15:48:56:913635 (<0.132.0>) << {'EXIT',<0.553.0>,
                                   {shutdown,
                                       {failed_to_start_child,
                                           'Elixir.Worker.ProcessRegistry.PIDPartition0',
                                           {already_started,<0.134.0>}}}}
15:48:56:913692 (<0.132.0>) <0.132.0> ! {'$gen_cast',
                                            {try_again_restart,'Elixir.Worker.ProcessRegistry'}}
15:48:56:913702 (<0.132.0>) << {'$gen_cast',{try_again_restart,'Elixir.Worker.ProcessRegistry'}}
15:48:56:913733 (<0.132.0>) spawn <0.554.0> as proc_lib:init_p('Elixir.Toto.Supervisor',[<0.131.0>],gen,init_it,[gen_server,<0.132.0>,<0.132.0>,
 {local,'Elixir.Worker.ProcessRegistry'},
 supervisor,
 {{local,'Elixir.Worker.ProcessRegistry'},
  'Elixir.Registry.Supervisor',
  {unique,'Elixir.Worker.ProcessRegistry',1,[],[{-1,{unique,1,nil,nil,[]}},{-2,{unique,1,nil}}]}},
 []])
15:48:56:913741 (<0.132.0>) link <0.554.0>
15:48:56:913748 (<0.132.0>) out {proc_lib,sync_wait,2}
15:48:56:914126 (<0.132.0>) in {proc_lib,sync_wait,2}
15:48:56:914195 (<0.132.0>) << {ack,<0.554.0>,
                                   {error,
                                       {shutdown,
                                           {failed_to_start_child,
                                               'Elixir.Worker.ProcessRegistry.PIDPartition0',
                                               {already_started,<0.134.0>}}}}}
15:48:56:914216 (<0.132.0>) getting_unlinked <0.554.0>
15:48:56:914218 (<0.132.0>) << {'EXIT',<0.554.0>,
                                   {shutdown,
                                       {failed_to_start_child,
                                           'Elixir.Worker.ProcessRegistry.PIDPartition0',
                                           {already_started,<0.134.0>}}}}
15:48:56:914274 (<0.132.0>) <0.132.0> ! {'$gen_cast',
                                            {try_again_restart,'Elixir.Worker.ProcessRegistry'}}
15:48:56:914288 (<0.132.0>) << {'$gen_cast',{try_again_restart,'Elixir.Worker.ProcessRegistry'}}
15:48:56:914329 (<0.132.0>) exit shutdown
15:48:56:914334 (<0.132.0>) unregister 'Elixir.Toto.Supervisor'
15:48:56:914337 (<0.132.0>) out_exited 0

I’m not so experient tracing, so any help would be awesome!

axelson · July 28, 2018, 2:10am

I think that this blog post is relevant to this thread. You might want to consider the registry process as part of the “Error Kernel”

isaacsanders · July 28, 2018, 3:10am

Your trace makes it appear that the registry’s children don’t die/aren’t in registered from their name registry when the registry supervisor is killed. The named processes are still alive, seen as already_started.

At least that is the appearance it has to me.

So this sounds like a bug in Registry.

sezaru · September 16, 2019, 4:10pm

Sorry to resurrect this thread, but this is the same issue I’m having right now, so I was wondering what a solution would be.

In my case I have multiple Registry processes for parts of my system, and they are child of supervisors with :one_for_all strategy, so they will restart if one other process crashes.

Looking at the logs, it seems to be an issue with the PIDPartition processes Registry create, seems like they take some time to be killed? If I increase my max_restarts to 20 for example, it will work correctly and restart everything as expected. But this seems more like a workaround instead of a solution.

sezaru · September 16, 2019, 6:37pm

Testing it a little bit more I think I got it now, this only happens when I try to kill the supervisor directly. If I kill any of my workers (which will make registry be killed since I have one_for_all, it works as expected.

I guess the reason is that supervisors are part of the error kernel as @axelson pointed out.

josevalim · September 22, 2019, 6:44pm

The most likely reason this is happening is because when you kill the registry, it kills everything registered under it. This means a supervisor would potentially get tons of unexpected failures, which means it will exceed its restart limit. This can be accounted in multiple ways, increasing restarts, changing the strategy, etc.