Unusual OTP behaviour- killing any but one of the processes results in restart

Hi,
I created this application: Islands Engine 0.1.35.
(The link shows 7 highlighted processes in a supervision tree of 10 processes.)

I start it running command: iex --sname islands -S mix
Then I start the observer: :observer.start

If I kill any but one of the 7 highlighted processes, it is immediately restarted.
The exception is process Elixir.Islands.Engine.Sup which sometimes restarts but most times does not!

I just cannot figure out what I am doing wrong. I tried changing :max_restarts to no avail.
I also managed to get a trace for both a restart and a failure. Not sure how to attach them here.

Thanks much!

N.B. I just pushed version 0.1.36 of my project and I placed the 2 trace files under the ./assets/ folder.

I just cloned your project and ran a trace log after killing Islands.Engine.Sup. It appears the problem is that Islands.Engine.Game.DynSup is not terminated. I will help you work on this further later on tonight, I just have to get my family situated for the night. Just wanted to point out what the issue is, perhaps that will help get you to a solution without any further help :smile:

18:20:27:851499 (<0.198.0>) << {'EXIT',<0.610.0>,
                                  {shutdown,
                                      {failed_to_start_child,
                                          'Elixir.Islands.Engine.Game.DynSup',
                                          {already_started,<0.201.0>}}}}
4 Likes

I think you’re hitting an edge case discussed here: https://github.com/erlang/otp/pull/1287

A supervisor normally shuts down all its children synchronously when it exits. The exception is when the exit reason is :kill. This one can’t be trapped (think of it like kill -9 on Linux), so the supervisor has no chance to do cleanup.

The children still exit eventually, because they’re linked to Sup, but this happens asynchronously so it’s possible for Sup to get restarted (and try spawning its children) before the old DynSup has died. This fails since the DynSup name is still taken by the old process.

You can confirm this by using a different exit reason when you stop Sup. Try for instance using :boom rather than :kill.

4 Likes

@dom, that link’s chat was most interesting. Pretty sure I nailed the issue, like so:

>   @spec start_link(term) :: Supervisor.on_start()
>   def start_link(:ok),
>     do: DynamicSupervisor.start_link(DynSup, :ok, name: maybe_wait(DynSup))
> 
>   @spec maybe_wait(atom) :: atom
>   defp maybe_wait(name) do
>     case Process.whereis(name) do
>       nil -> name
>       _pid ->
>         Process.sleep(10)
>         maybe_wait(name)
>     end
>   end

Thank you so much, guys! :orange_heart::yellow_heart::green_heart:

2 Likes