Gracefully shutdown child of dynamic supervisor

Hi folks,
I want to gracefully shutdown a child of a dynamic supervisor (the parent), where the child is a continuously running state machine which uses the gen_statem behavior; also, the child is trapping exits to catch signals from its parent. The parent uses the following to issue a signal to the child:
Process.exit(pid, :normal)
I expected the child to terminate; this was not the case and it continues to run code.

From reading the following links (link_1 , link_2 and link_3) It seems that Process.exit(pid, :normal) does not/should not/can not work and I should use :kill as the reason. The documentation is confusing for my case. Any help is much appreciated.

The only way for an external process to stop a exit trapping process (ungracefully) is to use :kill reason. Otherwise the signal will be converted to a message for the exit trapping process to handle. It can decide if it shuts down or not based on the message.

Hello @LostKobrakai
Thank your for reaching out. You pointed to the crux of the problem. Why does the child ignore/drop the exit signal with :nromal from its parent? Do you have any suggestions for debugging the issue? For my use case, it’s better for the child process to go down gracefully.

I have not been able to reach a meaningful conclusion after reading the docs.

How do you call Process.exit(child, :normal) from parent, when parent is a dynamic supervisor?
Can you provide the code snippet?

That depends on the implementation of the child. It can do whatever it wants when handling the exit message.

Hi @hst337
Thanks for reaching out. This is where I send the exit signal to the child from the dynamic supervisor. It’s part of a function which issues the terminate signal to the child process and removes artifacts related to the child:

def clean_up(gv_spec)  do
# cleanup  
  case :pg.get_members(gv_spec.id) do
    [] ->
       Logger.info(%{msg: "there is no gv for given instance", id: gv_spec.id})
 
   [pid | _] ->
       Process.exit(pid, :normal)
  end
# cleanup  continued
end

I’m attaching a code snippet which displays how I’m trapping exits for the child, hope it helps:

@impl :gen_statem
def init(%{spec: gv_spec, config: config}) do 
# initial processing
   Process.flag(:trap_exit, true)
# create gv_spec   
   join_pg(gv_spec) 
# variable  setting : event, data, state
   {:ok, state, data, event}
end

A terminate function is also present in the child:

@impl :gen_statem
def terminate(reason, %{spec: gv_spec} = state, data) do                                                                                                                           
   leave_pg(gv_spec)
   Logger.warn(%{msg: "terminating", reason: reason, id: gv_spec.id})
end  

terminate is only called when the process actually shuts down. The exit message would be handled in a handle_event with event type :info (handle_info elsewhere).

2 Likes

Thanks for the feedback @LostKobrakai. My goal was to demonstrate that the terminate function has been implemented. As you pointed out, since the process does not terminate as expected, it doesn’t get called. To address your original point, a handle_event with event type :info is present in the code base.

Does it stop the process when it receives the exit message?

This code is called not from dynamic supervisor process, I can assure you. Perhaps this is called from dynamic supervisor module, but not process. There is actually no non-hacky way to call anything from any supervisor process (unless it is a supervisor written from scratch).


Considering the trap_exit.
Any OTP compliant process with trap_exit behaves this way:
If exit signal is received from the parent process (this can be checked in Process.get()), it is handled in a way if the child was not trapping exits. And, if the child receives exit signal from any non-parent process, it is handled as a message in handle_info (or in handle_event or state function in case of gen_statem)

Mix.install [:gen_state_machine]

defmodule Server do
  use GenStateMachine, callback_mode: :state_functions

  def start_link(opts \\ []) do
    GenStateMachine.start_link(__MODULE__, opts)
  end

  def init(opts) do
    Process.flag(:trap_exit, true)
    {:ok, :state, opts}
  end

  def state(:info, message, data) do
    IO.inspect(message, label: :received)
    :keep_state_and_data
  end

  def terminate(reason, :state, data) do
    IO.inspect(reason, label: :terminating)
  end
end

DynamicSupervisor.start_link(name: Sup)
{:ok, child} = DynamicSupervisor.start_child(Sup, Server)

And in iex

iex(3)> Process.exit(child, :normal)
received: {:EXIT, #PID<0.112.0>, :normal}
true
iex(4)> Process.exit(child, :kill)
true
iex(5)> Process.info child
nil

As you can see, with trap_exit, normal exit results in just a message being sent. But if you call this

iex(3)> DynamicSupervisor.terminate_child Sup, child
terminating: :shutdown
:ok

It initiates the terminate callback


P.S. don’t forget to mark the correct answer

3 Likes

Hello @LostKobrakai ,

Yep, I’ve placed the proper handle_event for managing exit messages and it should stop the process. I used this part of the documentation

Hi @hst337 ,
Thanks a lot for the thorough explanation.

1 Like

Hello again @hst337
Thank you very much for your guidance. I’ve encountered an interesting case and it would be great if I could have your feedback.
While reading the logs, it came to my attention that a child was terminated 10 hours after terminate_child/2 was issued. The child in question, is terminated from an external process other than the parent; it uses the gen_statem behavior and has two timeout events with a resolution of 10 minutes. In these timeout events, it opens an external file (therefore using an external resource).

Is termination postponed while the child is reading an external file when using terminte_child/2?
Or if termination is issued during a timeout? 10 hours is a huuuge gap and I don’t understand what could be the problem.

I edited the last paragraph.

No, the termination is not postponed in any case. You need to take in account, that messages received before termination, are processed before terminate callback is called.

If you want to kill your child in an instant, you should delete child and kill with :kill reason. In this case, it won’t be possible to run terminate callback

Anyway, this is XY problem, since you’re trying to do resource management relying on terminate callback.

You need to know 2 things

  1. terminate callback is for optimistic cleanup. This means that the callbacks is called only when exception can be handled. Some cases like hardware failure, out-of-memory errors and infinite loops are not covered by terminate callbacks. That’s why you should not rely on terminate callback

  2. The idiomatic way to resource management is a resource pool or observer pattern. Latter is much more easier to implement, and it is basically a separate process, which hosts the resource and monitors the process using the resource. In this case, when the user of the resource dies from any reason or stucks in infinite loop, your observing process will be able to close or cleanup the resource

1 Like

Hi @hst337 ,
Thanks a lot for the feedback. I found the problem and it’s not related to termiante_child/2. Thank you very much for your insights.