GenServer - Parent pid gets killed

defmodule MyApp.Stack do

  use GenServer

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: opts[:stack_name])
  end

  @impl GenServer
  def init(opts) do
    {:ok, {do_track_stack_singleton(opts), opts}}
  end

  @impl GenServer
  def handle_info({:DOWN, _, :process, _pid, _reason}, {_pids, opts} = _state) do
    {:noreply, {do_track_stack_singleton(opts), opts}}
  end

  defp do_track_stack_singleton(stack_opts) do
    stack_module = stack_opts[:server_module]
    process_name = stack_opts[:process_name]

    pid =
      case GenServer.start_link(stack_module, stack_opts, name: {:global, process_name}) do
        {:ok, pid} ->
          pid

        {:error, {:already_started, pid}} ->
          pid
      end

    Process.monitor(pid)

    pid
  end
end

In test file,

defmodule MyApp.StackTest do
  use ExUnit.Case, async: true

  import Assertions

  alias MyApp.Operation
  alias MyApp.Stack
  alias MyApp.Server

  @process_name :stack_cron

  @opts [
    intended_run_time_utc: ~T[08:00:00.000000],
    operation_module: Operation,
    process_name: @process_name,
    run_interval_milliseconds: 86_400_000,
    server_module: Server,
    timeout: :timer.seconds(2)
  ]

  defp start(_) do
    {:ok, stack_pid} =
      {Stack, @opts}
      |> start_supervised(restart: :temporary)

    child_pid = :global.whereis_name(@process_name)

    %{
      child_pid: child_pid,
      stack_pid: stack_pid
    }
  end

describe "While running Stack" do
    setup [:start]

    test "if stack child exits Stack restarts it", %{
      child_pid: child_pid,
      stack_pid: stack_pid
    } do
      {pids, _opts} = :sys.get_state(stack_pid)

      Process.exit(child_pid, :kill)

      {other_pids, _opts} = :sys.get_state(stack_pid)

      assert pids == other_pids
    end
  end
end

This test seems to fail intermittently with the following error message:

 ** (exit) exited in: :sys.get_state(#PID<0.1124.0>)
         ** (EXIT) killed
     code: {other_pids, _opts} = :sys.get_state(stack_pid)

From what I debugged, if I remove the line which exists the child_pid, and run this test for around 30times, I do not see this error. But when I add back this line, this test seems to fail once/twice in 30times. If I am not wrong, somewhere when existing the child_pid, it also kills the parent pid. But I am not able to come to a conclusion as to why it is intermittent. Or I could be entirely wrong about the child process killing parent process. Any idea how to resolve this flaky test?

Okay, but what is the result you’re after?

As the test states, I would like Stack to restart the child after it is killed.

Your test has a race between the test and the child process exiting; normally the second :sys.get_state wins and the child PID is still alive at the end of the test - if it had delivered its :DOWN message the PID in Stack would have changed. Infrequently, the child process exits first and causes the linked parent process to exit.

Here’s a simplified version of your code that demonstrates this (parts inspired by “What happens when a linked process dies”):

defmodule ChildServer do
  use GenServer

  def init(_), do: {:ok, %{}}
end

defmodule ParentServer do
  use GenServer

  def start do
    GenServer.start(__MODULE__, {})
  end

  def init(_), do: {:ok, setup_process()}

  def trap_exits(server) do
    :ok = GenServer.call(server, :trap_exits)
  end

  def handle_call(:trap_exits, _, s) do
    Process.flag(:trap_exit, true)
    {:reply, :ok, s}
  end

  def handle_info({:DOWN, _, :process, _p, _r} = event, _s) do
    IO.inspect({self(), event}, label: :handle_info)
    {:noreply, setup_process()}
  end

  def handle_info(event, s) do
    IO.inspect({self(), event}, label: :handle_info)
    {:noreply, s}
  end

  def terminate(reason, _s) do
    IO.inspect({self(), reason}, label: :terminate)
  end

  defp setup_process() do
    {:ok, child_pid} = GenServer.start_link(ChildServer, [])

    Process.monitor(child_pid)

    child_pid
  end
end


{:ok, l1} = ParentServer.start()

child_pid = :sys.get_state(l1)

# see below about this line
# ParentServer.trap_exits(l1)

Process.exit(child_pid, :kill)

:sys.get_state(l1) |> IO.inspect()

:sys.get_state(l1) |> IO.inspect()

Running this produces:

#PID<0.102.0>
** (exit) exited in: :sys.get_state(#PID<0.101.0>)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (stdlib 3.15.2) sys.erl:338: :sys.send_system_msg/2
    (stdlib 3.15.2) sys.erl:139: :sys.get_state/1
    main.exs:59: (file)
    (elixir 1.12.2) lib/code.ex:1261: Code.require_file/2

so the first :sys.get_state succeeds and the second one fails with a message similar to what you’re seeing.

Uncommenting the ParentServer.trap_exits line above gives a different behavior:

#PID<0.102.0>
handle_info: {#PID<0.101.0>, {:EXIT, #PID<0.102.0>, :killed}}
handle_info: {#PID<0.101.0>,
 {:DOWN, #Reference<0.1131499688.1731198977.106810>, :process, #PID<0.102.0>,
  :killed}}
#PID<0.103.0>
2 Likes

Was about the same and was confused if that was intended by OP.

Hi @al2o3cr , thanks for such a detailed explanation. I have some questions based on this:

  1. You mentioned, my test has a race between the test and child process exiting. If the :sys.get_state wins first and child process is still alive, the PID in Stack would not have changed?

  2. And when does the DOWN message happen? Is it when the child PID exits?

  3. Infrequently if the child process exits first, why does it also exit the linked parent process? Is that expected?

  4. Why are we trapping exits?

  5. And why does your code have :sys.get_state(l1) |> IO.inspect() twice?

:sys.get_state(l1) |> IO.inspect()

:sys.get_state(l1) |> IO.inspect()

The PID in Stack’s state is only updated when it receives a :DOWN message:

  def handle_info({:DOWN, _, :process, _pid, _reason}, {_pids, opts} = _state) do
    {:noreply, {do_track_stack_singleton(opts), opts}}
  end

so when your original test:

    test "if stack child exits Stack restarts it", %{
      child_pid: child_pid,
      stack_pid: stack_pid
    } do
      {pids, _opts} = :sys.get_state(stack_pid)

      Process.exit(child_pid, :kill)

      {other_pids, _opts} = :sys.get_state(stack_pid)

      assert pids == other_pids
    end

that means stack_pid has not yet received the :DOWN message from child_pid.

This race happens because Process.exit signals the target process and returns immediately, but the signal is only handled when the BEAM next schedules the target process.


start_link creates a bidirectional link between two processes; when one exits the other gets an exit signal as well.


The output of the “trapping exits” version of my demo is helpful here:

# the output of the first :sys.get_state
#PID<0.102.0>

# the trapped exit signal from the link
handle_info: {#PID<0.101.0>, {:EXIT, #PID<0.102.0>, :killed}}

# the :DOWN message from the monitor
handle_info: {#PID<0.101.0>,
 {:DOWN, #Reference<0.1131499688.1731198977.106810>, :process, #PID<0.102.0>,
  :killed}}

# the output of the second :sys.get_state
#PID<0.103.0>

When not trapping exits, the exit signal from the link to child_pid kills the parent server before the second :sys.get_state call.


The joys of distributed systems! This is polling l1 to track state changes.

:sys.get_state can also be useful with GenServers as a “wait for completion” operation in tests. For instance:

GenServer.cast(some_pid, {:long_running_operation, "foo"})

# This will block until `some_pid` returns to the GenServer loop
:sys.get_state(some_pid)