How to stop OTP processes started in ExUnit setup callback?

lessless · February 22, 2017, 3:54am

Hello,

The module under test depends on three OTP process and thus they’re started in test setup callback:

setup do
    accounts = TestAccounts.accounts()
    {:ok, scheduler}      = Enum.map(accounts, &Map.get(&1, :name)) |> Scheduler.start_link() # GenStage
    {:ok, acc_supervisor} = AccountsSupervisor.start_link() # Supervisor
    {:ok, provisor}       = Provisor.start_link() # GenServer

    {:ok, accounts: accounts}
  end

I thought that they will be killed automatically after completion of each of the test case, but looks like it’s not the case - once in 3-4 runs a wild ** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.2161.0>}} error begun to appear.

I managed to catch it both for Scheduler and for AccountsSupervisor.

The application supervision tree is:

workers  = [
  supervisor(Registry, [:unique, Postman.Registry]),
  supervisor(AccountsSupervisor, []),
  worker(Provisor, []),
  worker(Scheduler, [Enum.map(accounts, &Map.get(&1, :name))])
]

First idea (confirmed by googling) was to stop those processes in on_exit function:

  setup do
    accounts = TestAccounts.accounts()
    {:ok, scheduler}      = Enum.map(accounts, &Map.get(&1, :name)) |> Scheduler.start_link() # GenStage
    {:ok, acc_supervisor} = AccountsSupervisor.start_link() # Supervisor
    {:ok, provisor}       = Provisor.start_link() # GenServer

    on_exit fn ->
      Supervisor.stop(acc_supervisor)
      GenServer.stop(provisor)
      GenStage.stop(scheduler)
    end

    {:ok, accounts: accounts}
  end

That led to a whole new bunch of other errors/complaints:

Supervisor.stop(acc_supervisor) produce

 ** (exit) exited in: :sys.terminate(#PID<0.572.0>, :normal, :infinity)
         ** (EXIT) shutdown

I think this is just a notification message, but I would really really like to avoid capturing errors for all tests where a Supervisor should be stopped.

GenServer.stop(provisor) produce

     ** (exit) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

GenStage.stop(scheduler) produce

 ** (exit) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Here you can see some clear contradictions(race conditions) - sometimes processes are still running, sometimes they’re not.

Lastly I wrote a function to overcome that problem which kills process only if it’s alive:

  def kill_if_alive(pid) do
    case Process.alive?(pid) do
      true -> Process.exit(pid, :kill)
      _    -> :ok
    end
  end

After that, an even stranger race condition in one of the tests started to appear.

test "start all accounts", ctx do
  assert Supervisor.which_children(AccountsSupervisor) |> length() == 0
  assert Provisor.start_all_accounts(ctx.accounts)     |> length() == length(ctx.accounts)
  assert Supervisor.which_children(AccountsSupervisor) |> length() == length(ctx.accounts)
end

 Assertion with == failed
     code:  Supervisor.which_children(AccountsSupervisor) |> length() == length(ctx.accounts())
     left:  1
     right: 2
     stacktrace:
       test/processor/provisor_test.exs:26: (test)

Provisor.start_all_accounts spawns a bunch of supervisors under AccountsSupervisor and thus they’should be stopped with AccountsSupervisor

This situation is utterly confusing and I hope somebody can clarify what’s going on and how to properly stop those processes.

NobbZ · February 22, 2017, 6:31am

Since we happen to have race conditions in tear-up and -down code of the tests, have you already set the testmodule to run the tests one by one by using async: false?

lessless · February 22, 2017, 7:02am

Yep, just added async: false to all test files. Also, because of all processes register themselves within the local registry, I added it to the restart routine as well:

  setup do
    accounts         = TestAccounts.accounts()
    {:ok, registry}  = Registry.start_link(:unique, Postman.Registry)
    {:ok, scheduler} = Enum.map(accounts, &Map.get(&1, :name)) |> Scheduler.start_link()
    {:ok, accs_sup}  = AccountsSupervisor.start_link()
    {:ok, provisor}  = Provisor.start_link()

    on_exit fn ->
      TestUtils.kill_if_alive(accs_sup)
      TestUtils.kill_if_alive(provisor)
      TestUtils.kill_if_alive(scheduler)
      TestUtils.kill_if_alive(registry)
    end

    {:ok, accounts: accounts}
  end

and bam - Registry.start_link(:unique, Postman.Registry) throw all kinds of amazing errors:

** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.608.0>}}

14:01:37.012 [error] GenServer Postman.Registry.PIDPartition0 terminating ** (stop) killed Last message: {:EXIT, #PID<0.566.0>, :killed}

(two different runs)

NobbZ · February 22, 2017, 8:31pm

Can you assemble a mini-project which shows the problem and make it accessible to the public via github or something similar?

josevalim · February 22, 2017, 10:16pm

There is no need for a mini-project. This is how ExUnit works.

@lessless the processes you start in setup are linked to the test process. This means that, when the test finishes, those processes will asynchronously terminate since the link between those processes and the test process is broken.

That’s why you have races: there is no guarantee those linked processes will terminate before the next test starts. Also, because on_exit runs after the test process exits, the linked processes may be running or have already died, that’s why Supervisor.stop and friends may fail or not.

Overall, it is the same race conditions. The processes you spawn may or may not have exited by the time you run on_exit or the next test starts.

That said, all you need to guarantee is that those processes are DOWN in the on_exit callback, making sure you have a client slate for the next test run. Since Process.monitor/1 won’t fail if you give it a dead process, it suits the bill perfectly. You should add this function to your codebase:

defp assert_down(pid) do
  ref = Process.monitor(pid)
  assert_receive {:DOWN, ^ref, _, _, _}
end

And call it for every named processes to have a beautifully green test suite.

We are in the process of making this simpler for Elixir v1.5 by starting a supervisor per test and allowing you to start processes under the test supervisor. This means we can cleanly shut everything at the end of the test without user intervention. Stay tunned.

lessless · February 26, 2017, 4:42pm

Thank you @josevalim, you saved day once again! I believe that should get on elixir radar, 'cause there is a chance that this behavior wasn’t explained anywhere before.

lukyanov · June 10, 2020, 8:25am

Hello!

We are in the process of making this simpler for Elixir v1.5 by starting a supervisor per test and allowing you to start processes under the test supervisor. This means we can cleanly shut everything at the end of the test without user intervention. Stay tunned.

@josevalim could you shed some light on this old subject? Has anything changed in Elixir since then regarding this?

As it looks to me, the issue is still valid for the latest (1.10) version. I mean the functions like assert_down in on_exit are still required to wait for all the processes to be stopped. Is this still true or am I missing something here?

Thank you.

LostKobrakai · June 10, 2020, 8:40am

There’s start_supervised, which will make the process be managed for the livecycle of the test running.