ExUnit fails tests randomly

Background

I have a some tests that create a worker and wait for said worker to open a connection. To test if the connection is up I pass along some stubs in the workers start_link. I check the the stubs were executed using counters.

Problem

All is fine but only some times. Sometimes my test suite passes, other times it fails. And this is random, which is really annoying.

Following is a test sample:

test "saves base_url to dumper and metrics if it cannot connect to it" do
      Process.flag :trap_exit, true

      dumper = :counters.new(1, [])
      metrics = :counters.new(1, [])

      args = %{
        deps: %{
          dumper: %{
            save_failed_request: fn _url ->
              :counters.add(dumper, 1, 1)
              {:ok, :saved}
            end
          },
          metrics: %{
            inc: fn _id ->
              :counters.add(metrics, 1, 1)
              {:ok, 1}
            end
          }
        }
      }

      {:ok, _pid} = Worker.start_link(args)

      dumper_calls = :counters.get(dumper, 1)
      dumper_expected_calls = 1

      metrics_calls = :counters.get(dumper, 1)
      metrics_expected_calls = 1

      assert dumper_calls === dumper_expected_calls
      assert metrics_calls === metrics_expected_calls
      assert_receive {:EXIT, _, :timeout}, 600
      Process.flag :trap_exit, false
    end

I guess I could extend the assert_receive to long times, but then my test suite would be painfully slow.
I assume this is happening because the Worker dies before sending messages or the test somehow finishes without waiting for my worker.

Is there is simple way to fix this so my tests pass consistently ?

1 Like

Unlikely.

As described “opening a connection” seems to inject an element of non-determinism into your test.

  • because the outcome may not always be successful
  • because the outcome won’t necessarily materialize in a reasonable amount of time.

So you have eliminate that non-deterministic element or if you can’t, defer to another form of testing (e.g. E2E).


Eradicating Non-Determinism in Tests
Erratic Tests

State Verification
Behavior Verification

Which particular assertion fails? Is it always the assert_receive?

I don’t actually connect to a real server, I have this part of the system stubbed. This is most definitely not the root cause of the issue. My tests are fully deterministic, unless they are smoke tests, in which case I actually want to try to connect to said servers (this is not a smoke test).

This is usually the part that fails:

09:27:57.533 [debug] Closing connection: %Clint.ConnPool.Worker.State{active_streams: [], conn_pid: nil, deps: %{dumper: %{save_failed_request: #Function<10.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>}, http: %{await_up: #Function<8.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>, close: #Function<9.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>, get: &Clint.IAsyncHttp.Gun.get/2, open: #Function<7.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>}, metrics: %{inc: #Function<11.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>}, registry: %{lookup: &Clint.ProcessRegistry.lookup/1, update_with_value: &Clint.ProcessRegistry.update_with_value/2, via_tuple: &Clint.ProcessRegistry.via_tuple/2}}, opts: %{base_url: 'google.com', mailbox_update_interval: 100, metrics_module: Clint.IMetrics.Default, name: :from_the_inside}, worker_id: 994}

09:27:57.533 [error] GenServer {Clint.ProcessRegistry, {Clint.ConnPool.Worker, {994, :from_the_inside}}, [estimated_mailbox_size: 0]} terminating
** (stop) time out
Last message: {:continue, :establish_conn}
State: %Clint.ConnPool.Worker.State{active_streams: [], conn_pid: nil, deps: %{dumper: %{save_failed_request: #Function<10.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>}, http: %{await_up: #Function<8.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>, close: #Function<9.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>, get: &Clint.IAsyncHttp.Gun.get/2, open: #Function<7.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>}, metrics: %{inc: #Function<11.93644000/1 in WorkerTest."test start_link/1 saves base_url to dumper and metrics if it cannot connect to it"/1>}, registry: %{lookup: &Clint.ProcessRegistry.lookup/1, update_with_value: &Clint.ProcessRegistry.update_with_value/2, via_tuple: &Clint.ProcessRegistry.via_tuple/2}}, opts: %{base_url: 'google.com', mailbox_update_interval: 100, metrics_module: Clint.IMetrics.Default, name: :from_the_inside}, worker_id: 994}


  1) test start_link/1 saves base_url to dumper and metrics if it cannot connect to it (WorkerTest)
     test/unit/worker_test.exs:119
     Assertion with === failed
     code:  assert dumper_calls === dumper_expected_calls
     left:  0
     right: 1
     stacktrace:
       test/unit/worker_test.exs:162: (test)

     The following output was logged:
     
     09:27:57.532 [debug] Starting worker 994
     


Finished in 0.1 seconds
33 tests, 1 failure, 32 excluded

Are the tasks you expect the worker to do done in init? Otherwise how can you be sure they’re already done once the tests asserts the counters?

The init call actually defers to a hande_continue where the real work is done.
In that handle_continue the process is supposed to connect to an external server. In this specific test, I make that connection fail by making the functions used return an error:

http: %{
  open: fn _base_url -> {:ok, self()} end,
  await_up: fn _conn -> {:error, :timeout} end,
  close: fn nil -> :ok end
}

Without a viable connection, the process stops initialization and returns an error:

with  {:ok, conn_pid} <- Logic.establish_connection(base_url, conn_deps) do
     #save the connection pid 
     new_state = %{state | conn_pid: conn_pid} 
      {:noreply, new_state}
    else
      {:error, reason} -> {:stop, reason, state}
    end

My objective is to make the test reliable. I believe the test sometimes fails because assert_receive waits a set amount of milliseconds, instead of actually waiting for the process being tested to terminate.

This means that if the process being tested terminates within the amount of time assert_receives waits, then everything is fine, but if for some reason it doens’t, then it fails the test.

I still don’t know of a way to fix this.

That’s your issue. start_link returns as soon as init is done. So your test asserts run concurrently to your handle_continue function. So your test might run the assertions before the process has done it’s tasks. handle_continue was added partly to make initialization of processes easier in a non-blocking fashion.

You need to find a way to let your test wait for the task to do it’s work before running the assertions. One way would be to send the test a message when work is done. Or you just wait for it to terminate and check the result only after that.

2 Likes

So, if I change:

      {:ok, _pid} = Worker.start_link(args)

      dumper_calls = :counters.get(dumper, 1)
      dumper_expected_calls = 1

      metrics_calls = :counters.get(dumper, 1)
      metrics_expected_calls = 1

      assert dumper_calls === dumper_expected_calls
      assert metrics_calls === metrics_expected_calls
      assert_receive {:EXIT, _, :timeout}, 600
      Process.flag :trap_exit, false

To:

      {:ok, _pid} = Worker.start_link(args)

      # Wait for SUT to finish 
      assert_receive {:EXIT, _, :timeout}, 600
      
      dumper_calls = :counters.get(dumper, 1)
      dumper_expected_calls = 1

      metrics_calls = :counters.get(dumper, 1)
      metrics_expected_calls = 1

      assert dumper_calls === dumper_expected_calls
      assert metrics_calls === metrics_expected_calls
      Process.flag :trap_exit, false

Everything should be consistent.

EDIT: Ya, it works like a charm. You killed it with your explanation, thanks a lot !