Trouble with intermittent failing tests

Even though I am using Phoenix, I do think my problem is more generic.

In my test, I create a record in the database. Subsequently I call a phx endpoint with a key (the Mac address).
Inside phoenix there is a plug that takes the Mac address, and queries a registry to see if a pid is already registered, if not a new process is created and this process registers itself.
If the record is not found in the database, the plug will return a 404.

Now when I run my tests in isolation they all work.
When I run my suite, they sometimes fail.
I tried adding async: true in my test file, but it does not help.

The flow is as follows:

Request -> Findhotspot (plug) -> Hotspot.find -> HotspotSupervisor.find_or_create (DynamicSupervisor)

defmodule Gwapi.FindHotspot do
  @moduledoc """
  Get the MAC address from the querystring and
  populate the connection with the found router
  or return a 404
  """

  import Plug.Conn
  alias Gwapi.Hotspot

  def init(_), do: []

  def call(conn, _) do
    case Hotspot.find(conn.params["mac"]) do
      {:ok, hotspot} ->
        assign(conn, :hotspot, hotspot)

       err ->
        IO.inspect err
        conn |> send_resp(404, "") |> halt
    end
  end
end

I added the IO.inspect to see what is happening, and it returns

{:error,
 {{:shutdown, "owner #PID<0.552.0> exited with: shutdown"},
  {GenServer, :call,
   [
     #PID<0.553.0>,
     {:checkout, #Reference<0.3277896844.2305556481.205878>, true, :infinity},
     5000
   ]}}}

Phoenix and Ecto can work together with some databases to ensure that your database connections are isolated from each other while running async tests. But that isolation is only for the database. Any other global process state in your system could pollute other tests. It sounds like the registry you’re using is global to your system and could be part of the problem; its a little hard to say for sure without more information. But generally, I suspect some other process state in the system is why you’re seeing issues when running async tests.

3 Likes

Maybe this can help you: Intermittent test errors with Ecto sandbox

2 Likes

Have you tried async: false (instead of true) in your test file?

That was a typo, I certainly meant false, it doesn’t help.

The registry is indeed global, and by inspecting its pid throughout the different test, I can see it remains the same. Initially I ran my tests with the same MAC address, and when I hit the failures, I thought it was because the tests were leaking, so I changed the tests to use individual MAC addresses, but the problem remains.

I read this and can see that the error is the same, but sadly I don’t see how to get rid of it.

I tried moving the Hotspot.find call into my test, this is essentially the call that searches the db, creates the process and registers it. I then ran mix test several times, but after 7 tests it failed at that point in my test with the same error.

Looking at the other topic, I added an after block that specifically terminates all children of the Gwapi.HotspotSupervisor using DynamicSupervisor.terminate_child, but the problem remains.

I tried moving the Hotspot.find call into my test, this is essentially the call that searches the db, creates the process and registers it. I then ran mix test several times, but after 7 tests it failed at that point in my test with the same error.

Was that the only place where you access the database from within another process other than the test?

Maybe you have the issue we occasionally have. See here for more information: Tests randomly failing when run in a VM

The gist is:
On a fast machine even when waiting on the workers finish using the database the error occurs randomly.

You could try to run the tests of the minimal case I created. If this fails on your machine (running it several times), then you have the same problem. Unfortunately there is no solution yet.

No, looking back, the purpose of this endpoint is to update the hotspot, so yeah that accesses the database too.

I can get my tests to pass by adding a Process.sleep(50) after the creation of the record in the db in my test setup. I just ran them in a loop for about 5 minutes and did not get any failures. Remove the sleep, bang it fails.

Can you create a minimum failing test case that you could share?

I’ll try