I have ran upon this problem several times already when building Elixir/Phoenix/Ecto applications, and I feel like I have a solution but maybe it’s not great, or maybe I am missing something out there that can be used to help me.
Basically, whenever we build a Phoenix system, which has an async bits in it, we start to experience Ecto failures like this one:
11:48:33.322 [error] Postgrex.Protocol (#PID<0.1513.0>) disconnected: ** (DBConnection.ConnectionError) owner #PID<0.3620.0> exited
Client #PID<0.1614.0> is still using a connection from owner at location:
:prim_inet.recv0/3
(postgrex) lib/postgrex/protocol.ex:2834: Postgrex.Protocol.msg_recv/4
(postgrex) lib/postgrex/protocol.ex:2550: Postgrex.Protocol.recv_transaction/4
(postgrex) lib/postgrex/protocol.ex:1855: Postgrex.Protocol.rebind_execute/4
(ecto_sql) lib/ecto/adapters/sql/sandbox.ex:370: Ecto.Adapters.SQL.Sandbox.Connection.proxy/3
while running tests.
The thing doesn’t crash tests, but the background processes spawned manually, Tasks and GenServers do fail outputting that or similar error to the console.
Sometimes the error is different and that for example a record we created in a test set up no longer exists in database.
The tests at this stage are already running with async: false
and execute sequentially one after another.
Reason for this happening
The reason why this is happening is that we have Ecto.Sandbox open a transaction at the beginning of a test, and then at the end of the test it rolls back the transaction. At that point, background tasks/processes spawned may be still trying to complete some work issued by web requests, for example.
This can happen for example with sending e-mails from a simple background Task. User registers, and confirmation e-mail is sent, test checks for the flash message being displayed to the user and exits. At the same time a background Task is spawned to send e-mail, and it attempts to access database to fetch User record. Since transaction already rolled back, User with given ID no longer is in the database, or you see the error I posted above because process that opened the transaction already exited.
The solution I am using
Since I have full control of the code base, I wrote a simple macro that I call when my background Tasks start or when my GenServer starts up, that registers given process in a Registry I use for tracking these processes in tests.
For example, in GenServer I would have:
def init(state) do
TestHelpers.register_gen_server(self())
end
and in Tasks, I would have similar call to say: TestHelpers.register_task(self())
.
The Registry keeps track on all of these, so it knows which processes are alive at given time.
Then, I wrap my tests that are causing trouble in a function like this:
test "registers user in the system", %{session: session} do
TestHelpers.wait_for_background_processes fn ->
session |> visit("/") |> ...
end
end
What happens in the wait_for_background_processes/1
function is that it does three things:
- Executes the callback function containing the actual test
- Waits for all the registered
Tasks
to complete and processes stop being alive - Waits for all the registered
GenServers
to empty their messages queue and change status to ‘waiting’
We do 2) this way:
defp wait_until_genserver_idle(pid, timeout) do
info = Process.info(pid)
if info == nil || (info[:status] == :waiting && info[:message_queue_len] == 0) do
:ok
else
:timer.sleep(10)
wait_until_genserver_idle(pid, timeout - 10)
end
end
defp wait_until_genserver_idle(_pid, timeout) when timeout <= 0 do
{:error, :timeout}
end
This works with GenServers and relatives (Supervisors, GenStage etc.) and Tasks but also normal spawned processes. I am fairly happy with the solution but I wonder if it can be improved on?
The problems I have
-
I particularly don’t like the need to add test-specific code to my Tasks and GenServers, however this macro is a no-op when
Mix.env() != :test
. -
The other thing I dislike is that I have to wrap my code in a function. I would prefer to install
on_exit
handler, but it seems like I can’t easily do that, because at the timeon_exit
handler is executed, the original test process is already dead and transaction is being already rolled back by Ecto.Sandbox.
I don’t think I’m the only one having this issue, so I wonder how you, Elixir people, deal with similar issues or ideas how to solve 1 & 2 above?