How to debug these random test failures?

I’ve a decently large test suite. ~1800 tests. Mostly async.

Every once in a while, the same two tests will fail with the message:

** (EXIT from #PID<0.6343.0>) killed

Sometimes, when running the tests (and there are 0 failures) I’ll receive this error message:

11:14:49.862 [error] Postgrex.Protocol (#PID<0.1534.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.21840.0> exited

No other info. Googling this yields no results. My best guess is that the DB connection is closed somehow. My questions are:

  • How do I begin debugging this? Where do I look?
  • What can cause this?
  • How can I reproduce it consistently?

The tests themselves are not particularly long or interesting. One of them just inserts a database record yet still fails randomly.

Thank you for the help

Do any of the tests (not necessarily the ones that are failing) start long-running processes? One thing that can produce this sort of message is a process that’s linked to the test process but outlives it.

There are a decent number of LiveView tests that themselves are fairly long-lived. Randomly, I received a longer error message from a disconnection message:

Client #PID<0.34370.0> is still using a connection from owner at location:

and that stacktrace points to a LiveComponent which starts another Task which seems like it kicks off some db queries.

I’m thinking the long-lived tests are the cause

In order to replicate the failure, you can use --seed option with seed number of failed test suite.
If failed test have async: true, try to set it to false

I don’t know how current this article is but I’d try enabling SASL logging temporarily: Gaining Insight into an Elixir Application with SASL

That should give you some insight on what starts and when. Could be useful.

And yeah use a fixed seed for a time, until you sort out this problem. E.g. mix test --seed 0. Tests will be executed in exactly the same order every time so it can help you pinpoint a problem.