Intermittent DBConnection.ConnectionError in tests

mbklein · February 14, 2022, 4:48pm

When running my tests, I sometimes see an error that looks like

[error] Postgrex.Protocol (#PID<0.788.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.7569.0> exited

Notable details:

It doesn’t always happen
When it does happen, it’s not always in the same test or test file
It doesn’t actually cause any tests to fail; it just shows up as output in the middle of the output like so:

image1209×233 6.75 KB

I’ve been ignoring it up to this point because the test suite still passes and it hasn’t caused any development or deployment bottlenecks, but I’d like to figure out why it’s happening and what I might be able to do to stop it.

Version details, in case it matters:

Postgres v10.19 via the postgres:10-alpine Docker image
postgrex v0.16.1
ecto v3.7.1
ecto_sql v3.7.2
ecto_psql_extras v0.7.4

Any thoughts?

al2o3cr · February 14, 2022, 5:06pm

The usual cause of that message is code that’s still waiting for the result of a Repo call when the test process shuts down.

mbklein · February 14, 2022, 5:40pm

Is there a particular strategy I can use to keep the test process alive until the Repo call completes? I’m not doing anything particularly unusual in any tests, though I think some of my on_exit teardown code might involve some database cleanup. When testing code that relies on DB triggers, I have to run the tests unsandboxed or the triggers won’t fire. So those particular tests get cleaned up manually.

stefanchrobot · February 15, 2022, 8:53am

I think we’re having the same issue and we haven’t nailed the root cause either. Things I’d look into:

Testing Contexts — Phoenix v1.6.6 (the ownership code that Phoenix generates changed a bit recently)
ExUnit.Callbacks — ExUnit v1.13.3 for the start_supervised function

pdgonzalez872 · February 15, 2022, 8:38pm

Could it be a Task that gets spawned somewhere?

llama · June 2, 2022, 3:26am

Did anyone figure out a fix for this? I’m having the same issue with my tests and since the errors appear to be fairly random, I have no idea how to even begin fixing them.

mbklein · June 2, 2022, 3:54am

Still haven’t figured it out. Other priorities have prevented me from spending much time on it. I don’t think it’s a Task. It might be a GenServer that gets started by the test setup, but I do have code to stop it in the on_exit. So for now I’m still stumped.

talkignquickly · December 19, 2022, 11:34am

Just to chime in with we seem to have the exact same intermittent errors.

We discovered they were MUCH worse when running the Elixir process locally but the database in docker on M1 Macs. Working theory being that it’s because the network connection on docker for Mac is quite slow.

The error also happens a lot more along with lots of intermittent failures when running in Github CI which are fairly resource constrained instances.

So wondering if it’s something to do with running lots of DB intensive stuff with some form of resource bottleneck. E.g. on Github actions there’s only 2 cores available for DB and parallel tests combined.

danturner · April 17, 2023, 1:56pm

Heya,

not sure if you’re still having the issue…

the error means that a db query is live when the test is torn down,

in our case it was due to some of our liveview tests clicking a submit button that caused a push_patch to happen in a live_component rather than the parent liveview,

for some reason LiveViewTest will wait for the patch if it happens in the parent process before returning, but not if it happens in a child component,

you can force it to wait by doing an assert_patch, or calling any of the functions that use the view object

not sure if that helps you, but thought i’d add it as it was a little confusing

i raised an issue here LiveviewTest render_submit behaves differently if form is in parent liveview or child live_component · Issue #2579 · phoenixframework/phoenix_live_view · GitHub as the behaviour feels inconsistent

mbklein · May 3, 2023, 3:38pm

Thanks for that info, @danturner. It turns out that all the tests exhibiting this behavior had one thing in common – they all started a GenServer (using start_supervised!) that was running a periodic query in the background. I added an on_exit handler to tell the GenServer to stop its polling behavior before the test continues with its teardown, and everything’s a-ok now. Thanks for the pointer!

alexandre · August 1, 2023, 11:27pm

Any tips on how to debug this? I’m working on a 5000 tests code base which shows this warning intermittently, and I would love to find which line triggers it, in order to fix…

alexandre · August 2, 2023, 12:51pm

If you find yourself in a similar situation, here is what worked so far:

Run mix test --trace to see which test file has the test that triggers this error
Put a setup block on this test file with on_exit(fn -> :timer.sleep(1000) end). This will give sometime between the test runs. Since on_exit runs in another process, the test process will be able to finish normally.
Run mix test test/app_name/problematic_test.exs --trace

This will show the DBConnection error right below the problematic test.
Also, since this is an intermittent error, it’s also useful to run the file multiple times with something like:

$ for i in {1..10} do; mix test test/app_name/problematic_test.exs --trace done

dimitarvp · August 2, 2023, 12:53pm

Did you manage to pinpoint the cause that way?

LostKobrakai · August 2, 2023, 12:55pm

You can also use a fixed seed, so test always run in the same sequence.

alexandre · August 2, 2023, 9:04pm

Yes! It was a Phoenix.Channel that received a message and ran a DB query after the test process finished.

achedeuzot · October 9, 2023, 5:48pm

And sorry to ask but how did you fix it ?

Did you keep the waiting or did you find a better way (like listening for the DB query to finish or something ?).

Thanks for your insights

mbklein · October 9, 2023, 8:17pm

I found that some of my tests had GenServers running (started in the test setup using start_supervised!/2), and the GenServer was outliving the test’s sandbox repo connection. I had to make sure I had an on_exit handler that stopped the GenServer from continuing to try to use the database before the rest of the test teardown could happen.

achedeuzot · October 10, 2023, 10:17am

I finally found that my DataCase modules were still using a previous approach which used the following syntax:

setup tags do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(MyModule.Repo)

    unless tags[:async] do
        Ecto.Adapters.SQL.Sandbox.mode(MyModule.Repo, {:shared, self()})
    end
end

and I replaced it with the new syntax:

setup tags do
    MyModule.DataCase.setup_sandbox(tags)
    :ok
end

def setup_sandbox(tags) do
    pid = Ecto.Adapters.SQL.Sandbox.start_owner!(MyModule.Repo, shared: not tags[:async])
    on_exit(fn -> Ecto.Adapters.SQL.Sandbox.stop_owner(pid) end)
end

and it seems to work a lot better

LostKobrakai · October 10, 2023, 10:21am

The new approach uses the on_exit functionality of ex_unit, which allows “nested” on_exit callbacks to be completed before stopping the sandbox for a test. The old approach would monitor the test process itself and stop the sandbox as soon as the test process stopped, which is before on_exit callbacks executed. That’s why the new api is to be prefered – even generally I’d say.

allanegidio · October 13, 2023, 7:28pm

Thanks for that.
Using this approach I still having a lot of errors about disconnected.

Postgrex.Protocol (#PID<0.610.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2513.0> exited

Did you faced something similar?