Strange DBConnection.ConnectionError with Ecto 3.0.6

axelson · January 31, 2019, 4:02am

I’m not sure how to interpret the following error. What is the relationship of PID 0.553.0 to 0.2195.0? It seems that 0.553.0 may be the test process in which case the error doesn’t make sense to me because (as I’m interpreting it) the client and the owner are the same process.

03:50:50.547 [error] Postgrex.Protocol (#PID<0.553.0>) disconnected: ** (DBConnection.ConnectionError) owner #PID<0.2195.0> checked in the connection

Client #PID<0.2195.0> is still using a connection from owner at location:

    :prim_inet.recv0/3
    (postgrex) lib/postgrex/protocol.ex:2701: Postgrex.Protocol.msg_recv/4
    (postgrex) lib/postgrex/protocol.ex:2454: Postgrex.Protocol.recv_close/3
    (postgrex) lib/postgrex/protocol.ex:1169: Postgrex.Protocol.close_parse_describe_flush/3
    (postgrex) lib/postgrex/protocol.ex:1587: Postgrex.Protocol.handle_prepare_execute/4
    (ecto_sql) lib/ecto/adapters/sql/sandbox.ex:375: Ecto.Adapters.SQL.Sandbox.Connection.proxy/3
    (db_connection) lib/db_connection/holder.ex:268: DBConnection.Holder.holder_apply/4
    (db_connection) lib/db_connection.ex:1189: DBConnection.run_execute/5
    (db_connection) lib/db_connection.ex:1276: DBConnection.run/6
    (db_connection) lib/db_connection.ex:550: DBConnection.execute/4
    (ecto_sql) lib/ecto/adapters/postgres/connection.ex:80: Ecto.Adapters.Postgres.Connection.execute/4
    (ecto_sql) lib/ecto/adapters/sql.ex:571: Ecto.Adapters.SQL.execute!/4
    (ecto_sql) lib/ecto/adapters/sql.ex:553: Ecto.Adapters.SQL.execute/5
    (ecto) lib/ecto/repo/queryable.ex:147: Ecto.Repo.Queryable.execute/4
    (ecto) lib/ecto/repo/queryable.ex:18: Ecto.Repo.Queryable.all/3
    test/hobnob_messages/queries/mentions_query_test.exs:101: HobnobMessages.Queries.MentionsQueryTest."test unread_in_threads_for_user/1 includes messages in threads when the user has no thread member"/1
    (ex_unit) lib/ex_unit/runner.ex:312: ExUnit.Runner.exec_test/1
    (stdlib) timer.erl:166: :timer.tc/1
    (ex_unit) lib/ex_unit/runner.ex:251: anonymous fn/4 in ExUnit.Runner.spawn_test/3

The connection itself was checked out by #PID<0.2195.0> at location:

    (ecto_sql) lib/ecto/adapters/postgres/connection.ex:80: Ecto.Adapters.Postgres.Connection.execute/4
    (ecto_sql) lib/ecto/adapters/sql.ex:571: Ecto.Adapters.SQL.execute!/4
    (ecto_sql) lib/ecto/adapters/sql.ex:553: Ecto.Adapters.SQL.execute/5
    (ecto) lib/ecto/repo/queryable.ex:147: Ecto.Repo.Queryable.execute/4
    (ecto) lib/ecto/repo/queryable.ex:18: Ecto.Repo.Queryable.all/3
    test/hobnob_messages/queries/mentions_query_test.exs:101: HobnobMessages.Queries.MentionsQueryTest."test unread_in_threads_for_user/1 includes messages in threads when the user has no thread member"/1
    (ex_unit) lib/ex_unit/runner.ex:312: ExUnit.Runner.exec_test/1
    (stdlib) timer.erl:166: :timer.tc/1
    (ex_unit) lib/ex_unit/runner.ex:251: anonymous fn/4 in ExUnit.Runner.spawn_test/3

The failure is intermittent but happens quite often on a full test run. I don’t get the same error on Ecto 2.2.11.

Environment:

ecto: 3.0.6
ecto_sql: 3.0.4
postgrex: 0.14.1
Elixir: 1.7.4
Erlang: 21.0.4
Mac OS X: 10.13.6

dimitarvp · February 1, 2019, 9:02pm

Blind shot in the dark but lately we had an intermittent error during tests (not the same as yours though) and we found that upgrading Erlang/OTP eliminated it. Can you try?

axelson · February 2, 2019, 2:00am

No luck, still getting the same error on Elixir 1.8.1 and Erlang 21.2.4

axelson · February 20, 2019, 1:39am

Does anyone have any ideas on the error?

The fact that these are all the same PIDs is what boggles me:

(DBConnection.ConnectionError) owner #PID<0.2195.0> checked in the connection

Client #PID<0.2195.0> is still using a connection from owner ...

The connection itself was checked out by #PID<0.2195.0>

josevalim · February 20, 2019, 6:18am

Yup, it looks like a DBConnection bug. What is your DBConnection version?

josevalim · February 20, 2019, 6:27am

The only scenario I can see it happening is if you have your tests as async but you are changing the mode of the sandbox. Or you have another process changing the sandbox mode while this test runs. Can you please get all of the places you call Ecto.Adapters.SQL.Sandbox and put the code snippets here?

axelson · February 20, 2019, 8:00pm

Yes!! That was exactly it. Thanks for the tip. I had a 1.5 year-old test that looked like this:

    test "it sends a notification to the user channel if they are a member that is not present",
         %{user: user, room: room} do
      # This test uses another process
      Ecto.Adapters.SQL.Sandbox.mode(MyApp.Repo, {:shared, self()})
      
      # rest of the test
    end

I think my thinking was that I wanted just this test to use the shared mode of the sandbox and since every test runs in a separate process that Sandbox.mode would set the mode for that specific process (since you’re passing the pid in).

From the docs it seemed okay:

iex(3)> h Ecto.Adapters.SQL.Sandbox.mode

                              def mode(repo, mode)

Sets the mode for the repo pool.

The mode can be :auto, :manual or {:shared, <pid>}.

Although I guess I should have realized that the “repo pool” is a shared resource across multiple tests.

But for my understanding, this is what I think was happening:

Multiple tests are running concurrently (since we’re using async: true)
This test called Ecto.Adapters.SQL.Sandbox.mode(MyApp.Repo, {:shared, self()})
When the DBConnection.Ownership.Manager GenServer received the mode change request it checked in all of the existing transaction checkouts, including for tests that were still running!
The tests that were still running tried to continue to use their connection, but the Manager GenServer had already checked them in resulting in the warning at the top of this thread.
The tests that were still running would then check out a new connection, and since this new connection was running in a new db transaction, any database entries inserted in the previous connection no longer exist which caused related test failures such as foreign key exceptions

I plan to create a doc PR to add some sort of warning to the Ecto.Adapters.SQL.Sandbox.mode function, unless this can be considered a true db_connection bug (I’m using db_connection 2.0.5). It would be nice if Ecto.Adapters.SQL.Sandbox.mode would eventually cause a warning if it forces the check-in of unrelated existing check-outs. Also this part of the original error message is not technically correct in this scenario:

(DBConnection.ConnectionError) owner #PID<0.2195.0> checked in the connection

Since #PID<0.2195.0> did not initiate the check-in (at least this is that is my current understanding). It looks like you’re in the process of adding an improvement for this case. That PR changes the error message to #PID<0.3100.0> checked in the connection owned by #PID<0.3096.0> which is a definite improvement since #PID<0.3100.0> was the process that was running the test that called Sandbox.mode.

Also due to the new $callers tracking in db_connection that test runs just fine in async mode without having to change the sandbox pool mode.

Thanks again for the help in tracking this down!

dimitarvp · February 21, 2019, 12:16am

It might be helpful if we have a function along the lines of Ecto.Adapters.SQL.Sandbox.moded_run function (akin to unboxed_run) that accepts a function and uses the specified mode only in it – if that’s at all doable of course. Since the mode is a shared resource I’m guessing my suggestion is unrealistic.

axelson · February 26, 2019, 2:21am

@josevalim should I file this as a bug on ecto_sql? Or is it an issue on db_connection?

josevalim · February 26, 2019, 11:51pm

Hi @axelson, there is no bug here. So the best would be a PR to improve the docs. We have already improved the error message on DBConnection master.

axelson · February 27, 2019, 6:59pm

Okay, I’ll prepare a docs PR

shijith.k · March 26, 2020, 12:00pm

I am using Task.async_stream function in one of my functions, which updates a list of datas in the DB. While running tests over this main function, it throws
Client #PID<0.773.0> is still using a connection from owner at location: this error. How can I fix this?

bnymn · April 9, 2020, 4:18am

In my case, I had the following line in both test_helper.exs and setup block.

Ecto.Adapters.SQL.Sandbox.mode(Blog.Repo, {:shared, self()})

Removing the line from test_helper.exs solved the problem.

SPrio-kt · June 25, 2020, 1:06pm

If you are writing test for a method where you have written a Task.async call, you can use :timer.sleep(<a minimum amount of time in milisecond>) would help. As it is not getting a result which is being done by another process, so wait a bit to let it complete.

dimitarvp · June 25, 2020, 2:52pm

Or the async Task can just send a message to the parent process. That makes it both more stable and predictable in the original code and more testable as well.

amos-kibet · August 29, 2023, 8:15am

Can someone give an example of sending a message to the parent process? I managed to use :timer.sleep/1 but it sounds like a code smell to me, especially because I have several such tests in one file.

bolducp · December 22, 2023, 8:49pm

@amos-kibet / @dimitarvp
Did you ever get an example of the more recommended approach here?

dimitarvp · December 22, 2023, 9:15pm

Show us your code, I’ll give you a recommendation.

bolducp · December 26, 2023, 1:11pm

Thanks so much, @dimitarvp!

So here’s roughly what the code looks like (just changed the function, module, and argument names) :

defp maybe_publish_subscription_updates(a_group, b_with_both_group) do
  # We aren't using the return values here, or handling any errors from the
  # graphql subscription publication calls, so we should just start the tasks
  # and not await them.
  {:ok, _pid} =
    Task.start(fn ->
      Calculations.maybe_publish_a_group_change(a_group)
    end)

  {:ok, _pid} =
    Task.start(fn ->
      Calculations.maybe_publish_b_group_change(b_group)
    end)

  :ok
end

And then I have a test that tests some code that calls that maybe_publish_subscription_updates/2 function. When I run that test, I get the error about Client #PID<0.2195.0> is still using a connection from owner at location... because, I understand, the test process has completed but the side effect function is still running.

I’ve fixed this by adding Process.sleep(50) calls to the tests for functions that call this side effect function. But that feels brittle and non-ideal. Do you know of a better way?

dimitarvp · December 26, 2023, 3:36pm

So what are these two functions doing? Are they using DB connections?