Cannot acquire DBConnection after 1000 tests run during CircleCI tests while in :shared mode

Nitrodist · August 6, 2020, 3:36pm

We have our elixir project setup to run a circleci job where it invokes mix test. We are seeing recently and in increasing frequency this type of an error:

     19:57:51.320 [error] Task #PID<0.3093.0> started from #PID<0.3098.0> terminating
     ** (DBConnection.ConnectionError) could not checkout the connection owned by #PID<0.3072.0>. When using the sandbox, connections are shared, so this may imply another process is using a connection. Reason: connection not available and request was dropped from queue after 129ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information
         (ecto_sql 3.4.4) lib/ecto/adapters/sql.ex:593: Ecto.Adapters.SQL.raise_sql_call_error/1
         (ecto_sql 3.4.4) lib/ecto/adapters/sql.ex:526: Ecto.Adapters.SQL.execute/5
         (ecto 3.4.4) lib/ecto/repo/queryable.ex:192: Ecto.Repo.Queryable.execute/4
         (ecto 3.4.4) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
         (ecto 3.4.4) lib/ecto/repo/preloader.ex:203: Ecto.Repo.Preloader.fetch_query/8
         (ecto 3.4.4) lib/ecto/repo/preloader.ex:120: Ecto.Repo.Preloader.preload_assoc/10
         (elixir 1.10.3) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
         (elixir 1.10.3) lib/task/supervised.ex:35: Task.Supervised.reply/5
         (stdlib 3.12.1) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
     Function: &:erlang.apply/2
         Args: [#Function<8.71330160/1 in Ecto.Repo.Preloader.maybe_pmap/4>, [{{:assoc, %Ecto.Association.Has{cardinality: :many, defaults: [], field: :action_shootouts, on_cast: nil, on_delete: :nothing, on_replace: :delete, ordered: false, owner: TheScoreData.Schemas.HockeyBoxScore, owner_key: :id, queryable: TheScoreData.Schemas.HockeyActionShootout, related: TheScoreData.Schemas.HockeyActionShootout, related_key: :box_score_id, relationship: :child, unique: true, where: []}, {0, :box_score_id}}, nil, nil, []}]] error_logger=%{tag: :error_msg} report_cb=&Task.Supervised.format_report/1

Here is our setup code in our tests configuring the mode:

  # test/test_helper.exs
  ExUnit.start(capture_log: true)
  Ecto.Adapters.SQL.Sandbox.mode(Op.Repo, :manual)

  # test/support/data_case.ex
  setup tags do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(Op.Repo)

    unless tags[:async] do
      Ecto.Adapters.SQL.Sandbox.mode(Op.Repo, {:shared, self()})
    end
  end

We have a code smell in a test we recently integrated that passes after almost exactly 20 seconds each time, so we suspect we may have side-effects from another test impacting this one. We’re investigating that part of it.

I believe we are using the above configuration is the same you get with a new phoenix app and that is we created this app from, and I had a quick glance in a new phoenix app to confirm as well.

Another question, why the query timeout set to around 50ms? Is that… low? If there is a 50ms network interruption due to jitter/queued packets etc. stuff out of the control of the machine running the elixir code… would this error happen?

Nitrodist · August 11, 2020, 8:39pm

For those out there that run into this, we increased the time from 50ms to 1000ms and this problem went away.

diff --git config/releases.exs config/releases.exs
index c2bb2657f..217cc973a 100644
--- config/releases.exs
+++ config/releases.exs
@@ -13,10 +13,11 @@ config :op, Op.Repo,
   username: System.fetch_env!("MYSQL_USER"),
   password: System.fetch_env!("MYSQL_PASSWORD"),
   port: String.to_integer(System.fetch_env!("MYSQL_PORT")),
   pool_size: String.to_integer(System.fetch_env!("MYSQL_POOL_SIZE")),
   timeout: String.to_integer(System.fetch_env!("MYSQL_QUERY_TIMEOUT")),
+  queue_target: 1000,
   charset: "utf8mb4",
   prepare: :unnamed,
   log: :debug

dimitarvp · August 11, 2020, 9:37pm

I actually even increased it to 5000 in one of my last commercial Elixir projects. Made the app really lag-proof and very rarely somebody actually had to wait for more than 1000-2000 ms.