Help with DBConnection.Connection error timeouts

landric · March 26, 2020, 7:02pm

Hello!

I’m currently running into repeated DBConnection.ConnectionErrors in Production, and I could use some help figuring out where to aim my troubleshooting. According to my Sentry error tracking, nearly all of the messages are…

(DBConnection.ConnectionError) connection not available and request was dropped from queue after 1001ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information

…however the “dropped from queue” time varies between instances.

In mix.exs I’ve got…

defmodule MyApp.Mixfile do

  # Lines omitted

  defp deps do
    [
      # Dependencies I think probably aren't relevant omitted
      {:absinthe_ecto, "~> 0.1.3"},
      {:ecto, "~> 3.1.1"},
      {:ecto_sql, "~> 3.1.0"},
      {:phoenix, "~> 1.4.0"},
      {:phoenix_ecto, "~> 4.0"},
      {:rihanna, "~>1.3.5"},
      {:timber_ecto, "~> 2.0"}
    ]
  end
end

In config/confix.exs I’ve got…

use Mix.Config

# Some config omitted

config :my_app,
  ecto_repos: [MyApp.Repo]

config :my_app, MyApp,
  migration_timestamps: [type: :utc_datetime],
  pool_size: 10

import_config "#{Mix.env}.exs"

…and in config/prod.exs I’ve got…

use Mix.Config

# Some config omitted

config :cardstock, MyApp.Repo,
  ssl: true,
  prepare: :unnamed

config :rihanna,
  debug: true,
  dispatcher_max_concurrency: 10,
  producer_postgres_connection: {Ecto, MyApp.Repo}

import_config "prod.secret.exs"

Here are some avenues I’ve already explored to no avail…

I realize I’m using an older version of ecto_sql (3.1.6 according to Mix.lock, whereas the most recent release is 3.4.1). In the changelogs for ecto_sql I don’t see any specific changes around timeouts / config.
From the DevOps side, my database is using DigitalOcean hosted Postgres, and I don’t suspect that’s my bottleneck. I’m currently showing ~37 active connections against a PgBouncer pool size of 70. (I have three libcluster-ed Elixir nodes). However, I’m definitely not an expert on DevOps, so I would love to hear if you think my provisioning is off.

My current hunch is that this is not a resource issue but a config issue, and that ultimately there will be one line of numbskullery that needs to get fixed.

I would really appreciate some wisdom here. I’m a solopreneur and my rubber duck is out of ideas.

Thanks!

benwilson512 · March 26, 2020, 10:25pm

A pool size of 10 for your production instances is pretty small, I’d probably double that. The error you’re seeing is relatively simple. You have a pool of 10 within your application, and a process had to wait over 1 second to even get a connection, which meant all 10 were busy for that full second. This tells me two things:

You probably have too few connections and
you may be holding on to connections too long, I’d check your db query metrics.