Postgrex connection timeout error in Phoenix app

Hi everyone,

I’m running into a database connection issue in my Phoenix app, and the error reported in Sentry doesn’t give me any useful stacktrace beyond this:

Postgrex.Protocol (#PID<0.4692.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1503867.0> timed out because it queued and checked out the connection for longer than 30000ms

#PID<0.1503867.0> was at location:

    :prim_inet.recv0/3
    (postgrex 0.21.1) lib/postgrex/protocol.ex:3261: Postgrex.Protocol.msg_recv/4
    (postgrex 0.21.1) lib/postgrex/protocol.ex:2263: Postgrex.Protocol.recv_bind/3
    (postgrex 0.21.1) lib/postgrex/protocol.ex:2155: Postgrex.Protocol.bind_execute/4
    (db_connection 2.8.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.8.0) lib/db_connection.ex:1555: DBConnection.run_execute/5
    (db_connection 2.8.0) lib/db_connection.ex:826: DBConnection.execute/4
    (ecto_sql 3.12.1) lib/ecto/adapters/postgres/connection.ex:139: Ecto.Adapters.Postgres.Connection.execute/4

Since this is all I see in Sentry, I don’t know how to trace back which query or part of the code triggered the issue.

Has anyone dealt with this? How can I get more context or a clearer stacktrace when this kind of timeout happens?

Are there any other errors? This one likely caused a crash in some other place which should have a different stacktrace and can lead to problematic query.

There are indeed other errors, but I’m not sure if they are related. However, there are not the same number of errors related to this one.

suspect could be:

* `lib/ecto/adapters/sql.ex` in `Ecto.Adapters.SQL.raise_sql_call_error/1` at line `1096`

* `lib/ecto/adapters/sql.ex` in `Ecto.Adapters.SQL.execute/6` at line `994`

* `lib/ecto/repo/queryable.ex` in `Ecto.Repo.Queryable.execute/4` at line `232`

* `lib/oban/engines/basic.ex` in `Oban.Engines.Basic.prune_jobs/3` at line `175`

* `lib/oban/engine.ex` in `anonymous fn/4 in Oban.Engine.prune_jobs/3` at line `272`

* `lib/oban/engine.ex` in `anonymous fn/3 in Oban.Engine.with_span/4` at line `387`

* `/app/deps/telemetry/src/telemetry.erl` in `:telemetry.span/3` at line `324`

* `lib/oban/plugins/pruner.ex` in `anonymous fn/1 in Oban.Plugins.Pruner.check_leadership_and_delete_jobs/1` at line `150`

I would check your pool_size in your config - are you using the default 10? Is that appropriate for what your app is doing? .

1 Like

we have a pool_size = 30 for the oban config, we have actually increased it, but we still have the same result


  config :app, App.Repo,
    socket_options: maybe_ipv6,
    url: database_url,
    pool_size: String.to_integer(System.get_env("POOL_SIZE") || "10"),
    queue_target: String.to_integer(System.get_env("QUEUE_TARGET") || "1000"),
    queue_interval: String.to_integer(System.get_env("QUEUE_TIMEOUT") || "10000"),
    timeout: String.to_integer(System.get_env("TIMEOUT") || "20000")

  config :app, App.RepoOban,
    socket_options: maybe_ipv6,
    url: database_url,
    pool_size: String.to_integer(System.get_env("OBAN_POOL_SIZE") || "30"),
    queue_target: String.to_integer(System.get_env("OBAN_QUEUE_TARGET") || "1000"),
    queue_interval: String.to_integer(System.get_env("OBAN_QUEUE_TIMEOUT") || "15000"),
    timeout: String.to_integer(System.get_env("OBAN_TIMEOUT") || "30000")

Perhaps oban pruner took more than 30s to prune jobs? If you export Oban metrics somewhere you can check how long prune_jobs usually take and if there were any spikes and/or exceptions.

1 Like

The issue is entirely because of slow queries causing connection timeouts. For the pruner timeout, that may be from a very large backlog of jobs (if you just added pruning recently, or it’s not able to keep up). The pruner only deletes 10k jobs at a time every 1m though, it’s minimal load on the database.

You can use telemetry to measure the time prune_jobs takes (Oban.Telemetry — Oban v2.20.1) to measure it. However, a query timeout there indicates that something else is probably going on:

  1. Is the database busy elsewhere? Is it under-provisioned for RAM/CPU for your workload?
  2. Do you have a high number of cancelled/discarded` jobs? If so, the v13 migration in Oban v2.20 would help with pruning queries.
1 Like