How to investigate DB errors like "client timed out because it queued and checked out the connection for longer than..."

Hi,

We’re starting to collect many of these errors in our logs and I am trying to understand how we can investigate them further:

Postgrex.Protocol (#PID<0.1264.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1555.0> timed out because it queued and checked out the connection for longer than 15000ms

I have read this topic and I understand these are caused by slow queries.
Now I am wondering, can Postgrex/Ecto be configured to log the slow query together with the above error?

I know I can find slow queries in a different way (e.g. with Postgres’pg_stat_statements extension), but it would be extremely useful if I could get some indication of what DB operation caused the timeout detailed in the error log.

Is that error log about a DB connection pool process crashing due to a timeout? Will it cause the caller process to crash as well? Maybe there’s a way to trap the timeout and log it somehow?

Thanks for your help!

Yes and yes. But given this is a timeout for the pooled connection this means it’s not bound to a particular query. You can e.g. do Repo.checkout(fn -> Process.sleep(:infinity) end) and you’d get that error without any query involved.

1 Like

I see, thanks.

So there’s no way to know the exact operation that made a DB pool process timeout?

I now realize the full error message logs the location at which the caller process was when the pool process died:

Postgrex.Protocol (#PID<0.1209.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1628.0> timed out because it queued and checked out the connection for longer than 15000ms

#PID<0.1628.0> was at location:

    :prim_inet.recv0/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:3171: Postgrex.Protocol.msg_recv/4
    (postgrex 0.16.5) lib/postgrex/protocol.ex:2201: Postgrex.Protocol.recv_bind/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:2093: Postgrex.Protocol.bind_execute/4
    (db_connection 2.4.3) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.4.3) lib/db_connection.ex:1413: DBConnection.run_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:1508: DBConnection.run/6
    (db_connection 2.4.3) lib/db_connection.ex:701: DBConnection.execute/4

Unfortunately that still doesn’t tell me much about what operation caused the pool process to timeout.

IMO, you have to diagnose this problem from the slow query.

Edit: checkout the below explaination of @LostKobrakai.


If you guess that the root cause is slow queries. You can find the slow queries by using the tools provided by PostgreSQL (which you said above). Or, you can use Ecto directly. Here’s how:

Ecto supports metrics which are reported by :telemetry.
You can log slow queries like this - Output slow Ecto queries to logs · GitHub.

(The gist is picked randomly from the Internet…

1 Like

Not sure you understood my comments correctly. You can get results by looking for slow queries, but a pool timeout is not guaranteed to come from a slow query. It could just as much be from a lot of fast queries accumulating their time needed to complete – or to an extreme by not running any query at all, but just leaving the connection checked out for longer than the timeout.

The better way to diagnose this is to see if the stacktrace printed can be extended to get out of general code paths and into ones which are useful for debugging.

4 Likes

Thank you for the explanation, now I understand. :smiley:

Thanks @LostKobrakai it’s much clearer now.

I was doing some tests from iex and I noticed that the caller process doesn’t get killed when the pool timeout occurs (the iex process has the same PID before and after the timeout). Interesting enough, the variable result isn’t bound to anything when the query timeouts.

How is the caller process expected to handle DB timeouts? I was expecting it to die on the timeout or at least for the result variable to be {:error, timeout} or similar.

iex(32)> self()
#PID<0.1357.0>

iex(33)> result = MyApp.Repo.all(MyApp.MySchema, timeout: 1)

[error] Postgrex.Protocol (#PID<0.1120.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1357.0> timed out because it queued and checked out the connection for longer than 1ms

#PID<0.1357.0> was at location:

    :prim_inet.recv0/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:3171: Postgrex.Protocol.msg_recv/4
    (postgrex 0.16.5) lib/postgrex/protocol.ex:2893: Postgrex.Protocol.recv_close/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:1460: Postgrex.Protocol.close_parse_describe_flush/3
    (db_connection 2.4.3) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.4.3) lib/db_connection.ex:1398: DBConnection.prepare/4
    (db_connection 2.4.3) lib/db_connection.ex:1391: DBConnection.run_prepare/4
    (db_connection 2.4.3) lib/db_connection.ex:1403: DBConnection.run_prepare_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:1508: DBConnection.run/6
    (db_connection 2.4.3) lib/db_connection.ex:644: DBConnection.parsed_prepare_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:636: DBConnection.prepare_execute/4
    (ecto_sql 3.9.2) lib/ecto/adapters/postgres/connection.ex:70: Ecto.Adapters.Postgres.Connection.prepare_execute/5
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:836: Ecto.Adapters.SQL.execute!/5
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    (elixir 1.14.0) src/elixir.erl:298: anonymous fn/4 in :elixir.eval_external_handler/1
    (stdlib 4.0.1) erl_eval.erl:748: :erl_eval.do_apply/7
    (stdlib 4.0.1) erl_eval.erl:492: :erl_eval.expr/6
    (elixir 1.14.0) src/elixir.erl:288: :elixir.eval_forms/3
    (elixir 1.14.0) lib/module/parallel_checker.ex:100: Module.ParallelChecker.verify/1
    (iex 1.14.0) lib/iex/evaluator.ex:329: IEx.Evaluator.eval_and_inspect/3
    (iex 1.14.0) lib/iex/evaluator.ex:303: IEx.Evaluator.eval_and_inspect_parsed/3
    (iex 1.14.0) lib/iex/evaluator.ex:292: IEx.Evaluator.parse_eval_inspect/3
    (iex 1.14.0) lib/iex/evaluator.ex:187: IEx.Evaluator.loop/1
    (iex 1.14.0) lib/iex/evaluator.ex:32: IEx.Evaluator.init/4
    (stdlib 4.0.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

** (DBConnection.ConnectionError) tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:913: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    iex:33: (file)

iex(33)> self()                                                
#PID<0.1357.0>

iex(34)> result
warning: variable "result" does not exist and is being expanded to "result()", please use parentheses to remove the ambiguity or change the variable name

Thanks again for your help!

IEx has some special ways to deal with exceptions. Generally the caller crashes.

1 Like

Moving away from Elixir here, you could also enable logs on Postgres and monitor them as you play around with the suggestions above. I have some notes about this here. Please post your findings!

2 Likes

Ah thank you so much, that explains it! I was getting confused by this.