How to investigate DB errors like "client timed out because it queued and checked out the connection for longer than..."

fedme · March 2, 2023, 12:41pm

Hi,

We’re starting to collect many of these errors in our logs and I am trying to understand how we can investigate them further:

Postgrex.Protocol (#PID<0.1264.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1555.0> timed out because it queued and checked out the connection for longer than 15000ms

I have read this topic and I understand these are caused by slow queries.
Now I am wondering, can Postgrex/Ecto be configured to log the slow query together with the above error?

I know I can find slow queries in a different way (e.g. with Postgres’pg_stat_statements extension), but it would be extremely useful if I could get some indication of what DB operation caused the timeout detailed in the error log.

Is that error log about a DB connection pool process crashing due to a timeout? Will it cause the caller process to crash as well? Maybe there’s a way to trap the timeout and log it somehow?

Thanks for your help!

LostKobrakai · March 2, 2023, 12:43pm

Yes and yes. But given this is a timeout for the pooled connection this means it’s not bound to a particular query. You can e.g. do Repo.checkout(fn -> Process.sleep(:infinity) end) and you’d get that error without any query involved.

fedme · March 2, 2023, 12:49pm

I see, thanks.

So there’s no way to know the exact operation that made a DB pool process timeout?

fedme · March 2, 2023, 1:07pm

I now realize the full error message logs the location at which the caller process was when the pool process died:

Postgrex.Protocol (#PID<0.1209.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1628.0> timed out because it queued and checked out the connection for longer than 15000ms

#PID<0.1628.0> was at location:

    :prim_inet.recv0/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:3171: Postgrex.Protocol.msg_recv/4
    (postgrex 0.16.5) lib/postgrex/protocol.ex:2201: Postgrex.Protocol.recv_bind/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:2093: Postgrex.Protocol.bind_execute/4
    (db_connection 2.4.3) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.4.3) lib/db_connection.ex:1413: DBConnection.run_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:1508: DBConnection.run/6
    (db_connection 2.4.3) lib/db_connection.ex:701: DBConnection.execute/4

Unfortunately that still doesn’t tell me much about what operation caused the pool process to timeout.

c4710n · March 2, 2023, 1:19pm

~~IMO, you have to diagnose this problem from the slow query.~~

Edit: checkout the below explaination of @LostKobrakai.

If you guess that the root cause is slow queries. You can find the slow queries by using the tools provided by PostgreSQL (which you said above). Or, you can use Ecto directly. Here’s how:

Ecto supports metrics which are reported by :telemetry.
You can log slow queries like this - Output slow Ecto queries to logs · GitHub.

(The gist is picked randomly from the Internet…

LostKobrakai · March 2, 2023, 1:28pm

Not sure you understood my comments correctly. You can get results by looking for slow queries, but a pool timeout is not guaranteed to come from a slow query. It could just as much be from a lot of fast queries accumulating their time needed to complete – or to an extreme by not running any query at all, but just leaving the connection checked out for longer than the timeout.

The better way to diagnose this is to see if the stacktrace printed can be extended to get out of general code paths and into ones which are useful for debugging.

c4710n · March 2, 2023, 1:32pm

Thank you for the explanation, now I understand.

fedme · March 2, 2023, 2:31pm

Thanks @LostKobrakai it’s much clearer now.

I was doing some tests from iex and I noticed that the caller process doesn’t get killed when the pool timeout occurs (the iex process has the same PID before and after the timeout). Interesting enough, the variable result isn’t bound to anything when the query timeouts.

How is the caller process expected to handle DB timeouts? I was expecting it to die on the timeout or at least for the result variable to be {:error, timeout} or similar.

iex(32)> self()
#PID<0.1357.0>

iex(33)> result = MyApp.Repo.all(MyApp.MySchema, timeout: 1)

[error] Postgrex.Protocol (#PID<0.1120.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.1357.0> timed out because it queued and checked out the connection for longer than 1ms

#PID<0.1357.0> was at location:

    :prim_inet.recv0/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:3171: Postgrex.Protocol.msg_recv/4
    (postgrex 0.16.5) lib/postgrex/protocol.ex:2893: Postgrex.Protocol.recv_close/3
    (postgrex 0.16.5) lib/postgrex/protocol.ex:1460: Postgrex.Protocol.close_parse_describe_flush/3
    (db_connection 2.4.3) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.4.3) lib/db_connection.ex:1398: DBConnection.prepare/4
    (db_connection 2.4.3) lib/db_connection.ex:1391: DBConnection.run_prepare/4
    (db_connection 2.4.3) lib/db_connection.ex:1403: DBConnection.run_prepare_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:1508: DBConnection.run/6
    (db_connection 2.4.3) lib/db_connection.ex:644: DBConnection.parsed_prepare_execute/5
    (db_connection 2.4.3) lib/db_connection.ex:636: DBConnection.prepare_execute/4
    (ecto_sql 3.9.2) lib/ecto/adapters/postgres/connection.ex:70: Ecto.Adapters.Postgres.Connection.prepare_execute/5
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:836: Ecto.Adapters.SQL.execute!/5
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    (elixir 1.14.0) src/elixir.erl:298: anonymous fn/4 in :elixir.eval_external_handler/1
    (stdlib 4.0.1) erl_eval.erl:748: :erl_eval.do_apply/7
    (stdlib 4.0.1) erl_eval.erl:492: :erl_eval.expr/6
    (elixir 1.14.0) src/elixir.erl:288: :elixir.eval_forms/3
    (elixir 1.14.0) lib/module/parallel_checker.ex:100: Module.ParallelChecker.verify/1
    (iex 1.14.0) lib/iex/evaluator.ex:329: IEx.Evaluator.eval_and_inspect/3
    (iex 1.14.0) lib/iex/evaluator.ex:303: IEx.Evaluator.eval_and_inspect_parsed/3
    (iex 1.14.0) lib/iex/evaluator.ex:292: IEx.Evaluator.parse_eval_inspect/3
    (iex 1.14.0) lib/iex/evaluator.ex:187: IEx.Evaluator.loop/1
    (iex 1.14.0) lib/iex/evaluator.ex:32: IEx.Evaluator.init/4
    (stdlib 4.0.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

** (DBConnection.ConnectionError) tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:913: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (ecto 3.9.4) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    iex:33: (file)

iex(33)> self()                                                
#PID<0.1357.0>

iex(34)> result
warning: variable "result" does not exist and is being expanded to "result()", please use parentheses to remove the ambiguity or change the variable name

Thanks again for your help!

LostKobrakai · March 2, 2023, 3:08pm

IEx has some special ways to deal with exceptions. Generally the caller crashes.

pdgonzalez872 · March 2, 2023, 3:40pm

Moving away from Elixir here, you could also enable logs on Postgres and monitor them as you play around with the suggestions above. I have some notes about this here. Please post your findings!

fedme · March 2, 2023, 3:44pm

Ah thank you so much, that explains it! I was getting confused by this.

justincy · October 24, 2024, 6:31pm

I’m also interested in logging the associated query, if there is one. I thought about adding it to the Postgrex.Protocol DBConnection.ConnectionError error message that gets printed but I haven’t been able to figure out where that’s being generated.

The error which gets returned by ecto is

{:error,
 %DBConnection.ConnectionError{
   message: "connection is closed because of an error, disconnect or timeout",
   severity: :error,
   reason: :error
 }}

Unfortunately at that point we are no longer sure if it’s a timeout anymore (it’s just a generic connection error) which makes it less useful to just wrap all Repo methods, though I might still have to do that.

justincy · October 24, 2024, 7:28pm

I think I found where the log is generated: db_connection/lib/db_connection/connection.ex at 700cbf6a715ae813678de716e4409cc22c7e03e6 · elixir-ecto/db_connection · GitHub

justincy · October 25, 2024, 2:49pm

Thanks to the discussion in this thread, I ended up doing something like this:

  def handle_event(
        [:my_app, :repo :query],
        _measurements,
        %{result: {:error, error}, query: query},
        _config
      ) do
    # Log the query
  end

That will capture more than just timeouts. If that’s a problem then you can go through the additional work to match on error or add some other filtering so that you only get timeouts. In my experience testing locally I saw a few different errors associated with timeouts but in production so far we’ve only seen this one:

Postgrex.Protocol (#PID<0.18044.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.18926.0> timed out because it queued and checked out the connection for longer than 5ms