GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating ** (DBConnection.ConnectionError)

script · March 28, 2022, 4:48pm

Hi,
My production logs are plagued with this error, and it keeps occurring. Here is a complete stack trace.

[error] Postgrex.Protocol (#PID<0.6662.0>) disconnected: ** 
(DBConnection.ConnectionError) client #PID<0.13357.5> timed out because it queued and 
 checked out the connection for longer than 15000ms

#PID<0.13357.5> was at location:

:prim_inet.recv0/3
(postgrex 0.15.11) lib/postgrex/protocol.ex:2976: Postgrex.Protocol.msg_recv/4
(postgrex 0.15.11) lib/postgrex/protocol.ex:2022: Postgrex.Protocol.recv_bind/3
(postgrex 0.15.11) lib/postgrex/protocol.ex:1914: Postgrex.Protocol.bind_execute/4
(db_connection 2.4.0) lib/db_connection/holder.ex:325: DBConnection.Holder.holder_apply/4
(db_connection 2.4.0) lib/db_connection.ex:1314: DBConnection.run_execute/5
(db_connection 2.4.0) lib/db_connection.ex:631: DBConnection.execute/4
(ecto_sql 3.7.0) lib/ecto/adapters/postgres/connection.ex:80: Ecto.Adapters.Postgres.Connection.execute/4

[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Plugins.Stager}}} terminating
 ** (DBConnection.ConnectionError) tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
(ecto_sql 3.7.0) lib/ecto/adapters/sql.ex:756: Ecto.Adapters.SQL.raise_sql_call_error/1
(ecto_sql 3.7.0) lib/ecto/adapters/sql.ex:689: Ecto.Adapters.SQL.execute/5
(oban 2.9.2) lib/oban/plugins/stager.ex:100: anonymous fn/1 in Oban.Plugins.Stager.lock_and_stage/1
(ecto_sql 3.7.0) lib/ecto/adapters/sql.ex:1013: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
(db_connection 2.4.0) lib/db_connection.ex:1512: DBConnection.run_transaction/4
(oban 2.9.2) lib/oban/plugins/stager.ex:83: anonymous fn/2 in Oban.Plugins.Stager.handle_info/2
(telemetry 0.4.3) /home/ubuntu/wswd/web/deps/telemetry/src/telemetry.erl:272: :telemetry.span/3
(oban 2.9.2) lib/oban/plugins/stager.ex:82: Oban.Plugins.Stager.handle_info/2

Timeout issues are also happening for other endpoints where we are searching through the data.
Response time from server is also increased for other api’s.
Everything was working fine at one point, but now we have variable response time from the server for the same api, and average response time for each api is 10 to 15 seconds.
I understand the issue is related to database, database we are using is locally installed PostgreSQL on ec2 instance and size is 37GB. 30GB free memory is available. RAM is 4gb.
The only solution I can think of is using Amazon RDS instead of using local database because there is no single endpoint which is causing the issue so there is no point of increasing the timeout for the entire database, and also I think it’s a scaling issue.
Any suggestions would be much appreciated.
Thank you.

sorentwo · March 28, 2022, 6:57pm

You have some general DB performance problems, most likely a handful of slow queries taking too much of that tiny 4GB RAM pool. With 37GB of data, I imagine you have a very low cache hit ratio, which will tank your queries.

While you’re seeing this show up for Oban’s stager query, I doubt that’s the real culprit. That query takes < 1ms even on a modestly sized table.