(DBConnection.ConnectionError) ssl recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)

We have (DBConnection.ConnectionError) error in multiple places in our production only.

2025-09-03T16:13:51.0122410Z Request: POST /user/employee
2025-09-03T16:13:51.0122511Z ** (exit) an exception was raised:
2025-09-03T16:13:51.0122564Z     ** (DBConnection.ConnectionError) ssl recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
2025-09-03T16:13:51.0122619Z         (ecto_sql 3.3.3) lib/ecto/adapters/sql.ex:612: Ecto.Adapters.SQL.raise_sql_call_error/1
2025-09-03T16:13:51.0122667Z         (ecto 3.3.4) lib/ecto/repo/schema.ex:655: Ecto.Repo.Schema.apply/4
2025-09-03T16:13:51.0122737Z         (ecto 3.3.4) lib/ecto/repo/schema.ex:263: anonymous fn/15 in Ecto.Repo.Schema.do_insert/4
2025-09-03T16:13:51.0122790Z         (my_app 4.0.0) lib/my_app/pension/workflow.ex:39: MyApp.Pension.Workflow.create_pension/2
2025-09-03T16:13:51.0122840Z         (ecto_sql 3.3.3) lib/ecto/adapters/sql.ex:886: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
2025-09-03T16:13:51.0122891Z         (db_connection 2.2.1) lib/db_connection.ex:1427: DBConnection.run_transaction/4
2025-09-03T16:13:51.0122966Z         (my_app 4.0.0) lib/my_app/controllers/employee_controller.ex:328: MyAppWeb.EmployeeController.create/3
2025-09-03T16:13:51.0123016Z         (my_app 4.0.0) lib/my_app/controllers/employee_controller.ex:1: MyAppWeb.EmployeeController.action/2

below is the function MyApp.Pension.Workflow.create_pension/2

 def create_pension(changeset, current_user) do
    # here we are inserting a new record into pensions table
    case Repo.insert(changeset) 
	.......

the below is the another function the error is repeating


2025-09-03T15:16:43.7459272Z 15:16:43.744 [error] Task #PID<0.5423.30> started from #PID<0.5443.30> terminating
2025-09-03T15:16:43.7459738Z ** (DBConnection.ConnectionError) ssl recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
2025-09-03T15:16:43.7459826Z     (ecto_sql 3.3.3) lib/ecto/adapters/sql.ex:612: Ecto.Adapters.SQL.raise_sql_call_error/1
2025-09-03T15:16:43.7459892Z     (ecto_sql 3.3.3) lib/ecto/adapters/sql.ex:545: Ecto.Adapters.SQL.execute/5
2025-09-03T15:16:43.7459998Z     (my_app 4.0.0) lib/my_app_web/plugs/user_auth_plug.ex:209: MyAppWeb.UserAuthPlug.update_last_seen!/2
2025-09-03T15:16:43.7460059Z     (elixir 1.13.4) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2
2025-09-03T15:16:43.7460121Z     (elixir 1.13.4) lib/task/supervised.ex:34: Task.Supervised.reply/4
2025-09-03T15:16:43.7460179Z     (stdlib 3.14.2.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
2025-09-03T15:16:43.7460262Z Function: #Function<0.37496757/0 in MyAppWeb.UserAuthPlug.put_current/3>
2025-09-03T15:16:43.7460334Z     Args: []

below is the MyAppWeb.UserAuthPlug.update_last_seen! method

  def update_last_seen!(%{id: id}, repo) when not is_nil(id) do
    "users"
    |> from
    |> where(id: ^id)
    |> update(set: [last_seen_at: ^NaiveDateTime.utc_now()])
    |> repo.update_all([])

    :ok
  end

We tried increasing the database pool size from 36 to 50. still it didn’t show any improvement. Can someone please help me here.

(Just offering my input until somebody competent comes along…)

Have you tried increasing the :queue_target in your repo? Maybe that will resolve the issue if the query is timing out.

Here is a list of hard-to-find options that may be relevant (these would go in your repo config):

https://hexdocs.pm/db_connection/DBConnection.html#start_link/2-options

There was a system with similar issues and, after further exploration, it became clear the root cause was because the connection to the database was indeed being lost. It was verified this by starting a service in a separate language from the same host whose only job was to connect to the database and log any errors, and it would log at roughly the same time. If increasing queue target or pool does not solve the issue, that’s most likely the issue (especially if the error is “recv closed”).

After having a look at the internals i could see that most of the time only 1-5 connections are active in the PostgreSQL and remaining 45 connections are ideal most of the time. And the error is logged only in few of the functions one of it being the update_last_seen!/2 , mentioned above. If the connection is being lost from the database then it should also throw the error in other database transactions as well, so i don’t think the database connection is not lost. Let me know if i am missing anything here.

During the investigation, from the database logs, i could see the query ran by update_last_seen!/2 is in the waiting state and it shows the reason as Lock ( at database ). Will this be an issue ? Can this lock cause 15000ms delay ? Any suggestion on how to double check this. ( focusing on update_last_seen!/2 as 95% of error logged in application is trace to that function ).

Have you tried the EXPLAIN SQL facility?

Based on what you said above, I agree with you and remove my original comment.

At the same time, I wouldn’t expect an update all with an ID as above to cause such issues, unless perhaps you have other queries in your database locking the users table or particular rows?

Yes, I have tried the running the query using EXPLAIN. It is doing a index_scan and cost=0.28..4.30.

I could see the update_last_seen/2 function is in waiting state at the database and the reason is LOCK. Can this lock occur for 15000ms ?

Any suggestion on how to identify which other query is locking the row? How to resolve it ?