(DBConnection.ConnectionError) tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)

sahilpaudel · August 27, 2020, 3:34pm

Stack trace
DBConnection.ConnectionError: (DBConnection.ConnectionError) tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
(ecto_sql) lib/ecto/adapters/sql.ex:612: Ecto.Adapters.SQL.raise_sql_call_error/1
(ecto_sql) lib/ecto/adapters/sql.ex:545: Ecto.Adapters.SQL.execute/5
(ecto) lib/ecto/repo/queryable.ex:192: Ecto.Repo.Queryable.execute/4
(ecto) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
(diamond) lib/diamond/concepts/experiment/update_count_for_each_running_experiment.ex:21: Diamond.Concepts.Experiment.UpdateCountForEachRunningExperiment.fetch_count/1
(diamond) lib/diamond/concepts/experiment/update_count_for_each_running_experiment.ex:10: Diamond.Concepts.Experiment.UpdateCountForEachRunningExperiment.call/0
(quantum) lib/quantum/executor.ex:141: anonymous fn/2 in Quantum.Executor.run/4
(elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
(quantum) anonymous fn() in Quantum.Executor.run/4

config.exs

config :diamond, Diamond.Scheduler,
  jobs: [
    {"*/60 * * * *", {Diamond.Concepts.Experiment.UpdateCountForEachRunningExperiment, :call, []}}
  ]

Also sometimes I get

15:00:15.060 [error] Postgrex.Protocol (#PID<0.2309.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.28048.4> timed out because it queued and checked out the connection for longer than 15000ms

#PID<0.28048.4> was at location:

    :prim_inet.recv0/3
    (postgrex) lib/postgrex/protocol.ex:2837: Postgrex.Protocol.msg_recv/4
    (postgrex) lib/postgrex/protocol.ex:1879: Postgrex.Protocol.recv_bind/3
    (postgrex) lib/postgrex/protocol.ex:1771: Postgrex.Protocol.bind_execute/4
    (db_connection) lib/db_connection/holder.ex:316: DBConnection.Holder.holder_apply/4
    (db_connection) lib/db_connection.ex:1255: DBConnection.run_execute/5
    (db_connection) lib/db_connection.ex:1342: DBConnection.run/6
    (db_connection) lib/db_connection.ex:539: DBConnection.parsed_prepare_execute/5

I have been getting this error in production sicen ages and I have increase the time in cron still it didn’t help.
Can anyone please help me here.

Thanks

al2o3cr · August 27, 2020, 5:45pm

This error usually happens when a query takes too long to execute on the database side. If you post the code of UpdateCountForEachRunningExperiment.fetch_count/1 and relevant schemas, folks will be able to help spot obvious performance traps.

sahilpaudel · August 31, 2020, 12:33pm

Here is code for UpdateCountForEachRunningExperiment.fetch_count/1

def fetch_version_wise_count(experiment_ids) do
    from(esl in Diamond.Models.ExperimentSplitLog,
      where: esl.experiment_id in ^experiment_ids,
      group_by: [esl.experiment_id, esl.rule_set_version_id],
      select: [count(), esl.rule_set_version_id, esl.experiment_id]
    )
    |> Repo.all()
  end

al2o3cr · August 31, 2020, 7:08pm

Some questions in-line:

from(esl in Diamond.Models.ExperimentSplitLog,
  # How many rows in ^ this table?
  where: esl.experiment_id in ^experiment_ids,
  # Is there an index on ^ this column?
  group_by: [esl.experiment_id, esl.rule_set_version_id],
  #                      what about ^ this column?
  select: [count(), esl.rule_set_version_id, esl.experiment_id]. 
  # how many rows do you expect in the result?

This will help determine why the query is timing out.

sahilpaudel · September 1, 2020, 9:30am

How many rows in experiment_split_log table?: 279517334

Is there an index on ^ this column?: foreign key constraint

how many rows do you expect in the result?: It depends on the condition

Also, after above query is executed below query is executed

def update_cached_count(count, experiment_id, version_id) do
    from(evm in @model,
      where: evm.experiment_id == ^experiment_id and evm.rule_set_version_id == ^version_id
    )
    |> Repo.update_all(set: [cached_count: count])
  end

Thanks.

al2o3cr · September 1, 2020, 11:16am

A foreign key constraint requires an index / primary key on the referenced column (id in the experiments table) but does NOT create or require an index on the referring column (experiment_id).

Sequential-scanning 279 million rows without an index? You’re Gonna Have A Bad Time.

sahilpaudel · September 1, 2020, 11:46am

We are using postgres database and select index query gave me this output.

Can you help me if there is an proper index or not.

Thanks.

al2o3cr · September 1, 2020, 3:05pm

There is not. You’ll want an index covering experiment_split_log.experiment_id, otherwise the only way Postgres can find rows satisfying this query is by checking each of the 279 million…

from(esl in Diamond.Models.ExperimentSplitLog,
      where: esl.experiment_id in ^experiment_ids,

sahilpaudel · September 2, 2020, 7:39am

Thanks for the help.