How do Ecto Repo.checkout timeouts work in detail?

Yegair · July 4, 2024, 7:19am

Hello everyone,

I am having trouble to fully understand how Repo.checkout works regarding DB connection timeouts. I did not find an explanation in the Docs (but maybe there is one, in that case I would be very thankful if someone could point me to it).

My code looks roughly like this:

# code that loads data into `very_large_table` ...

Repo.checkout(fn ->
    # use minimal number of CPU cores for index creation to not overwhelm the database
    Repo.query!("SET max_parallel_maintenance_workers = 0")

    Repo.query!(
      "CREATE INDEX very_large_index ON very_large_table (...)",
      [],
      timeout: :timer.minutes(5)
    )

    Repo.query!(
      "CREATE INDEX another_large_index ON very_large_table (...)",
      [],
      timeout: :timer.minutes(5)
    )

    # ... more heavy operations
end)

It runs as part of a job that loads a huge amount of data into a fresh table and then sets up indexes, constraints, etc.

I verified that the CREATE INDEX ... operations easily finish within the specified timeout, but I still get a DBConnection.ConnectionError tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated) from the Repo.checkout call.

I assume I have to change my code so the Repo.checkout call has a timeout that allows all the nested calls to finish.

Repo.checkout(
  fn ->
    # ... long running calls, see example above
  end,
  timeout: :timer.minutes(30)
)

But now I am not sure what exactly that does to my nested Repo.query!("...", [], timeout: ...) calls.

Do they still have their default timeout of whatever is specified in the config, unless a specific timeout is given?
Do they inherit the timeout from the Repo.checkout call?
Do they override the timeout from the Repo.checkout call if a specific timeout is given?

Any help or links to some docs that clarify the matter would be highly appreciated

ruslandoga · July 4, 2024, 8:39am

@Yegair

This is probably not very helpful, but I think all answers are “no”. Or “it depends” on the Ecto adapter. Here’re some examples for Postgrex which is used for Ecto.Adapters.Postgres repos.

Do they still have their default timeout of whatever is specified in the config, unless a specific timeout is given?

iex> Mix.install [:postgrex]

iex> defmodule Demo do
  def run do
    {:ok, pool} =
      Postgrex.start_link(database: "plausible_dev", username: "postgres", password: "postgres")

    try do
      DBConnection.run(
        pool,
        fn conn ->
          Postgrex.query!(conn, "select pg_sleep(30)", [], [])
        end,
        timeout: :timer.seconds(60)
      )
    after
      Process.exit(pool, :normal)
    end
  end
end

iex> Demo.run()
{:ok,
 %Postgrex.Result{
   command: :select,
   columns: ["pg_sleep"],
   rows: [[:void]],
   num_rows: 1,
   connection_id: 91,
   messages: []
 }}

A 30s query succeeds with no custom timeout provided. Default timeout in Postgrex is 15s.

Do they inherit the timeout from the Repo.checkout call?

Instead of inheriting, it seems more like individual timeout no longer applies to queries inside a checkout block.

defmodule Demo do
  def run do
    {:ok, pool} =
      Postgrex.start_link(database: "plausible_dev", username: "postgres", password: "postgres")

    try do
      DBConnection.run(
        pool,
        fn conn ->
          Postgrex.query!(conn, "select pg_sleep(30)", [], [])
          Postgrex.query!(conn, "select pg_sleep(30)", [], [])
          Postgrex.query!(conn, "select pg_sleep(30)", [], [])
        end,
        timeout: :timer.seconds(60)
      )
    after
      Process.exit(pool, :normal)
    end
  end
end

Demo.run

15:49:22.723 [error] Postgrex.Protocol (#PID<0.309.0>) disconnected:
  ** (DBConnection.ConnectionError) client #PID<0.215.0> timed out because it queued and
     checked out the connection for longer than 60000ms

#PID<0.215.0> was at location:

    :prim_inet.recv0/3
    (postgrex 0.18.0) lib/postgrex/protocol.ex:3287: Postgrex.Protocol.msg_recv/4
    (postgrex 0.18.0) lib/postgrex/protocol.ex:2281: Postgrex.Protocol.recv_bind/3
    (postgrex 0.18.0) lib/postgrex/protocol.ex:2136: Postgrex.Protocol.bind_execute_close/4
    (db_connection 2.6.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.6.0) lib/db_connection.ex:1512: DBConnection.run_execute/5
    (db_connection 2.6.0) lib/db_connection.ex:743: DBConnection.parsed_prepare_execute/5
    (db_connection 2.6.0) lib/db_connection.ex:735: DBConnection.prepare_execute/4

** (Postgrex.Error) ERROR 57014 (query_canceled) canceling statement due to user request
    (postgrex 0.18.0) lib/postgrex.ex:330: Postgrex.query!/4
    iex:27: anonymous fn/1 in Demo.run/0
    (db_connection 2.6.0) lib/db_connection.ex:930: DBConnection.run/3
    iex:23: Demo.run/0
    iex:18: (file)

Do they override the timeout from the Repo.checkout call if a specific timeout is given?

defmodule Demo do
  def run do
    {:ok, pool} =
      Postgrex.start_link(database: "plausible_dev", username: "postgres", password: "postgres")

    try do
      DBConnection.run(
        pool,
        fn conn ->
          Postgrex.query!(conn, "select pg_sleep(30)", [], timeout: :timer.seconds(60))
          Postgrex.query!(conn, "select pg_sleep(30)", [], timeout: :timer.seconds(60))
          Postgrex.query!(conn, "select pg_sleep(30)", [], timeout: :timer.seconds(60))
        end,
        timeout: :timer.seconds(10)
      )
    after
      Process.exit(pool, :normal)
    end
  end
end

Demo.run

15:49:51.534 [error] Postgrex.Protocol (#PID<0.318.0>) disconnected: 
  ** (DBConnection.ConnectionError) client #PID<0.215.0> timed out because it queued and
     checked out the connection for longer than 10000ms

#PID<0.215.0> was at location:

    :prim_inet.recv0/3
    (postgrex 0.18.0) lib/postgrex/protocol.ex:3287: Postgrex.Protocol.msg_recv/4
    (postgrex 0.18.0) lib/postgrex/protocol.ex:2281: Postgrex.Protocol.recv_bind/3
    (postgrex 0.18.0) lib/postgrex/protocol.ex:2136: Postgrex.Protocol.bind_execute_close/4
    (db_connection 2.6.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.6.0) lib/db_connection.ex:1512: DBConnection.run_execute/5
    (db_connection 2.6.0) lib/db_connection.ex:743: DBConnection.parsed_prepare_execute/5
    (db_connection 2.6.0) lib/db_connection.ex:735: DBConnection.prepare_execute/4

** (Postgrex.Error) ERROR 57014 (query_canceled) canceling statement due to user request
    (postgrex 0.18.0) lib/postgrex.ex:330: Postgrex.query!/4
    iex:27: anonymous fn/1 in Demo.run/0
    (db_connection 2.6.0) lib/db_connection.ex:930: DBConnection.run/3
    iex:24: Demo.run/0
    iex:19: (file)

Yegair · July 4, 2024, 9:13am

Thank you so much @ruslandoga, that really helps a lot understanding the problem.

My interpretation would be that the timeout applies to the checkout of the connection only, and so specifying it on a nested query has no effect at all since the connection has already been checked out using a specific timeout.

That makes a lot of sense considering the error message explicitly mentions ... checked out the connection for longer than ... .

I think it was the API design of query!("...", [], timeout: ...) that led me astray, because from how it looks in isolation one might assume the timeout there is related to the query rather than the connection checkout.