Database connection issues during deployment on Digital Ocean App Platform

stefanchrobot · March 26, 2023, 3:44pm

Hey, recently I started to have issues during deployments to Digital Ocean App Platform. I’m deploying an Elixir release packed as a Docker image. During startup I’m often getting a database connection error:

[myapp] [2023-03-26 09:31:39] 09:31:39.044 [error] Could not create schema migrations table. This error usually happens due to the following:
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39]   * The database does not exist
[myapp] [2023-03-26 09:31:39]   * The "schema_migrations" table, which Ecto uses for managing
[myapp] [2023-03-26 09:31:39]     migrations, was defined by another library
[myapp] [2023-03-26 09:31:39]   * There is a deadlock while migrating (such as using concurrent
[myapp] [2023-03-26 09:31:39]     indexes with a migration_lock)
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39] To fix the first issue, run "mix ecto.create".
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39] To address the second, you can run "mix ecto.drop" followed by
[myapp] [2023-03-26 09:31:39] "mix ecto.create". Alternatively you may configure Ecto to use
[myapp] [2023-03-26 09:31:39] another table and/or repository for managing migrations:
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39]     config :myapp, MyApp.Repo,
[myapp] [2023-03-26 09:31:39]       migration_source: "some_other_table_for_schema_migrations",
[myapp] [2023-03-26 09:31:39]       migration_repo: AnotherRepoForSchemaMigrations
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39] The full error report is shown below.
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39] ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2483ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39]   1. Ensuring your database is available and that you can connect to it
[myapp] [2023-03-26 09:31:39]   2. Tracking down slow queries and making sure they are running fast enough
[myapp] [2023-03-26 09:31:39]   3. Increasing the pool_size (although this increases resource consumption)
[myapp] [2023-03-26 09:31:39]   4. Allowing requests to wait longer by increasing :queue_target and :queue_interval
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39] See DBConnection.start_link/2 for more information
[myapp] [2023-03-26 09:31:39] 
[myapp] [2023-03-26 09:31:39]     (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:913: Ecto.Adapters.SQL.raise_sql_call_error/1
[myapp] [2023-03-26 09:31:39]     (elixir 1.14.3) lib/enum.ex:1658: Enum."-map/2-lists^map/1-0-"/2
[myapp] [2023-03-26 09:31:39]     (ecto_sql 3.9.2) lib/ecto/adapters/sql.ex:1005: Ecto.Adapters.SQL.execute_ddl/4
[myapp] [2023-03-26 09:31:39]     (ecto_sql 3.9.2) lib/ecto/migrator.ex:677: Ecto.Migrator.verbose_schema_migration/3
[myapp] [2023-03-26 09:31:39]     (ecto_sql 3.9.2) lib/ecto/migrator.ex:491: Ecto.Migrator.lock_for_migrations/4
[myapp] [2023-03-26 09:31:39]     (ecto_sql 3.9.2) lib/ecto/migrator.ex:403: Ecto.Migrator.run/4
[myapp] [2023-03-26 09:31:39]     (ecto_sql 3.9.2) lib/ecto/migrator.ex:146: Ecto.Migrator.with_repo/3
[myapp] [2023-03-26 09:31:39]     nofile:1: (file)

My migration task seems unable to connect to the database.

The DB exists and is reachable via other means,
The DB connection pool size is 22,
The app’s pool size is set to 6,
I’m only running one instance,
I use zero-downtime deployments, so max connections should be 12 during deployment.

I’m only experiencing this during deployments which makes them fail. Any idea what might be causing this?

jerdew · March 26, 2023, 4:06pm

Not sure if this is the same problem, but on DO I found I had to set the :maintenance_database to defaultdb (Ecto default is postgres) if I wanted to do create/migrate as part of a deploy.

stefanchrobot · March 27, 2023, 3:36pm

Thanks! I’ve updated the value for :maintenance_database. But it’s something else.

I’ve wrapped my migration function with try do ... catch ... end so that it doesn’t crash. This no longer stops the deployment process - the app is happily booting and connecting to the database Not sure what is going on here. It feels like it’s working sometimes. Sounds like some sort of a race condition.

Should I be manually starting the :postgrex application?

dimitarvp · March 27, 2023, 4:09pm

Sorry for a generic suggestion: have you tried the migration_lock option?

ryanwinchester · July 3, 2023, 9:10pm

Late to the party, but if you have multiple instances on the app platform, but are they perhaps trying to run migrations at the same time?

stefanchrobot · August 2, 2023, 6:56am

I only have one instance running, so the migrations are not attempted by multiple nodes.

For some reason if I connect to the DB manually before running migrations, things seem to work. If I don’t do it, the migration task doesn’t connect. Any ideas why that might be? I’m guessing maybe different connection timeouts?

Just to recap:

Digital Ocean App Platform with hosted PostgreSQL over SSL
MIGRATION_PRECONNECT=true ./bin/myapp eval MyApp.Release.migrate works
MIGRATION_PRECONNECT=false ./bin/myapp eval MyApp.Release.migrate fails

defmodule MyApp.Release do
  @moduledoc """
  Release tasks.
  """

  require Logger

  @app :myapp

  def migration_preconnect do
    Application.ensure_all_started(:ssl)
    Application.ensure_all_started(:postgrex)

    database_url = System.get_env("DATABASE_URL")
    conn_opts = Ecto.Repo.Supervisor.parse_url(database_url)
    Logger.info("Connecting: #{inspect(Keyword.delete(conn_opts, :password))}")

    {:ok, pid} =
      Postgrex.start_link(
        conn_opts ++
          [
            ssl: true,
            ssl_opts: [
              verify: :verify_peer,
              cacerts: [
                "DATABASE_CA_CERT"
                |> System.get_env()
                |> then(fn pem ->
                  [{_type, der, _info}] = :public_key.pem_decode(pem)
                  der
                end)
              ]
            ]
          ]
      )

    Logger.info("Querying...")

    %Postgrex.Result{rows: [[count]]} =
      Postgrex.query!(pid, "SELECT count(*) FROM accounts;", [])

    Logger.info("Found #{count} account(s).")
  end

  def migrate do
    try do
      if System.get_env("MIGRATION_PRECONNECT") do
        migration_preconnect()
      end

      load_app()

      for repo <- repos() do
        {:ok, _, _} = Ecto.Migrator.with_repo(repo, &Ecto.Migrator.run(&1, :up, all: true))
      end
    catch
      kind, value ->
        Logger.warning("Migration failed: #{inspect(kind)}, #{inspect(value)}")
    end
  end

  def rollback(repo, version) do
    load_app()
    {:ok, _, _} = Ecto.Migrator.with_repo(repo, &Ecto.Migrator.run(&1, :down, to: version))
  end

  defp repos do
    Application.fetch_env!(@app, :ecto_repos)
  end

  defp load_app do
    Application.ensure_all_started(:ssl)
    Application.load(@app)
  end
end

johantell · August 2, 2023, 7:03am

I actually battled with this error yesterday but on heroku where I got quite unhelpful error messages:

** EXIT (exited #pid<0.98.0>) shutdown

Turned out it was the changed default value for ssl’s verify option (changed in OTP 26). So what @stefanchrobot wrote will probably solve it for you (but if you like me didn’t want to get the certificate injected into the platform yet), you can revert to the old behaviour (although it’s considered to be less safe):

# runtime.exs

config :my_app, MyApp.Repo,
  ssl: true,
  verify_ssl: true,

# repo.ex
defmodule MyApp.Repo do
 # ...
 @impl Ecto.Repo
  def init(_type, config) do
    config =
      if config[:verify_ssl] do
        Keyword.put(config, :ssl_opts, ssl_options(config[:hostname]))
      else
        config
      end

    {:ok, config}
  end


  defp ssl_options(_server) do
    [
      # The `verify` is (since OTP 26) set to `:verify_peer` as default since that is a much safer option.
      # However in order to do so we'll need to merge the AWS certificate into our castore (since it's not
      # there by default).
      #
      # See https://hexdocs.pm/postgrex/Postgrex.html#start_link/1-ssl-client-authentication for more details
      verify: :verify_none,
      cacert: :public_key.cacerts_get()
    ]
  end
end

stefanchrobot · August 2, 2023, 7:28am

Not sure if it’s the same issue though. Digital Ocean provides the cert via an ENV variable, so I’m always connecting with SSL.

johantell · August 2, 2023, 7:48am

Right, I just got so used to the error by working on it the whole day yesterday that I assumed it would be the same .

How are the certificates injected into SSL during normal a normal start of postgres? Are you running OTP 26?

stefanchrobot · August 2, 2023, 7:11pm

No worries. It might turn out that it does have something to do with SSL.

I’m on OTP26, but the issue was there with OTP25 too. Here’s more or less my repo config:

config :myapp, MyApp.Repo,
  url: System.get_env("DATABASE_URL"),
  maintenance_database: System.get_env("MAINTENANCE_DATABASE"),
  pool_size: System.get_env("DATABASE_POOL_SIZE") |> String.to_integer(),
  log: System.get_env("LOG_LEVEL_ECTO") |> String.to_existing_atom(),
  ssl: true,
  ssl_opts: [
    verify: :verify_peer,
    cacerts: [
      "DATABASE_CA_CERT"
      |> System.get_env()
      |> then(fn pem ->
        [{_type, der, _info}] = :public_key.pem_decode(pem)
        der
      end)
    ]
  ]

As written above, the cert is provided in an ENV variable by Digital Ocean App Platform (PaaS), so it’s just a matter of passing it through.

johantell · August 3, 2023, 7:04am

@stefanchrobot Perhaps a very stupid question, but is the application live and running already so that it’s been verified that the ssl_opts is passed through from the repo configuration properly?

I’m assuming that your configuration lives inside the runtime configuration so that shouldn’t be a problem either…

Have you been able to look at the connection logs at the database to see if it’s trying to connect?

stefanchrobot · August 6, 2023, 6:15pm

Yes, the application is up and running. The migration fails to connect to the DB, but the application boots right after that with no issues and functions properly.

stefanchrobot · August 6, 2023, 6:55pm

I think I’m loosing any hope in resolving the issue. @ericmj sorry for tagging you, but I feel like I’m out of ideas.

Issue

The migrations fail to run even though the application properly connects to the database later on.

Setup

Digital Ocean App Platform app on basic plan, single instance
Digital Ocean Managed PostgreSQL 15 with SSL, no VPC
PostgreSQL provides 22 connections
DB pool size for the app (DATABASE_POOL_SIZE) is 6
Elixir 1.15.4, Erlang 26.0.2 on Alpine 3.18.2 (same issue happened with OTP25)
Using Elixir releases
Dockerfile command: CMD ["sh", "-c", "/app/bin/migrate && /app/bin/server"]
The scripts are the usual exec ./myapp eval MyApp.Release.migrate and exec ./myapp start

config/runtime.exs:

config :myapp, MyApp.Repo,
  url: System.get_env("DATABASE_URL"),
  maintenance_database: System.get_env("MAINTENANCE_DATABASE"),
  pool_size: System.get_env("DATABASE_POOL_SIZE") |> String.to_integer(),
  log: System.get_env("LOG_LEVEL_ECTO") |> String.to_existing_atom(),
  ssl: true,
  ssl_opts: [
    verify: :verify_peer,
    cacerts: [
      "DATABASE_CA_CERT"
      |> System.get_env()
      |> then(fn pem ->
        [{_type, der, _info}] = :public_key.pem_decode(pem)
        der
      end)
    ]
  ]

release.ex:

defmodule MyApp.Release do
  require Logger

  @app :myapp

  def manual_connect do
    Application.ensure_all_started(:ssl)
    Application.ensure_all_started(:postgrex)

    database_url = System.get_env("DATABASE_URL")

    conn_opts =
      Ecto.Repo.Supervisor.parse_url(database_url) ++
        [
          ssl: true,
          ssl_opts: [
            verify: :verify_peer,
            cacerts: [
              "DATABASE_CA_CERT"
              |> System.get_env()
              |> then(fn pem ->
                [{_type, der, _info}] = :public_key.pem_decode(pem)
                der
              end)
            ]
          ]
        ]

    safe_conn_opts = Keyword.put(conn_opts, :password, "*****")
    Logger.info("connection options: #{inspect(safe_conn_opts)}")

    {:ok, pid} = Postgrex.start_link(conn_opts)

    Logger.info("Querying...")

    %Postgrex.Result{rows: [[count]]} =
      Postgrex.query!(pid, "SELECT count(*) FROM accounts;", [])

    Logger.info("Found #{count} account(s).")
  end

  def query_with_repo() do
    Application.ensure_all_started(:ssl)

    for repo <- repos() do
      {:ok, _, _} =
        Ecto.Migrator.with_repo(repo, fn repo ->
          result = Ecto.Adapters.SQL.query(repo, "SELECT COUNT(*) FROM accounts;")
          IO.inspect(result, label: "query result")
        end)
    end
  end

  def migrate(opts \\ []) do
    try do
      if opts[:manual_connect] do
        manual_connect()
      end

      load_app()

      for repo <- repos() do
        {:ok, _, _} = Ecto.Migrator.with_repo(repo, &Ecto.Migrator.run(&1, :up, all: true))
      end
    catch
      kind, value ->
        Logger.warning("Migration failed: #{inspect(kind)}, #{inspect(value)}")
    end
  end

  def rollback(repo, version) do
    load_app()
    {:ok, _, _} = Ecto.Migrator.with_repo(repo, &Ecto.Migrator.run(&1, :down, to: version))
  end

  defp repos do
    Application.fetch_env!(@app, :ecto_repos)
  end

  defp load_app do
    Application.ensure_all_started(:ssl)
    Application.load(@app)
  end
end

Expected behavior

The migrations run successfully.

Actual behavior

The migrations fail:

09:46:41.723 [error] Could not create schema migrations table. This error usually happens due to the following:

  * The database does not exist
  * The "schema_migrations" table, which Ecto uses for managing
    migrations, was defined by another library
  * There is a deadlock while migrating (such as using concurrent
    indexes with a migration_lock)

To fix the first issue, run "mix ecto.create" for the desired MIX_ENV.

To address the second, you can run "mix ecto.drop" followed by
"mix ecto.create", both for the desired MIX_ENV. Alternatively you may
configure Ecto to use another table and/or repository for managing
migrations:

    config :myapp, MyApp.Repo,
      migration_source: "some_other_table_for_schema_migrations",
      migration_repo: AnotherRepoForSchemaMigrations

The full error report is shown below.

09:46:42.108 [warning] Migration failed: :error, %DBConnection.ConnectionError{message: "connection not available and request was dropped from queue after 2402ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n  1. Ensuring your database is available and that you can connect to it\n  2. Tracking down slow queries and making sure they are running fast enough\n  3. Increasing the pool_size (although this increases resource consumption)\n  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n", severity: :error, reason: :queue_timeout}

Notes

This used to work, but started failing ~5 months ago and I was not able to fix it since then
The database exists, since the app boots and works properly
I can reproduce the issue by running the task manually in the app console:
```
~ $ ./bin/myapp eval 'MyApp.Release.migrate'
< same error as above>
```

Running ./bin/myapp eval 'MyApp.Release.manual_connect' works 7 out of 8 times

~ $ ./bin/myapp eval 'MyApp.Release.manual_connect'
18:10:55.695 [info] connection options: [password: "*****", hostname: "app-9b676190-67dd-41ee-<truncated>", scheme: "postgresql", username: "db", database: "db", port: 25060, sslmode: "require", ssl: true, ssl_opts: [verify: :verify_peer, cacerts: [<<48, 130, 4, 65, 48, 130, 2, 169, 160, 3, 2, 1, 2, 2, 20, 127, 134, 181, 234, 214, 51, 106, 230, 141, 57, 143, 45, 82, 18, 133, 113, 172, 225, 180, 168, 48, 13, 6, ...>>]]]
18:10:56.109 [info] Querying...
18:10:58.999 [info] Found 2 account(s).

~ $ ./bin/myapp eval 'MyApp.Release.manual_connect'
18:11:12.907 [info] connection options: [password: "*****", hostname: "app-9b676190-67dd-41ee-<truncated>", scheme: "postgresql", username: "db", database: "db", port: 25060, sslmode: "require", ssl: true, ssl_opts: [verify: :verify_peer, cacerts: [<<48, 130, 4, 65, 48, 130, 2, 169, 160, 3, 2, 1, 2, 2, 20, 127, 134, 181, 234, 214, 51, 106, 230, 141, 57, 143, 45, 82, 18, 133, 113, 172, 225, 180, 168, 48, 13, 6, ...>>]]]
18:11:13.393 [info] Querying...
** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2696ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

    (postgrex 0.17.2) lib/postgrex.ex:375: Postgrex.query!/4
    (myapp 0.1.0) lib/myapp/release.ex:42: MyApp.Release.manual_connect/0
    nofile:1: (file)
    (stdlib 5.0.2) erl_eval.erl:750: :erl_eval.do_apply/7
    (elixir 1.15.4) lib/code.ex:543: Code.validated_eval_string/3

Note the initial time in the query above: is it connecting lazily?

Running ./bin/myapp eval 'MyApp.Release.migrate(manual_connect: true) sometimes work, but not reliably

~ $ ./bin/myapp eval 'MyApp.Release.migrate(manual_connect: true)'
18:51:50.298 [info] connection options: ...
18:51:50.709 [info] Querying...
18:51:53.494 [info] Found 2 account(s).
18:51:55.402 [info] Migrations already up

Doesn’t seem to be migration-related, since running MyApp.Release.query_with_repo blows up almost always

~ $ ./bin/myapp eval 'MyApp.Release.query_with_repo'
query result: {:error,
%DBConnection.ConnectionError{
  message: "connection not available and request was dropped from queue after 2518ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n  1. Ensuring your database is available and that you can connect to it\n  2. Tracking down slow queries and making sure they are running fast enough\n  3. Increasing the pool_size (although this increases resource consumption)\n  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n",
  severity: :error,
  reason: :queue_timeout
}}

Occasionally the migrations work

My guesses

Slow SSL handshake causing connection issues? Maybe the supervision tree in the app magically heals the issue when the app boots?
Weird timing issue somewhere in the release tasks?

benwilson512 · August 8, 2023, 11:45am

Do you see these failed connections in the database logs?

stefanchrobot · August 10, 2023, 10:30pm

I’ve spent more time debugging this issue. I’ve copied db_connection and postgrex from deps/ into local_deps/, made them a path dependency and then sprinkled them with logging. Here’s what I got.

When the migrations fail, it looks like the connection checkout times-out before postgrex connects to the database:

~ $ ./bin/myapp eval MyApp.Release.migrate
22:03:09.885 [info] DBConnection.ConnectionPool.start_link
22:03:09.898 [info] DBConnection.register_as_pool
22:03:10.001 [info] DBConnection.ConnectionPool.init target 50 interval 1000
22:03:10.001 [info] DBConnection.ConnectionPool.start_poll
22:03:10.001 [info] DBConnection.ConnectionPool.start_idle
22:03:10.109 [info] >> Postgrex.connect
22:03:10.110 [info] >> Postgrex.connect
22:03:10.594 [info] DBConnection.prepare_execute: %Postgrex.Query{ref: nil, name: "", statement: ["CREATE TABLE ", "IF NOT EXISTS ", [34, "schema_migrations", 34], 32, 40, [[[], [[34, "version", 34], 32, "bigint", [[], []]], ", "], [34, "inserted_at", 34], 32, ["timestamp", "(0)"], [[], []]], [", ", "PRIMARY KEY (", [[], 34, "version", 34], ")"], 41, []], param_oids: nil, param_formats: nil, param_types: nil, columns: nil, result_oids: nil, result_formats: nil, result_types: nil, types: nil, cache: :reference}
22:03:10.594 [info] DBConnection.parse
22:03:10.702 [info] DBConnection.run
22:03:10.702 [info] DBConnection.checkout
22:03:10.702 [info] DBConnection.ConnectionPool.checkout
22:03:10.805 [info] DBConnection.ConnectionPool.handle_info checkout busy
22:03:11.098 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:11.098 [info] DBConnection.ConnectionPool.start_poll
22:03:11.098 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:11.098 [info] DBConnection.ConnectionPool.drop_idle
22:03:11.098 [info] DBConnection.ConnectionPool.start_idle
22:03:12.093 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:12.093 [info] DBConnection.ConnectionPool.start_poll
22:03:12.093 [info] DBConnection.ConnectionPool.timeout
22:03:12.195 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:12.195 [info] DBConnection.ConnectionPool.drop_idle
22:03:12.195 [info] DBConnection.ConnectionPool.start_idle
22:03:13.003 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:13.003 [info] DBConnection.ConnectionPool.start_poll
22:03:13.003 [info] DBConnection.ConnectionPool.timeout
22:03:13.092 [info] DBConnection.ConnectionPool.drop_slow {-576460738507, 100}
22:03:13.092 [info] DBConnection.ConnectionPool.drop
22:03:13.101 [info] DBConnection.checkout: error %DBConnection.ConnectionError{message: "connection not available and request was dropped from queue after 2196ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n  1. Ensuring your database is available and that you can connect to it\n  2. Tracking down slow queries and making sure they are running fast enough\n  3. Increasing the pool_size (although this increases resource consumption)\n  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n", severity: :error, reason: :queue_timeout}
22:03:13.199 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:13.199 [info] DBConnection.ConnectionPool.drop_idle
22:03:13.199 [info] DBConnection.ConnectionPool.start_idle

In the happy scenario, the connection seems to be made on time:

~ $ ./bin/myapp eval MyApp.Release.migrate
21:38:12.709 [info] DBConnection.ConnectionPool.start_link
21:38:12.793 [info] DBConnection.register_as_pool
21:38:12.891 [info] DBConnection.ConnectionPool.init target 50 interval 1000
21:38:12.891 [info] DBConnection.ConnectionPool.start_poll
21:38:12.891 [info] DBConnection.ConnectionPool.start_idle
21:38:12.997 [info] >> Postgrex.connect
21:38:12.999 [info] >> Postgrex.connect
21:38:13.699 [info] DBConnection.prepare_execute: %Postgrex.Query{ref: nil, name: "", statement: ["CREATE TABLE ", "IF NOT EXISTS ", [34, "schema_migrations", 34], 32, 40, [[[], [[34, "version", 34], 32, "bigint", [[], []]], ", "], [34, "inserted_at", 34], 32, ["timestamp", "(0)"], [[], []]], [", ", "PRIMARY KEY (", [[], 34, "version", 34], ")"], 41, []], param_oids: nil, param_formats: nil, param_types: nil, columns: nil, result_oids: nil, result_formats: nil, result_types: nil, types: nil, cache: :reference}
21:38:13.699 [info] DBConnection.parse
21:38:14.001 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:14.001 [info] DBConnection.ConnectionPool.start_poll
21:38:14.001 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:14.001 [info] DBConnection.ConnectionPool.drop_idle
21:38:14.001 [info] DBConnection.ConnectionPool.start_idle
21:38:14.095 [info] DBConnection.run
21:38:14.095 [info] DBConnection.checkout
21:38:14.095 [info] DBConnection.ConnectionPool.checkout
21:38:14.202 [info] DBConnection.ConnectionPool.handle_info checkout busy
21:38:14.893 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:14.893 [info] DBConnection.ConnectionPool.start_poll
21:38:15.013 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:15.013 [info] DBConnection.ConnectionPool.drop_idle
21:38:15.013 [info] DBConnection.ConnectionPool.start_idle
21:38:15.890 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:15.890 [info] DBConnection.ConnectionPool.start_poll
21:38:15.890 [info] DBConnection.ConnectionPool.timeout
21:38:16.087 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:16.087 [info] DBConnection.ConnectionPool.drop_idle
21:38:16.087 [info] DBConnection.ConnectionPool.start_idle
21:38:16.095 [info] << Postgrex.connect
21:38:16.095 [info] << Postgrex.connect
21:38:16.197 [info] Postgrex connect: {:ok, %Postgrex.Protocol{sock: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.6>, :tls_connection, :undefined}, [#PID<0.179.0>, #PID<0.177.0>]}}, connection_id: 2191998, connection_key: -613615798, peer: {{209, 38, 218, 180}, 25060}, types: {Postgrex.DefaultTypes, #Reference<0.822817904.78249985.212280>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.822817904.78118913.212392>, queries: #Reference<0.822817904.78249985.212386>, postgres: :idle, transactions: :naive, buffer: :active_once, disconnect_on_error_codes: [], scram: nil}}
21:38:16.197 [info] Postgrex connect: {:ok, %Postgrex.Protocol{sock: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.5>, :tls_connection, :undefined}, [#PID<0.178.0>, #PID<0.176.0>]}}, connection_id: 2191999, connection_key: -574140242, peer: {{209, 38, 218, 180}, 25060}, types: {Postgrex.DefaultTypes, #Reference<0.822817904.78249985.212280>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.822817904.78118913.212393>, queries: #Reference<0.822817904.78249985.212382>, postgres: :idle, transactions: :naive, buffer: :active_once, disconnect_on_error_codes: [], scram: nil}}
21:38:16.197 [info] DBConnection.ConnectionPool.handle_info :ets-transfer
21:38:16.197 [info] DBConnection.ConnectionPool.handle_checkin
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue_fast
21:38:16.197 [info] DBConnection.ConnectionPool.go
21:38:16.197 [info] DBConnection.ConnectionPool.handle_info :ets-transfer
21:38:16.197 [info] DBConnection.ConnectionPool.handle_checkin
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue_fast
21:38:16.197 [info] DBConnection.checkout: OK
...
21:38:16.786 [info] Migrations already up

Note that it still takes over 3 seconds to establish a connection (that’s a lot, ain’t it?).

My thoughts on this:

It looks like the DB connection is established asynchronously. In case it takes a lot of time, the queries will be dropped instead of blocking until a connection is established.
postgrex has a connect timeout - it’s 5 seconds by default, so it doesn’t come into play here.
Not sure why it takes so much time to connect to the DB. Maybe this is related to SSL and the handshake?
Not sure why the app boots properly. Maybe it’s because the migrations only have a pool of 2 connections, whereas the app is configured to use 6. Do the migrations just run out of connections because they are still connecting?
Bumping :queue_target and :queue_interval to some high values makes the problem go away reliably.

Since I’ve spent significant time on this, I hope to turn this into some sort of an improvement (updating docs, updating the error message, filing an issue in one of the packages, etc.), but I first need to understand what the right solution is. Bumping the queueing params? Addressing slow connection times? Something else?

Looking forward to your thoughts on this! @dimitarvp @benwilson512 @ericmj @josevalim and basically anyone who has time and insight on this.

dimitarvp · August 11, 2023, 12:46am

I am definitely biased in favor of increasing the values of :queue_target and :queue_interval because I had similar (emphasis: definitely NOT identical to your) problems in a few hobby projects where I relied on slow and public 3rd party APIs. I was stupid enough to try and pre-populate a Finch HTTP pool with 20 connections and until they all got initialized the DB connections failed in a similar manner to yours.

In those cases bumping the above two parameters eliminated the problem (though I also fixed my apps to do less stupid things as well ).

I am not as well versed as the core library maintainers but naively I’d say just keep those two parameters bumped and add telemetry. Though definitely make sure to analyze / monitor impact in the rest of the operation of the app; IMO bumping those parameters might mask problems with lower-performing code in other places in your app. I think it’s worth the small risk though.

benwilson512 · August 11, 2023, 1:17pm

@stefanchrobot really nice job debugging. One thing that may help here is instead of running your migrations as an eval task, put the migrator in your supervision tree via Ecto.Migrator — Ecto SQL v3.10.1. This way it isn’t competing with other activity in your application, and it will also block application start until the migrations can run.

stefanchrobot · August 11, 2023, 1:32pm

@dimitarvp I’ll make the parameters configurable via ENV and bump it for now. That sounds like the best short-term solution. But I am quite worried that it might hide some errors. Luckily, this is not a mission critical app.

@benwilson512 Thanks! First time seeing this approach, but it looks really interesting.

benwilson512 · August 11, 2023, 1:47pm

Yeah we’ve been doing this for a long time since it synergizes really well with the K8s liveliness probes and readiness probes. Basically what we have is two endpoints /alive and /ready. /alive unconditionally returns true but /ready looks like this:

  def ready(conn, _params) do
    if Application.get_env(:myapp, :ready) do
      json(conn, %{ready: true})
    else
      send_resp(conn, 503, "")
    end
  end

Then our supervision tree looks like:

Phoenix Endpoint, #( this has `/alive` 200, but `/ready` is 503 still.)
DB Migrator,
Repo,
...Other Children,
MyApp.DeploymentNotifier, # this is just a one shot genserver that sets the env variable used in `/ready`

K8s has two different timeouts for pods. The first is about whether it’s alive at all, and the second is whether it is ready. It also only hooks up nodes to the load balancer that are /ready. This is K8s specific but this paradigm is found in other deployment structures as well. Basically it just ensures your app has time to initialize anything it needs before getting traffic, while still getting some indication from the app that it is alive and starting to boot.

samrat · October 19, 2023, 12:54pm

I am experiencing the same issue, also with a managed DigitalOcean Postgres instance.

I’m fairly new to Elixir. Has there been a fix for this? Where can I bump up :queue_target and :queue_interval? The values seem part of DBConnection and I can’t tell where in the Ecto.Repo config to change this(if that is where it needs to be specified).