Have Ecto wait for DB to spin up?

mbklein · April 20, 2022, 9:08pm

We’re considering using AWS Aurora Serverless PostgresSQL for our developer environment. I’ve already determined that it will meet our needs as far as Postgres compatibility, but I’m looking for help getting over the last inconvenience. Whenever the DB is idle for a certain amount of time, it spins down completely. It spins itself back up when the next connection attempt is made, but it takes ~20-30 seconds. What this means is that when I start my Elixir/Ecto/Postgrex application, I get a whole lot of DBConnection errors. I can then wait 20 seconds and restart and everything works fine.

One other thing that’s probably relevant is that the port is open and the cluster is listening even when it’s spun all the way down. That is, I can do the following:

See cluster capacity at 0
Run nc -v DB_HOST 5432 on my client, and immediately get Ncat: Connected to IP_ADDR:5432 in return
See cluster capacity spinning up to 2 just from that port connection

So I’m guessing that Postgrex isn’t failing to connect; it’s just failing to get a response in a timeframe that makes DBConnection happy.

I’ve set timeout, connect_timeout, and handshake_timeout to 60 seconds, but I’m still getting errors from GenServers and other processes that start up and immediately try to query the Repo, e.g.,

** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2918ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

I have no idea what appropriate queue_target and queue_interval settings would be for this particular case. I’d also love to be able to display some kind of “Waiting for database to spin up…” message.

Any thoughts on this would be greatly appreciated.

joaoevangelista · April 20, 2022, 10:34pm

Reading the docs, I got the idea that the queue_ options are for preventing overload, but found a interesting one: after_connect_timeout

The maximum time allowed to perform function specified by :after_connect option (default: 15_000 ).

There is also a configure option, that you can use to print messages and reconfigure before any attempt to connect:

A function to run before every connect attempt to dynamically configure the options, either a 1-arity fun, {module, function, args} with options prepended to args or nil where only returned options are passed to connect callback (default: nil )

Here is the full docs on the options for DBConnect: DBConnection — db_connection v2.6.0

mbklein · April 21, 2022, 5:10am

Ah, so if after_connect blocks, I could use it to wait for the connection to actually return something, with an after_connect_timeout of 60 seconds or so. Worth a try. Thank you!

elixirium · August 23, 2022, 11:57am

@mbklein I’ve run into similar challenges. How did you solve this problem?

mbklein · September 28, 2022, 9:20pm

I haven’t really solved it yet. I’ve got an acceptable (for our purposes) workaround, but it’s not great. In my application’s Ecto.Repo implementation, I added two functions:

  def wait_for_connection do
    Logger.info("Waiting for active database connection...")
    canary() |> WaitForIt.wait(timeout: 60_000, frequency: 1_000)
  end

  defp canary do
    case __MODULE__.query("SELECT 1") do
      {:ok, %{rows: [[1]]}} -> :ok
      _ -> :error
    end
  rescue
    _ in DBConnection.ConnectionError -> :error
  end

Then I updated my Application module to start the repo alone under a DynamicSupervisor and call its wait_for_connection() before starting any of the application’s other child processes. I still get a lot of junk in the log while it waits for the DB to spin up, but it works reasonably well.

Note: This solution requires the wait_for_it hex package.

joey_the_snake · September 29, 2022, 4:44am

What about waiting for Postgres to spin up before starting the application? If you’re using docker-compose have one of the containers talk to Postgres until it’s up and once it does trigger the application to start up.

mbklein · September 29, 2022, 5:51am

I’m not running that way. Everything is running on AWS. The database autoscales – it spins down when it’s idle for more than a particular period of time, and the only way to spin it back up is to attempt a connection.

I could probably have the process that starts my app wait for the connection instead of having ecto do it, but it would be great if the app can do it all on its own.

joey_the_snake · September 29, 2022, 6:29am

Ah I see. I’m still trying to figure out exactly what is happening in your case. The error you showed in your first post means that your process tried to get a connection from the pool but it waited too long and stopped trying:

** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2918ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

I’m trying to figure out what could be blocking the pool. When you got this error did you have any after_connect callbacks on your connections? Your connection is a GenServer and after_connect sends a cast message to the connection before returning to the caller. So it blocks the connection but not the app from continuing.