Thanks to Elixir’s process supervision, the death of database connection processes does not take down a Phoenix app. However, it can be useful to notice and react when database connectivity is lost or regained.
I’m currently polling to determine this, like this:
# Can I run a very simple query?
case Ecto.Adapters.SQL.query(MyApp.Repo, "SELECT true", []) do
{:ok, _} -> true
_result -> false
end
Is there a better way than polling? I looked for telemetry events in ecto, ecto_sql, and postgrex, but I don’t see any for connectivity.
I’m having a similar code, and I use it to delay starting endpoint until database connections are in full operation mode. So the code I have that performs the actual check makes sure I can execute 10 concurrent queries, i.e. for both master and replica databases, to make sure the connections are hot and operational:
defmodule DB.ReadyChecks do
require Logger
def all do
[
&check_primary/0,
&check_replica/0
]
end
defp check_primary do
concurrently(10, fn ->
try do
{:ok, _} = Ecto.Adapters.SQL.query(DB.Repo, "SELECT pg_sleep(0.1)")
true
rescue
e ->
Logger.error("#{__MODULE__}.check_primary check failed because of:")
Logger.error("#{inspect(e)}")
false
end
end)
end
defp check_replica do
concurrently(10, fn ->
try do
{:ok, _} = Ecto.Adapters.SQL.query(DB.Repo.replica(), "SELECT pg_sleep(0.1)")
true
rescue
e ->
Logger.error("#{__MODULE__}.check_replica check failed because of:")
Logger.error("#{inspect(e)}")
false
end
end)
end
defp concurrently(processes_count, func) do
0..processes_count
|> Enum.map(fn _ -> Task.async(fn -> func.() end) end)
|> Enum.map(&Task.await/1)
|> Enum.all?(& &1)
end
end
The difference from your code is that my queries are blocking and executed in parallel, so the check will fail if the pool has less than 10 connections available to check out at the same time.
But over all, an error monitoring is what you probably want to have, and when you see DBConnection (timeouts or similar) then this means your database / pool is struggling.
The purpose of my connectity monitoring is for a Kubernetes liveness probe. If db connectivity is lost - because of a network blip, or because the db credentials have changed - I have a route which will return 503, signaling the K8s liveness probe to restart that container (and perhaps load new configuration when it does). A custom header to that route lets K8s specify how much db downtime is considered unacceptable, and a GenServer tracks that downtime so we can compare it with the header.
There is an initContainer which uses the same credentials within the same pod to run migrations before the main container starts, so that should ensure the application container doesn’t start until it has db connectivity.
All of this is still in the development stage, so I can’t tell you how well it works, except that in local development, if I tell the application container to connect to the db on the host machine (outside the cluster) using host.docker.internal and I stop the database, I do see the container get restarted until I start the database again so that the container can connect to it.