Best way to monitor Ecto's database connectivity?

nathanl · March 19, 2021, 2:26pm

Thanks to Elixir’s process supervision, the death of database connection processes does not take down a Phoenix app. However, it can be useful to notice and react when database connectivity is lost or regained.

I’m currently polling to determine this, like this:

    # Can I run a very simple query?
    case Ecto.Adapters.SQL.query(MyApp.Repo, "SELECT true", []) do
      {:ok, _} -> true
      _result -> false
    end

Is there a better way than polling? I looked for telemetry events in ecto, ecto_sql, and postgrex, but I don’t see any for connectivity.

hubertlepicki · March 19, 2021, 3:08pm

I’m having a similar code, and I use it to delay starting endpoint until database connections are in full operation mode. So the code I have that performs the actual check makes sure I can execute 10 concurrent queries, i.e. for both master and replica databases, to make sure the connections are hot and operational:

defmodule DB.ReadyChecks do
  require Logger

  def all do
    [
      &check_primary/0,
      &check_replica/0
    ]
  end

  defp check_primary do
    concurrently(10, fn ->
      try do
        {:ok, _} = Ecto.Adapters.SQL.query(DB.Repo, "SELECT pg_sleep(0.1)")
        true
      rescue
        e ->
          Logger.error("#{__MODULE__}.check_primary check failed because of:")
          Logger.error("#{inspect(e)}")
          false
      end
    end)
  end

  defp check_replica do
    concurrently(10, fn ->
      try do
        {:ok, _} = Ecto.Adapters.SQL.query(DB.Repo.replica(), "SELECT pg_sleep(0.1)")
        true
      rescue
        e ->
          Logger.error("#{__MODULE__}.check_replica check failed because of:")
          Logger.error("#{inspect(e)}")
          false
      end
    end)
  end

  defp concurrently(processes_count, func) do
    0..processes_count
    |> Enum.map(fn _ -> Task.async(fn -> func.() end) end)
    |> Enum.map(&Task.await/1)
    |> Enum.all?(& &1)
  end
end

The difference from your code is that my queries are blocking and executed in parallel, so the check will fail if the pool has less than 10 connections available to check out at the same time.

hubertlepicki · March 19, 2021, 3:09pm

But over all, an error monitoring is what you probably want to have, and when you see DBConnection (timeouts or similar) then this means your database / pool is struggling.

nathanl · March 22, 2021, 2:08pm

The purpose of my connectity monitoring is for a Kubernetes liveness probe. If db connectivity is lost - because of a network blip, or because the db credentials have changed - I have a route which will return 503, signaling the K8s liveness probe to restart that container (and perhaps load new configuration when it does). A custom header to that route lets K8s specify how much db downtime is considered unacceptable, and a GenServer tracks that downtime so we can compare it with the header.

There is an initContainer which uses the same credentials within the same pod to run migrations before the main container starts, so that should ensure the application container doesn’t start until it has db connectivity.

All of this is still in the development stage, so I can’t tell you how well it works, except that in local development, if I tell the application container to connect to the db on the host machine (outside the cluster) using host.docker.internal and I stop the database, I do see the container get restarted until I start the database again so that the container can connect to it.

nathanl · March 22, 2021, 2:20pm

@Ecto-Core-Team if it’s possible to add telemetry events for “db connectivity acquired” or “db connectivity lost”, I think that would be useful.

qcam · March 22, 2021, 8:47pm

DBConnection just added this telemetry event, would it be interesting for you?

nathanl · March 23, 2021, 2:41pm

Hmmm. That is interesting, but I’m not sure how I would detect when things are OK again.