Idiomatic setup for Ecto replicas with runtime configuration

mikej · July 26, 2024, 4:15pm

Okay, this one may get a bit in the weeds but here goes. First, while my application is a Phoenix app, this is more of an Ecto/Elixir setup question. Please let me know if this should go in the general category.

My application is multi-tenant.
We achieve multi-tenancy with Postgres schemas.
It runs on AWS ECS containers.
It uses AWS RDS Aurora Postgres as the database backend.

The requirements:

Multiple read-only replicas are required. This is a very high traffic system that’s heavy on the reads.
Configuration options must be passed at runtime, we use AWS Secret Manager to pass in sensitive data.
Different ECS clusters may connect to different RDS clusters.

So the problem I’m facing is: How do I build a system that can connect to different database clusters with different read replicas depending on the cluster configuration it’s running in?

What I have now is this bunch of bananas - it works but it has one very specific limitation.

First, I’m using a runtime config value to read in a comma separated list of replica endpoints:

config :snw_bowman,
  pod_replicas: String.split(System.get_env("POD_REPLICAS") || "localhost", ",", trim: true)

Then in my Repo file I have:

defmodule SnwBowman.Repo do
  @moduledoc false
  require Logger

  use Ecto.Repo,
    otp_app: :snw_bowman,
    adapter: Ecto.Adapters.Postgres

  @replicas [
    SnwBowman.Repo.ReadOnly1,
    SnwBowman.Repo.ReadOnly2
  ]

  def replica do
    @replicas
    |> Enum.random()
  end

  @doc """
  Starts the read-only replicas. This function is called by the GenServer
  defined below, and should not be called directly. The GenServer is started
  under the main supervision tree.
  """
  def start_replicas(hosts) do
    conf = Keyword.put(config(), :read_only, true)

    for {repo, index} <- Enum.with_index(@replicas) do
      case repo.start_link(
             Keyword.put(conf, :hostname, Enum.at(hosts, index))
             |> Keyword.put(:name, repo)
           ) do
        {:ok, _} -> :ok
        {:error, reason} -> Logger.error("Failed to start replica #{index}: #{inspect(reason)}")
      end
    end
  end

  for repo <- @replicas do
    defmodule repo do
      use Ecto.Repo,
        otp_app: :snw_bowman,
        adapter: Ecto.Adapters.Postgres
    end
  end
end

defmodule SnwBowman.Repo.Replicas do
  @moduledoc false
  @name :snw_bowman_repo_replicas

  use GenServer

  def start_link(args) do
    GenServer.start_link(__MODULE__, args, name: @name)
  end

  def init(hosts: hosts) do
    SnwBowman.Repo.start_replicas(hosts)
    {:ok, %{}}
  end
end

The key here being start_replicas/0 and the GenServer module at the bottom. Those are used in the application.ex file like this:

defmodule SnwBowman.Application do

  use Application

  @impl true
  def start(_type, _args) do

    children =
      [
        ...
        # Start the Ecto repository
        SnwBowman.Repo,
        # Start the replicas
        {SnwBowman.Repo.Replicas, hosts: Application.get_env(:snw_bowman, :pod_replicas)}
        ...
      ]

    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: SnwBowman.Supervisor]
    Supervisor.start_link(children, opts)
  end

end

This allows me to sub in a different :hostname value for each replica, have them configured at run time, and start under the main supervision tree.

The biggest problem is that it requires that ALL clusters have the same number of read replicas.

The second problem is that this just feels… odd. Like there should be some code smells, but I can’t see them. Is this idiomatic Elixir? Is there a better way to handle this?

Would love to hear some feedback and suggestions.

Thanks in advance,
~mike

LostKobrakai · July 26, 2024, 4:35pm

I’d suggest looking at Ecto.Repo — Ecto v3.11.2 to deal with dynamic repos instead of dynamic module generation.

mikej · August 13, 2024, 7:09pm

Thanks for the suggestion! My initial worry with dynamic repos is that the databases backing the replicas will always be read-only. So for each call I’d need to set then reset the dynamic repo on every call, right? Seems like it’d be messy.

In the end, what I came up with is this lovely bunch of coconuts:

defmodule SnwBowman.Repo.Replicas do
  @moduledoc false
  @name :snw_bowman_repo_replicas
  require Logger
  use GenServer

  def start_link(args) do
    GenServer.start_link(__MODULE__, args, name: @name)
  end

  def init(hosts: hosts) do
    replicas =
      define_replicas(hosts)
      |> Enum.with_index()
      |> Enum.map(fn {repo, index} ->
        start_replica!(repo, Enum.at(hosts, index))
      end)

    {:ok, %{replicas: replicas}}
  end

  @doc """
  Returns a list of initialized replicas.
  """
  def get_replicas() do
    GenServer.call(@name, :get_replicas)
  end

  def handle_call(:get_replicas, _from, %{replicas: replicas}) do
    {:reply, replicas, %{replicas: replicas}}
  end

  defp start_replica!(replica, host) do
    conf =
      SnwBowman.Repo.config()
      |> Keyword.merge(
        read_only: true,
        hostname: host,
        name: replica
      )

    case replica.start_link(conf) do
      {:ok, _} ->
        replica

      {:error, reason} ->
        raise("Failed to start replica #{host}: #{inspect(reason)}")
    end
  end

  defp define_replicas(hosts) do
    conf = Keyword.put(SnwBowman.Repo.config(), :read_only, true)

    for {host, index} <- Enum.with_index(hosts) do
      defmodule Module.concat(["SnwBowman.Repo", "ReadOnly#{Integer.to_string(index + 1)}"]) do
        use Ecto.Repo,
          otp_app: :snw_bowman,
          adapter: Ecto.Adapters.Postgres
      end
    end
    |> Enum.map(fn {:module, replica, _, :ok} ->
      replica
    end)
  end
end

It’s started the same way as the original code, and the main Repo module still has a replica/0 function.

Not sure if it’s the best solution, but it works.

LostKobrakai · August 14, 2024, 7:22am

You can use two base repos. One for the write repo and one for replicas.

You need to select a repo no matter if you generate modules for them or if you do dynamic repos (e.g. registered to a registry). You can even hide some of that behind a module also adhereing to the repo behaviour (see e.g. fly_postgres_elixir/lib/repo.ex at main · superfly/fly_postgres_elixir · GitHub)

mikej · August 14, 2024, 9:09pm

Interesting… So with this sort of solution, the connections would be created on demand, instead of boot time. Wonder what kind of performance impact that might have. I’ll need to hook up some telemetry to test it out.

LostKobrakai · August 15, 2024, 6:28am

Depends how you set things up. You could also start them in advance and call put_dynamic_repo wherever you query for things. put_dynamic_repo is just a write to the process dict, so very fast.