Load balancing DB connections with Ecto

ukutaht · October 4, 2022, 9:02am

Hi all Uku here from Plausible Analytics.

We are moving from a single Clickhouse database instance to a multi-primary clustered setup. In this setup, connection management and load balancing are important considerations. We want to be able to:

keep separate connection pools for all database instances
only route queries to healthy database instances
use least-conn balancing strategy over round-robin / random if possible

This can be done by adding a TCP or HTTP-level load balancer in between the app and database.

However, I don’t see a why we can’t do connection handling, health checking, and least-conn balancing within Elixir. The benefit would be keeping the infrastructure & ops load lower for the team.

What’s the state of the art in load balancing between Ecto Repos?

I read through the doc on Ecto Replicas and Dynamic Repostiories. What seems to be missing is:

Check if Myapp.Repo.Replica1 is healthy
Check the number of active connections on each replica to choose the one with least active connections for the next query (instead of Enum.random)

Are there any facilities in Ecto for this?

LostKobrakai · October 4, 2022, 9:17am

A repo crashes if it cannot connect, so Process.whereis or monitoring the repo should work. Not sure what else you’d consider for your definition of “healthy” though.

There’s afaik no way to know if a pool connection is active or not from the outside. The best ecto exposes to judge utilization of a connection pool is the idle_time/queue_time measurements on query telemetry.

ukutaht · October 4, 2022, 10:36am

Thanks @LostKobrakai

A repo crashes if it cannot connect, so Process.whereis or monitoring the repo should work

I tested this and it doesn’t work as far as I can tell. The Process is alive after killing the database manually:

iex(5)> Process.whereis(Plausible.ClickhouseRepo)
#PID<0.880.0>
iex(6)> [error] Clickhousex.Protocol (#PID<0.890.0>) disconnected: ** (Clickhousex.Error) closed
[error] Clickhousex.Protocol (#PID<0.889.0>) disconnected: ** (Clickhousex.Error) closed
[error] .. many more disconnected errors
iex(6)> Process.whereis(Plausible.ClickhouseRepo)
#PID<0.880.0>
iex(7)> [error] Clickhousex.Protocol (#PID<0.888.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.890.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.887.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.883.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
iex(7)> Process.whereis(Plausible.ClickhouseRepo)[error] Clickhousex.Protocol (#PID<0.892.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused

#PID<0.880.0>
iex(8)> [error] Clickhousex.Protocol (#PID<0.885.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.886.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.889.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
iex(8)> Process.whereis(Plausible.ClickhouseRepo)
#PID<0.880.0>
iex(9)> Process.whereis(Plausible.ClickhouseRepo)[error] Clickhousex.Protocol (#PID<0.891.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.886.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.884.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.890.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused

#PID<0.880.0>
iex(10)> Process.whereis(Plausible.ClickhouseRepo)[error] Clickhousex.Protocol (#PID<0.888.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.883.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.887.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused

#PID<0.880.0>
iex(11)> [error] Clickhousex.Protocol (#PID<0.885.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.891.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
iex(11)> Process.whereis(Plausible.ClickhouseRepo)[error] Clickhousex.Protocol (#PID<0.892.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused

#PID<0.880.0>
iex(12)> [error] Clickhousex.Protocol (#PID<0.886.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused
[error] Clickhousex.Protocol (#PID<0.889.0>) failed to connect: ** (ErlangError) Erlang error: :econnrefused

As you can see the PID is always the same which tells me the process never crashes.

There’s afaik no way to know if a pool connection is active or not from the outside.

Could this be added to Ecto? Something like ‘Hey Repo, tell me how many connections are currently active’. The number of active connections in a pool seems like a useful metric to monitor in general.

LostKobrakai · October 4, 2022, 10:40am

I always thought the failed to connect: ** … logs would be the repo crashing, seems like I was wrong about that one. Another check could simply be attempting a query and see if that works.

pdgonzalez872 · October 4, 2022, 2:21pm

Hey Repo, how many connections?

Interesting, do we not have something like that in the live_dashboard? I know there was a ton of cool stuff there regarding the db. If not, I can see how that may be useful.

On a different note: there is a concept of connection_listener in db_connection. Here is an example of how to implement a simple one: db_connection/lib/db_connection.ex at master · elixir-ecto/db_connection · GitHub

Maybe this is an interim solution to check out.

joey_the_snake · October 4, 2022, 2:54pm

it seems better suited for a dedicated load balancer tbh.

if you check the status of each pool before choosing one you are increasing the latency by increasing the # of messages passed before executing a query, eating up each pool’s message queue making them slower, and also the status can easily change between your status request and query execution.

if you have your own load balancer process it will need to handle checkin/checkout messages to keep the state of each pool and this will be fighting against the requests for a connection. it can get backed up quickly because it’s single threaded.

LostKobrakai · October 4, 2022, 3:55pm

From my understanding there has been a lot of research into pooling strategies and random– while not perfect – trumps most other stategies when considering the not just performance, but performance to needed complexity to implement. So yeah I think keeping it simple is not a bad approach.

benwilson512 · October 4, 2022, 4:44pm

Hey @ukutaht. I think the db_connection | Hex library is going to be more useful. If you want to do your own load balancing you’re going to need lower level access to the individual connections than Ecto provides. Ecto uses db_connection under the hood so it is a well tested library.

timmo · October 4, 2022, 6:45pm

What about dev load for the team? I agree with @LostKobrakai that random is probably good enough in most cases. Another challenge is that least-conn doesn’t tell us anything about query-complexity and actual DB load.

If these are the hard requirements keeping a dynamic pool of repos like laid out in the linked article with simple monitoring if queries can still be executed (think SELECT 1, but perhaps some DB internal that can be used as health metric/indication). If a query turns out to be very slow take a repo out of the pool of replicas.

This seems to me the lowest possible load for the team while sticking to the requirements and dropping one wish

HamSamara · October 11, 2022, 10:51pm

I also think that a dedicated load balancer might fit better here than choosing random replicas with no periodical feedback about db servers health/current consumed load.

Since Plausible Analytics uses ClickHouse, and the primary goal is scaleability - I presume, have you considered moving to distributed table engines? Forgive me if that’s already the case, it’s been awhile since last time I followed the latest Plausible Analytics source code.

Altinity blog has an insightful blog post about distribution and data sharding scheme in ClickHouse.

Quoting from their post:

So if there as a single node table that needs to be scaled out to multiple servers, the procedure is the follows:

Create shard tables at new servers

Re-distribute data if necessary (manually is better, though ClickHouse has some support for re-sharding as well)

Define cluster configuration and place it on every ClickHouse node

Create a Distributed table to access the data from multiple shards

Cluster extension is even easier, since it only requires change in config file to include new server(s).

I also noticed that PostHog uses this mechanism to scale their main events tables for their enterprise setup - docs briefing here.

Moving the overhead to ClickHouse inner engines to manage, might be a choice to consider.

P.S. I would like to take the chance to thank you and the rest of the team of Plausible for the awesome work. I learned a lot reading your open source contributions on github.