We are moving from a single Clickhouse database instance to a multi-primary clustered setup. In this setup, connection management and load balancing are important considerations. We want to be able to:
keep separate connection pools for all database instances
only route queries to healthy database instances
use least-conn balancing strategy over round-robin / random if possible
This can be done by adding a TCP or HTTP-level load balancer in between the app and database.
However, I donât see a why we canât do connection handling, health checking, and least-conn balancing within Elixir. The benefit would be keeping the infrastructure & ops load lower for the team.
Whatâs the state of the art in load balancing between Ecto Repos?
A repo crashes if it cannot connect, so Process.whereis or monitoring the repo should work. Not sure what else youâd consider for your definition of âhealthyâ though.
Thereâs afaik no way to know if a pool connection is active or not from the outside. The best ecto exposes to judge utilization of a connection pool is the idle_time/queue_time measurements on query telemetry.
As you can see the PID is always the same which tells me the process never crashes.
Thereâs afaik no way to know if a pool connection is active or not from the outside.
Could this be added to Ecto? Something like âHey Repo, tell me how many connections are currently activeâ. The number of active connections in a pool seems like a useful metric to monitor in general.
I always thought the failed to connect: ** ⊠logs would be the repo crashing, seems like I was wrong about that one. Another check could simply be attempting a query and see if that works.
Interesting, do we not have something like that in the live_dashboard? I know there was a ton of cool stuff there regarding the db. If not, I can see how that may be useful.
it seems better suited for a dedicated load balancer tbh.
if you check the status of each pool before choosing one you are increasing the latency by increasing the # of messages passed before executing a query, eating up each poolâs message queue making them slower, and also the status can easily change between your status request and query execution.
if you have your own load balancer process it will need to handle checkin/checkout messages to keep the state of each pool and this will be fighting against the requests for a connection. it can get backed up quickly because itâs single threaded.
From my understanding there has been a lot of research into pooling strategies and randomâ while not perfect â trumps most other stategies when considering the not just performance, but performance to needed complexity to implement. So yeah I think keeping it simple is not a bad approach.
Hey @ukutaht. I think the db_connection | Hex library is going to be more useful. If you want to do your own load balancing youâre going to need lower level access to the individual connections than Ecto provides. Ecto uses db_connection under the hood so it is a well tested library.
What about dev load for the team? I agree with @LostKobrakai that random is probably good enough in most cases. Another challenge is that least-conn doesnât tell us anything about query-complexity and actual DB load.
If these are the hard requirements keeping a dynamic pool of repos like laid out in the linked article with simple monitoring if queries can still be executed (think SELECT 1, but perhaps some DB internal that can be used as health metric/indication). If a query turns out to be very slow take a repo out of the pool of replicas.
This seems to me the lowest possible load for the team while sticking to the requirements and dropping one wish
I also think that a dedicated load balancer might fit better here than choosing random replicas with no periodical feedback about db servers health/current consumed load.
Since Plausible Analytics uses ClickHouse, and the primary goal is scaleability - I presume, have you considered moving to distributed table engines? Forgive me if thatâs already the case, itâs been awhile since last time I followed the latest Plausible Analytics source code.
Altinity blog has an insightful blog post about distribution and data sharding scheme in ClickHouse.
Quoting from their post:
So if there as a single node table that needs to be scaled out to multiple servers, the procedure is the follows:
Create shard tables at new servers
Re-distribute data if necessary (manually is better, though ClickHouse has some support for re-sharding as well)
Define cluster configuration and place it on every ClickHouse node
Create a Distributed table to access the data from multiple shards
Cluster extension is even easier, since it only requires change in config file to include new server(s).
I also noticed that PostHog uses this mechanism to scale their main events tables for their enterprise setup - docs briefing here.
Moving the overhead to ClickHouse inner engines to manage, might be a choice to consider.
P.S. I would like to take the chance to thank you and the rest of the team of Plausible for the awesome work. I learned a lot reading your open source contributions on github.