We have an application hosted on AWS Fargate that uses AWS RDS for a managed database service. We have configured the RDS to use four reader replicas (in addition to the one writer) and we have two Fargate tasks with 40 DB connections apiece that query the readers. Our expectation is that Ecto queries and DB connections would be fairly evenly spread out across the four readers.
However, what we have found is that RDS does not evenly spread out the connections across the readers, but concentrates them on only one or two of the reader replicas. Usually we will see 40 connections on two of the readers and no connections on the other two, and there are times when we will see all 80 of our DB connections on a single reader while all three of the other readers sit idle.
After some discussion with AWS support we’ve found out that the load balancing algorithm for RDS does not play nicely with the
DBConnection pool. Specifically, we were told that if multiple connections occur at or around the same time, then all of those connections can be sent to the same reader endpoint. And, since
DBConnection acquires all of its connections immediately upon startup, this means that all of those connections will be sent to the same reader endpoint when the Fargate task first starts up.
It seems that, in general, the RDS load balancing algorithm assumes that a new connection will be opened up for each query, rather than having persistent connections all opened up at the same time. Given that we have no control over the load balancing algorithm, I’m wondering if there’s anything we can do in our application code or in the configuration of the
DBConnection pool to alleviate this problem?
Is there any way, for instance, to configure
DBConnection to introduce a small delay (maybe ~1s) between opening its connections at startup? Or, even better, a way to have
DBConnection open its connections lazily, such that new connections are opened only when there is a request for a connection but all existing connections are currently checked out? Or would this require a custom connection pool implementation?
Any other ideas about how to get around this problem?
Thanks in advance for any help you can provide!