Analysis of Connection Pool Bottlenecks in Elixir/PostgreSQL App on Heroku

brunopedroso · November 22, 2024, 6:39pm

Hello everyone,

I’ve been conducting load testing on an Phoenix/PostgreSQL application deployed on Heroku, using Locust as our testing framework. Our primary focus has been analyzing the throughput capacity of a specific transaction under high concurrent load.

During initial testing, we encountered our bottleneck manifesting as:

** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 136ms. (...)

Initially, this appeared to be a straightforward connection pool exhaustion issue. My first approach was to increase the connection pool size, but this didn’t yield the expected capacity improvements. (While I considered adjusting queue_target and queue_interval parameters, this seemed to address the symptom rather than the root cause.)

Further investigation steps included:

Query optimization (achieved reasonable improvements)
Isolated database load testing (confirmed queries weren’t significantly degrading under stress)
Connection pool monitoring via :telemetry (revealed increasing queue_time, while query_time and decode_time remained stable)

The breakthrough came when experimenting with different Heroku dyno configurations. Surprisingly, deploying increasing number of small dynos with minimal connection pools (only 2 connections each) significantly improved throughput.

My working hypothesis is that the bottleneck stems from actual parallelism limitations in Heroku dynos due to small number of cores, despite Elixir VM’s high concurrency. This would explain why database queries were processing quickly, but Elixir processes couldn’t execute concurrently fast enough to process results and release connections.

I’d appreciate your thoughts on:

Does this analysis sound reasonable to you?
Are there alternative explanations I should consider?
Any suggestions for additional testing approaches?

(While exploring alternative infrastructure options is on our roadmap, I’m keen to fully understand the current setup’s behavior first.)

Thank you for your insights.

D4no0 · November 26, 2024, 8:26am

It would be useful if you posted what are the specs of the VMs (or how they call them dynos) to understand more about how limited your resources are.

It’s also important to realize that you can have bottlenecks in 2 cases:

Single core processing is slow, this might be due them using old hardware or cheaping out on time slices allocated for their vCPUs;
You have a lot more processes doing other work. The more processes you have running on the same scheduler, the less execution time you will get for them, you can obviously in this case increase the number of cores. You should be able to spin up some monitoring with this: scheduler — runtime_tools v2.1.1

brunopedroso · November 26, 2024, 12:20pm

Thank you for your response.

Well, heroku documents their dynos’ resources here.
Basically, standard 1x dynos share CPU with other apps, so it’s hard to tell. But this other doc (targeted on rails/puma) mentions “it is best to assume only one process can execute at a time”.

So, yeah, I understand this reinforces the conclusion that CPU resources are the bottleneck.

But I think the point I’m really trying to confirm here is that ‘exhausted pool’ is really the symptom we should expect from a CPU bottleneck. Because it was not obvious to me.

Like - it seems I/O from DB query responses were getting complete, but some CPU work was still needed for the processes to release the connections. Does that make sense?

LostKobrakai · November 26, 2024, 12:28pm

If queue times increase and the other queue metrics do not then it’s not really an exhausted pool. It’s a pool not getting to do enough work because noicy neighbors (the rest of the application) don’t leave enough room to actually work through the queue. In an overloaded state anything can topple over because of resource starvation. That doesn’t mean the thing toppling over is the problem, it just was the first to fall.

josevalim · November 26, 2024, 1:06pm

To add, OP can measure the run_queue and overall CPU/RAM usage in the machine, to validate this point.

brunopedroso · November 26, 2024, 3:53pm

Just tested again, twice.

First time, using 1 1x-dyno (shared CPU), I got a peak of ~30% in scheduler-utilization in all 8 schedulers, but run_queue keeps at 0;

Second time, using 1 M-dyno (dedicated CPU), I got a peak of 57% in scheduler-utilization in all 2 schedulers, and also a run_queue peak of 51 in the both schedulers.

Maybe it doesn’t measure run_queue properly in 1x-dynos, due to shared CPU? I noted the exact same pattern in historical data from last week, when I tested a lot - it only shows run_queue increases when I use dedicated-cpu dynos.

(Memory is still verly low compared to what’s available, I see some increase in ‘dyno-load’ in heroku, but doesn’t seem huge)

benwilson512 · November 26, 2024, 5:30pm

Access upgraded, try again! Sorry for the inconvenience, gotta be careful cause of spam accounts!

josevalim · November 26, 2024, 11:07pm

Thanks for spelunking. I have a question: you are benchmarking a particular transaction. Do you have enough connection in the pool to satisfy the average transaction time vs the number of clients?

You said the query time is not increasing, but maybe it is a high time anyway? How long does each transaction take and how many clients x connections you have?

brunopedroso · November 28, 2024, 6:49pm

Thank YOU for spelunking! (funny word )

So, under a not-high-enough pressure, the transaction takes at max 50ms which would correspond to 20rps in a single connection. In the measurements I did, with a single connection in the pool, it was able to handle until ~85rps before response times ‘exploded’. (Not sure why it’s bigger than 20 though)

For the following analysis I’m interpretting that point where the response time ‘explodes’ (95th-percentile go above 200ms) as the 'max-throughput.

As an example of how most measurements feels like:

(I still want to measure queue_length in the server at that point, but did not yet)

I did a bunch of reasonably systematic tests comparing the impact of increasing the pool_size in a single dyno; and increasing the number of dynos (always having 2 connections per pool):

(the measures could only be made using dedicated-cpu dynos. With shared-cpu ones the measures are too erratic to reach any conclusion)

I tried to keep the x-axis in a comparable proportion (since 1 extra dyno = 2 extra connections or paralelism)

What I can conclude from it is that increasing the number of connections together with dynos have quite a linear response, as I would expect. Increasing connections in the same dyno, on the other hand, has almost no effect. It’s as if there were another bottleneck in each dyno - which is related to CPU, according to the info in previous posts.

(I’m still investigating bc, of course, that CPU consumption may perfectly be somewhere in the app, which would cause the pool to ‘complain first’ as also said previouly.)