Best ways to limit concurrency to prevent overwhelming DB?

CTD · March 29, 2021, 10:41pm

We just wrapped up some stress testing, and we’re seeing a future issue where we’ll eventually overwhelm our Postgres databases. We get tons of these at high concurrency:

** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 121ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information

We’re write-heavy with concurrent POST requests flowing in from various external integrations.
We have some async network-bound operations that fan out from each of these write requests (~0-5 per write), which also perform DB writes themselves and thus tie up extra connections in our pool.

Here are some solutions we’ve looked at so far:

Tune the queries - we’ve done a fair amount of this already, and we’ll eventually run out of tuning we can do. These requests need to make ~20-30 queries per request.
Handle our busiest write requests in their own pool with poolboy. We tried this quickly and didn’t see improvement, but our test may have been flawed.
Put a queue in front of these requests & limit their rate. We’d rather avoid this, and are hoping for a lighter native solution.
Increase POOL_SIZE for our Postgres DBs (currently at 40/server with 2 servers). This did not seem to improve the issue, and just delays the problem anyway.
Get bigger databases, which also just delays the problem
Synchronize the async operations so each request process maintains a single Postgres pool connection. (this also just delays the problem)

Any recommendations?

chulkilee · March 30, 2021, 5:53am

Please note that it (already* limit the concurrency to prevent overwhelming DB

If there are more requests than the app can handle at any layer (e.g. DB in this case), the request processing should be throttled (e.g. return 429 response for example) for back-pressure. Otherwise, all solution would be just “delaying the problem”.

If you must prefer availability (e.g. never return failure - always accept the request - since it’s not on your control), then you need to have a buffer to hold the data until processed which is just delaying the problem.

Put a queue in front of these requests & limit their rate. We’d rather avoid this, and are hoping for a lighter native solution.

This is actually needed. You may check out API gateways, as they may provide a plugin to implement API rate limit over HTTP API easily. You may not need to implement this in your end applications in such case.

CTD · March 31, 2021, 3:21pm

Thanks, agreed & makes sense.
We are interested in native elixir/OTP solutions vs an external services, especially since this will only be an issue for specific request types with lots of DB writes.

It seems like connection pooling w/ Poolboy, or perhaps a GenStage implementation with back pressure would be able achieve the behavior we’re looking for. Memory becomes a concern as some of these requests that would be buffered are large, but that’s something we can measure & tune.

Any thoughts on that?

karlosmid · March 31, 2021, 3:33pm

Hi! You should check out Broadway:

keathley · March 31, 2021, 3:57pm

I think this depends on the guarantees that you’re going to provide as part of your api. It does sound to me like you need a queue of some kind. If you’re going to use a queue, you’re going to want to bound that queue otherwise you’ll continue to experience overload. So you’ll need a way to evict items from the queue or to block new items from being added to the queue once a threshold has been reached. You’ll also need to determine what ordering semantics you want (FIFO, LIFO) and whether you want any sort of durability.

GenStage has a very limited set of problems where its the ideal solution. Its unclear to me whether this is one of those situations. GenStage is a pull based approach to batch work. The problem you’re describing relies on push semantics. Which means that you’re going to have to translate between api requests (push) being placed into a GenStage producer, which are then consumed by a GenStage consumer. GenStage works well when it controls the entire pipeline because it manages back pressure. But it doesn’t sound like you can control the rate of requests hitting your API. In queueing theory terms we’d say that this is an unbounded queue. GenStage may fit your problem but your producer will still need to implement load shedding or be bounded in some way. You’ll still need to decide your ordering semantics. Otherwise you continue to risk overload scenarios if your workers can’t consume operations faster than operations are being produced by your api requests.

chulkilee · March 31, 2021, 4:32pm

I’ve seen a talk on maintaining read/write cache layer in Elixir for database operation. It requires a lot of work, but there are certain benefits doing that.

Back-pressure should be considered not only inside your service, but between external actors and your service. For example Broadway solves the back-pressure in your service but you need to do more to give back pressure to the source (e.g. HTTP client sending the requests)

The naive way to control the concurrency is to let a GenServer process to handle a request, and have a pool of such processes. Pooling db connections may lead to halt an API request in the middle of it.

mpope · March 31, 2021, 5:28pm

One thing to consider about queuing is not being able to ‘catch up’. A queue is just a sink for requests. If the bottleneck still exists then your sink will fill up. Instead of putting a queue infront of these requests, it could be better to put one behind them. If possible, multiple readers can fetch batches from the queue and do larger inserts than a single request. If it isn’t possible to batch inserts then it gets trickier.

evadne · March 31, 2021, 10:18pm

Hi

Are these transactions mutually independent (i.e. later invocations do not require / depend on results returned from earlier invocations)
Are these transactions required to be synchronous (i.e. when the request returns HTTP 200 / HTTP 201 do you need to guarantee that the data has been written)
What is your desired performance profile (in terms of database… Transactions per Second)
What is your actual performance profile and where are the bottlenecks
What is the round-trip latency between your Elixir hosts and your database

Just a guess but it might get to a point where you will wake up one day realise that the database should be the warehouse into which you commit results, and that you can use another kind of data store to handle intermediate representations of your data or even keep it entirely in memory.

OR that you will write an expense report for a few grand, install a few SSDs and be done with it.

OR that you will change providers. I noticed that you had this thread open earlier… Phoenix API latency > 30s and wanted to say that one must keep in mind that Heroku Postgres is seldom the best Postgres can offer… you can explore some other offerings… AWS RDS… Amazon Aurora with PostgreSQL compatibility… Whatever that is available from GCE… Self-hosted solutions etc, etc, etc.

Exadra37 · March 31, 2021, 10:23pm

Can you find the link for it?