How can you avoid overwhelming ecto?

axelson · November 22, 2019, 10:14pm

I’m getting the following error in my logs fairly often:

connection not available and request was dropped from queue after 476ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information

I understand that I can tweak the queue_target and queue_interval, however I’d like to better understand what (if anything) I should be trying to do before that. I have operations coming in from multiple locations that hit the database:

Web requests directly from clients (via phoenix/http)
Batch jobs from another server (via phoenix/http)
Batch jobs via oban

Should I be trying to provide backpressure somehow? Should I try to prioritize the client requests over the batch requests? Should I be checking the telemetry for the ecto queue time in my code?

Basically I’m trying to understand the best practices and how I should approach this problem and I’m curious to hear any thoughts (including possibly telling me that I’m thinking about this all wrong)

OvermindDL1 · November 22, 2019, 11:35pm

Is it really just Ecto being overwhelmed, or is it the database itself? If the database can’t keep up then doing more in Ecto won’t really help, you’d need to fix the queries. If the database is still not being fully loaded then adding more connections would probably help then (in addition to fixing up some of the more costly queries).

axelson · November 22, 2019, 11:42pm

yeah I think it is the database itself as well

And I will be looking into fixing the more costly queries but I want to look to see if there’s a general pattern/approach that I should implement as well.

drl123 · November 22, 2019, 11:45pm

Are you running MacOS? I noticed these starting to pop up on my machine too, mostly when in iex and it’s sitting idle. It may have something to do with the latest update of MacOS (I’m on Mojave 10.14.6). The db connector in my IDE also has started dropping the PG connection about the same time as I started to see the errors in iex and that was shortly after the OS had an update.

Curious if anyone else is seeing this and if it is only MacOS or Linux too.

axelson · November 22, 2019, 11:46pm

This is in production running on heroku with https://github.com/HashNuke/heroku-buildpack-elixir/

drl123 · November 22, 2019, 11:48pm

OK…well that rules out my theory! I’d been seeing it on two different machines (both on 10.14.6) and the IDE dropping even on Rails projects, although Rails wasn’t throwing errors…just the DB explorer would lose the PG connection and I’d have to reconnect.

OvermindDL1 · November 22, 2019, 11:54pm

If it’s the database getting overwhelmed then not much you can do in ecto to help short of just adding delay to the inevitable, just need to optimize the queries or space them out more or so.

Woo, that sounds like major overloading?

axelson · November 22, 2019, 11:58pm

So if I wanted to slow down my batch processing, could I control the amount of slowdown based on how overloaded ecto/the database is?

I’m thinking it would be possible to create a telemetry handler that’ll record how long the queuing time is and if it is too long, then the next batch job will be delayed to give the db some time to “cool off”

OvermindDL1 · November 23, 2019, 12:02am

You could ask PG what it’s usage is, or maybe check the pool to see how many are checked out and halt for a period if too high? Honestly I’m more curious in what the queries are actually doing…

axelson · November 23, 2019, 12:55am

How would you ask PG what it’s usage is? Or check how checked out ecto’s pool is? I don’t see an API to get that info.

Yeah, I might start a separate thread about the queries themselves.

hauleth · November 23, 2019, 1:17am

Have you checked system catalogs? Like that.

drl123 · November 23, 2019, 10:51am

@axelson if this is on Heroku, from command line use heroku pg:info DATABASE_URL and you’ll get something like this to see how many connections are being used. You can also see some statistics if you click on the PG addon in the Dashboard GUI for the app and go to the Diagnose tab. There you’ll see a limited set of query stats. There may be more you can do with the CLI under the heroku pg command if you check out their documentation. Hope that helps.

PS…FWIW the issue I was seeing in iex above is unrelated as I found out and I don’t want to clutter your thread with details…it was a PICNIC (Problem In Chair, Not In Computer)!

07%20AM

OvermindDL1 · November 25, 2019, 5:05pm

For PG you’d just need to run a fragment select statement (or raw sql), query the system catelogs as @hauleth said.

As for Ecto’s pool, I’m unsure what pool it uses but there is always a way to get the information ‘somehow’, otherwise sounds like a bug. ^.^;

But still, if the DB is being overwhelmed, you should look more at the work being done instead, it likely can be quite optimized.

hkrutzer · November 28, 2019, 5:43pm

Should I be trying to provide backpressure somehow?

Yes.

Should I try to prioritize the client requests over the batch requests?

You should have a separate connection pool. You can also configure a higher queue target for it, as it probably doesn’t matter if these requests take longer.

So if I wanted to slow down my batch processing, could I control the amount of slowdown based on how overloaded ecto/the database is?

That sounds like a lot of work. Depending on how you are doing this processing, you could have a worker pool, throttle the amount of concurrent requests on the URL using your loadbalancer, or have a job queue.

josevalim · November 28, 2019, 6:00pm

Correct. You should avoid tweaking queue_target and queue_interval because increasing them mostly means your users have to wait longer.

You should look at the logs and generally track your database times. Ecto logs the following data:

query_time / db_time - how long your queries are taking. if you have slow queries, they will get all connections busy, leading to the errors above. Your priority is to reduce those (adding indexes, denormalizing the data, etc)
queue_time - how long you are waiting for the connection. If query times are already low but queue time is high, it is like that you have a lot of traffic and you can’t serve it all in time. A good solution here is to increase the pool size or create separate pools based on usage as mentioned by @hkrutzer

axelson · November 29, 2019, 3:34am

Ah, that is a good idea for the bulk requests, although it might be a little difficult passing the repo to use through the rest of the system.

Thanks for the tips! I’ll take a look at those values (looks like they’re exported via telemetry as well) and then choose the approach from there.