How can you avoid overwhelming ecto?

I’m getting the following error in my logs fairly often:

connection not available and request was dropped from queue after 476ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information

I understand that I can tweak the queue_target and queue_interval, however I’d like to better understand what (if anything) I should be trying to do before that. I have operations coming in from multiple locations that hit the database:

  • Web requests directly from clients (via phoenix/http)
  • Batch jobs from another server (via phoenix/http)
  • Batch jobs via oban

Should I be trying to provide backpressure somehow? Should I try to prioritize the client requests over the batch requests? Should I be checking the telemetry for the ecto queue time in my code?

Basically I’m trying to understand the best practices and how I should approach this problem and I’m curious to hear any thoughts (including possibly telling me that I’m thinking about this all wrong)

3 Likes

Is it really just Ecto being overwhelmed, or is it the database itself? If the database can’t keep up then doing more in Ecto won’t really help, you’d need to fix the queries. If the database is still not being fully loaded then adding more connections would probably help then (in addition to fixing up some of the more costly queries).

yeah I think it is the database itself as well

And I will be looking into fixing the more costly queries but I want to look to see if there’s a general pattern/approach that I should implement as well.

Are you running MacOS? I noticed these starting to pop up on my machine too, mostly when in iex and it’s sitting idle. It may have something to do with the latest update of MacOS (I’m on Mojave 10.14.6). The db connector in my IDE also has started dropping the PG connection about the same time as I started to see the errors in iex and that was shortly after the OS had an update.

Curious if anyone else is seeing this and if it is only MacOS or Linux too.

1 Like

This is in production running on heroku with https://github.com/HashNuke/heroku-buildpack-elixir/

OK…well that rules out my theory! I’d been seeing it on two different machines (both on 10.14.6) and the IDE dropping even on Rails projects, although Rails wasn’t throwing errors…just the DB explorer would lose the PG connection and I’d have to reconnect.

If it’s the database getting overwhelmed then not much you can do in ecto to help short of just adding delay to the inevitable, just need to optimize the queries or space them out more or so. :slight_smile:

Woo, that sounds like major overloading?

1 Like

So if I wanted to slow down my batch processing, could I control the amount of slowdown based on how overloaded ecto/the database is?

I’m thinking it would be possible to create a telemetry handler that’ll record how long the queuing time is and if it is too long, then the next batch job will be delayed to give the db some time to “cool off”

You could ask PG what it’s usage is, or maybe check the pool to see how many are checked out and halt for a period if too high? Honestly I’m more curious in what the queries are actually doing…

1 Like

How would you ask PG what it’s usage is? Or check how checked out ecto’s pool is? I don’t see an API to get that info.

Yeah, I might start a separate thread about the queries themselves.

Have you checked system catalogs? Like that.

3 Likes

@axelson if this is on Heroku, from command line use heroku pg:info DATABASE_URL and you’ll get something like this to see how many connections are being used. You can also see some statistics if you click on the PG addon in the Dashboard GUI for the app and go to the Diagnose tab. There you’ll see a limited set of query stats. There may be more you can do with the CLI under the heroku pg command if you check out their documentation. Hope that helps.

PS…FWIW the issue I was seeing in iex above is unrelated as I found out and I don’t want to clutter your thread with details…it was a PICNIC (Problem In Chair, Not In Computer)! :wink:

07%20AM

For PG you’d just need to run a fragment select statement (or raw sql), query the system catelogs as @hauleth said. :slight_smile:

As for Ecto’s pool, I’m unsure what pool it uses but there is always a way to get the information ‘somehow’, otherwise sounds like a bug. ^.^;

But still, if the DB is being overwhelmed, you should look more at the work being done instead, it likely can be quite optimized.

1 Like

Should I be trying to provide backpressure somehow?

Yes.

Should I try to prioritize the client requests over the batch requests?

You should have a separate connection pool. You can also configure a higher queue target for it, as it probably doesn’t matter if these requests take longer.

So if I wanted to slow down my batch processing, could I control the amount of slowdown based on how overloaded ecto/the database is?

That sounds like a lot of work. Depending on how you are doing this processing, you could have a worker pool, throttle the amount of concurrent requests on the URL using your loadbalancer, or have a job queue.

1 Like

Correct. You should avoid tweaking queue_target and queue_interval because increasing them mostly means your users have to wait longer.

You should look at the logs and generally track your database times. Ecto logs the following data:

  • query_time / db_time - how long your queries are taking. if you have slow queries, they will get all connections busy, leading to the errors above. Your priority is to reduce those (adding indexes, denormalizing the data, etc)

  • queue_time - how long you are waiting for the connection. If query times are already low but queue time is high, it is like that you have a lot of traffic and you can’t serve it all in time. A good solution here is to increase the pool size or create separate pools based on usage as mentioned by @hkrutzer

2 Likes

Ah, that is a good idea for the bulk requests, although it might be a little difficult passing the repo to use through the rest of the system.

Thanks for the tips! I’ll take a look at those values (looks like they’re exported via telemetry as well) and then choose the approach from there.