Hi. At work we are running an application with a lot of web traffic and background jobs. Some time ago we split the Oban database from the regular application database.
There are four web nodes and a separate worker node. Oban jobs are only processed by the worker node.
The issue that we are facing is that we get the following error frequently:
** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2999ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
1. Ensuring your database is available and that you can connect to it
2. Tracking down slow queries and making sure they are running fast enough
3. Increasing the pool_size (although this increases resource consumption)
4. Allowing requests to wait longer by increasing :queue_target and :queue_interval
See DBConnection.start_link/2 for more information
As per AppSignal, this error is always originated in lib/oban/stager.ex:86.
The thing is, our Oban repository is configured with pool_size: 4
and we have three queues. I seem to remember reading somewhere that Oban uses a connection per queue plus another extra connection to listen to PG notifications.
In that case 4 connections should be enough (3 for the queues + 1 for pubsub notifications), but this error that we experience makes me thing otherwise.
Anyone knows how many connections does Oban use for internal operations?
EDIT: thinking about this further, my understanding is that the stager is also running in the four web nodes, even though they donât run any queue. I see that the stager does nothing if the peer is not the leader, but still connects to the database to perform a noop transaction. Iâve seen that the stager wonât be initialized if we configure stage_interval: :infinity
. Would this be an appropriate configuration for the web nodes?
2 Likes
Which Oban version are you using? The recent v2.15 release has query improvements that significantly speed up job staging for larger tables.
Oban shares connections from the pool and one additional dedicated connection for the Postgres notifier. It doesnât matter how many queues or plugins you run; Oban doesnât take more connections.
Youâre welcome to disable staging on web nodes by setting the interval to :infinity
. However, it seems like you have some slow queries monopolizing the connection pool if a noop transaction canât grab a connection.
1 Like
Which Oban version are you using? The recent v2.15 release has query improvements that significantly speed up job staging for larger tables.
We are currently using Oban 2.14.2. Iâve took a look at the 2.15 release notes and it looks like the changes may be helpful. Will update and see if the problem is resolved or mitigated.
Oban shares connections from the pool and one additional dedicated connection for the Postgres notifier. It doesnât matter how many queues or plugins you run; Oban doesnât take more connections.
Youâre welcome to disable staging on web nodes by setting the interval to :infinity
. However, it seems like you have some slow queries monopolizing the connection pool if a noop transaction canât grab a connection.
Sorry but Iâm not sure that I understand this. The issue is happening in a repository that points to a different database and is only used by Oban. The application code uses a different repository, with a different connection pool and pointing to a different database. Is it possible that the application operations using MyApp.Repo
could affect MyApp.ObanRepo
as well?
Right, I understood that. My point is that some slow queries are monopolizing connections, which implies that the pool size is too small or the database canât manage the load.
If you identify slower queries or bottlenecks, Iâm eager to help optimize queries/connections.
Oh that makes sense! Iâll update Oban to 2.15 first and then increase the pool size, as 4 seems too low. If the problem still persists Iâll try to identify the more impactful queries to see if they can be optimised.
Thanks! 
Updating Oban improved the situation a bit but we still had errors here and there. At the end we ended up increasing the pool size as we found out that the stager errors happened at the same time as a heavyweight job was running (it inserted about 50k jobs itself).