We used a direct connection to our RDS instance in the past, and we suffered no issues.
However, as we scaled up, we started hitting connection limits, so we deployed a pgbouncer machine and redirected our elixir services to use the pgbouncer.
We did follow the guide at Postgrex — Postgrex v0.19.3 and set the db config to prepare: :unnamed
, and that allowed us to enable transaction pooling in our pgbouncer.
Most of our services are just phoenix servers with JSON or GraphQL APIs, and those remain unaffected. However, two smaller aspects still keep failing irregularly with the following error:
DBConnection.ConnectionError
connection not available and request was dropped from queue after 2411ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by: 1. Ensuring your database is available and that you can connect to it
2. Tracking down slow queries and making sure they are running fast enough
3. Increasing the pool_size (although this increases resource consumption)
4. Allowing requests to wait longer by increasing :queue_target and :queue_intervalSee DBConnection.start_link/2 for more information
or sometimes with this error
DBConnection.ConnectionError
tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
The latter happens more often. Anyway, the only two aspects which are affected are:
- A custom GenServer that’s part of the application supervision tree and is started from the beginning. This GenServer runs a loop where each iteration runs a db query to fetch pending jobs, and then starts other GenServers that process those jobs. This custom GenServer is always supposed to be alive, but these errors cause the GenServer to crash and burn.
- We have multiple cases of these handling different types of tasks. We fixed one of them by replacing the long-running GenServer with an Oban job that runs at a regular cron. Oban spawns new processes entirely when running jobs. We are wondering if the uptime of a GenServer affects whether it can checkout connections or not from Ecto’s connection pool, and whether we should just move to Oban entirely for such jobs, and why would moving to pgbouncer cause this issue.
- A Phoenix LiveView admin monitoring dashboard that fetches a lot of data to build tables and charts.
- This is definitely not long running. It only runs if any admin person opens the dashboard of course. When we originally built this dashboard we were using a direct connection to the db and had to add in a bunch of indexes to make this dashboard work. Howeveer, the migration to pg bouncer broke the dashboard again and we are wondering if we just need to add in even more indexes for this one.
Do you have any ideas on how the above phenomenons can be explained? We have solutions that seem to work but we don’t know why they work, and why things would break in the first place.