Ecto poolboy checkout timeouts

bobics · August 31, 2019, 7:56pm

I’m seeing an issue where all the poolboy database workers in our pool are being used after some request load (not a lot) on our server. :poolboy.status/1 shows the state as full. Even when the request load drops, all the poolboy workers are still busy. Error log:

GenServer #PID<0.21289.1135> terminating ** (stop) exited in: :gen_server.call(App.Repo.Pool, {:checkout, #Reference<0.2565884396.2513436674.60058>, true}, 5000) ** (EXIT) time out

In our postgres logs we immediately start seeing, and very few queries going through:


unexpected EOF on client connection with an open transaction

The issue seems similar to https://github.com/elixir-ecto/db_connection/issues/127 except we’re on OTP 21.0.5 where that bug is already fixed.

We don’t have a specific repro case, except that it seems to occur under some request load. The DB isn’t under any significant load. This typically occurs anywhere from 30 minutes to a few hours, and the request load doesn’t vary significantly during that period as far as I can tell. What’s odd is that the application goes from sub 50ms mean response time to unresponsive almost immediately when the issue occurs, and then never recovers until we restart.

Using Ecto 2.2.10, db_connection 1.1.3

Hoping someone here can provide some leads, thanks!

bobics · September 1, 2019, 6:33am

Did a bunch of debugging today, I think the underlying issue is due to some long running transactions (which don’t get logged properly so we weren’t seeing them in our postgres logs). I’ll update this thread when I confirm.

alexfilatov · February 10, 2022, 9:56am

hey @bobics , any chance you solved this back in 2019? thanks!