Ecto/Postgres/Poolboy time outs after a few queries

Since a few days ago I keep getting the following error on my dev machine when making requests to a phoenix app:

exited in: :gen_server.call(#PID<0.656.0>, {:checkout, #Reference<0.1406881335.1305214977.237415>, true, 15000}, 5000) ** (EXIT) time out

It seems like the app works for some time and then the error happens for every request. Sometimes restarting Postgres or the app fixes the problem again for some time, but sometimes not. I suspect it maybe just works again after something times out.

I’m not sure what changed since I’ve last worked on this project some weeks ago. I’ve tried downgrading Postgresql and Elixir and also completely removed Postgresql, including the data directory.

The Phoenix logs for the affected requests show several successful queries before the error happens. It feels like connections are not checked in to the pool again, but I’ve no idea if that’s really true or where to start debugging. The same app works on a different laptop without those issues, though.

I’m on Arch Linux with Postgres 10.4, Ecto 2.2.10 and postgres 0.13.5.

1 Like

it’s difficult to answer but what is happening IMHO is that Ecto can’t check out a connection from the pool for some reason as all connections are being used.

It is possible that your app is doing something in background, or you are (even in dev) making so many parallel requests, or doing some long-running operations that they are not getting freed in time for other requests to pick up.

Can you tell me what’s your pool size config option is in dev mode, and also try to possibly bump it up to something like 30 and see if that sorts things out?

1 Like

That’s what I thought, too. I’ve tried increasing pool_size and pool_overflow to 50, though and it didn’t help.

Also, from the logs it looks like maybe 6 or 7 queries are made before the timeout, so it shouldn’t be enough to exhaust the pool.

The only thing I know I’ve done since it worked was running some system upgrades. I’ve downgraded postgres again though.

1 Like

Update: Seems to only happen with Erlang 20.3. After downgrading to 20.2.2, everything works again. My other project on the same machine was using 20.2.2. When I upgrade Erlang there to 20.3 it’s broken, too.

I’d like to raise this as a bug somewhere, but I’ve no idea where. I don’t know if it’s an issue with Erlang, Ecto, Postgrex, Poolboy, …

4 Likes

That timeout would be on DBConnection. There’s a comment on a similar issue here:

I would also suggest rm -rf _build then recompiling, just to be sure there’s no weirdness.

1 Like

Oh that’s interesting. I am not using 20.3 yet but it seems like a good opportunity to update one of the projects and see if I can replicate your problem. I have got one that’s matching Ecto and postgrex versions to your set up, and is reasonably big so it will be a good test.

It might be a bug in poolboy or database_connection on this specific version of Erlang.

3 Likes

Just to give you an update, so far in development I saw no such issues :(. We are going to deploy to staging an update some time today and we’ll see if it’ll be caught there. It’s weird.

thanks. it’s very weird. we’re only seeing the issue on my machine running arch linux. on a macbook and our ubuntu servers everything is fine. maybe it only happens with very recent postgres versions.

Damn, I migrated from Arch to Ubuntu last year :(.

good for you in this case :wink:

Well I had issues setting up Docker projects and just got fed up with constantly fighting with Arch and I’m already 33 so you know, too old to fight against my dev machine like that all the time.

1 Like

Same error…distro: Antergos Gnome, Erlang 20.3.7 and Elixir 1.6.5

same error here, but only on a production server. Running Arch Linux on both dev and prod machines.
Erlang 20.3.8, Elixir 1.6.5. postgresq db has only four tables with a maximum of 10 rows.

The main question is, how to debug this in production to find the source of the error.

1 Like

This looks to be fixed by Erlang: https://github.com/elixir-ecto/db_connection/issues/127

5 Likes

@amagdas Thanks for updating this thread!

This was all fascinating reading:

A high-level summary is that erlang ports updated a shared global counter async_ref in a way that was not atomic. This race-condition has existed for quite a while but on pre GCC 8.1 this variable would only get read once in each thread. But GCC 8.1 (especially with some optimizations enabled) exposed the race-condition and caused prim_inet:recv0/3 to hang because it was waiting for the wrong reference (since it was waiting for a reference from a different thread). The bug was fixed by changing the async_ref to be unique per-port (since it doesn’t actually need to be shared between ports). This fix is included in Erlang/OTP 21.0.2.

If I got any of that wrong please feel free to correct it :slight_smile:

3 Likes