Ecto/Postgres/Poolboy time outs after a few queries

manukall · May 31, 2018, 12:59pm

Since a few days ago I keep getting the following error on my dev machine when making requests to a phoenix app:

exited in: :gen_server.call(#PID<0.656.0>, {:checkout, #Reference<0.1406881335.1305214977.237415>, true, 15000}, 5000) ** (EXIT) time out

It seems like the app works for some time and then the error happens for every request. Sometimes restarting Postgres or the app fixes the problem again for some time, but sometimes not. I suspect it maybe just works again after something times out.

I’m not sure what changed since I’ve last worked on this project some weeks ago. I’ve tried downgrading Postgresql and Elixir and also completely removed Postgresql, including the data directory.

The Phoenix logs for the affected requests show several successful queries before the error happens. It feels like connections are not checked in to the pool again, but I’ve no idea if that’s really true or where to start debugging. The same app works on a different laptop without those issues, though.

I’m on Arch Linux with Postgres 10.4, Ecto 2.2.10 and postgres 0.13.5.

hubertlepicki · May 31, 2018, 1:12pm

it’s difficult to answer but what is happening IMHO is that Ecto can’t check out a connection from the pool for some reason as all connections are being used.

It is possible that your app is doing something in background, or you are (even in dev) making so many parallel requests, or doing some long-running operations that they are not getting freed in time for other requests to pick up.

Can you tell me what’s your pool size config option is in dev mode, and also try to possibly bump it up to something like 30 and see if that sorts things out?

manukall · May 31, 2018, 1:17pm

That’s what I thought, too. I’ve tried increasing pool_size and pool_overflow to 50, though and it didn’t help.

Also, from the logs it looks like maybe 6 or 7 queries are made before the timeout, so it shouldn’t be enough to exhaust the pool.

The only thing I know I’ve done since it worked was running some system upgrades. I’ve downgraded postgres again though.

manukall · June 4, 2018, 2:14pm

Update: Seems to only happen with Erlang 20.3. After downgrading to 20.2.2, everything works again. My other project on the same machine was using 20.2.2. When I upgrade Erlang there to 20.3 it’s broken, too.

I’d like to raise this as a bug somewhere, but I’ve no idea where. I don’t know if it’s an issue with Erlang, Ecto, Postgrex, Poolboy, …

dom · June 4, 2018, 2:41pm

That timeout would be on DBConnection. There’s a comment on a similar issue here:

github.com/elixir-ecto/db_connection

Elixir.ErlangErrorDBConnection.Poolboy.checkout/3

opened 04:33PM - 28 Nov 17 UTC

closed 10:35PM - 07 Jan 18 UTC

samsondav

We've seen this error a couple of times in production. Stack trace: ``` Elix…ir.ErlangError: Erlang error: {:timeout, {:gen_server, :call, [#PID<0.2180.0>, {:checkout, #Reference<0.2496162780.1105199106.71425>, true, 15000}, 5000]}} File "lib/db_connection/poolboy.ex", line 112, in DBConnection.Poolboy.checkout/3 File "lib/db_connection.ex", line 920, in DBConnection.checkout/2 File "lib/db_connection.ex", line 742, in DBConnection.run/3 File "lib/db_connection.ex", line 636, in DBConnection.execute/4 File "lib/ecto/adapters/postgres/connection.ex", line 98, in Ecto.Adapters.Postgres.Connection.execute/4 File "lib/ecto/adapters/sql.ex", line 256, in Ecto.Adapters.SQL.sql_call/6 File "lib/ecto/adapters/sql.ex", line 436, in Ecto.Adapters.SQL.execute_or_reset/7 File "lib/ecto/repo/queryable.ex", line 133, in Ecto.Repo.Queryable.execute/5 ``` Any ideas why this might be happening and what we can do to resolve it?

I would also suggest rm -rf _build then recompiling, just to be sure there’s no weirdness.

hubertlepicki · June 4, 2018, 3:26pm

Oh that’s interesting. I am not using 20.3 yet but it seems like a good opportunity to update one of the projects and see if I can replicate your problem. I have got one that’s matching Ecto and postgrex versions to your set up, and is reasonably big so it will be a good test.

It might be a bug in poolboy or database_connection on this specific version of Erlang.

hubertlepicki · June 5, 2018, 9:50am

Just to give you an update, so far in development I saw no such issues :(. We are going to deploy to staging an update some time today and we’ll see if it’ll be caught there. It’s weird.

manukall · June 5, 2018, 9:52am

thanks. it’s very weird. we’re only seeing the issue on my machine running arch linux. on a macbook and our ubuntu servers everything is fine. maybe it only happens with very recent postgres versions.

hubertlepicki · June 5, 2018, 9:53am

Damn, I migrated from Arch to Ubuntu last year :(.

manukall · June 5, 2018, 9:56am

good for you in this case

hubertlepicki · June 5, 2018, 9:57am

Well I had issues setting up Docker projects and just got fed up with constantly fighting with Arch and I’m already 33 so you know, too old to fight against my dev machine like that all the time.

Green4ao · June 9, 2018, 10:53am

Same error…distro: Antergos Gnome, Erlang 20.3.7 and Elixir 1.6.5

miha · June 21, 2018, 9:31am

same error here, but only on a production server. Running Arch Linux on both dev and prod machines.
Erlang 20.3.8, Elixir 1.6.5. postgresq db has only four tables with a maximum of 10 rows.

The main question is, how to debug this in production to find the source of the error.

amagdas · July 10, 2018, 3:15pm

This looks to be fixed by Erlang: https://github.com/elixir-ecto/db_connection/issues/127

axelson · July 10, 2018, 8:28pm

@amagdas Thanks for updating this thread!

This was all fascinating reading:

Elixir bug report with initial investigation: https://github.com/elixir-ecto/db_connection/issues/127
Erlang bug report with c/assembly-level investigation: https://bugs.erlang.org/browse/ERL-654
The patch that fixes the issue: https://github.com/jhogberg/otp/commit/e27b98b353df6b4d02d0180fd6ffa86519b41917

A high-level summary is that erlang ports updated a shared global counter async_ref in a way that was not atomic. This race-condition has existed for quite a while but on pre GCC 8.1 this variable would only get read once in each thread. But GCC 8.1 (especially with some optimizations enabled) exposed the race-condition and caused prim_inet:recv0/3 to hang because it was waiting for the wrong reference (since it was waiting for a reference from a different thread). The bug was fixed by changing the async_ref to be unique per-port (since it doesn’t actually need to be shared between ports). This fix is included in Erlang/OTP 21.0.2.

If I got any of that wrong please feel free to correct it