DBConnection crashing with "tcp recv: closed"

dantswain · February 17, 2017, 2:19pm

I could use some help trying to track down where this might be coming from. My DB admin is telling me that he doesn’t see anything on his side and I’ve had the same code work on a different box. The main differences between the two boxes (one where it works and one where it crashes) are: a) using Elixir 1.3.2 on the box where it works, using Elixir 1.4.1 on the box where it crashes and b) the box where it crashes is newer and less busy.

I’m using some fairly recent commits of Ecto and Postgrex - I can supply the commit SHAs if that would help. db_connection is at 1.1.0.

This is the stack trace of the crasher:

=SUPERVISOR REPORT==== 17-Feb-2017::08:15:14 ===
     Supervisor: {local,loader_supervisor}
     Context:    child_terminated
     Reason:     {#{'__exception__' => true,
                    '__struct__' => 'Elixir.DBConnection.ConnectionError',
                    message => <<"tcp recv: closed">>},
                  [{'Elixir.Ecto.Adapters.Postgres.Connection',execute,4,
                       [{file,"lib/ecto/adapters/postgres/connection.ex"},
                        {line,108}]},
                   {'Elixir.Ecto.Adapters.SQL',sql_call,6,
                       [{file,"lib/ecto/adapters/sql.ex"},{line,243}]},
                   {'Elixir.Ecto.Adapters.SQL',execute_or_reset,7,
                       [{file,"lib/ecto/adapters/sql.ex"},{line,441}]},
                   {'Elixir.Ecto.Repo.Queryable',execute,5,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,130}]},
                   {'Elixir.Ecto.Repo.Queryable',all,4,
                       [{file,"lib/ecto/repo/queryable.ex"},{line,35}]},
                   {'Elixir.Ecto.Repo.Preloader',fetch_query,8,
                       [{file,"lib/ecto/repo/preloader.ex"},{line,171}]},
                   {'Elixir.Ecto.Repo.Preloader',preload_assoc,10,
                       [{file,"lib/ecto/repo/preloader.ex"},{line,111}]},
                   {'Elixir.Task.Supervised',do_apply,2,
                       [{file,"lib/task/supervised.ex"},{line,85}]}]}

josevalim · February 17, 2017, 2:30pm

Does it always fail or only from time to time? The error means the database closed the connection but we could have many different reasons for that.

dantswain · February 17, 2017, 2:33pm

It doesn’t fail every single time but does fail fairly regularly. It would take a lot of log digging to get concrete numbers but I’d say 1 out of every 10 or 20 queries?

josevalim · February 17, 2017, 2:36pm

Do you have anything else logged around the log above? The log above is the supervisor report for the failing task, we should also see errors logged directly from DBConnection.

dantswain · February 17, 2017, 2:45pm

The logs are pretty noisy right now because I’m debugging the application and it does a ton of work. The only other bit that seems relevant is this (business stuff redacted):

Function: &:erlang.apply/2
    Args: [#Function<10.33749089/2 in Ecto.Repo.Preloader.preload_each/4>, [{{:assoc, %Ecto.Association.Has{cardinality: :many, defaults: [], field: :field_name, on_cast: nil, on_delete: :nothing, on_replace: :r
aise, owner: OwnerModel, owner_key: :id, queryable: FieldModel, related: FieldModel, related_key: :owner_id, relationship: :child, unique: true}, {0, :owner_id}}
, nil, #Ecto.Query<from l in FieldModel, select: l.field>, []}, [caller: #PID<0.10015.2>]]]

It’s in the middle of a big batch of parallel preloads.

A little more digging and I do occasionally also see this:

2017-02-17 08:39:51.145 [error] Postgrex.Protocol (#PID<0.26991.2>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.28433.2> timed out because it checked out the connection for longer than 15000ms

Is there a config value I can set to increase that timeout? It’s weird that that would be happening just on one box. This could be a problem with our database or the client box configuration, I’m just not familiar enough with the way that DBConnection works to be able to give good feedback to our ops team to look into it.

dantswain · February 17, 2017, 3:38pm

Ah I think I got it!

I didn’t realize that the timeout option was compile time. In my previous experiments I had been cloning the repo and compiling it on the box and on the new box I’m deploying the release with chef. I usually compile against a stripped down prod.exs and then generate a sys.config with chef, and my sys.config had a longer timeout but it wasn’t taking effect because that wasn’t in place at compile time. That really threw me off. We will have to modify our deployment process. It’s a little bit suboptimal for us not to be able to pick up this kind of configuration change without a recompile OTOH these timeout values shouldn’t change often, so it’s not a huge deal.

It also threw me off a little because I could swear I’ve had queries time out before and the error messages were quite different. That could just be bad memory on my part.

Either way, thanks José!

josevalim · February 18, 2017, 8:29am

@dantswain we introduced init/2 in Ecto 2.1 for making runtime values clearer. However I believe the timeout configuration does not handle it, so I have opened an issue to make sure we support it.