(DBConnection.ConnectionError) tcp recv: closed . Server doesnt recover

We have a large query that gives us a timeout from the PostrgreSQL. This results in (DBConnection.ConnectionError) tcp recv: closed.
However after that the app doesnt recover and I have to force restart the app in order for it to be reachable again.

  exit: ** (exit) %DBConnection.ConnectionError{message: "tcp recv: closed"}
  File "lib/ecto/adapters/postgres/connection.ex", line 86, in Ecto.Adapters.Postgres.Connection.prepare_execute/5
  File "lib/ecto/adapters/sql.ex", line 243, in Ecto.Adapters.SQL.sql_call/6
  File "lib/ecto/adapters/sql.ex", line 431, in Ecto.Adapters.SQL.execute_and_cache/7
  File "lib/ecto/repo/queryable.ex", line 130, in Ecto.Repo.Queryable.execute/5
  File "lib/ecto/repo/queryable.ex", line 35, in Ecto.Repo.Queryable.all/4
  File "lib/ecto/repo/preloader.ex", line 171, in Ecto.Repo.Preloader.fetch_query/8
  File "lib/ecto/repo/preloader.ex", line 111, in Ecto.Repo.Preloader.preload_assoc/10
  File "lib/task/supervised.ex", line 85, in Task.Supervised.do_apply/2
  Module "erlang", in :erlang.apply/2

We are using the following setup:
{:phoenix, “~> 1.2.4”},
{:phoenix_ecto, “~> 3.2”},
…

On the server
Erlang/OTP 19 [erts-8.3.5] [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]
psql (PostgreSQL) 9.6.5

Is this the correct behaviour? : killing the server completely?

It depends a bit on when and where you’re making the query. If it’s inside the init clause of a process then yes, this is expected. Can you elaborate a bit more about the circumstances that accompany this query?

We are using exAdmin and have around 20 records that have relationships with another table of 650 000 records. When trying to go to that index it loads for 15-20 seconds then 502 bad gateway and the app dies. I know the query def could be improved. However Im more concerned about the application not getting up on its feets after its been kicked down.

Does your database become completely unresponsive? Do you run out of ram on either the server or DB instance? Query timeouts don’t normally cause application failures for me. Do you get just one of the errors you listed or a bunch of them?

So once I run it, it goes up to about max. 1.91G. Then it did jump down… Se attached.

After that it crashes and it goes down to 22xM.

After that it crashes and it goes down to 22xM.

Sorry what’s the it? The BEAM or postgres?

Sorry. The servers used RAM I meant when sayin “IT”. :slight_smile:

Here’s the log from POSTGRESQL

2017-11-22 18:03:52.053 UTC [26009] nike_raffle_user@nike_raffle_production LOG:  could not send data to client: Connection reset by peer
2017-11-22 18:03:52.053 UTC [26009] nike_raffle_user@nike_raffle_production STATEMENT:  SELECT p0."id", p0."retailer_id", p0."shoesize_id", p0."shoe_id", p0."first_name", p0."last_name", p0."email", p0."phone", p0."birthdate", p0."picked_up", p0."has_won", p0."is_bot", p0."notified", p0."inserted_at", p0."updated_at", p0."retailer_id" FROM "participants" AS p0 WHERE (p0."retailer_id" = ANY($1)) ORDER BY p0."retailer_id"
2017-11-22 18:03:52.056 UTC [26009] nike_raffle_user@nike_raffle_production FATAL:  connection to client lost
2017-11-22 18:03:52.056 UTC [26009] nike_raffle_user@nike_raffle_production STATEMENT:  SELECT p0."id", p0."retailer_id", p0."shoesize_id", p0."shoe_id", p0."first_name", p0."last_name", p0."email", p0."phone", p0."birthdate", p0."picked_up", p0."has_won", p0."is_bot", p0."notified", p0."inserted_at", p0."updated_at", p0."retailer_id" FROM "participants" AS p0 WHERE (p0."retailer_id" = ANY($1)) ORDER BY p0."retailer_id"

Looks like its the application itself that takes a huge amount of RAM. At around 70% of servers total RAM usage it crashes.

Basically, I’m wondering if your OS is just killing it because it’s using up all the RAM.

Yeah maybe it is. So its rather a OS problem than the application it self you mean?

1 Like

Well, the OS is doing what it’s supposed to be doing. If a program is taking over all the RAM, it’s gonna close that program.

In a production environment you always want to have some form of supervision over the application itself. This can be something simple like systemd or init.d on the server, or something more complex like Kubernetes or ECS. In both cases it’s the job of those systems to restart the whole application should it go down for one reason or another.

In short, everything is doing the job it’s supposed to be doing. The problem here is a design problem. Loading 650k DB records into RAM is not really what you want to do. But the program tries its best, the OS stops it, and then there’s nothing in place to restart it.

1 Like

I agree, however I do have a Systemd configured, but maybe not the correct way?

[Unit]
Description=Phoenix Application for ...
After=network.target

[Service]
Type=forking
User=root
Group=root
WorkingDirectory=/var/www/...
ExecStart=/var/www/... start
ExecStop=/var/www/... stop
Restart=on-failure
RestartSec=5
SyslogIdentifier=...
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

And does this work if you stop the application ? Is it restarted automatically?
I think RemainAfterExit is not something you want to turn on if I understand it correctly.

One thing I came to realize after struggling with making systemd work correctly when re/starting apps was that environment variables have to be set on the systemd file (or somehow loaded into it) otherwise they won’t be used when restarting the app.

Also, if the application itself doesn’t fully crash, just errors out, systemd won’t do anything and that has to be dealt inside the app