Postgrex Client Exited Help

chasers · March 18, 2019, 6:36pm

Any insight would be appreciated here…

I’m getting this message in my logs and can’t quite figure it out:

Postgrex.Protocol (#PID<___>) disconnected: ** (DBConnection.ConnectionError) client #PID<___> exited

I’ve been working on a centralized logging service https://logflare.app. It’s really easy to start using if you’re on Cloudflare because we have a Cloudflare app. Recently been getting some larger sites signing up.

This weekend we got another big one playing with it. CPU was high and ram started climbing. Looking at the logs I see that message.

At that point everything was on a $40 a month Digital Ocean box (which is awesome because we are handling a lot of requests, sometime sites are sending upwards of 1000 requests a second). No downtime at all and no bad response codes. Ecto seems to just be reissuing the query or something if it exits.

My first try at alleviating this was to move Postgres to it’s own instance. I did that … still seeing errors. Doesn’t seem to be Postgres struggling. Upped the Ecto pool count, and enabled pooling on Postgres. No difference there.

Then I started caching API keys in ETS. Which is also awesome because it was super easy. A little lighter load on Postgres but still I get those errors in the logs.

Not quite sure where to go from here. The way I see it I can:
a) Cache more stuff in ETS.
b) Upgrade the box.
c) Remove Nginx as maybe there is some contention there (currently limiting Nginx to 2 workers and it’s a 4 core box).

Although I’m still kind of just throwing darts here. Is something in my code killing the client? Is whatever making that call finishing before the query returns and causing the exit?

I need to figure out how to run observer remotely. Maybe that will shed some light on things. I’m assuming the ram issue is logger filling up with error messages.

Another fun side effect is that mix edeliver restart production just hangs. No idea why that is either.

Code is all up at: https://github.com/Logflare/logflare

engineeringdept · March 18, 2019, 8:16pm

You’re not running on Erlang/OTP 21.3 are you? I got bitten by this bug, when connecting to Postgres over SSL.

outlog · March 18, 2019, 8:46pm

yeah, this(similar error output)was also in an earlier OTP (20)… so make sure to bump to latest patch version - and avoid 21.3

chasers · March 18, 2019, 8:58pm

Running OTP 21 will by a different patch version

outlog · March 18, 2019, 9:04pm

what version are you on now?

chasers · March 18, 2019, 9:21pm

Did sudo apt-get upgrade esl-erlang=1:21.2.7-1 and still seeing this issue.

Upgrade looks like it went fine. Killed and restarted the app. Maybe recompile?

chasers · March 18, 2019, 9:23pm

And come to think of it … it was happening before when Postgres was on the same server and I wasn’t connecting over ssl.

outlog · March 18, 2019, 9:30pm

can you post a bit more of the error log eg. is an timeout?

would assume you need to do a new release to pick it up - also update your build server…

also mix.hex outdated has latest postgrex/ecto etc?

EDIT: also what OTP version did you have on it?

chasers · March 18, 2019, 10:01pm

Logs just look a lot like this:

21:37:30.862 [info] POST /api/logs
21:37:30.862 [info] Sent 403 in 228µs
21:37:30.862 [error] Postgrex.Protocol (#PID<0.1817.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.8124.20> exited
21:37:30.874 [info] Sent 200 in 42ms
21:37:30.877 [info] POST /api/logs
21:37:30.878 [info] Sent 200 in 27ms
21:37:30.878 [info] POST /api/logs
21:37:30.878 [info] Sent 403 in 158µs
21:37:30.878 [error] Postgrex.Protocol (#PID<0.1816.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.8480.20> exited
21:37:30.881 [info] POST /api/logs
21:37:30.885 [info] Sent 200 in 32ms
21:37:30.886 [info] Sent 200 in 29ms
21:37:30.886 [info] Sent 200 in 28ms
21:37:30.892 [info] POST /api/logs
21:37:30.892 [info] Sent 403 in 224µs
21:37:30.892 [error] Postgrex.Protocol (#PID<0.1802.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.32377.19> exited
21:37:30.894 [info] Sent 200 in 12ms
21:37:30.895 [info] Sent 200 in 17ms
21:37:30.897 [info] POST /api/logs
21:37:30.897 [info] Sent 403 in 288µs
21:37:30.898 [error] Postgrex.Protocol (#PID<0.1869.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.7904.20> exited

deps are:

defp deps do
    [
      {:phoenix, "~> 1.4.0"},
      {:phoenix_pubsub, "~> 1.0"},
      {:phoenix_ecto, "~> 4.0"},
      {:ecto_sql, "~> 3.0.5"},
      {:postgrex, ">= 0.0.0"},
      {:phoenix_html, "~> 2.10"},
      {:phoenix_live_reload, "~> 1.0", only: :dev},
      {:gettext, "~> 0.11"},
      {:plug_cowboy, "~> 2.0"},
      {:ueberauth_github, "~> 0.7"},
      {:plug, "~> 1.7"},
      {:jason, "~> 1.0"},
      {:distillery, "~> 2.0"},
      {:edeliver, ">= 1.6.0"},
      {:httpoison, "~> 1.4"},
      {:phoenix_oauth2_provider,
       github: "danschultzer/phoenix_oauth2_provider", branch: "phoenix-1-4"},
      {:poison, "~> 3.1"},
      {:ueberauth_google, "~> 0.8"}
    ]
  end

Build server is currently the same as prod.

outlog · March 18, 2019, 11:21pm

Those logs are missing key information… but I will go ahead and assume those are timeouts…

I assume “POST /api/logs” to be a flaming hot path and from a quick look you have 5-7 db queries going on that is way much for a hot path (starting with a plug assigning user which seems not needed)

increase the ecto :timeout

consider using Repo.checkout to bundle them into one, but perhaps better to use Cachex and avoid the db calls.

they look like somewhat stable repo.get/repo.all - so wrap those calls with Cachex - add appropriate cachex update/delete calls where needed elsewhere to bust/update the cache.

consider instrumentation like https://github.com/deadtrickster/prometheus-ecto or similar

chasers · March 18, 2019, 11:39pm

Yeah untangling those queries from there was definitely my next idea.

I thought about the timeout but none of them seem to be running long:

And from what I read elsewhere you’d normally see a timeout message with the error.

Anyways, much appreciated. I will rewrite the logs stuff with some caching and go from there.

outlog · March 19, 2019, 12:37am

are you running pgbouncer or pgpool?

also try and bump log level in prod to :debug for a short while and get a better error trace…

chasers · March 19, 2019, 1:07am

pgbouncer

sanswork · March 19, 2019, 1:27am

I’m seeing this exact same thing since switching to DO DBaaS. I haven’t had much time to investigate since as far as I can tell it isn’t actually impacting our service but happy to provide any additional data from another source that might help resolve this.

chasers · March 19, 2019, 4:03am

@sanswork yeah I was seeing it before moving to DO DBaaS … still seeing it there.

@outlog debug logs. Some duplicate queries I will take care of and cache but still nothing insightful.

03:57:00.467 [info] POST /api/logs
03:57:00.467 [info] Sent 403 in 134µs
03:57:00.467 [error] Postgrex.Protocol (#PID<0.1792.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.15429.0> exited
03:57:00.467 [debug] QUERY OK source="sources" db=0.9ms
SELECT s0."id", s0."name", s0."token", s0."public_token", s0."user_id", s0."overflow_source", s0."inserted_at", s0."updated_at" FROM "sources" AS s0 WHERE (s0."token" = $1) [<<250, 216, 124, 203, 137, 171, 76, 211, 131, 234, 81, 39, 84, 62, 128, 240>>]
03:57:00.472 [debug] QUERY OK source="sources" db=4.5ms
SELECT s0."id", s0."name", s0."token", s0."public_token", s0."user_id", s0."overflow_source", s0."inserted_at", s0."updated_at" FROM "sources" AS s0 WHERE (s0."token" = $1) [<<250, 216, 124, 203, 137, 171, 76, 211, 131, 234, 81, 39, 84, 62, 128, 240>>]
03:57:00.473 [debug] QUERY OK source="sources" db=1.1ms
SELECT s0."id", s0."name", s0."token", s0."public_token", s0."user_id", s0."overflow_source", s0."inserted_at", s0."updated_at" FROM "sources" AS s0 WHERE (s0."token" = $1) [<<250, 216, 124, 203, 137, 171, 76, 211, 131, 234, 81, 39, 84, 62, 128, 240>>]
03:57:00.475 [debug] QUERY OK source="rules" db=1.4ms
SELECT r0."id", r0."regex", r0."sink", r0."source_id" FROM "rules" AS r0 WHERE (r0."source_id" = $1) [142]
03:57:00.475 [info] Sent 200 in 10ms
03:57:00.496 [info] POST /api/logs
03:57:00.496 [info] Sent 403 in 180µs
03:57:00.496 [error] Postgrex.Protocol (#PID<0.1791.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.15435.0> exited
03:57:00.553 [info] POST /api/logs
03:57:00.556 [debug] QUERY OK source="sources" db=2.1ms
SELECT s0."id", s0."name", s0."token", s0."public_token", s0."user_id", s0."overflow_source", s0."inserted_at", s0."updated_at" FROM "sources" AS s0 WHERE (s0."token" = $1) [<<250, 216, 124, 203, 137, 171, 76, 211, 131, 234, 81, 39, 84, 62, 128, 240>>]

idk

chasers · March 19, 2019, 4:11am

Like maybe it’s supposed to happen if you halt() in a plug it just kills the client?

outlog · March 19, 2019, 8:04am

sorry my bad on the debug level… you need to enable :show_sensitive_data_on_connection_error to get better disconnect error messages, note the sensitive part… db_connection/lib/db_connection.ex at master · elixir-ecto/db_connection · GitHub

nope you are fine

the exact same error? or similar?

yeah, seems like a general issue running up against pgbouncer (at least the DO one)

yurko · March 19, 2019, 3:34pm

Weird that there is so little info in debug logs, but these might still be timeouts cause workers have to wait too long and increasing pool size would not help if your CPU can’t handle the load. To check it it’s true you can add timeout to your config, the default one is 15 seconds and you can set it to say 30:

config :app, App.Repo,
  ...
  pool_size: ..,
  timeout: 30000

The actual option is here https://hexdocs.pm/postgrex/Postgrex.html

If you have less such errors after changing the setting it would mean you actually short on CPU, but only sometimes. If it’s not about peak times but rather constant load is too high, the setting will do nothing but delay the errors. Anyway this might be a sign that you have to upgrade the droplet.

chasers · March 19, 2019, 8:12pm

Well rewrote it … no db calls anymore. Previously:

20:01:39.968 [info] Sent 200 in 6ms
20:01:39.973 [info] POST /api/logs
20:01:39.974 [info] POST /api/logs
20:01:39.976 [info] POST /api/logs
20:01:39.977 [info] POST /api/logs
20:01:39.979 [info] POST /api/logs
20:01:39.981 [info] Sent 200 in 7ms
20:01:39.982 [info] Sent 200 in 7ms
20:01:39.983 [info] Sent 200 in 5ms
20:01:39.984 [info] Sent 200 in 7ms

Using my ETS cache:

20:05:50.294 [info] Sent 200 in 251µs
20:05:50.327 [info] POST /api/logs
20:05:50.327 [info] Sent 200 in 430µs
20:05:50.337 [info] POST /api/logs
20:05:50.337 [info] Sent 200 in 483µs
20:05:50.342 [info] POST /api/logs
20:05:50.343 [info] Sent 200 in 358µs
20:05:50.344 [info] POST /api/logs
20:05:50.344 [info] Sent 200 in 374µs
20:05:50.357 [info] POST /api/logs

I cut off that user from POSTing to us temporarily via an NGINX rule. Will reactivate them again soon to see but since we’re not making any db calls I’m certain I won’t see that message again. It was definitely only when they were hitting us and we were sending a 403 because they deleted the source they were actively logging to.

Another question … so implementing this cache … I’m assuming this is what Phoenix contexts are for? I started this app in Phoenix 1.3 so wasn’t familiar with contexts but now it would be nice if all my db calls were behind one module interface so I only had one place to update the cache when needed.

chasers · March 19, 2019, 10:16pm

Holy hell okay if anyone cares this was the real fix for the original problem:

Basically I was using pipelineing in a plug module and was setting the response multiple times. Rewrote with plug builder so halt() is actually halting now. That was also one of the first things I wrote in Elixir / for this app so I’m sure that explains it haha.