Intermittent DB Timeout Errors from Postgrex on AWS RDS

wfgilman · March 2, 2018, 7:02pm

My logs are showing intermittent timeout errors for queries that are normally really quick (milliseconds). They are just SELECT queries for ~10 records.

DBConnection.Connection: 20:52:25.688 [error] Postgrex.Protocol (#PID<0.2258.0>) 
disconnected: ** (DBConnection.ConnectionError) client #PID<0.2502.0> timed out because 
it checked out the connection for longer than 15000ms

Has anyone experienced this? Some thoughts that come to mind:

Connectivity of RDS
Records are being locked by another process

Would appreciate any commentary here. I’m not sure how to go about debugging this. I read all the threads about passing extended timeouts to the DB, but that doesn’t seem necessary since I don’t have any ecto queries that run over 1s.

OvermindDL1 · March 2, 2018, 11:37pm

Should probably turn on the query logger as well as EXPLAINing the query to start with.

wfgilman · March 3, 2018, 12:58am

Do you mean, like, enable debug: true?

OvermindDL1 · March 3, 2018, 1:47am

Heh, I mean on the PostgreSQL Server itself, I’m pretty sure this will be standard SQL server debugging rather than anything Elixir specific (though it’s possible, I find it highly unlikely).

byungjikroh · March 11, 2018, 4:14pm

I have same problem. Not slow query takes too much time more than timeout setting.

I haven’t solved this problem and not found out a hint to solve.

wfgilman · March 11, 2018, 7:27pm

What hint did you find out?

byungjikroh · March 11, 2018, 8:03pm

No no. I meant I didn’t find any hint. Sorry.

ardhitama · March 12, 2018, 2:08am

In your config xxxx.exs do something like the following.

use Mix.Config

config :db_name, App.Repo,
  adapter: Ecto.Adapters.MySQL,
  username: "",
  password: "",
  database: "",
  hostname: "",
  port: "",
  pool_timeout: :infinity,
  timeout: :infinity

Try the :infinity

wfgilman · March 12, 2018, 4:16am

I’m sure that would do the trick, but it would just address the symptom. I’m trying to uncover the source of the problem. If I figure it out I’ll post it up here.

ardhitama · March 12, 2018, 4:37am

That problem also happened to one my project, but the conn is to external network, so my guess is something about slow TCP handshake.

But yours maybe something about conn pool maxed out.

wfgilman · March 13, 2018, 8:04pm

That’s what I’m thinking.

wfgilman · March 13, 2018, 8:05pm

We’re also seeing CPU spikes set off alarms which start additional instances. Since we’re not PGBouncer, that can double or triple the number of connections to the same database. I hypothesize this is the cause.

byungjikroh · March 15, 2018, 12:46pm

In my case using Phoenix on Heroku, changing pool size on config/prod.exs and Heroku config differently solves the timeout error currently.

When I set each pool size same on prod.exs and Heroku like below, I encountered the client timeout error every 5 minutes or less.

config/prod.exs 
config :myapp, MyApp.Repo,
  adapter: Ecto.Adapters.Postgres,
  url: System.get_env("DATABASE_URL"),
  pool_size: String.to_integer(System.get_env("POOL_SIZE") || "18"),
  ssl: true

Heroku config
$ heroku config:set POOL_SIZE=18

I followed Phoenix official doc (https://hexdocs.pm/phoenix/heroku.html) and set pool size differently.

config/prod.exs 
config :myapp, MyApp.Repo,
  adapter: Ecto.Adapters.Postgres,
  url: System.get_env("DATABASE_URL"),
  pool_size: String.to_integer(System.get_env("POOL_SIZE") || "10"),
  ssl: true

Heroku config
$ heroku config:set POOL_SIZE=18

I don’t know why but it works.

I’m not sure but I suppose if those pool size are same, there are some pool overflow and it causes timeout error because a process can’t connect DB in limited time.

fmcgeough · March 15, 2018, 1:32pm

You’ll have to dig into Postgres to find an explanation here. Nothing you mention seems Elixir related but here’s some information that will hopefully be helpful to you.

You’ll need to use psql (or whatever your favorite tool is for issuing SQL queries). You need to examine the logs that Postgres RDS has available. You need to ensure that you’re RDS instance is setup to capture data that might be useful for debugging things like this.

Here’s what I wrote up at some point as general advice for Postgres RDS:

rds.log_retention_period - this is set to 4320 (3 days of logs) by default. If you want more than 3 days of logs then you’ll have to increase this. You specify this in seconds.
log_statement and log_min_duration_statement - these are off by default. You should set log_statement to “all” and log_min_duration_statement to a reasonable value for your database. Setting this does not require a restart and take effect immediately. The log_min_duration_statement is in milliseconds.
cloudwatch monitoring gives a lot with RDS for tracking CPU/memory usage. You should examine whether you need to enable detailed monitoring. This can be very helpful but enabling is dependent on how much stress your application db will be under. The more the database is stressed out the more likely it is that you want detailed monitoring to help indicate problems.
pg_stat_statements is supported for RDS. You should enable pg_stat_statements. It is an incredibly useful tool for examining what queries are hitting your database, how often and how long they run. Installing this extension requires you to shutdown/restart your database.
Install the pg_repack utility on your RDS. This enables you to reduce bloat for indexes and tables without locking the database. This can be done via the AWS Console for RDS. It does not require a restart.

There are a couple of approaches to checking on locks with Postgres. Check the Postgres web site for queries. Here’s one to get you started:

SELECT bl.pid AS blocked_pid,
a.usename AS blocked_user,
ka.query AS blocking_statement,
now() - ka.query_start AS blocking_duration,
kl.pid AS blocking_pid,
ka.usename AS blocking_user,
a.query AS blocked_statement,
now() - a.query_start AS blocked_duration
FROM pg_catalog.pg_locks bl
JOIN pg_catalog.pg_stat_activity a ON a.pid = bl.pid
JOIN pg_catalog.pg_locks kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid
WHERE NOT bl.GRANTED;

Once you gather some info on your problem from Postgres side if you are still puzzled head over to the Postgres IRC channel on freenode. The folks there are friendly for the most part, incredibly knowledgeable about the db and can offer assistance on how to track down your issue.

hassan · March 3, 2018, 12:00am

Check the RDS console to see if there’s any unusual activity –
even a scheduled snapshot can have an impact if there’s a
minimal amount of memory
Are you sure nothing else is accessing the DB? I had a client
whose app ground to a halt when the Data Science folks were
using Tableau for fancy visualizations (against the prod DB!)

HTH,

wfgilman · April 14, 2018, 2:24am

To close the loop on this, I’m almost 100% certain that the timeout errors were due to the DB connection pool being maxed out due to multiple instances of my application being started by my autoscaling parameters.