Ecto: retry query on timeout

ericteubert · October 16, 2018, 10:42am

My situation: I have a table with millions of download tracking entries that I am running queries on to generate report data. Query times seem to vary a lot, but sometimes even giving multiple minutes timeout is not enough. When I then go to the database manually and try to run the identical query again it finishes in seconds.

Postgrex.Protocol (#PID<0.438.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.593.0> timed out because it checked out the connection for longer than 300000ms

My workaround idea is to rerun the query when it timeouts. But how do I do that? Isolate each query execution in Task.Supervisor.async_nolink and check if the task finishes normally or is killed? Is there a more direct way?

Bonus question: What could be the reason for these erratic timeouts?

josevalim · October 16, 2018, 10:52am

My workaround idea is to rerun the query when it timeouts. But how do I do that? Isolate each query execution in Task.Supervisor.async_nolink and check if the task finishes normally or is killed? Is there a more direct way?

I like this approach. Task.yield/shutdown were designed exactly for this.

Whenever a query succeeds, we log some stuff. For the cases the query succeeds but still takes long, how long is the decoding/queue/query time? That should provide some guidance. If the query is the one varying a lot, then I would guess database locks maybe?

ericteubert · October 16, 2018, 11:01am

Cool, I’ll go that route then

Yes, query time. Also, since it’s a tracking table it still gets lots of inserts while I generate the reports. Don’t know enough about databases to know if those inserts affect the expensive SELECTs I’m running. Where would I start reading about database locks or figure out if that’s the core issue and how to solve it?

josevalim · October 16, 2018, 11:09am

Each database is going to have a different implementation, so I would recommend getting an advanced book on your database of choice. It has been a while since I read one about Postgres specifically, so I don’t have one to recommend, but hopefully others can chime in!

benwilson512 · October 16, 2018, 11:39am

If you’re using postgres look into turning on the PostgreSQL: Documentation: 16: F.32. pg_stat_statements — track statistics of SQL planning and execution extension which will let the database itself help you track long running queries.

fmcgeough · October 16, 2018, 12:47pm

If its Postgres then there are resources available to figure what the issue is. There are some basic things you should know.

The pg_stat_statements suggestion by @benwilson512 is great. What I do with every Postgres db I use for production is to ensure that’s loaded then grab all the data from the table that it uses on an hourly basis, write it all to an archive database (with an hour timestamp) and reset the pg_stat_statements table. This allows me to see impact of a release and compare hours historically (number of queries, average execution time, etc).

pg_stat_activity shows your current transactions and the last (or currently active) query that is executing as part of that transaction. This query might be helpful:

SELECT s."state", s.client_addr, s.usename, now()-s.query_start as age, s.pid, s.query
FROM pg_stat_activity s
WHERE s.query != '<IDLE>'
AND s."state" != 'idle'
and s.pid != pg_backend_pid()
order by 4 desc;

pg_locks stores locking information. There are queries on the Postgres site for checking to see if your query is blocking and what is blocking it. Here’s a basic one to start with:

 SELECT bl.pid                 AS blocked_pid,
         a.usename              AS blocked_user,
         ka.query               AS blocking_statement,
         now() - ka.query_start AS blocking_duration,
         kl.pid                 AS blocking_pid,
         ka.usename             AS blocking_user,
         a.query                AS blocked_statement,
         now() - a.query_start  AS blocked_duration
  FROM  pg_catalog.pg_locks         bl
   JOIN pg_catalog.pg_stat_activity a  ON a.pid = bl.pid
   JOIN pg_catalog.pg_locks         kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
   JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid
  WHERE NOT bl.GRANTED;

If you are stuck or determine that you have some sort of Postgres issue then I’d also suggest visiting the #postgres channel on free node IRC. There are some incredibly knowledgeable folks that hang out there and they are helpful.

abtrapp · October 17, 2018, 5:11am

Hard to tell without more information.

When you rerun the query - I hope you do it with the same parameters
Do the long running queries appear one after the other? Impacient users can trigger the query multiple times when the first one does not respond - on traditional hard drives this can lead from unproblematic to unresponsive very fast
Check how many connections the DB server has and how many are active
Learn more about your DB (not funny, but necessary), as @josevalim mentioned every DB is different
Think about a simple replication, must reporting stuff will lead to impacts on the production side at some time (in my experience) - A simple 2 node scenario where node1 is the production node and node2 is the “readonly” source for reporting is not too hard to set up.

Most important thing: Try to catch some app and db server stats (cpu load, …) when you retry your query - maybe it’s just a resource problem. Do you monitor your server? Any data about when the queries go slow?