Transactions sometimes time out for no apparent reason

arvidkahl · May 24, 2018, 1:41am

What do I do when a transaction hangs on the final commit? Ecto wraps the update query (second to last) in a transaction, and the begin as well as the update statements seem to go through. what could be the reason for a transaction like this timing out? Particularly over 60s in a table of under 1mil rows with 3 indices. This query usually finishes in 30ms.

The update statement is something as boring as "UPDATE "templates" SET "shared_count" = $1, "updated_at" = $2 WHERE "id" = $3".

The resulting error code is HTTP request error: tcp recv: closed from (similar call, as an example):

(db_connection) lib/db_connection.ex:802: DBConnection.transaction/3
(ecto) lib/ecto/repo/schema.ex:125: Ecto.Repo.Schema.insert!/4

From the APM: https://www.dropbox.com/s/gqnrkoruywe7l37/Screenshot%202018-05-23%2021.24.59.png?dl=0

What could be the reason for this? The people running the database said there were no connectivity issues from their end.

michalmuskala · May 24, 2018, 9:34am

There might be other transactions in the system with locks on the same rows, so the database needs to wait for them to finish and release those locks before committing your transaction.

arvidkahl · May 24, 2018, 1:41pm

This is a replicated system with regular backups. I am trying to figure out what in the world could take longer than 60s to finish that would lock rows in a database during an INSERT operation. I do know there are indices that need to be updated, but there is nothing that the INSERT touches that could be locked.

I have the feeling that there is some operation that times out, pulling down the whole postgrex connection with it in some way (throwing an exception somewhere_, which THEN times out all the currently running INSERTS and UPDATES with it. Is that possible? If so, how can I prevent the initial exception from killing the process?

arvidkahl · May 24, 2018, 4:42pm

Furthermore, I have found this error message in all cases of this happening recently: Postgrex.Protocol (#PID<0.2126.0>) failed to connect: ** (RuntimeError) awaited on another connection that failed to bootstrap types.

Actually, here is the list of errors as they happen:

May 24 16:07:49: 14:07:49.902 [error] Postgrex.Protocol (#PID<0.2126.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.6725.4> timed out because it checked out the connection for longer than 120000ms 
May 24 16:07:50: 14:07:50.803 [error] Postgrex.Protocol (#PID<0.2128.0>) timed out because it was handshaking for longer than 120000ms 
May 24 16:07:50: 14:07:50.810 [error] Postgrex.Protocol (#PID<0.7168.4>) failed to connect: ** (RuntimeError) awaited on another connection that failed to bootstrap types

So this happens over and over again, until I restart the application. Then, it reconnects fine. What can I do to prevent this error loop?

dimitarvp · May 25, 2018, 11:13pm

Have you tried fiddling with the pool_size Ecto option?

I would still guess you have transactions blocking each other though.

arvidkahl · May 25, 2018, 11:44pm

I have reduced the pool size to no effect. I’ll increase it a bit and see if it helps.

Here is what I wonder about is how an INSERT can be blocked by a transaction. That seems so odd.

andre1sk · May 26, 2018, 12:17am

well vacuum can block insert.
You can use queries from this wiki to see what is blocking the transaction:
https://wiki.postgresql.org/wiki/Lock_Monitoring

arvidkahl · May 31, 2018, 4:20pm

The problem was resolved on the database side. The running postgres instance was faulty, randomly timing out connections for minutes on end. Failing over to the secondary completely resolved this issue. It would be great if postgres reported the actual reason for disconnects (which in this case was transactions falsely deadlocking until the db pretty much threw up), but as far as ecto and postgrex are concerned, no fault with the libraries.

Pool size and timeout also had no effect, as this was a database problem. Thanks everyone for your thoughts and ideas.