I’ve spent more time debugging this issue. I’ve copied db_connection
and postgrex
from deps/
into local_deps/
, made them a path dependency and then sprinkled them with logging. Here’s what I got.
When the migrations fail, it looks like the connection checkout times-out before postgrex
connects to the database:
~ $ ./bin/myapp eval MyApp.Release.migrate
22:03:09.885 [info] DBConnection.ConnectionPool.start_link
22:03:09.898 [info] DBConnection.register_as_pool
22:03:10.001 [info] DBConnection.ConnectionPool.init target 50 interval 1000
22:03:10.001 [info] DBConnection.ConnectionPool.start_poll
22:03:10.001 [info] DBConnection.ConnectionPool.start_idle
22:03:10.109 [info] >> Postgrex.connect
22:03:10.110 [info] >> Postgrex.connect
22:03:10.594 [info] DBConnection.prepare_execute: %Postgrex.Query{ref: nil, name: "", statement: ["CREATE TABLE ", "IF NOT EXISTS ", [34, "schema_migrations", 34], 32, 40, [[[], [[34, "version", 34], 32, "bigint", [[], []]], ", "], [34, "inserted_at", 34], 32, ["timestamp", "(0)"], [[], []]], [", ", "PRIMARY KEY (", [[], 34, "version", 34], ")"], 41, []], param_oids: nil, param_formats: nil, param_types: nil, columns: nil, result_oids: nil, result_formats: nil, result_types: nil, types: nil, cache: :reference}
22:03:10.594 [info] DBConnection.parse
22:03:10.702 [info] DBConnection.run
22:03:10.702 [info] DBConnection.checkout
22:03:10.702 [info] DBConnection.ConnectionPool.checkout
22:03:10.805 [info] DBConnection.ConnectionPool.handle_info checkout busy
22:03:11.098 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:11.098 [info] DBConnection.ConnectionPool.start_poll
22:03:11.098 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:11.098 [info] DBConnection.ConnectionPool.drop_idle
22:03:11.098 [info] DBConnection.ConnectionPool.start_idle
22:03:12.093 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:12.093 [info] DBConnection.ConnectionPool.start_poll
22:03:12.093 [info] DBConnection.ConnectionPool.timeout
22:03:12.195 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:12.195 [info] DBConnection.ConnectionPool.drop_idle
22:03:12.195 [info] DBConnection.ConnectionPool.start_idle
22:03:13.003 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:13.003 [info] DBConnection.ConnectionPool.start_poll
22:03:13.003 [info] DBConnection.ConnectionPool.timeout
22:03:13.092 [info] DBConnection.ConnectionPool.drop_slow {-576460738507, 100}
22:03:13.092 [info] DBConnection.ConnectionPool.drop
22:03:13.101 [info] DBConnection.checkout: error %DBConnection.ConnectionError{message: "connection not available and request was dropped from queue after 2196ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n 1. Ensuring your database is available and that you can connect to it\n 2. Tracking down slow queries and making sure they are running fast enough\n 3. Increasing the pool_size (although this increases resource consumption)\n 4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n", severity: :error, reason: :queue_timeout}
22:03:13.199 [info] DBConnection.ConnectionPool.handle_info :timeout
22:03:13.199 [info] DBConnection.ConnectionPool.drop_idle
22:03:13.199 [info] DBConnection.ConnectionPool.start_idle
In the happy scenario, the connection seems to be made on time:
~ $ ./bin/myapp eval MyApp.Release.migrate
21:38:12.709 [info] DBConnection.ConnectionPool.start_link
21:38:12.793 [info] DBConnection.register_as_pool
21:38:12.891 [info] DBConnection.ConnectionPool.init target 50 interval 1000
21:38:12.891 [info] DBConnection.ConnectionPool.start_poll
21:38:12.891 [info] DBConnection.ConnectionPool.start_idle
21:38:12.997 [info] >> Postgrex.connect
21:38:12.999 [info] >> Postgrex.connect
21:38:13.699 [info] DBConnection.prepare_execute: %Postgrex.Query{ref: nil, name: "", statement: ["CREATE TABLE ", "IF NOT EXISTS ", [34, "schema_migrations", 34], 32, 40, [[[], [[34, "version", 34], 32, "bigint", [[], []]], ", "], [34, "inserted_at", 34], 32, ["timestamp", "(0)"], [[], []]], [", ", "PRIMARY KEY (", [[], 34, "version", 34], ")"], 41, []], param_oids: nil, param_formats: nil, param_types: nil, columns: nil, result_oids: nil, result_formats: nil, result_types: nil, types: nil, cache: :reference}
21:38:13.699 [info] DBConnection.parse
21:38:14.001 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:14.001 [info] DBConnection.ConnectionPool.start_poll
21:38:14.001 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:14.001 [info] DBConnection.ConnectionPool.drop_idle
21:38:14.001 [info] DBConnection.ConnectionPool.start_idle
21:38:14.095 [info] DBConnection.run
21:38:14.095 [info] DBConnection.checkout
21:38:14.095 [info] DBConnection.ConnectionPool.checkout
21:38:14.202 [info] DBConnection.ConnectionPool.handle_info checkout busy
21:38:14.893 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:14.893 [info] DBConnection.ConnectionPool.start_poll
21:38:15.013 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:15.013 [info] DBConnection.ConnectionPool.drop_idle
21:38:15.013 [info] DBConnection.ConnectionPool.start_idle
21:38:15.890 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:15.890 [info] DBConnection.ConnectionPool.start_poll
21:38:15.890 [info] DBConnection.ConnectionPool.timeout
21:38:16.087 [info] DBConnection.ConnectionPool.handle_info :timeout
21:38:16.087 [info] DBConnection.ConnectionPool.drop_idle
21:38:16.087 [info] DBConnection.ConnectionPool.start_idle
21:38:16.095 [info] << Postgrex.connect
21:38:16.095 [info] << Postgrex.connect
21:38:16.197 [info] Postgrex connect: {:ok, %Postgrex.Protocol{sock: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.6>, :tls_connection, :undefined}, [#PID<0.179.0>, #PID<0.177.0>]}}, connection_id: 2191998, connection_key: -613615798, peer: {{209, 38, 218, 180}, 25060}, types: {Postgrex.DefaultTypes, #Reference<0.822817904.78249985.212280>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.822817904.78118913.212392>, queries: #Reference<0.822817904.78249985.212386>, postgres: :idle, transactions: :naive, buffer: :active_once, disconnect_on_error_codes: [], scram: nil}}
21:38:16.197 [info] Postgrex connect: {:ok, %Postgrex.Protocol{sock: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.5>, :tls_connection, :undefined}, [#PID<0.178.0>, #PID<0.176.0>]}}, connection_id: 2191999, connection_key: -574140242, peer: {{209, 38, 218, 180}, 25060}, types: {Postgrex.DefaultTypes, #Reference<0.822817904.78249985.212280>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.822817904.78118913.212393>, queries: #Reference<0.822817904.78249985.212382>, postgres: :idle, transactions: :naive, buffer: :active_once, disconnect_on_error_codes: [], scram: nil}}
21:38:16.197 [info] DBConnection.ConnectionPool.handle_info :ets-transfer
21:38:16.197 [info] DBConnection.ConnectionPool.handle_checkin
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue_fast
21:38:16.197 [info] DBConnection.ConnectionPool.go
21:38:16.197 [info] DBConnection.ConnectionPool.handle_info :ets-transfer
21:38:16.197 [info] DBConnection.ConnectionPool.handle_checkin
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue
21:38:16.197 [info] DBConnection.ConnectionPool.dequeue_fast
21:38:16.197 [info] DBConnection.checkout: OK
...
21:38:16.786 [info] Migrations already up
Note that it still takes over 3 seconds to establish a connection (that’s a lot, ain’t it?).
My thoughts on this:
- It looks like the DB connection is established asynchronously. In case it takes a lot of time, the queries will be dropped instead of blocking until a connection is established.
postgrex
has a connect timeout - it’s 5 seconds by default, so it doesn’t come into play here.
- Not sure why it takes so much time to connect to the DB. Maybe this is related to SSL and the handshake?
- Not sure why the app boots properly. Maybe it’s because the migrations only have a pool of 2 connections, whereas the app is configured to use 6. Do the migrations just run out of connections because they are still connecting?
- Bumping
:queue_target
and :queue_interval
to some high values makes the problem go away reliably.
Since I’ve spent significant time on this, I hope to turn this into some sort of an improvement (updating docs, updating the error message, filing an issue in one of the packages, etc.), but I first need to understand what the right solution is. Bumping the queueing params? Addressing slow connection times? Something else?
Looking forward to your thoughts on this! @dimitarvp @benwilson512 @ericmj @josevalim and basically anyone who has time and insight on this.