During Supervisor startup, I’m seeing intermittent connection not available..., in the same child list starting the Repo. The later child is caching some data from the Repo. There should be no other possible requests, so I’m guessing it is taking longer for the Repo/DBConnection to initialize sometimes and the init() returns before they are ready. I thought that when a child returns from a start_link that it is ready to go, but is that not the case with Ecto.SQL?
Of course when I say intermittent, this error crashes the whole VM because it’s in startup. So if I try again, once or twice it then works and doesn’t occur again.
This is the typical way to solve the problem in a active pool, but mine is not active yet:
The way database connections are wrapped in Repo makes it hard to figure out what’s going on, so I’m hoping someone might recommend a path of investigation. Maybe it’s a bug down there, but given that it’s a timing issue it seems unlikely I can find it without better understanding.
Ecto.SQL does not not block to setup the connection pool because Ecto.SQL should run whether there is a connection available or not. If Ecto.SQL startup failed because the database is down, this could cause reliability issues for many application. Therefore the application depending on Ecto.SQL needs to determine how it handles these errors. Note that if Ecto.SQL can not connect to the database it will throw these errors because a connection is only available once it has completed the database handshake.
If the database is up and still no connections are available then it could be a long handshake - likely a bug somewhere. We could debug this by timing how long it takes for the first query to succeed.
It seems worthwhile to determine what and where the issue is. I looked at some of this init code but wasn’t able to make much progress. Can you suggest what I can do first?
If you still have this issue you can try setting/reducing :handshake_timeout if you’re using postgrex or myxql and see if that produces timeout errors in the logs. Postgrex can be slow to handshake on its initial connections because it needs to load type information, which can be slow in pathological cases.
Another option is to put an “await for ecto pool live” process in between ecto and our applications code in the supervision tree. This can make sense if our application can not work without ecto. This process would block in init repeatedly trying a simple “ping” query, until we get a successful result or we hit a timeout. For example it could be a GenServer that runs this inside init/1 and returns :ignore on success and crashes on timeout. If the application doesn’t have this limitation (hopefully it doesnt) you could choose not to fail after the timeout and continue with error logging and return :ignore. Also you may even want to try this to debug/discover how long it takes for connections to become available.