What causes long "idle" time in Ecto?

fishcakez · May 30, 2020, 9:20am

tl;dr the idle time is how long a connection was ready and enqueued in the connection pool waiting to be checked out by a process. Therefore we can be happy to see it above 0.

idle time is recorded to show how busy or not busy the connection pool is. During connection pool IO overload the idle time will be as close to 0 as it can be, not counting message passing overhead, because then the connections are always in use by processes running transactions/queries. If a connection is not immediately available there is queue time that is the time between when the request for a connection was sent and when a connection becomes available. This time has latency impact for processes running transaction/queries as they need to wait for a connection before they can perform a query because the connection pool does not have enough capacity. db time is the time spent holding onto to the connection, i.e. time spent actually using the database. Therefore the latency for the calling process is queue time plus db time.

When the connection pool has extra capacity the queue time should always be as close 0 as it can be because a connection is available when requested. Only once the transaction rate is beyond what the connection pool can handle does the queue time increase. However that is a little late because latency is already impacted once it goes above ~0. idle allows us to see how close to having queue time, the higher the idle time the more capacity we have in the connection pool.

Perhaps a clearer way to think about it is if you see an idle timeout of 100ms, that means we could have run 100ms worth of db time on that connection since the last time that connection was used. So to be as efficient as possible on resource we would want queue and idle to both be near 0. However because of the latency impact of queue, we should be happy to sacrifice some idle time to keep queue at 0 when possible.

Of course we can make queue time move to 0 by increasing the pool size. However if the pool is too large db time will suffer as the database becomes the bottleneck instead of the connection pool.

If the application is not sending any queries for a prologued interval and then sends a query why is the idle time capped? This occurs because idle connections periodically send a ping message to the database to try to ensure connectivity so that the connections are ready and live when needed, and we don’t need to attempt to connect when a query comes in. If we needed to reconnect then the database handshake time would add latency to the transaction/query being processed. The ping message resets the idle timer because an individual connection is unavailable from the pool during a ping. If it was left in the pool the connection would be blocked waiting for a pong response and incur a wait on the connection like just like a handshake. Pings are randomized to prevent all connection becoming busy at the same time.