What causes long "idle" time in Ecto?

pejrich · May 29, 2020, 10:03am

I’ve noticed recently that the idle time for my queries is creeping up over 1k ms.

[debug] QUERY OK source="chats" db=0.8ms idle=1272.7ms
[debug] QUERY OK source="messages" db=0.8ms queue=0.7ms idle=1616.1ms
[debug] QUERY OK source="users" db=0.8ms idle=1185.4ms

I restarted my postgres, which brought it down, but within 3 get queries it was back up to over 1k ms.

[debug] QUERY OK source="users" db=0.6ms queue=0.1ms idle=204.0ms
[debug] QUERY OK source="chats" db=0.4ms idle=686.6ms
[debug] QUERY OK source="users" db=0.4ms idle=1605.8ms

It is slow for all schemas, so that makes me think it’s either an issue with my postgres, or maybe my app settings. What could be the cause of this?

idi527 · May 29, 2020, 10:31am

According to https://github.com/elixir-ecto/db_connection/blob/436101ab9678f7173d8284086536ee27ad8c05c5/lib/db_connection/log_entry.ex#L30

`:idle_time` - The amount of time the connection was idle before use

so the cause is probably rare requests.

pejrich · May 29, 2020, 10:47am

I’m a bit unclear about the metric, because whether I wait a while between requests (let’s say 10 seconds, while running them from the console), or fire a bunch off pretty quickly, the idle time still seems to hover between 800-2k ms. Maybe it’s not an issue, but I was just curious what could cause it to be so high (if it even is high, but 1-2s in DB time seems like a long time).

benwilson512 · May 29, 2020, 11:20am

I’m not entirely sure why the numbers don’t go higher, but it isn’t measuring time in the database. You should be able to confirm that because the requests themselves don’t take 2 seconds to return.

fishcakez · May 30, 2020, 9:20am

tl;dr the idle time is how long a connection was ready and enqueued in the connection pool waiting to be checked out by a process. Therefore we can be happy to see it above 0.

idle time is recorded to show how busy or not busy the connection pool is. During connection pool IO overload the idle time will be as close to 0 as it can be, not counting message passing overhead, because then the connections are always in use by processes running transactions/queries. If a connection is not immediately available there is queue time that is the time between when the request for a connection was sent and when a connection becomes available. This time has latency impact for processes running transaction/queries as they need to wait for a connection before they can perform a query because the connection pool does not have enough capacity. db time is the time spent holding onto to the connection, i.e. time spent actually using the database. Therefore the latency for the calling process is queue time plus db time.

When the connection pool has extra capacity the queue time should always be as close 0 as it can be because a connection is available when requested. Only once the transaction rate is beyond what the connection pool can handle does the queue time increase. However that is a little late because latency is already impacted once it goes above ~0. idle allows us to see how close to having queue time, the higher the idle time the more capacity we have in the connection pool.

Perhaps a clearer way to think about it is if you see an idle timeout of 100ms, that means we could have run 100ms worth of db time on that connection since the last time that connection was used. So to be as efficient as possible on resource we would want queue and idle to both be near 0. However because of the latency impact of queue, we should be happy to sacrifice some idle time to keep queue at 0 when possible.

Of course we can make queue time move to 0 by increasing the pool size. However if the pool is too large db time will suffer as the database becomes the bottleneck instead of the connection pool.

If the application is not sending any queries for a prologued interval and then sends a query why is the idle time capped? This occurs because idle connections periodically send a ping message to the database to try to ensure connectivity so that the connections are ready and live when needed, and we don’t need to attempt to connect when a query comes in. If we needed to reconnect then the database handshake time would add latency to the transaction/query being processed. The ping message resets the idle timer because an individual connection is unavailable from the pool during a ping. If it was left in the pool the connection would be blocked waiting for a pong response and incur a wait on the connection like just like a handshake. Pings are randomized to prevent all connection becoming busy at the same time.