Ecto latency problem on queries

katta · November 20, 2018, 7:56pm

I’m having some issues with ecto when running our app in production where around 1/5 of the queries that ecto processes take around 10x (fast: 5-10ms, slow: 75-100ms) as long as the quickest ones even if it’s the exact same query running. The main place I’m seeing the spike in the ecto query times is on the the first database call that the server must make when it is flowing through all the code required to respond to a request. The first thing I usually check is their API key and just do a quick lookup on the api key which is a pretty small string and also indexed. Whether or not that api key lookup is fast or slow all of the remaining DB queries seem to be very fast and all under 10 ms.

Context of environment: I’m running this on an AWS instance and also using an Aurora Postgres instance from AWS. The server is running inside of a kubernetes cluster and we have two pods running, each with one container running our service. In the AWS database dashboard I can see there are 20 active connections at almost all times unless we take something down, which makes sense because our pool size is set to 10 and we have two pods. We have relatively low traffic coming through that is around maybe 100-200 requests per day. When I look at the postgres logs for how long it spends processing the queries on the actual db itself every single query is under 1ms because I’m not doing anything to complex basically just selecting based on a column on a relatively small table of under 100 rows.

My only thought is that something around the connection pool is not working properly which is causing the random lag time on some of the requests. I also have the ecto.logentry record broken down into little sections and have found that the queue_time is almost instant (under 1 ms), the decode_time is also almost instant (about 1 ms), but the query_time is 75ms+. I am absolutely shocked that it is the query_time that is taking so long for us because on the actual DB the query is handled basically instantly (1 ms). My only thought is that there is some sort of random overhead that ecto adds to the query when it is the first query run for any given network request. The reason I think this is because I only see the added latency on the API key select statement which is the first thing I run inside of a plug on my server before running the rest of the queries I need.

Does anybody have any ideas on what could be causing this or have any ideas on a little experiment that I could write that could possibly help me debug this?

OvermindDL1 · November 20, 2018, 9:31pm

And you’re sure it’s not just the database cache itself warming up?

benwilson512 · November 20, 2018, 11:52pm

Do double check that your protocols are consolidated, there were a couple minor versions of Distillery that weren’t including them properly. You should see:

iex(2)> Protocol.consolidated?(Enumerable)
true

katta · November 26, 2018, 5:17pm

I read that ‘From Elixir v1.2, protocol consolidation happens automatically for all projects’. Is this still something you think I need to do explicitly or should this be happening automatically. Also if you do believe this is the issue do you mind explaining why you think this will help the random yet repeatable latency?

benwilson512 · November 26, 2018, 5:45pm

It is not something you need to do manually. However, are certain versions of Distillery which had bugs that meant protocols weren’t consolidated properly when using releases. Other buggy build situations can sometimes occur that prevents protocol consolidation from happening successfully. The most recent version of distillery does not have this bug, but it’s worth still checking in your specific circumstance.

katta · November 27, 2018, 7:15pm

Could you explain a little more on how I would go about consolidating protocols? I see the snippet above, but I’m not sure which file to put this in or when I need to do this. Is this a runtime thing? a compile time thing?

benwilson512 · November 27, 2018, 7:27pm

The snippet is something you should remote console in and run. Alternatively you can run it in your application start function and then print the result to your logs.

EDIT: are you using releases by the way or mix phx.server?

katta · November 27, 2018, 8:07pm

we use mix phx.server as far as I know, what would using ‘releases’ buy me?

Also I tried: Protocol.consolidated?(Enumerable)
Returned: true

benwilson512 · November 27, 2018, 8:11pm

Ah so that’s at least one issue here.

Problem 1 you’ve already sort of identified: It isn’t super easy to do a remote console session with mix phx.server where you might be able to run the snippet I posted. releases will make that easy.

More importantly though: mix lazy loads code. This means that the first time a module is called mix has to go get it from the disk and load it. If your application is under load then that’s gonna really bog things down for a bit as all the different modules get loaded. Releases however come bundled with a list of all relevant modules and load them all on boot.

katta · November 27, 2018, 8:19pm

It doesn’t just occur one time when the first request is made after starting the server up. It happens about 10-15% of the time, even though all the payloads are the exact same. If I ping it constantly or just traffic slowly come through I see pretty similar results. So I wouldn’t think it has anything to do with the code being lazily loaded or else all of the subsequent requests would be faster after it pulls the module in.