Phoenix Ecto realistic benchmarks

I having some trouble figuring out if I have set myself too strict of standards for my production server. Currently I can handle 75% of requests in under 6ms, but my 90th percentile begins to show times that are closer to 75-100ms and the reason for these longer requests is something in the ecto library taking a while to return, because inside datadog I can see that the query_time field of the EctoLogEntry is what is taking 70 out of the 75ms of time. I wasn’t able to find a bunch of benchmarks from other peoples experiences online that use Ecto requests as part of the benchmark.

I would like to get all of my requests under 20ms, but I just don’t know if this is realistic or not. Could anybody she any light on what they would expect me to be able to achieve from a performance perspective as far as the latency times for request?

Off the cuff I’d assume something is waiting for a database connection?

Sounds interesting, what is your pool size set to?

We have our pool size set to 10 and we are running two pods inside a kubernetes cluster. So we see a total of 20 active connections to the postgres Db inside Aurora on our AWS dashboard. The way the ecto log entry is defined I would’ve thought the queue_time field of the EctoLogEntry would be the time it takes to check out a connection from the pool. The part that is taking longer is the query_time field of the log entry, but what is weird is on the DB itself we can see that every single query runs in under 1 ms because we are doing just simple selects on small tables and there is no reason for this latency. Do you think it could still be a connection problem if the queue_time field is under 1ms? Is there something about the queue_time and connections to the db that I’m not understanding properly?

Also, the latency only ever occurs randomly on the first Ecto call on any given network request. So if route A hits ecto 3 times, the only one with random slow latency is that first ecto call no matter what the query is. I’ve tried different routes with different first queries and they all randomly are slow around the 90th percentile.

Since the slow down is on query_time, then it is definitely not a connection issue, as that would show on queue_time.

It is important to keep in mind that though that queue_time includes all the time we hold the database connection so if for some reason the current process is being scheduled out for another process to run, then the queue_time will have those spikes.

So in theory, I wouldn’t worry about queue_time if the response times are not seeing such variance. For example, imagine this code:

do_something
do_a_query
do_something_else

Sometimes the scheduler will swap on “do something”, sometimes inside the query and sometimes after the query. That’s why from time to time you see such a variance.

However, in practice, your variance is quite high, which may mean that the system is overloaded. The Erlang/VM is capable of handling overload well but this may be one of the symptoms. I would start measuring the run_queue and other values to see if that’s the case or not.

EDIT: Also please let us know your Ecto version and, if you have recently updated to Ecto 3, let us know if you saw this issue before or if it is something recent.

We were on Ecto 2.2 before and we just upgraded to 3.0 in hopes that it would fix our issues, but nothing changed. Also I pulled the metrics for our CPU over time and we never go above 5% utilization. However just because we don’t go above 5% doesn’t mean that we don’t already have processes running on all the cores in parallel and things aren’t getting context swtiched out right? or am I misunderstanding CPU load? I haven’t had a chance to dig into the run_queue of the system yet, but even if we find something in there what would I be able to do to prevent some sort of unwanted context switch? Is there somewhere in our project I should try to manually raise the priority level for the process in order to attempt to not get preempted while we go to the Db to make a query?

Would it be possible to produce a reproducible example in a repo?

I can’t reproduce the random latency locally, but that might have to do with the fact that i have the server and db both on the same machine. Does anyone have any benchmarks or know of somewhere else where someone has done a good job documenting their latencies and response times in a production like environment?

It can even be your database itself - for instance if a certain query would produce 0 results it can behave strangely even when indexed on a reasonable column, I’ve seen queries that usually take ms’s timeout on that. Adding composite indexes for the relevant fields helped and brought them back to the ms’ range - not saying this is the case, but in mine I was sure it was Postgresql as I re-ran those in psql connected to the actual database, they took long, and explaining the query showed the planner thought it to be reasonably fast, but in practice it wasn’t what was happening.

2 Likes

I have already dug into the logs on our postgres instance and every single query has run in under 1ms. We are really just doing selects on small tables as of now because this is a new service. The extra random time occurs in between the database and the ecto process that needs to return the data. My only thought at this point is that when the ecto call is made the parent thread blocks while it waits for the result and then gets context switched out by another process because the scheduler is trying to be efficient and then it is just taking a while for the parent process to resume and read the data from the Ecto call. This is completely all speculation though. Even if this is the case it is not going to be trivial to fix. The main reason I’m trying to see if anyone else has any benchmarks is so I can determine if I should keep looking into this or if this is something other people have experienced.