Looking for help with poor Ecto query performance

scouten · September 17, 2019, 8:35pm

My team at work is seeing some rather poor performance, especially on update queries, from an application we’ve built using Ecto and Postgrex. We’ve been able to reproduce this in a very small, sanitized demo application which I’ve posted here:

We’ve run the same operations in Benchee + Ecto and compared them to equivalent queries performed using pgbench and the results are rather disturbingly out of league with each other. In the case of update queries in particular, I am seeing a difference of nearly 75x in request throughput when testing against Postgres on my 2015-era laptop. The discrepancy is on the order of 15x for create transactions.

We are seeing a 10-20x discrepancy between pgbench and Ecto when we run against actual provisioned database instances on AWS using similar methodology.

What I’m hoping for from the community:

If there’s something wrong with our test methodology, I’d love to hear it.
If there’s something we can tune to make our Ecto configuration more in line with pgbench reported results, I’d love to hear it.
Or is there actually a performance problem with Ecto? (I’d be less happy to hear that, but at least I’d know.)

Pull requests or comments on that repo welcome.

hauleth · September 17, 2019, 9:00pm

I think that the important think is that Ecto example is doing way more things than only inserting data:

Encoding data
Decoding data (as you are using numeric IDs it need to read the entry after it was inserted to check the inserted ID)
Logging and Telemetry
Generating random integer (you cannot be sure that the PRNG engine has the same performance, so this can greatly impact the IOPS)
You do run-time protocol choosing
Success checking (I am not sure whether pgbench do not just ignore failed inserts)

Other than that, as I see that you are running bench and the DB on the same machine - what are your guarantees that these two do not interfere in each other? Maybe system scheduler is an important thing there as well.

So in the end it isn’t fair apples to apples comparison.

scouten · September 18, 2019, 12:33am

Hmmm. I’m having a hard time accepting this answer. I could explain away a 10-20% performance impact based on this, but not an order of magnitude or more.

As I understand it, pgbench is just another TCP-based client just like psql or Postgrex, so at least the encoding is present in both tasks.

I did just now re-run the tests with the Mac Activity Monitor application watching CPU consumption during the create-records test.

In the pgbench test, it was pgbench at 75% CPU with the 10 postgresql processes running at ~50% each. (This is an 8-core machine.)

In the Ecto + Benchee test, it was beam.smp at ~500% with the 10 postgresql processes each running at 4-5% each.

FWIW I took the random number generation out of the create-items benchmark (replaced with a hard-coded constant). Both benchmarks improved somewhat, but there’s still a 10x discrepancy to be explained.

As a rhetorical question: Should a database abstraction layer impose an order-of-magnitude performance penalty? Had I known that it was on that order, let’s just say I’d have made some different technology recommendations.

cmkarlsson · September 18, 2019, 1:52am

One problem seem to be how benchee’s parallel option is used and how the result is reported. I think 1 ips = all parallel task being run. See https://github.com/bencheeorg/benchee/wiki/Parallel-Benchmarking for more information

For example, if you set benchee parallel to 1 and pgbench clients and thread to 1 the result was that pgbench was only 5% faster. (If I changed the pgbench sql to run in a transaction they had the same throughput)

This means, the time for serialization and type checking and random generation has nothing to do with the speed as in the simple case they are almost the same.

I don’t know how to properly measure this with benchee. A simplistic way would be to multiply the ips by the parallel number. But I don’t know if that is the right approach either. In your case it changes the results from pgbench not being so much faster (60% to 10% faster depending on settings).

Look into how pgbench does connection pooling. Perhaps it opens up a connection and then just holds that and runs the transactions sequentially on that? Ecto will have to check out a connection from the pool, execute the query, and return the connection. A pool is there to make sure the database doesn’t get overloaded in terms of the number of connections, so there is always going to be overhead. Especially if the number of concurrent requests is higher than the number of connections in the database.

keathley · September 18, 2019, 12:54pm

I wonder if you’re hitting ecto queue limits. You can check this by getting the telemetry from ecto. It’ll give you a breakdown of where you’re spending the most time (per query). We always have to crank the db connections for ecto quite a bit. A quick test would be to measure the run queue by calling :erlang.statistics(:run_queue). If you see it increasing through the course of your test you know you have a bottleneck somewhere.

benwilson512 · September 18, 2019, 1:30pm

Yeah I think you’re on to something here. Here’s another issue with how parallel works, from the benchee hexdocs:

This is terrible if you’re relying on benchee’s parallelism to maximize throughput, because instead of trying to always have 4 processes constantly inserting, a group of four will always be as slow as the slowest insert. This is to say, if you have 3 that take 0.5ms, and one that takes 1ms, there will be a full 0.5ms where only one process is inserting.

Also exactly correct. @scouten see the benchee wiki on parallel: Parallel Benchmarking · bencheeorg/benchee Wiki · GitHub.

Notice how the IPs degrade as parallelism increases. What benchee parallelism about is generating load, and then measuring how a single function call performs in the context of that load. It is not measuring throughput.

scouten · September 18, 2019, 3:36pm

I think the wording on the Benchee wiki could use some improvement, but I can show that the interpretation you’ve arrived at is incorrect.

Let’s benchmark Process.sleep/1:

delays = [1000, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

Benchee.run(
  %{
    "sleep" => fn ->
      delays
      |> Enum.random()
      |> Process.sleep()
    end
  },
  parallel: 10,
  time: 10
)

If Benchee were running tranches of iterations as you suggest, iterations per second would be very close to 1.0 because the one long delay I’ve included would prevent the other “parallel” iterations from continuing. That does not appear to be the case:

$ mix run bench/sleep.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.9.1
Erlang 22.0.7

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 10
inputs: none specified
Estimated total run time: 12 s

Benchmarking sleep...

Name            ips        average  deviation         median         99th %
sleep          8.31      120.36 ms   ±257.98%       11.00 ms     1001.22 ms

scouten · September 18, 2019, 3:49pm

We’re running multiple transactions on a single connection in both cases. Is checking out a pool from a connection that expensive? If so, why?

Postgres opens one postgres OS process per active connection and terminates that process when the connection is closed by the client.

As mentioned earlier in the thread, I’ve observed the Mac Activity Manager while running both the Ecto + Benchee and pgbench. In both cases, I see exactly the requested number of connection processes open (corresponding to the pool size I’ve specified) and CPU load is evenly distributed among those processes.

That would seem to suggest a similar connection-pooling mechanism being used in both cases. Yes, it’s possible pgbench doesn’t have to do any pool check-out, but I’m having a really hard time explaining orders-of-magnitude away for that reason. Having an even harder time explaining away the markedly worse behavior on updates that way.

keathley · September 18, 2019, 3:59pm

The updates are easy to explain. With pgbench you’re just doing an update action. With ecto you’re doing a find and then an update.

scouten · September 18, 2019, 4:03pm

Have a look at https://github.com/scouten/pghr/pull/5/files.

I also ran the update benchmark with hand-rolled SQL to match the pgbench approach. No meaningful difference.

keathley · September 18, 2019, 4:14pm

Ok so you’re clearly hitting some other bottleneck. Did you look at the run queue or telemetry data as I described above?

scouten · September 18, 2019, 4:15pm

I haven’t yet. Appreciate those suggestions.

benwilson512 · September 18, 2019, 4:22pm

To be clear, I was making two distinct points.

Benchee ips count only a single process, not N processes where N is the parallelism count.
Benchee parallelism may run in batches which would limit throughput.

#1 is the most important, and isn’t addressed by your example as far as I can tell. There is strong evidence for #1, since both the example you have and the example from the docs shows literally < 0 speed up from parallelism. It’s also trivially easy to prove:

Benchee.run(
  %{
    "sleep" => fn ->
      Process.sleep(1_000)
    end
  },
  parallel: 10,
  time: 10
)

warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 10
inputs: none specified
Estimated total run time: 12 s

Benchmarking sleep...

Name            ips        average  deviation         median         99th %
sleep          1.00         1.00 s     ±0.02%         1.00 s         1.00 s

If parallelism counted towards IPs, then IPs shouldn’t be 1.0. This is a serious methodological problem, which is great!

#2 could be false, I agree that the docs are unclear. I’m going to see if I can test this more directly.

kwando · September 18, 2019, 4:23pm

So I tried your benchmarks on my computer, MacBook Pro 2019 with Postgres.app

If I look in the database after both test runs I get the following results:

ecto: 116378 items
pgbench: 190140 items (~1.63x)

Is this inline with your results?

edit: I sat the benchee :warmup option to 0

benwilson512 · September 18, 2019, 4:26pm

This is critical, and I think affirms the point #1 I was making before. Benchee is running multiples of the benchmark, but only counting a single thread for the IPs. The total in the database creates a clearer picture.

When using mix however warmup is critical, since mix lazy loads code. The first second or so will have a significant performance penalty.

kwando · September 18, 2019, 4:29pm

Yes, I could not figure out how to reset the database between the warmup and the test run… but that means it can only get better

scouten · September 18, 2019, 4:30pm

Ah, these are great points. I have some personal medical appointments (routine stuff, no big deal) that are going to require me to step away from the computer for the next few hours. Other members of my team are watching this thread and may try some of these suggestions and I may try once I’m back online. Looking forward to hearing what is found.

Thanks, especially, for the observation in #1: That … surprised me. (Side note: Do you have a different benchmarking tool that you prefer?)

scouten · September 18, 2019, 4:31pm

I have been in the habit of using mix ecto.drop && mix ecto.create && mix ecto.migrate between each test run.

benwilson512 · September 18, 2019, 4:31pm

Doing a fixed numer of inserts (~100) before the actual benchmark would probably work as a warmup, and then we could run the benchmark without warmups. Then, just subtract that number from the total in the database.

scouten · September 18, 2019, 4:41pm

Filed an issue on the benchee repository about this. I agree that the ips is misleading in this case.