Looking for help with poor Ecto query performance

Instruction Per Second (IPS) is the measure of how many times the callback can be called in one second per benchmarking scaffold.

Parallel means that multiple benchmarking scaffolds can be run simultaneously, each are still distinct.

If you want to test parallelism then you set Benchee to no parallelism (1) and you create the parallelism inside your test structure. This is pretty universal among all benchmarkers in all languages.

Setup the connection in setup, benchmark parallel connections within the benchmark (an input is great for this so you can test different concurrency levels). Do not have Benchee run in parallel. In general never every run Benchee in parallel except for memory testing.

Precisely, because Benchee’s Parallelism is for running tests faster but more inaccurately, it is not to ‘test’ parallelism.

I prefer to setup load in the setup call, much more control. ^.^

Read above in my post here.

Both correct.

You don’t, just absolutely don’t run Benchee in parallel mode for accurate timing measurements, it’s for other purposes.

Not misleading, it is reporting precisely what it should be reporting. If it reported anything else, like 10, it would be absolutely outright wrong.

This is the absolute right answer.

Given that Ecto is doing type conversions, pool handling, etc… This seems quite accurate.

6 Likes

Hi there - benchee creator here,

to be clear this is absolutely intended behaviour on the side of benchee.

The parallel option is there to exercise more load on the system or gather more data samples in the same time. Benchee is concerned with measuring and providing data about the time it takes individual function calls to execute. The function takes 10 seconds to execute hence an average of 1 and an ips of one is the correct behaviour.

I’ll edit the parallel benchmarking wiki page to highlight this property even more.

Cheers,
Tobi

PS: I haven’t read the original thread (yet), I will and I have something of note to say I’ll add a new post

7 Likes

Okay I do have some more input here :smiley:

@benwilson512 @scouten there is a second misconception here, :parallel doesn’t really batch in that sense.

If parallel: 4 then 4 processes will be spawned that all execute the same function for the given time. When these finish/the time is up 4 new processes will be spawned for the next scenario

The 4 processes just dutifully call their function for the specified :time - each as often as they can, no waiting nothing. The new scenario refers to if there’s either another input or another function being benchmarked. So there’s no “waiting” effect to account for.

At it’s core benchee uses the most naive parallel map for this:

  def map(collection, func) do
    collection
    |> Enum.map(fn element -> Task.async(fn -> func.(element) end) end)
    |> Enum.map(&Task.await(&1, :infinity))
  end

And later on they are simply flat_map’ed:

    run_times = Enum.flat_map(measurements, fn {run_times, _} -> run_times end)
    memory_usages = Enum.flat_map(measurements, fn {_, memory_usages} -> memory_usages end)

As for how do we get to the total “throughput” - just multiplying the IPS with the number of parallel processes should be more or less okay. It will be slightly inaccurate due to different abort times but with a high enough sample size it shouldn’t matter.

If you really wanted to see your parallel throughput in my opinion you should provide the parallelism, because that’s a crucial part performance wise. Our naive implementation works for us but as I said it’s rather naive, things like Task.async_stream ore more advanced things are a better choice for all that I know :slight_smile:

All that said, I’m contemplating the display of a “throughput” when parallel is used in benchee now but I’m a bit torn on it. I’ll have to meditate on it, I’ll try to make the docs even clearer on the existing option. It’s powerful and was very easy to add, but seemingly sadly has sometimes done more harm than good.

5 Likes

We are currently working on reproducing our results without benchee. Will report back as we know more, hopefully later today.

When looking at your results and taking the parallel 20 example and multiplying it with its ips you’re almost at the pg_bench result and the remaining 10% or so can be accounted for through decoding/encoding etc imo.

If you provide extended_statistics: true you’ll also get the sample size aka the total number of times the function ran/number of transactions.

Looking at your statistics from the one run that is shown completely I can see that there is a HUGE difference between median and average and 99th% is even LOWER than the average. That means that there is a huge outlier out there some where. Maybe a very big garbage collection run (don’t forget this you’re producing a lot of memory garbage) or a long time waiting for a new connection.

The median seems in line with pg_bench times. Pg_bench doesn’t have to deal with GC so that seems likely.

I’d recommend using the HTML formatter to see the outlier. You might have to lower the times though as it might break graphing that many samples.

1 Like

I think we may be on to something there. Working on breaking up the test runner so that we build up less garbage per-process.

Definitely seeing a big delta between 99th percentile and max.

To avoid garbage influencing your measurements you could use hooks with after_each: fn _ -> :erlang.garbage_collect() end however those collects take a long time so the total number of samples/work done will go down by A LOT.

It’s a nice way to check if it’s really GC though (increase the time by a bit though so you get at least some samples in :wink: ) To see how it affects the sample size I still recommend extended_statistics: true

2 Likes

Thanks, all, for the feedback on benchee’s intended design space. I’ve updated the benchmark repo (https://github.com/scouten/pghr). In this version, I’ve:

  • replaced benchee with my own (very quick-n-dirty) parallel benchmark runner,
  • upgraded Postgres to a more recent version (11.5 vs 9.6),
  • broken out a raw-SQL update from an Ecto-based update,
  • re-run all benchmarks and posted new results.

New findings: Create record results are now at least within the same order of magnitude as the pgbench results.

Unfortunately, update performance is still off by a large margin (roughly 15x). One concern that was raised was that the Ecto version needs to read the row and then update that row. To isolate that concern, I wrote a version of the benchmark that uses exactly the same SQL statement (via SQL.query!) as the one being run in pgbench. The results of that test are essentially the same as the pure Ecto version.

Why? You can pass an update query to update, it shouldn’t need to read back the rows at all?

Drop change sets and use Repo.update_all/3 as right now you need to make more queries, processing, and validations.

Here you go

2 Likes

Thanks for the suggestion. I happened to write a version of that at the same time (see my PR #8, just merged), which I think is differs from yours only in a trivial syntax approach.

I see a very modest performance gain (maybe 6% faster) compared to my previous update benchmarks (either the Ecto changeset or the raw SQL), but it’s still way off the pgbench results. Numbers are added to the README.

I agree with you and I was skeptical, but I wanted to make sure that argument was well and truly dead. The Changeset vs raw SQL approach benchmarked at almost exactly the same performance. Using Repo.update_all/3 as suggested by @hauleth just now yielded a slight perf boost, but there’s still something larger to be found.

Could that be related to pgbench and ecto using different transaction isolation levels? Or perhaps connection contention? Just spitballing because update_all/3 is just a single SQL command so hard to imagine there is material ecto computation behind it beyond transforming types to the binary wire protocol.

1 Like

Transaction isolation levels rings true as a theory. FWIW we are seeing a pretty substantial fall-off in performance as we increase the initial size of the items table. This happens only in the Ecto version of this test; it does not reproduce when using pgbench.

It feels like some kind of full-table lock or table scan is occurring when using Ecto that is not happening when using pgbench.

I’m looking for ways to tune the isolation level and not finding it yet.

Also agree that the amount of data transmitted between Ecto and PG should not be materially different between create and update tests, so that does’t feel like a valid cause.

This can help set isolation level:

Repo.query!("set transaction isolation level repeatable read;")

This article might help.

2 Likes

Since my posts a couple of hours ago, I’ve done quick experiments with the following:

  • Revert to Ecto 2.2 and Postgrex 0.13.
  • Removed use of Enum.random from the update test.
  • Use an explicit transaction around the update operation.
  • Run ANALYZE after seeding the database with the item records to be updated.

None of these had any appreciable impact on update performance.

As mentioned a few minutes ago, the size of the items table does have a dramatic impact on perf.

I tried this. (See https://github.com/scouten/pghr/pull/9/files.)

At 5000 rows in the initial table, it crashed repeatedly with the following message:

12:53:02.386 [error] Task #PID<0.241.0> started from #PID<0.92.0> terminating
** (MatchError) no match of right hand side value: {:error, %Postgrex.Error{connection_id: 19800, message: nil, postgres: %{code: :serialization_failure, file: "nodeModifyTable.c", line: "1257", message: "could not serialize access due to concurrent update", pg_code: "40001", routine: "ExecUpdate", severity: "ERROR", unknown: "ERROR"}, query: nil}}
    bench/update_item_sql.exs:37: anonymous fn/2 in :elixir_compiler_1.__FILE__/1
    (ecto_sql) lib/ecto/adapters/sql.ex:898: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection) lib/db_connection.ex:1415: DBConnection.run_transaction/4
    bench/update_item_sql.exs:34: anonymous fn/1 in :elixir_compiler_1.__FILE__/1
    (pghr) lib/parallel_bench.ex:29: ParallelBench.iterate/3
    (elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
    (elixir) lib/task/supervised.ex:35: Task.Supervised.reply/5
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

At 50,000 rows, it succeeded, but the performance was not markedly different from the version with that line commented out (both 19xx requests per second).

Update: I’ve posted new results to the repo based on some research by my team today.

There’s a very important difference in how Ecto is processing update requests compared to create requests that I can not replicate with pgbench.

In the create benchmark, we see both pgbench and Ecto have almost constant performance regardless of how many records are created. (I use length of time the test is run as a proxy for table size.)

In the update benchmark, we see pgbench have almost constant performance regardless of the number of records that were posted into the table.

But Ecto – running the exact same SQL query (via Ecto.Adapters.SQL.query) – has performance fall off almost proportionally to the number of records in the table. I didn’t post those numbers, but the performance characteristics are fundamentally similar if I use Ecto changesets or Repo.update_all/3.

So … What could Ecto (or Postgrex or DBConnection?) be doing differently that’s specific to update queries? It almost feels like there’s some kind of directive being issued to cause Postgres to ignore the primary key index.

2 Likes

Filed as an issue at Ecto: https://github.com/elixir-ecto/ecto_sql/issues/148.