Looking for help with poor Ecto query performance

scouten · September 20, 2019, 3:43pm

As @josevalim patiently noted in the GitHub issue, Enum.random turns out to be the culprit here. I thought I had controlled for that in a previous experiment, but apparently I had not. Uff. Been a rough week. My apologies.

dimitarvp · September 20, 2019, 4:13pm

Is your repository now adjusted to show exact results from both experimental setups without the Enum.random call skewing the results?

scouten · September 20, 2019, 4:15pm

Working on it as we speak …

scouten · September 20, 2019, 4:19pm

Now that I have that laptop connected to power again, I re-ran the results and have updated the repo with new numbers.

The O(n) on table size issue is gone. There’s still an absolute difference between Ecto and pgbench, but there are leads we can follow here as well.

dimitarvp · September 20, 2019, 4:20pm

Impressive progress. Please post the numbers when convenient, this saga is quite interesting and educational.

scouten · September 20, 2019, 4:21pm

Thanks, and I will do so. Hopefully this makes life a little easier for the next person who struggles with perf.

scouten · September 20, 2019, 10:22pm

I’ve updated the repo with the results of experiments run during the day. We’re up to about 73% of pgbench performance on the update benchmark.

I’ve run fprof against the latest test harness and it looks like we’re now spending about 87% of the time waiting for messages from Postgres, so there’s probably not much more to squeeze out of this. The only thing that stands out for me is about 3% of time spent on logging (even though the test harness is configured not to produce logs).

PragTob · September 21, 2019, 8:29am

@scouten what was the reason that you rolled your own little benchmarking library? Benchee has ways to call out to Enum.random that won’t affect the measured time (hooks) - if benchee added a throughput display would that be usable for you?

jola · September 21, 2019, 9:17am

You can remove log calls completely from an application using :compile_time_purge_matching, it’s described here.

That should remove those 3%. Note that it requires the application to be recompiled to have an effect (mix deps.compile ecto --force to be sure).

scouten · September 21, 2019, 3:12pm

@PragTob I think benchee and my tool have different design intents. As you put it earlier in this thread:

Benchee is concerned with measuring and providing data about the time it takes individual function calls to execute.

I’m asking a different question with my tool: What is the theoretical maximum write capacity of a system that uses Elixir, Ecto, and Postgres together when under load? I still call it a “benchmarking tool,” but I’m starting to think it might be more accurate to refer to it as a load-generation tool.

IMHO the maximum write throughput of a Postgres system by itself on any given hardware is shown by pgbench by itself. Right now, I’m showing that a fairly rudimentary Elixir + Ecto application sustains about 73% of that theoretical capacity on the same hardware (a 2015-era MacBook Pro). There’s hope that with logging disabled, we might squeeze out a few more percentage points. (If there are more leads beyond that, I’m all ears!)

Armed with this data, I can now say to others on my team that we’re paying x% for logging, x% for checking out connections on each request, x% for Elixir/Ecto itself, etc., and we can have a discussion about whether these are reasonable costs and whether Elixir/Ecto/Postgres is an appropriate tech stack for our intended use case given our expected (hoped for) traffic.

This is obviously a very simple example. Internally, our situation is more complex and we’re seeing other perf problems that are probably rolled up in that complexity. The original intent of this tool was to mimic our PG system and allow me to layer on complexity, looking for that place where we lose performance.

I think I’ve got a “good enough” feel for baseline Ecto/Postgres write performance that I resume that next stage of the experiment (looking for the details of our internal issues).

Thanks, all, for your patience and help this week.

dimitarvp · September 21, 2019, 6:51pm

Results from running the tests on two Mac machines:

CPU (cores/threads)	Test	IPS	Duration
Xeon-2150B, 10/20	Elixir create	27610	10s
Xeon-2150B, 10/20	pgbench create	32204	10s
Xeon-2150B, 10/20	Elixir update	41451	30s
Xeon-2150B, 10/20	pgbench update	135406	10s
i7-4870HQ, 4/8	Elixir create	17661	10s
i7-4870HQ, 4/8	pgbench create	29368	10s
i7-4870HQ, 4/8	Elixir update	23242	30s
i7-4870HQ, 4/8	pgbench update	47860	10s

One thing that struck me as odd that out of all 4 tests, 1 of them has a 30s duration (the Elixir update one) and the rest are all at 10s. Why is that?

Another thing I don’t like about the tests is the inconsistency of the inserted data coming from :rand.uniform(). I would create an ETS table or an in-memory list with 100k pre-calculated random values and would reuse that in all tests (all 4, one after another). Not sure how I’d code that yet.

josevalim · September 22, 2019, 6:06pm

Btw, thanks for putting the benchmarks up, they led to a batch of improvements on the Enum.random implementation in Elixir master!

tcoopman · September 23, 2019, 10:57am

@scouten this might be out of scope, but have you tried to benchmark the differences between Ecto and directly using postgrex with raw queries? It might be interesting to see what the overhead is for this as well.

scouten · September 23, 2019, 9:04pm

I’ve considered that, but haven’t tried it yet.

scouten · September 23, 2019, 9:53pm

Re: 30s vs 10s. I’m not sure and probably an artifact of tweaking methodology over time. I’m working up a new set of update numbers based on another tweak I came across this afternoon and standardizing on 30s.

I am seeing results akin to yours as I move to beefier hardware. My work laptop is newer and faster than the personal laptop I’ve been using for the results published so far. pgbench outperforms the Ecto benchmark by a wider margin on that machine than on my personal laptop. In preliminary results, we are seeing a similarly larger discrepancy when we move to provisioned hardware on Amazon (EC2 + RDS). (I hope to have more detail to share on these later today or tomorrow.)

One thing I’m also observing on the beefier laptop is that the Ecto benchmark benefits substantially from increased parallelism. Switching from 10 to 40 processes yielded something like a 50% improvement in reported throughput. I’ll be experimenting with that more over the next ~24 hours and will share here.

dimitarvp · September 23, 2019, 10:33pm

I’d also advise standardising the output of your scripts to have the exact same format. IMO there’s not much point in stats like “number of connections without the connection establishing”. You’re not doing micro benchmarks on Postgres itself after all; you are testing real throughput between two competing implementations. Of course connection establishment has to be included in all timing stats.

Another thing I’d definitely do is replace the ad-hoc Task.async / Task.await with the much simpler Task.async_stream. But then I am not sure how would that help (or hamper) your measurements.

scouten · September 23, 2019, 10:35pm

What would you be hoping to accomplish with that change? Your “but then…” doesn’t exactly make for a compelling argument.

dimitarvp · September 23, 2019, 10:38pm

My recommendation is mostly related to automatic (read: provided by the Elixir stdlib) parallelisation which in my limited testing in previous projects was clearly superior performance-wise – albeit not by huge margins – compared to hand-crafted Task.async + Task.await workflows.

scouten · September 24, 2019, 7:42pm

Interesting findings here: https://github.com/scouten/pghr/pull/15

First, I’m surprised by the change in pgbench results. I changed to 30-second tests instead of 10-second tests for consistency (thank you @dimitarvp for catching that!), but both the 30s and 10s runs wound up with substantially more throughput than I had observed last week on the same hardware. The only thing I can think of was that I had rebooted the laptop over the weekend and maybe some other process that was on the machine was interfering.

Second, I eliminated the string interpolation from the Elixir/Ecto benchmark and that seems to have “helped” by some percentage. This is the primary change in PR 15.

Third, I re-ran the benchmarks and did more experimentation with the number of parallel tasks. The Elixir/Ecto version benefits substantially from increasing the number of clients up to a point. In the example posted, 80 parallel tasks seemed to be the point where performance peaked. (Having that be an inflection point is actually good for us, given expected usage patterns.)

dimitarvp · September 24, 2019, 7:59pm

I’ll gladly rerun the tests on my iMac Pro and MBP 2015 but I am not sure if that wouldn’t pollute this thread. Maybe I should post the revised results somewhere else, what do you reckon?