Benchmark results of the Commanded (CQRS/ES framework) + EventStore (event storage): are they ok?

Hello!

At the company where I currently work we are building a PoC project using CQRS/ES and @slashdotdash 's EventStore + Commanded.

At some point we found that if we need to replay all events (for example, if we add a new sub-domain which requires past events to create its state) then it takes significant amount of time even on relatively low amount of events.

To reproduce the situation I’ve made a simple testing strategy using Conduit from “Building Conduit” book, here is the code on a separate branch ( perf ), README contains all the information including my test results: https://github.com/vheathen/conduit-perfomance/tree/perf

I would be really grateful if someone can have a look at the testing strategy and numbers and tell if they make sense.

When any aggregates are involved performance is terribly low, like 15 events per second per aggregate (not per aggregate instance !). I’m quite sure that I can miss something and it can be improved a lot, but currently don’t really understand how, so any thoughts on the topic are greatly welcome.

Thanks!

have you made a baseline benchmark of the database itself? As in, just measure how fast vanilla ecto insert statements are when talking directly to your DB. If this is the same then the bottleneck is your current DB setup itself. Are you running postgres directly on your machine? in a docker env? is it with a shared disk from host to the docker VM (because that’s a known issue to be very slow) ?

1 Like

It is native postgresql 15.0 installation without any virtualization, and its very fast, so the bottleneck isn’t there. I’ve made a simple test: insert 1_000_000 rows into the accounts_user_names table, its ~1602 rows per second.

Actually the same can be seen from the time of direct projection speed (1 min 46 sec for 5000 inserts and 100_000 updates), step 34.

currently don’t have much time left to dive deeper into this but I ran both eprof/cprof with

PORT=4000 MIX_ENV=prod mix profile.eprof --no-warmup -e "Conduit.B.create(250)"

which gave:


Profile results of #PID<0.568.0>
#                                                                                                  CALLS    %  TIME µS/CALL
Total                                                                                              12511 100. 31543    0.25

...(more)

Vex.Validator.Source.Atom.lookup/2                                                                 10000 0.61  1936    0.19
:lists.keyfind/3                                                                                   28506 0.62  1958    0.07
:elixir_aliases.to_partial/1                                                                       10500 0.65  2036    0.19
Module.concat/2                                                                                    10000 0.65  2040    0.20
String.Unicode.downcase/3                                                                          30750 0.69  2185    0.07
Enum."-reduce/3-lists^foldl/2-0-"/3                                                                22387 0.71  2250    0.10
Vex.result/4                                                                                        2750 0.87  2758    1.00
Code.ensure_loaded/1                                                                               20000 0.89  2810    0.14
:re.run/3                                                                                            500 0.90  2841    5.68
Macro.do_underscore/2                                                                              22500 0.93  2936    0.13
:elixir_aliases.concat/1                                                                           10500 0.97  3047    0.29
Vex.Validator.Source.lookup/2                                                                      10000 0.97  3050    0.31
:elixir_aliases.do_concat/1                                                                        21000 1.09  3447    0.16
Vex.Validator.Source.Atom.check/1                                                                  10000 1.17  3695    0.37
Vex.Validator.Source.Atom.validator_by_function/2                                                  10000 1.18  3717    0.37
:erlang.atom_to_binary/1                                                                           23519 1.20  3793    0.16
Enum.find_value_list/3                                                                             10000 1.24  3917    0.39
:code.ensure_loaded/1                                                                              20041 1.30  4090    0.20
Macro.camelize/1                                                                                   12500 1.38  4359    0.35
:ets.lookup/2                                                                                       6267 1.58  4993    0.80
:erlang.module_loaded/1                                                                            20041 1.76  5563    0.28
:erlang.atom_to_binary/2                                                                           23519 1.86  5857    0.25
:erts_internal.port_command/3                                                                        511 1.92  6069   11.88
Vex.Validator.Source.Atom.validator_by_structure/2                                                 10000 1.99  6271    0.63
:elixir_aliases.do_concat/2                                                                        21000 2.05  6460    0.31
:erlang.function_exported/3                                                                        20791 2.16  6799    0.33
:erlang.binary_to_atom/2                                                                           15500 2.25  7095    0.46
Macro.do_camelize/1                                                                                78500 3.70 11681    0.15
:erlang.demonitor/2                                                                                17294 3.81 12013    0.69
:erlang.monitor/2                                                                                  16544 4.72 14899    0.90
:code_server.call/1                                                                                15544 5.09 16058    1.03

and later

PORT=4000 MIX_ENV=prod mix profile.cprof --no-warmup -e "Conduit.B.create(250)"

which gave:
                                                                     CNT
Total                                                          341074646
:erl_prim_loader                                               308291218  <--
  :erl_prim_loader.archive_split/3                             247019256
  :erl_prim_loader.is_prefix/2                                  26522018
  :erl_prim_loader.debug/2                                       4538849
  :erl_prim_loader.deep_member/2                                 3844600
  :erl_prim_loader.reverse/1                                     2269424
  :erl_prim_loader.prim_get_file/2                               2269424
  :erl_prim_loader.name_split/2                                  2269424
  :erl_prim_loader.efile_get_file_from_port2/2                   2269424
  ...(more)
:prim_file                                                       9077696  <--
  :prim_file.read_file_nif/1                                     2269424
  :prim_file.read_file/1                                         2269424
  :prim_file.internal_name2native/1                              2269424
  :prim_file.encode_path/1                                       2269424
:erlang                                                          7470530  <--
  :erlang.++/2                                                   2343934
  :erlang.setelement/3                                           2272223
  :erlang.whereis/1                                              2124073
  :erlang.module_loaded/1                                          73631
  ...(more)
:lists                                                           2711297  <--
  :lists.reverse/2                                               2303084
  :lists.keyfind/3                                                221316
  ...(more)
:code_server                                                     2338054  <--
  :code_server.mod_to_bin/3                                      2113863
  :code_server.reply/2                                             22433
  ...(more)

and they both give quite an interesting report which may could help you further investigate your findings.
Maybe others on the forum can explain those results because they seem strange to me. That much erl_prim_loader and code_server calls/reductions.

1 Like

so a quick last look at the results pointed me towards Vex, it has two pretty big performance bottlenecks as open issues on their github that mirror in both profile reports. I almost can’t believe this would slow down your results that much but I think it’s really worth it to further investigate.

The first is Code.ensure_loaded with 10x perf improvement issue here: Replace Code.ensure_loaded/1 with faster counterpart by timbuchwaldt · Pull Request #52 · CargoSense/vex · GitHub
And the second is the (absence of) pre compilation of the validators. The fix gives also 5-6x perf improvement. see here: https://github.com/CargoSense/vex/pull/63
Both are still open issues with the library so not that easily used in your PoC repo unfortunately.

2 Likes

Ok couldn’t resist to still give it a try for 5 minutes; removed Vex from your PoC and I think Vex is your actual problem:

iex(1)> Conduit.B.create(5000)
...
Time to create 5000 users: 35secs (my previous result with Vex was 778 secs)

This is on same hardware as you (apple M1 with 32GB RAM) but my postgres is running inside a docker (so most of the time super slow with these runs).

3 Likes

Thanks a lot, it explains everything. I have disabled Vex and results are orders of magnitude better.

1 Like

Some time ago I was able to get a barebones workflow of events/event handlers/process managers completing at 40 per sec. There was some learning that had to happen and there were probably two major speed ups.

First was process managers (PM) can be slow because they A) save their state to the db after each event, B) process list of commands serially and C) per PM the event stream is read (not per instance) and the one process serially creates new instances (initing a GenServer is fast but not free), run all 'interested?'s and GenServer.casts data to instances. At the end of the day they are designed for workflow correctness not for speed. So we had to modify our UX to give the front-end user the :+1: message before the PM had finished its work. In our case this was fine as the decision logic was largely made with the first command that started the PM, and if it failed somewhere in the PM we could reasonably send a ‘sorry we encountered a problem’ email to the user.

Second was splitting up the event handlers can make the system much faster. This happened in two ways. First, Commanded added concurrency for event handlers and that has not been added to Conduit. Second, we started to split up handlers/projectors such that there was an ‘AggregateCreatedProjector’ and ‘AggregateUpdateProjector’ instead of just an AggregateProjector – double the workers did great, but … Obvious in hindsight, we had the problem that the create projection might come after an update projection because the create projector was busy and the update was not. Once we grouped together projections that depended on each other within one projector, everything worked fine again and we were still able to have two separate projectors/handlers for two workflows going through one aggregate, with concurrency on top.

We also ended up distributing it across a cluster with something like 12 CPUs in all? I don’t feel like I was running into a per-second limit per aggregate (not instance) like you describe.

The CQRS form gives you several ways to shape your guarantees. Unfortunately my experience has been that gaining intuition about async / CQRS / Commanded is done the hard way, you spend lots of time head scratching at error messages.

2 Likes