At the company where I currently work we are building a PoC project using CQRS/ES and @slashdotdash 's EventStore + Commanded.
At some point we found that if we need to replay all events (for example, if we add a new sub-domain which requires past events to create its state) then it takes significant amount of time even on relatively low amount of events.
To reproduce the situation I’ve made a simple testing strategy using Conduit from “Building Conduit” book, here is the code on a separate branch ( perf ), README contains all the information including my test results: https://github.com/vheathen/conduit-perfomance/tree/perf
I would be really grateful if someone can have a look at the testing strategy and numbers and tell if they make sense.
When any aggregates are involved performance is terribly low, like 15 events per second per aggregate (not per aggregate instance !). I’m quite sure that I can miss something and it can be improved a lot, but currently don’t really understand how, so any thoughts on the topic are greatly welcome.
have you made a baseline benchmark of the database itself? As in, just measure how fast vanilla ecto insert statements are when talking directly to your DB. If this is the same then the bottleneck is your current DB setup itself. Are you running postgres directly on your machine? in a docker env? is it with a shared disk from host to the docker VM (because that’s a known issue to be very slow) ?
It is native postgresql 15.0 installation without any virtualization, and its very fast, so the bottleneck isn’t there. I’ve made a simple test: insert 1_000_000 rows into the accounts_user_names table, its ~1602 rows per second.
Actually the same can be seen from the time of direct projection speed (1 min 46 sec for 5000 inserts and 100_000 updates), step 34.
and they both give quite an interesting report which may could help you further investigate your findings.
Maybe others on the forum can explain those results because they seem strange to me. That much erl_prim_loader and code_server calls/reductions.
so a quick last look at the results pointed me towards Vex, it has two pretty big performance bottlenecks as open issues on their github that mirror in both profile reports. I almost can’t believe this would slow down your results that much but I think it’s really worth it to further investigate.
Some time ago I was able to get a barebones workflow of events/event handlers/process managers completing at 40 per sec. There was some learning that had to happen and there were probably two major speed ups.
First was process managers (PM) can be slow because they A) save their state to the db after each event, B) process list of commands serially and C) per PM the event stream is read (not per instance) and the one process serially creates new instances (initing a GenServer is fast but not free), run all 'interested?'s and GenServer.casts data to instances. At the end of the day they are designed for workflow correctness not for speed. So we had to modify our UX to give the front-end user the message before the PM had finished its work. In our case this was fine as the decision logic was largely made with the first command that started the PM, and if it failed somewhere in the PM we could reasonably send a ‘sorry we encountered a problem’ email to the user.
Second was splitting up the event handlers can make the system much faster. This happened in two ways. First, Commanded added concurrency for event handlers and that has not been added to Conduit. Second, we started to split up handlers/projectors such that there was an ‘AggregateCreatedProjector’ and ‘AggregateUpdateProjector’ instead of just an AggregateProjector – double the workers did great, but … Obvious in hindsight, we had the problem that the create projection might come after an update projection because the create projector was busy and the update was not. Once we grouped together projections that depended on each other within one projector, everything worked fine again and we were still able to have two separate projectors/handlers for two workflows going through one aggregate, with concurrency on top.
We also ended up distributing it across a cluster with something like 12 CPUs in all? I don’t feel like I was running into a per-second limit per aggregate (not instance) like you describe.
The CQRS form gives you several ways to shape your guarantees. Unfortunately my experience has been that gaining intuition about async / CQRS / Commanded is done the hard way, you spend lots of time head scratching at error messages.