I want to build a system where I want to to execute some jobs periodically (about once per day). One of the requirements of the system is to store all the results of those tasks as history, so they can be used later for analytics and KPIs.
My previous approach to this kind of problem was to use real-time databases (usually influxDB) and send measurements directly from the application, however in this case it seems that there will be large amount of data and I’m not sure whether those types of databases can handle this kind of load.
Some time ago I stumbled CQRS/ES concept and it seemed like a good idea to use this as an alternative to measurements. Having no previous experience with this kind of systems, I wondered if someone from you, who worked with library like commanded could advice me wether it is a good idea to use this tool, or it would be smarter to just use a primitive approach, like storing the data manually with queries, as I don’t have requirements to use previous events for projections (at least for now).
In addition to this, I would like to know if you prefer to be using Postgres to store events, or it is much better to use something like EventStoreDB.
That’s a great question.
There are several things I think that will help you decide
Event Sourcing and operations history for analytics are very different approaches, though they might seem similar. Event Sourcing pattern works only when you store the command/event in casually ordered chain before you actually apply this operation to read or write model. This approach makes sure that the
- Each event and command has happened in exactly the same order as it is stored in Event Store. This is extremely useful for any money-related problem or for distributed transactions with very critical data
- It is easy to rollback in time and replay some events from exact time to find a bug or inconsistency or security breach or whatever
On the other hand, analytics are not required to have casual or linear order. Most of analytical data can be stored after the request was processed. Analytical data can be lost, since it is not critical and data loss just decreases quality of the analytics. Most of the analytical data can be aggregated. Analytical storages do not need to have ACID and append-only properties (which are required for Event Stores), and they can be column oriented like Clickhouse.
Commanded linearizes all events and commands in a single process and then stores them in the Event Store. This is clearly a bottleneck for heavily loaded systems, but it is very useful for systems where linear order of commands/events is required.
Given the two points above, you don’t need Event Sourcing and therefore you don’t need Commanded.
I’d recommend time series or OLAP databases such as Clickhouse. This choice depends on the kinds of analytics you want to gather.
These are definitely strong arguments, it seems that I don’t need a CQRS system after all, at least I don’t care about the order of events.
The other thing that I liked about
commanded is the fact that you can store events in Postgres, the application I want to develop will be mainly focused on small self-hosted instances, and I would like to use only Postgres for data storage as to avoid complexity in infrastructure.
The other question if
commanded is out of the picture, are there any patterns for storing historical events data in elixir, the last time I was using
telemetry, however I have a feeling that there might be some better solutions out there.
Unfortunately, I don’t have any experience with analytics in Postgres, but if I was solving this task, I would take a look at how commanded stores it’s events.
If events are written in huge amounts and read very rarely in export fashion (read all at once), I would
- Batch data on the client as much as possible.
- Write one huge batch into one row with postgresql or client-side compression
However, it might also depend on the kind of analytics you’re gathering. If it answers questions like “how many times in a second on average/maximum/minimum”, I would take a look at OLAP approach. If it anwsers questions like “what and when did the user do in our program”, I would take a look at time series approach
Plus, Postgres might require some tuning for write-heavy workloads like WAL tuning and setting fillfactor to 100%
Unless you need load balancing or any other multi-application-server scenario then you might as well go with DuckDB. It’s basically the equivalent of SQLite for OLAP needs.
That’s interesting! Does duckdb have any adapter for elixir/erlang you can recommend?
Well, my quick hex.pm search says there are only a handful. But never tried them.
Definitely a very interesting tool, it is beyond my understanding on how to use this tool now effectively, but I think it might be interesting to take a look in the future.
It’s honestly not as hard as people think. Just another DB, though the usage patterns are not what we are used to. Still, it took me an hour, and I was extremely grumpy and tired back then. I am sure you can do better than me, easily.
Looks like you are right, I started on the wrong foot by looking at a bad tutorial where the point was to show how it integrates with panda(witch I never used), looking at a conference from one of the creators, it is much more clearer how it works.
My only concern is to know wether such databases can support serving multiple clients (for example showing analytics on a frontend that people visit), because if not, then you will need to make some abomination that stores the results of analytics in a database like Postgres.
I forgot the details but I think it’s more or less like SQLite: you can have a ton of readers but only one writer – which translates to still being able to write from multiple clients BUT only one writer actually does its job at a time, which means the other ones wait… which in turn means that unless you have 1000+ writers at the same time you’ll very rarely even notice.
Check out their home page, it has a section on “When to use DuckDB” and “When not to use DuckDB”. It seems they don’t encourage concurrent access but I’ve still seen fair amount of people on HN claiming it was never a problem for them.
I think this is their design choice, as to avoid having to explicitly support high concurrency, as they have to sacrifice time and functionality for this kind of optimisation. I might give it a try these days and see if any of the libraries work at all.
DuckDB and Clickhouse Local are very popular among data scientists, FYI.
But yeah, don’t be intimidated. Personally I’m always erring on the side of minimum machines and simplest setup so any advice I give should be evaluated within that context.
Oh by the way, here’s a link to clickhouse-local.
This may be way out of left field but would TimescaleDB be an option? When you mention analytics I’m reminded of the talk Dave Lucia gave at CodeBEAM Code BEAM America. The docs also include a Livebook notebook to play around with it, which significantly reduces friction for someone like me Intro to Timescale — Timescale v0.1.0. Though looking deeper it is more of a preview state probably and I doubt there are other more mature packages.
The question is already solved, but I want to clear one thing up.
CQRS != Event Sourcing
- CQRS is an architectural pattern where you separate the read and write model
- Event Sourcing is a pattern where you use domain events as the single source of truth to make decisions instead of a “regular” database.
CQRS doesn’t require EventSourcing, and strictly speaking neither does Event Sourcing need CQRS, but ES almost always uses CQRS.
Yes this is true, however the most literature uses them in combination, and this is also true for a lot of libraries, so from the point of perspective of a beginner it is not very clear how to distinguish them correctly.
Yeah, for EventSourcing, the logical choice is also to use CQRS. So that’s the reason they’re often bundled together which can be confusing.