Opinion on file & memory based event sourcing system

Storing events directly on disk is how both Axon Server (Java) and Event Store (C#) deal with persistence. The source code for both products is available to view on GitHub. For Event Store there’s a two hour video by James Nugent (one of the core contributors) on the internals and a page outlining the internal architecture in their official documentation.

Writing to disk is easy to start with but presents quite a few challenges which have already been solved by all modern DBMS (which is why my Elixir EventStore uses Postgres). Some considerations include flushing writes to disk to survive power loss; compaction of the event log and scavenging of disk space; backup/restore; handling data corruption; multi-node deployment; indexing and performance.

Once deployed to production you also need to consider how devops can support and optimise the application for its data access patterns. Using an existing storage provider gives you access to tooling, expertise, and managed services so someone else can take care of hosting the database for you.

But I’d agree it would be an interesting problem to work on, even if you don’t need to use it at scale.

3 Likes

Commanded provides the Commanded.EventStore behaviour as a way of plugging in different event store adapters. If you wrote an event store persisting to disk you could use it with Commanded by adhering to the above behaviour (or writing an adapter/facade for compatibility). In my experience it’s the subscriptions that is the more complex part of building an event store.

1 Like

This would be pretty much my approach for the memory image too. You can always just query the aggregate that’s already sitting in a GenServer. Where my implementation would require CQRS is that some queries (over a multitude of aggregates for example) would be difficult or at the very least slow. In that instance I’d create other aggregates that handled these queries and were optimised for them.

How does replaying events work in your system? Do you just replay all the events and create the aggregate rows as you go? I found that thinking about the problem from the perspective of replaying events rather than direct commands => events highlighted a lot of shortcomings in my design. The main one was that picking up events from various files was more complex/less reliable than just reading off a list of events from a DB table.

That’d be great! I’ve had a look through the source code before (and again this weekend) and it’s clear but a lot of the “why” gets lost which I think is where a lot of the value comes from, and where docs are really helpful! (At least for me)

That would give you the ability to replay “events” and rebuild the state of the application, but I think you’d be missing one of the biggest advantages of ES - a domain specific language for the changes in your system…

… and it’s that domain specific language that tends to make a lot of the code duplicated. There doesn’t need to be a conceptual difference between commands and events, but they are quite useful to ensure that events can always been applied (for the purpose of replays) and that commands can sometimes be rejected with helpful messages. I’ve found that whenever I try to drop the commands to simplify the system, I actually end up Command Sourcing rather than Event Sourcing. I map the intent (the command) rather than the outcome (the event) and rely on my application to consistently end up in the same state despite commands having the potential to be rejected. Removing the duplication by removing commands and sticking to just events tends to create a one-directional system in my experience where a users actions are just sent into the ether and may or may not result in an event. A defined commands provides a feedback mechanism.

As for duplicate models between aggregates and projections, that’s the CQRS bit coming in. As Ben pointed out above, if you can get away without additional read models and just directly query the aggregate you can drop the additional complexity and duplication that projections bring, but you also drop the ability to have tailored read models.

I think a degree of duplication or at least similarity throughout a domain is always going to be required for this structure. That’s where the flexibility and granularity come from and the benefits start to diminish if this duplication is too ardently avoided in my experiments.

This is something I haven’t struggled with yet, but I haven’t been running ES systems for long and went to great pains to ensure that any changes to my domain/models were additive, in that new events had extra information that old ones didn’t and that info was just inferred for the old events.

I hadn’t considered sqlite because my head was so firmly focussed on the BEAM VM. That could be a good alternative, albeit without some of the advantages (keeping data as terms).

I had read that page, but hadn’t seen the video. I’ll have a watch!

That’s pretty much my experience too. I had wanted to go with a very simple and “pure” implementation using just File and term_to_binary but it very quickly becomes apparent that…

…these problems are real and need addressing. I hadn’t considered them until quite far into my first version.

This I thought could be quite straightforward as it is effectively just like backing up any other files on your computer/server.

Again, I hadn’t considered this, but the “what-ifs” started creeping in as I was working on it.

I was going just for single node for the time being. I wanted to try this approach as a simpler, smaller scale alternative so multi-node wasn’t a concern.

I had hoped that IO performance wouldn’t be an issue as I could rely on eventual consistency and my in-memory data to provide all the speed I’d need. Of course, complete replays would be expensive and slow, but hopefully that would be a rare occurrence.

I’ve been working to that behaviour so far…

…but as you say, it’s been the subscriptions thats the tricky part!

I’ll keep cracking on with it, but instead of building out my little side-project/business with the file and memory based ES, I’ll use Commanded with the aim to switch from one to the other if I can maintain consistent API between them.

Thanks for all the feedback guys! If you have any more thoughts, please keep adding them. This is very much an ongoing thing.

3 Likes

Our team at Inflowmatix built our ES/CQRS into our platform using a combination of meta-programming, protocols and functional Elixir. We originally ran our platforms domain store (behind a protocol to allow switching) on Mnesia for around 18 months in Production. We used term_to_binary along with compression and made heavy use of snapshots of our Aggregate state to reduce Aggregate startup costs/times. Eventually though, as our most active Aggregates started to have a long history and frequent snapshots, the table fragmentation in Mnesia was getting heavy. Additionally the locking model at a table level for certain transactions caused frequent rollbacks and retries with lots of conflict resolutions, which eventually converge, but with heavy load can cause upstream timeouts on GenServers especially where a call was needed to assure the transaction completed.

Eventually after many small repairs and refactors we switched our store to Postgresql and we were able to use our protocol to write a splitter module that wrote to both stores for a while (and read from Postgresql) until we migrated the historic data in the background to Postgresql then switched the splitters behaviour to use only Postgresql.

Just some notes really to perhaps bear in mind.

7 Likes

It’s really great to hear what a similar system was like in production!

I had considered using Mnesia, as it meets the criteria of being on the BEAM and would provide a lot of the extra control around saving to disk that I would otherwise have to write.

Transactions have become more of a requirement/issue the more I’ve looked into the concept. Guaranteeing that events are written to disk fully means that the entire event store effectively becomes sequential and blocking — unless I’m misunderstanding the fundamental requirements…

I’d very much like to make the file-based event store work, but it’s increasingly looking like PostgreSQL is a far more pragmatic choice.

Seeing as I’m looking at using files for persistence and GenServers for the memory image then perhaps all I’m looking for is a GenServer that can be recovered/restored/rebuilt from file. Something like PersistentGenServer, Mnemonix, or even CacheX? It wouldn’t be ES as commonly practised and more actor model with some events stuffed into it. I’d have to do some thinking on whether this would suit my requirements.

1 Like

I’d be interested to hear more about this, especially as I’m not far from you in Winchester. We should meet for a coffee to chat solutions. Did you consider using Commanded & EventStore before building your own solution?

The main reason I think was that we built it in 2015 before I think you started your work.

Sure it would be good to meetup if you fancy coming over to the science park.

Mike

I’m still chipping away at this whenever I get a moment, and thought that I’d provide a quick update on where I’m at.

I tried using many different options for storing events, all using a shared interface so that I could quickly switch between them. I’ve found some wonderful little features of both Elixir and Erlang and some cool libraries to help me along the way — it really has been a brilliant learning experience!

So far I’ve tried:

  • Simple term_to_binary and File.open(x [:binary, :append])
  • DETS
  • Erlang’s disk_log
  • Erlang’s file:consult for reading events back and a custom io_lib:format function to write them
  • CubDB (using min_key and max_key for event number ranges was brilliant)

All had positives and negatives with disk_log probably coming out on top. It has a load of features that you really need for working with files already baked in and thoroughly tested. The deal breaker though was that it’s really hard to get events back out by their event number, as disk_log really doesn’t work with any keys, it literally just appends a given term to the end of the log.

Getting the events out required bringing all events into memory (ignoring disk_logs wonderful chunking feature) and filtering out events older than those we were interested in. This isn’t the worst thing in the world as I was planning on keeping logs partitioned by aggregate, meaning that they are unlikely to ever become so long that parsing the whole log becomes an issue as it’s incredibly fast - loading 100_000 average-sized events in approximately 75ms.

My concern was that I could end up with a compromised method for reading and writing the literal heart of this system.

The next phase in this experiment was yet more learning… Martin Kleppmann’s “Designing Data Intensive Applications”. I can honestly say that I learnt more reading this in a week than I did over the previous 9 months piecing together articles and documentation online. His closing conclusion of “turning the database inside out” is very in line with what I’m trying to accomplish here, although I definitely think he was talking at a much larger scale than I’m working at! :joy:

Importantly, the book highlighted many issues that I was unwittingly making with other parts of my system. I was neglecting a total order by writing all of my events in partitions based on aggregates. It wasn’t a deal-breaker though as I could accept partial order as a trade-off in this instance as causality isn’t an issue in my domain. However, these partitions did make my projections quite complex due to needing to track their event offsets (using consumer offsets over ack) for many, many, many streams. Not an issue but one that I decided that for now at least to avoid by opting for a single unified log with total order.

I’m using a very simple writer and index pattern from the book and at the moment it’s working very well. Sending a command to an aggregate, having it validated, events created, persisted and applied to the aggregate is taking on average around 150 micro-seconds. I still have to handle log rotation and snapshots, but these should be quite simple as I’m modelling my store as a series of GenStage (yet another thing I’ve learnt since starting this).

I’ve decided in many cases to opt for purposely naive approaches - whereas before they were just naive :wink: - in order to keep my system simple. My tests have shown that I’ve got a good amount of performance margin with which to make compromises such as the single log which is a potential bottleneck, but hugely simplifies the entire system. Equally, a single event stream, local-only Registry makes it a lot easier to work with and understand. I figure that it’s easier to make a system that already works well faster, than it is fix a fast system thats yet to work.

I have had crisis’ of confidence along the way — “Why aren’t you just using Commanded and PostgreSQL like a sensible person?” - “You’re totally out of your depth here!” - “Why even bother with event sourcing, CRUD could work here after all” — but overall I’m very happy with the progress I’ve made and the things I’ve learnt. I think I’m probably a little way off being able to build anything complex with it yet, but I’m enjoying the process.

6 Likes

Sounds like you’re having fun. Martin’s book is a fantastic reference for building these types of systems.

Once you have a working single-node system you can experiment with making it distributed for scalability and fault-tolerence. It looks like disk_log supports distribution, but likely not useful because it doesn’t guarantee writes go to all nodes:

A collection of open disk logs with the same name running on different nodes is said to be a distributed disk log if requests made to any of the logs are automatically made to the other logs as well.

It is not guaranteed that all log files of a distributed disk log contain the same log items. No attempt is made to synchronize the contents of the files.

Erlang -- disk_log

1 Like

It’s been so much fun, albeit frustrating and head scratching at times. As you say, Martins book almost feels written with this type of system as the ultimate goal.

It’s so tempting to build on top of disk_log because it’s already there and does so much of what I want to achieve, but the hurdles I think are too great. The querying of events (or lack thereof) will become a real issue with a single event stream, and as you point out, the built in distribution isn’t all that helpful in this case.

Distribution and replication whilst maintaining consensus on event ordering is where I’ve spent a lot of time re-reading sections of the book. Unfortunately it looks like single leader replication of the log is the only really viable way to approach the problem, and therefore not hugely useful as I’ve still got a single node as a bottleneck.

The only way to distribute the log (either locally or across machines) would be to go the same route as Kafka and break my event streams into topics and partitions (as I was originally doing) which brings a lot of additional complexity with it, both in persisting events and making them available to the rest of the application.

As my use case is smaller, simpler applications I think that restricting myself to a single node (at least for the event store) is not unreasonable. I’m opening a can of worms and losing sight of the objective by trying to solve problems with distributed logs and total ordering, so for now I’m keeping my solution as simple as possible.

As both file and database backed logs run into the same restrictions around distribution, the only real question that remains is which is better in terms of scale and throughput on a single node. I suspect that’ll be a DB like Postgres, unless the lack of encoding/decoding of terms starts to play a larger role in performance. We’ll see…

1 Like

If you make the event store single node then you could expose an API that other parts of your application can use. Treat it like a database which as many nodes as required can connect to. Event Store exposes a “native interface of AtomPub over HTTP” as an example.

Thanks for the fantastic thread. A week ago I stumbled upon the event sourcing concept and I’m learning quite a bit thanks to libraries like Fable and EventStore. But obviously I probably still not get the nuances of where, when and how to use this pattern correctly.

What I don’t necessarily like with the approach is that my logic is somehow also bound to event history; if something changes in the way data is being stored I now need to either:

  • Keep the old and new data-structures as they were stored and just add the necessary application logic to handle multiple representations.
  • Migrate stored data and reducers as necessary to keep the system tidy.

The Fable example from @benwilson512 gave me a cue:

¿Would storing the event along its transformation be too crazy of an idea?
Let’s say these are the events I want to persist (this is just general elixir, no framework or library attached):

%Event{
  id: "123-abc",
  event_type: "Inserted",
  handler: :erlang.term_to_binary(fn x -> x end),
  number: 1
}

%Event{
  id: "123-abc",
  type: "Incremented",
  handler: :erlang.term_to_binary(fn x -> x + 1 end),
  number: 2
}

We could then retrieve the desired events and apply the stored transformations to any data as a starting point. Taking it a little further:

%Event{
  id: "1a2-b3c",
  type: "Incremented",
  handlers: %{
    up: :erlang.term_to_binary(fn x -> x + 1 end),
    down: :erlang.term_to_binary(fn x -> x - 1 end)
  }
  number: 108
}

We could store handlers for both transforming data further or doing a rollback (the Sage - Sagas project docs gave me this idea). This would let us “advance” or “rollback” data in any state to another point in time.

Aside from security or possible changes on how terms are stored… Would something like this be at odds with EventSourcing? Is there a library that already does this? I’m pretty sure there’s a lot I’m missing :anguished:

Hey, I’m glad that the thread has been useful. Event sourcing is a bit of a bear to get your head around initially, and for a “simple” pattern, there’s a lot of complexity hidden in implementing it well. I’ll confess that I’m still not there myself and will keep using Commanded whilst slowly tapping away at this idea.

You’re very right that storing events leaves you bound to your past mistakes or oversights. That’s pretty much the biggest cost of the pattern, if you screw up you either have to live with it or spend a lot of time managing previous versions of your events, or migrating them to the new structure. There’s some good links floating around the forum (just search “eventsourcing”) that dive into this issue further.

For this reason, it’s often recommended to use ES in a well understood domain with a lot of time spent up front working out the data your events need. It’s not a tool to pull off the shelf when you’re still finding your way through the problem.

For what it’s worth I handle the event versioning issue by being quite overzealous in the data that each event captures. You can be quite objective with event data as it’s “something that happened” so there’s only so many ways that you can interpret it. Make sure you store everything that may be relevant to this action and you should be OK. I’ve only had one instance in the last year where I needed a new version of an event, and it was because the original was missing a data point I didn’t think was required at the time. Fortunately it was easy to add in and provide my handlers a fallback value where the older values were missing it.

One thing that often gets overlooked and which tripped me up was the boundaries of my aggregates. Perhaps because I don’t have a DDD background, but where my aggregates sat in relation to my projectors caused me some grief early on.

As for storing the function within the event: personally I’m not sure it would work. It’s fine for mapping the behaviour of that event within the aggregate, but doesn’t allow for the possibly dozens of ways this event will be used by a countless number of handlers. Not to mention you lose what is wonderful about ES: data separated from logic and therefore the ability to rewrite the handlers and operate on old data as your requirements change. It’s a nice idea and I do think that “logic as data” has it’s place, I just wouldn’t use it here.

4 Likes

Yes I’m still trying to understand the real meaning of DDD lingo (boundaries, aggregates, projectors) I’m guessing that eventually a book (or experience) will have to do.

Yeah my approach bounds an event with a “single interpretation”, but I guess that I have this irrational feel that without my application logic a series of events by itself would be meaningless. But I guess the same could be said of any generic enough abstraction.

Thank you so much for the well put thoughts :slightly_smiling_face: the picture is getting clearer bit by bit.

1 Like

I thinking that disabling the Linux write cache could help here:

Not all system’s belong to the same “turn-on write-back caching” recommendation group as write-back caching caries a risk of data loss in the event such as power failure etc. In the event of power failure, data residing in the hard drive’s cache do not get a chance to be stored and a lost. This fact is especially important for database system. In order to disable write-back caching set write-caching to 0:

# hdparm -W0 /dev/sda

/dev/sda:
setting drive write-caching to 0 (off)
write-caching =  0 (off)

Reading the Kafka design decisions about going with a filesystem approach while taking advantage of all the low level stuff that Linux as to offer may help you in making some good decisions.

1 Like

Thanks for the info and links! I’ve tabled this project whilst I work on other things — bills to pay and all that… — but the idea still keeps nagging at me. I had intended to look at the guts of systems like Kafka and EventStore properly when I revisit this.

Do you have any insight on how the Erlang VM might impact this approach? When looking at this before it seems that any access to the low level OS stuff outside the VM is more limited than with other systems, but that may just be my unfamiliarity with the guts of the VM.

This approach of using files to persist the data need to take in consideration that the Erlang library disk_log by default as a delay of 2 seconds or 64kb to write to the disk as I say here:

This setting can be tuned, but care needs to be taken to find the right balance in terms of disk IO performance, otherwise you may create a bottleneck when writing to the disk.

I am not that familiar to, but @dimitarvp may have something to say here.

From what I remember when reading the Kafka design decisions you don’t need to interact with the OS from the BEAM, you just need to tune the machine for disk IO as they recommend.

Did you get back to this at some point?

I hadn’t — I’ve actually changed careers and work as a photographer now — but funnily enough started poking around my terminal again today and seeing what’s new in Elixir as I’ve a few ideas I want to try (on my own time). Event sourcing is still one of those things, but I’ll need to dig into it again and see whether this approach has merit versus a more conventional ES setup using Postgres.

1 Like