Hobbes - a scalable, fault-tolerant transactional record store

Thanks! You and anyone else are still welcome to ask questions even if you don’t know much about this stuff btw. I don’t mind!

Congratulations on technically being the first person to get the pun!

I was originally going to write a clone of Calvin before realizing that it was a research system and not a great candidate for that. Thankfully I found FDB which taught me a lot. But I kept the name :slight_smile:

3 Likes

Thanks, I now see your point about developer discipline writing code in this style. I suppose this is why Flow exists. And I’m sure the next iteration exists at the core of Antithesis.

Question for you, since you’ve been thinking deeply about this. Suppose BEAM does someday get a simulation mode, with a deterministic and stable scheduler. Would we require fundamental language changes in order to make proper (and ergonomic) use of it, or do you see a world where one or more of the current languages can adapt: Erlang, Elixir, Gleam? I guess I’m wondering what building blocks would we need, hypothetically, to remove some of that required developer discipline?

How deep does the rabbit hole go? :slightly_smiling_face:

1 Like

I mean, I shouldn’t oversell this: it really is not that hard. In the very beginning I had a couple of determinism bugs caused by enumerating maps. After I learned about that I don’t think I ever had a determinism bug again.

Also, if you have a rare determinism bug it may not matter very much. The purpose of the simulator is debugging; deterministic execution does not have to be a correctness condition of the database.

At the end of the day, you actually have to care about these things. Good testing takes effort. Really, anything good takes effort.

Like, I said property-based tests are better than unit tests, and then everyone shows up to defend unit tests. And, look, I am not here to tell you what to do, but if you actually care about correctness and want to get to zero bugs this attitude is just NGMI.

Fundamentally, what set FDB apart is that from the beginning they set a culture of caring deeply about correctness. They went to war with bugs. Flow, the simulator, and the database were all downstream of that, not upstream.

Today I think most of that culture has been erased. They still have the simulator, but the team maintaining the database is several generations removed from those pioneers.

If you want to see who’s really carrying this torch today, it’s Tigerbeetle. In particular, their work on handling storage faults far surpasses FDB, and if FDB’s culture was still alive today they would be working on that instead of replacing their storage engine with RocksDB (lol).

I would like to carry a bit of that torch too.

1 Like

To actually answer your question, I don’t think any language changes are needed. Simulation tests on the BEAM work today; I’m literally doing it right now.

If you made the BEAM scheduler deterministic and wrote simulated APIs for I/O and time you could do DST natively, which would have performance benefits. But this would probably be a lot of work.

The real benefit of a new VM would be to simtest the VM itself. The biggest problem with my approach is that I cannot catch bugs in Erlang. Of course Erlang is old and stable so this isn’t the end of the world, but I would love to do better.

If you look at Tigerbeetle they have no dependencies at all. They literally vendor parts of the Zig stlib into their repo. That’s the dream: no untested surface area whatsoever.

But obviously “rewrite the entire BEAM” was one yak too many even for me, so Construct works for now!

When I saw this in the forum’s email digest, I was like wait what, I need to see this :smile:

Congrats! It will be pretty exciting to see how this project grows, good luck!

Actually, the reason I wanted to see this is because I need something like Uber’s LedgerStore, an immutable, scalable storage for financial transactions… Will this project ever be suitable for this? (apologies if this is unrelated, I didn’t dive in the answers yet)

P.S. I just can’t seem to find anything for this in the Elixir ecosystem, searching for column-oriented DB (DuckDB?) that blends in, but defaulting to Postgres right now… If anyone has suggestions for what to use please let me know :folded_hands:

1 Like

This is an incredible amount of effort put into something that could become extremely valuable to the ecosystem. I just want to ask how you ended up with all the knowledge needed to build something like this. It’s difficult for me to imagine how to even begin understanding these concepts, at least from my position of a regular developer shovelling endless CRUD apps and integrating half-working REST APIs. It’s clearly a very different world.

3 Likes

I did not have the knowledge when I started! I acquired it by trying. Of course I had to read a lot and studying databases had become a hobby of mine before it turned into writing databases, but at the end of the day (and this is really true of everything) reading is not enough. There is a lot of stuff that only clicks once you sit down and try to implement it.

Maybe, maybe not. If you’ve followed my posts on this forum you will know that this project was born specifically out of me trying to build apps that are better than CRUD apps with half-working REST APIs and consistently failing to meet my own standards. Some of that is on me, but I prefer to think in terms of process errors and I’ve found that bad tooling leads to bad results.

Especially in this brave new world of AI slop I have become convinced that saas-with-forms is just not going to cut it anymore. I am trying very hard to do better.

Well, the explicit goal of this project is to make sure you don’t have to :slight_smile:

But really, if you’re interested in this stuff it might not be as hard as you think! Keep in mind that non-programmers would say the same things about the knowledge you take for granted. Don’t count yourself out.

8 Likes

I will give you the long answer to this question.

First of all, Hobbes is a low-level database; that is, a tool for building databases. The architecture is optimized specifically for a “distributed OLTP” model where the dataset is large but transactions are small. In these workloads, contention is rare and has to be guarded against mainly for correctness. This architecture is pretty optimal for app backends and so on.

However, there are some workloads that have high contention. These workloads do not scale (here’s a fun paper). As it turns out, financial transactions are quite close to the worst possible case here. It is common to route transactions through a small number of accounts which leads to extremely high contention (“hot keys”).

So for financial transactions specifically the best design is to write extremely tightly optimized code to process all transactions in batches as fast as possible on a single core. Elixir is essentially the worst language I can imagine for doing something like this.

Coincidentally, there is a new database designed specifically for financial transactions called Tigerbeetle which was the topic of discussion a few posts up. I mentioned them because I’ve spent a good amount of time in their codebase now (learning) and, frankly, they have the best engineering practices I have ever seen. If you need to store financial transactions Tigerbeetle is without question the way to go. You still need another database to store everything else, though, as they can only store transactions.

Hobbes is a low-level OLTP store and can be used as a tool to build essentially any kind of distributed database. You could use it to build a relational database, or as a metadata store for blob storage (like S3). You could even use it to store metadata for an analytics/columnar data warehouse; this is literally exactly what Snowflake did with FoundationDB (Hobbes’s architectural donor). You could keep track of Parquet files as blobs and query them with something like DuckDB. In fact, that pattern has become a whole thing lately.

But financial transactions are essentially the prototypical example of the worst possible case for a “horizontally scalable” database like Hobbes.

8 Likes

Thank you for such a thorough answer! You are the engineer the world needs!)) (thanks for your work and answers throughout the Elixir community man!)

It was so awkward to see that literally previous post of mine mentions Tigerbeetle, the thing I was searching for :saluting_face:

Much appreciated for the paper and technologies (and even terms new to me) you mentioned, it got me interested and made me want to learn more about much things!

I am a beginner to data warehouses with metadata stores stuff, but I really want to learn because I am planning to work with lots of analytical (frequently written and aggregated) data, and this stuff just seams like must know right now. FoundationDB has caught my attention much times and I want to try to do some work with it, but what bothers me is that it needs API language binding and there is no official one for Elixir (I saw you answered A LOT on the EctoFdb forum, so it would be great to hear some advice from an expert). Perhaps there is an alternative that has better Elixir support?

I am now wondering is Hobbes suitable for analytical workloads or it is better to use it as metadata store for a data warehouse? (TBH I don’t really understand for what metadata store is needed and how Hobbes could be used as that :persevering_face:)

2 Likes

(garrison is a legend and he was born with all that knowledge, even peaky blinders named their pub after him, JK :P)

Writing previous post led me to an answer to you my friend: I think becoming a real engineer is tough, because besides the work we all have and do, it requires just delving into difficult stuff to accumulate knowledge about various topics (as garrison said trying, literally like this).

Just dig into code, try implementing – learning by practise is the best, and ask questions! Engineering is pretty courageous ey) But if you want it, and embrace the difficulty, it can be fun and rewarding afterall :wink:

I am very glad to hear that!

This kinda bothered me too, and was one of the original reasons I wanted to try writing my own DB. A ridiculous reason on its own of course, but others came along.

FDB does not have a wire protocol like, say, Postgres. They opted instead for a “thick client” design where the client is responsible for some pretty important stuff like buffering writes, implementing a “read your writes” cache, caching and invalidating shard location metadata, and so on. A lot of this stuff is in the critical path for correctness, meaning if somebody implemented it wrong (which they likely would, because it’s hard) the client would corrupt your data.

FDB was designed using a very sophisticated simulation testing methodology (especially for the time - they pioneered this) and it was necessary to test the client along with the server. So the FDB client is a C library which you have to link against. They do have “official” bindings for a couple of languages but they just link the C library for the important stuff.

For Elixir, @jstimps maintains the BEAM FDB bindings (erlfdb). My understanding is that they were originally written for the CouchDB rewrite which never happened and they are fairly mature and stable. If you have questions about that you should direct them to him in the EctoFDB thread. He also knows much more about operating FDB than me (I just read the code).

4 Likes

As @garrison says, if you’re interested in using FoundationDB, you can use erlfdb in Elixir. It is API stable and used in serious production systems. Here’s an Elixir-focused tutorial

EctoFDB is more of a batteries-included system, with my opinions built in. I run it in production for a side project, and I believe it works quite well.

Please let me know if you have questions about either one in the EctoFDB thread or in GitHub issues!

BTW excited to see where Hobbes goes. The simulation testing is just as critical as @garrison claims, maybe even moreso :slight_smile: . Having a strictly serializable MVCC system embedded in BEAM certainly opens a lot of doors.

5 Likes

Imagine you’re trying to build something like S3. That is, a distributed filesystem or blob store. The purpose of such a system is to map a small identifier (a file name) to a large blob of contiguous bytes.

The way this is done, generally, is to have a system that associates large blobs (or chunks thereof) to opaque ids, and then have a system that associates file names to those ids. This allows you to quickly perform logical updates against the system without moving around huge blobs of data. It’s an indirection mechanism.

Pretty much every object store you can think of follows this pattern, with a metadata service and a blob storage service. If you want to learn more I’d recommend going back to the classic Google Filesystem (GFS) paper, which is a good early example. Another really good one is the classic Haystack paper from FB, which explains pretty much exactly how to store blobs on disk and why the metadata split is important. They talk a lot about storing metadata in-memory but keep in mind this was before SSDs were cheap.

Hobbes, like FDB, is architecturally amenable to being the metadata store. It’s designed to offer fast, small updates to a large dataset with extremely strong consistency guarantees. It is not designed to offer large updates; even a few megabytes in one transaction would be excessive. This is the correct side of the tradeoff to take for OLTP use-cases where you insert or update a few distant rows/indexes at a time.

This “blob tradeoff” goes right down to the hardware. Hobbes is intended for use on SSDs which can efficiently perform random I/O. But for blob storage you would generally want to use spinning disks because they’re cheaper and the I/O pattern is more sequential. This dichotomy is even more dramatic with SMR drives, which physically cannot perform random writes.

So a design kinda naturally falls out of this: use Hobbes to store metadata (filename mappings) and design a simple “blob storage” server that does nothing but write bytes to spinning disks. The blob storage system can dodge basically all hard distributed systems problems (consistency and so on) because Hobbes provides them by default.

I intend to write something like this soon, once Hobbes is ready.

4 Likes

Okay, so for an “analytics database” it’s a little more complicated.

First of all, you want to store analytics data in a columnar format. Hobbes is a pure binary/binary KV store underneath, so you can technically store data in whatever format you want. There have been attempts to store columnar data directly in FoundationDB, though I’m not sure if there are any production examples.

However, there are a few things to note:

  1. The access patterns for OLAP are generally to write data once, sequentially (updates are rare), and then perform large aggregations over the dataset, which kinda sounds an awful lot like the blob case if you think about it
  2. You have to design a columnar format for the data, which is a lot of work
  3. You have to write an advanced (probably) SQL analytics query engine, which is on the order of a few million lines of tightly optimized low-level code (certainly not Elixir)

So due to number 1, you might want to structure your analytics database like a blob store. That is, there is a metadata store tracking which blobs of columnar analytics data are “live” and which ones have been superseded or removed. Snowflake built a hundred-billion-dollar business out of doing this on top of FDB, storing data in (I believe) S3 and specializing in the query engine stuff.

Due to market forces, a number of standards for columnar storage developed. The most notable of these is probably Parquet. The idea here is that a customer can keep their data in Parquet files and then move their “data warehouse” to another provider, which I’m sure companies like Snowflake just loved.

More recently, standards for the query engines themselves have developed. Embedded query engines like Apache DataFusion started to pop up. An emerging winner here is DuckDB, which has distinguished itself with cute branding and a user-friendly standalone CLI. As it turns out, those things are actually very important. You have probably never even heard of DataFusion, have you?

And now, standards for how to maintain the metadata store mapping have also arrived. There are a lot of marketing buzzwords here (try “data lake” and “lakehouse”), but it’s just more standardization at play underneath. DuckDB now actually has its own standard “lakehouse thing” with DuckLake, which indeed uses a metadata store (they call it a “catalog store”) like Postgres to keep track of a bunch of Parquet files.

So, if you were so inclined, you could build an analytics database in Elixir, using Hobbes, like this:

  • Write parquet files to some sort of blob store (perhaps also built with Hobbes)
  • Keep track of which files are currently “live” using Hobbes as an index
  • Use DuckDB to perform fast analytical queries on the parquet files

And, again, you would get strong correctness, durability, and availability guarantees while solving zero hard distributed systems problems.

That’s the idea.

3 Likes

As the source of truth for the facts you want to persist for later analysis I would recommend an append only log, just like databases do, they call them a WAL (Write Ahead Log).

From the append-only log then you can feed the facts to whatever tools and databases more suited for each use case you need.

This approach doesn’t lock you in to the specifics of a database or tool, thus leaving you with an immutable source of truth for the persisted facts, the append only log, that you can use today, tomorrow, one decade ahead from now, with any tool or database you may need/want.

3 Likes

If you are building it in pure Elixir then you may need to account for the delayed_write with a default of 2 seconds, that may work against or favourable to you: file — kernel v10.4.1

delayed_write - The same as {delayed_write, Size, Delay} with reasonable default values for Size and Delay (roughly some 64 KB, 2 seconds).

I discovered this while investigating why Mnesia lost all data after restarting my laptop and also found later an issue with an EventLog Elixir library that lost all records inserted to the file during the default two seconds window, when closing a file: Example in README cannot read the last entry in the append only log · Issue #1 · pbudzik/eventlog · GitHub

I also talk about this delayed write issue in this forum post:

SIde Note: You may want to read the Kafka design decisions about using a filesystem approach to take advantage of all the low level stuff that Linux has to offer, which may help you with making some decisions on how you use the filesystem from the BEAM. Another interesting option you may want to consider is to disable Linux write cache as I mention in this post: Opinion on file & memory based event sourcing system - #21 by Exadra37 .

3 Likes

Oh, if only we lived in a world where kernel and filesystem developers didn’t hate us. Then perhaps things could be so easy :slight_smile:

The BEAM buffering your writes is only the beginning.

Unfortunately the kernel also takes it upon itself to buffer your writes, meaning you need to issue an fsync after writing and before returning a commit. Unless you’re writing a new file, in which case you also need to fsync the parent directory.

Unfortunately the disk can silently corrupt your data, so you need to write a checksum with the data. And you probably want that to be a cryptographic checksum if you’re writing user data (you know, the thing that databases do), otherwise you could be vulnerable to all sorts of weird collision attacks.

Unfortunately disks can also misdirect reads and writes to the wrong sectors, meaning that it’s possible for an entirely valid page (with a consistent checksum inside) to be written to or read from the wrong place on disk, silently corrupting the database. So you have to write the page and then keep the checksum as part of the address used to access it, turning the on-disk state into a hash-chained structure.

Unfortunately you still have to store the root hash somewhere, so you have to write several (nearly) identical copies of a superblock, each containing the root. You can then read from and write to those superblock copies like a quorum, which you have to do very carefully because those reads and writes can be misdirected.

Unfortunately even if you do all of the above fsync can still fail, and when it fails the OS page cache will lie to you and tell you that the page was actually written. There is no way to make this crash-safe, so you need to bypass the page cache entirely using O_DIRECT and then read back all of your writes directly from the disk to make sure they succeeded.

Unfortunately Erlang does not support O_DIRECT so you need a nif for this.

Unfortunately even if you do all of that your disk can still fail, so you need to perform writes against a set of replicas. There is no way to do that without a consensus algorithm, so you need to implement Paxos/Raft/Viewstamped Replication.

Unfortunately storage faults can propagate through consensus algorithms in a way that violates fault tolerance, so you have to be very careful to implement consensus properly with respect to disk faults. Which, as it turns out, nobody did back when that paper was published.

Hopefully the value proposition of this work is starting to become clear lol.

This is what I mean when I say that I want to solve these problems once and never again.

7 Likes

I guess your database should also email each transaction to the DPO for good measure :smiley: Just to be safe!

1 Like

I know you’re joking, but it really is storage hardware and APIs specifically that are so unreliable. Byzantine-fault-tolerant systems exist, but servers with ECC memory generally work pretty much all of the time.

Storage hardware, on the other hand, absolutely does exhibit these faults at scale, and even a bit of corruption in the wrong place can break the guarantees of the entire database. Correctness is hard.

Just have a look at what Matrix (the chat service) had to deal with recently when their Postgres indexes were silently corrupted. It’s nightmare fuel.

4 Likes

Lots of insights here that I wasn’t aware of. I knew only about the kernel to also buffer and for the need to use fsync to go around it, but everything else is new to me.

Thanks for the detailed explanation :slight_smile:

2 Likes