Hobbes - a scalable, fault-tolerant transactional record store

Thank you very much for the explanation (I re-read it like 20 times and learned a lot) and papers for me to study!

How large can the dataset be? I don’t get what is the limit to the amount of data that can be stored until query/write performance gets bad.

Extremely strong consistency guarantees sounds amazing btw! I’m loving Hobbes more and more

If it’s not much to ask, I understand that there’s also network throughput consideration, that’s why batching exists. But what should you do when massive amount of requests to write small pieces of data come from a lot of sources? Just use Kafka?

Will be looking forward to it! It will be interesting to see how it works :eyes:

Actually, why I want to study FoundationDB is because it’s really impressive how powerful it is. But it seems really advanced to me, so maybe it’s not the right time for me to dive into it…

I worked with Postgres a lot, so I think TimescaleDB is the way to go for me, btw their Insights tool that uses TDB itself is really impressive with ~3TB of daily data ingest :face_exhaling:

I heard about Parquet quite often, I don’t get though how much data you need to have to have a need for it, as with FDB… (and I indeed didn’t heard of DataFusion :sweat_smile:)

Oh wow… My jaw dropped when I read your post, I couldn’t believe things are that bad in storage hardware world, but apparently, it’s reality… And now tables have turned that I’m taking Sorc96 place wanting to ask you his questions :sweat_smile:

I take my hat off to you. At first I didn’t understand, but now I’m eagerly looking forward to Hobbes :face_exhaling:

Your knowledge on the topic is commendable. But I do wonder how can you solve this in a comprehensive manner, once and for all? Barring writing a driver that uses disk in their raw form (block devices?) then I am not sure you can protect against the kernel and the existing drivers trying to be too smart and too helpful, and crippling proper database performance as a result – which as you alluded to is an old and known problem.

3 Likes

Lol first I think I need to clarify something. When I say I want to solve these problems once I mean for me. I don’t mean that I am, like, the savior of the database universe or something; that would be quite silly.

Actually, it’s the Tigerbeetle devs who have been pushing this boundary forward lately with respect to having a proper storage fault model, and I am very much “copying off of their homework” in my own implementation. But that’s how this stuff works: much of their approach was copied from ZFS (the filesystem), the designers of which also contributed enormously to this idea of systems that don’t incessantly lose data.

What I mean by “solve these problems once” is actually, for me, very specific. I am very interested in sovereignty as I have mentioned, and I want to be able to build systems that store data for some products that I’m working on. Here are a few of the things I want:

  • A scalable relational database with strong correctness guarantees and native support for multitenancy
  • A distributed blob store/filesystem that integrates with the above and maintains the same guarantees
  • A full text search engine that integrates with the above and maintains the same guarantees

This list is not random, it’s a specific list of things that I actually need for specific functionality in apps that I am actually designing. I want to be able to “own” these things so I am not forever at the mercy of megacorporations that hate us. There are many open source projects available that do some of these things, but none of them are exactly what I want.

The thing is, all of these things (and others) have the same base requirements: a way to consistently replicate persistent data across servers and be somewhat resilient to corruption. So I could write that code three or four times, or I could factor it out, write it once, and then build the other stuff as (mostly) stateless layers on top of it.

Hobbes is that code factored out. Hobbes is a toolkit which will contain all of the parts of building “distributed database-shaped things” which are the same so that I don’t have to re-implement them 10 times. This is actually a practical endeavor. I have clear and specific goals.

However, in sharing this work with the community of course I am trying to also present it in terms of how it might be useful to them, as that’s what people really want to hear (see the first few posts in this thread).

6 Likes

Well on Linux you kinda can. You just set O_DIRECT and bypass the page cache. That’s literally what it’s for. This is going to become a lot more common now that it is known to be effectively impossible to avoid corruption without that flag. Postgres prominently supports direct I/O now for example.

(Actually it’s funny, this all began with Postgres’s fsyncgate. After that they modified the database to crash on fsync failure, only for that paper to then be published showing that doesn’t work either!)

I am not knowledgeable or resourced enough to do all of the research needed for this on my own. But thankfully Tigerbeetle has pretty solid internals docs and readable code and they care a lot about evangelizing this approach (because it makes their database look good) so I can just learn from them.

If they say ECC memory and O_DIRECT is good enough then I believe them. Obviously no database is going to survive the Sun swallowing the Earth, but if I’m writing a storage engine from scratch (which I literally am right now, see the xks directory) then I may as well use the latest best practices.

BTW I will probably not do the nif thing for a while. I’m just trying to structure everything so that it will be an easy switch. I am not looking forward to dealing with native code because I am not a systems programmer. But I was not a database programmer either so shrug. I’ll make do.

5 Likes

I won’t know the answer to this for a while as Hobbes is not yet at a place where this can be tested. And even if it were, doing such tests would be pretty (financially) expensive. Scaling to huge numbers is not a priority for me in the short term. What matters to me is that Hobbes is architecturally capable of scaling so that it can grow with me.

With that said, FoundationDB could provide a rough upper bound. FDB is known to tap out around a few hundred terabytes of data (i.e. around a petabyte after replication). This is on the order of maybe a thousand servers. My understanding is that FDB taps out due to keepalives, though, which is a very fixable thing. So I suspect its users (i.e. Apple/Snowflake) simply have not found much use for getting past ~100TB scale in a single cluster.

Apple’s deployment for example is highly multitenant: it stores all CloudKit data, and is probably slowly consuming their (also enormous) Cassandra deployment. You want to have separate clusters for blast radius reasons, and you can move tenants between clusters, so really the cluster size limit is a tenant size limit. And 100TB is a very big tenant, so you can see why there is little need to get larger than that. Also, you want separate clusters so that you can shard geographically for low latency.

If you have a very big dataset you have to shard it across servers. A primitive approach here is to hash shard data into explicit buckets like e.g. DynamoDB which has 10GB shards controlled by a user-defined (developer-defined, that is) shard key. In that case you just send data to the right server(s) for its shard.

Unfortunately this means that you have to know the distribution of your dataset in advance, which is problematic. To solve this you can do range-based autosharding, where the shards are split/merged automatically and moved around as the distribution changes. This is a more advanced approach.

Spanner, CockroachDB, FoundationDB, and of course Hobbes are examples of the latter approach.

Maintaining atomic transactions across disparate shards gets tricky, though. Older systems just don’t, while newer systems like CockroachDB have atomic transactions but with questionable consistency guarantees. Spanner, FDB, and Hobbes offer very strong consistency guarantees.

The answer to your actual question is deeply technical and completely different among all of the systems I mentioned, but I can try to hand-wave a bit. For Hobbes we essentially take in a bunch of small transactions, group them together into a batch with a single version, perform concurrency control, and then split them up and send them to different servers for storage. This is, like, legit pretty complicated though.

3 Likes

Hi @garrison , I was reading some of Hobbes (specifically the binary enc/dec in Keyset) and realized you may be interested in this erlfdb PR.

The implementation from the original erlfdb accumulated an offset int and used that to do ever-increasing match clauses. This turns out to be pretty slow. It makes a measurable difference when doing many erlfdb_tuple operations.

2 Likes

Hah, I thought the offset thing was pretty clever in that you can grab the binary slice without accumulating by character (which would be the naive approach). But it looks like :binary.match() is a nif and there’s no beating that. Thanks for the tip, I’ll keep that in mind.

If you’re reading that code I should note that I only implemented enough of Keyset to be useful internally. The first implementations of a lot of the internals were scaffolded out with hacky manual string encodings and that quickly became an actual problem so I needed some sane form of binary encoding.

I was going to make a point about how unfortunate it is that we can’t prefix the strings with their lengths but I see you already mentioned that in the PR! This is a case where I think the value encodings can make a much better tradeoff by including a header with types/offsets/lengths at the start of each value. That way you don’t have to decode the entire value (much larger than a key) in order to pull out a “column”. Obviously the values have no ordering requirement.

2 Likes

I think it’s time for a little update.

Storage engine development has been going very well. There is, mostly, a storage engine now. It’s named XKS (short for ExKeyStore, which I think is a hilarious pun) and you can find the code in the lib/xks directory. It’s unfinished and obviously still messy, but most of the constituent parts are now there and it’s just a matter of gluing them together.

For those unaware, a “storage engine” is just the part of a database that writes data to disk. Generally a storage engine exposes some level of abstraction (like “key/value store”) and atomic commit functionality. Strangely, storage engines are often completely separate pieces of software from the databases that utilize them. MySQL has InnoDB, MongoDB has WiredTiger, and so on. Some databases do have their own for various reasons: Postgres’s is quite bespoke and quite old (and quite bad), Tigerbeetle’s is very deeply integrated with their data model (it literally only stores fixed-length data, which is unique).

For those of you who remember some of my earlier comments on the topic (before Hobbes was formally announced), you may remember that I originally intended to just use SQLite as a storage engine and dodge this particular yak for now. In fact, that’s exactly what FoundationDB did for over a decade. They have since switched over to RocksDB, for some reason.

So why did I change my mind? Well, firstly at the time I just didn’t know how to write a proper storage engine, but as Hobbes’s development dragged on I ended up studying enough to fix that. But also, I came to realize that there are some areas where deep integration with the storage engine can actually simplify the implementation of Hobbes’s distributed features and improve correctness. And if there are any two things I like, they’re being lazy and not writing bugs. So, uh, yolo.

On the correctness side, rolling the entire storage engine from scratch means Hobbes can maintain zero dependencies (and if there’s anything I hate, it’s dependencies) and perform fully integrated simulation testing of the entire database as a whole. Whereas if I used something like SQLite, I would probably be mocking it out during testing, which is Not Great. Also, XKS is designed to be very resilient to corruption via comically aggressive cryptographic checksumming of the entire file a la ZFS or Tigerbeetle; it’s a modern design.

On the functionality side, FDB’s most unfortunate limitation is that read-only transactions can only be a few seconds long. Most of FDB’s limitations are Good Actually, including the limit on read+write transactions. Long transactions in an optimistic system (or really even a pessimistic system) are poison and to be avoided. But read-only transactions are different as they have no contention under an MVCC model. The reason FDB doesn’t support this is really a skill issue on the part of the SQLite btree: it doesn’t know how to store versioned data. Poor thing.

XKS uses an LSM tree design that pushes database versions directly down into the storage engine. LSMs are a natural fit for this because compaction provides an opportunity to garbage-collect old versions. And if you design this wrong you end up with Postgres’s VACUUM disaster, where, fun fact, they originally designed it so that old tuples would never be garbage collected at all and then apparently found out that that is a bad idea (lol).

The XKS design is quite unique and I have been unable to find another real-world example of an LSM that works this way. I’m sure the idea must exist in research somewhere, but in practice it seems like people just add another version to the keys “in userspace” (see CockroachDB/Pebble). Maybe Spanner’s storage engine does this but we’ll never know because they only gave us a one-paragraph description (WHY).

That’s all for now.

7 Likes

Very thankful for my fuzzers when I have to refactor garbage like this. And no, I do not want to talk about wtf I was thinking when I wrote the first pass. I think I was just trying to get through it.

It’s becoming increasingly clear to me how much code quality matters, and also how much of it is a pure expression of skill and experience. I did not understand how to shape that code properly until I had to write a similar version a few more times elsewhere in the engine, and then it slowly dawned on me. That first attempt was so messy I literally had to sit down and reverse engineer it despite having written it only a few weeks ago.

It’s also becoming increasingly clear to me that very little of the difference can be automatically linted or checked. I have suspected this for a while (and I don’t like linters or formatters very much), but this is a pretty clear example. I doubt anyone who clicks that link will even be able to understand why the new version is better; not out of ignorance, but because it’s enormously context-specific. I doubt a top LLM could make much sense of it either, at least not well enough to make useful recommendations in advance.

I wrote a lot of messy code over the past week trying to get persistence of the remaining data structures to work. Now that the engine can save itself to “disk” (or so it thinks), it’s definitely time to go over everything and clean up before I move on to the fun parts. Though I do find refactoring to be quite relaxing as long as I have good tests, which is of course the hard bit.

I’ve started to get very good at scaffolding out large projects. I suspect junior devs think this ability is black magic, but it’s very much a skill that you hone with practice. Eventually you learn how to break the thing up and build it one piece at a time.

I’ve seen people suggest that they find LLMs useful for starting new projects, which I find concerning. If you don’t practice you will never learn how to do it yourself. Maybe fine if models are writing all of the code, but what do you do when you want to build something outside of the training set?

I think I should consider posting more devlog-shaped things. Maybe in a separate thread, or perhaps on the blog. Unclear.

2 Likes

LLMs are a great way to overcome the blank canvas terror. Or just make a draft so you can curate it. I don’t view remembering the exact layout of a typical Elixir project as an important skill.

In commercial programming generating and then massaging the generated code is an oft-enough occurrence for LLMs to become objective net positives.

Took a quick look. It mostly seems aimed at making sure your recursive accumulator-using (and potentially tailcall-optimized) functions more… idiomatic? Easier to work with? Not quite sure but yeah, it’s a touch obscure code indeed. But deconstructing data and making the recursive functions easier to read is already a big win IMO.

How did fuzzers help? Did they expose edge cases?

1 Like

This is the claim I was referring to, and I’m not disputing its truth. What I’m suggesting is that there may be some dangerous second-order effects to look out for, particularly for junior programmers who do not yet have the skill set to design things themselves.

Obviously this is a spectrum, and at the far end of one side you have the most basic imaginable boilerplate. The stuff you get from mix new. Nobody needs to write that; we don’t even need LLMs to write that (we use mix new). But even then, there is probably some value to doing it at least once, just so you know where everything is.

As you move further down the spectrum, you might get something like mix phx.new. I still think these generators have some value, but even there it starts to get iffy and I know there are some who don’t like them. I don’t really like them either, to be honest. I think if I was designing a new framework (which, uh, maybe I am) I would probably not go generator-heavy. But it’s a personal choice.

However, on the far end of the other side of the spectrum you have green field designs. Stuff that you don’t actually know how to write because you’ve never done it before. Like, in my case, an LSM tree storage engine.

I cannot mix new an LSM tree. I have to figure out where the compaction goes, and where the k-way merge goes, and how the multiversion manifest is structured, and how they all fit together. This is hard because unfortunately there is no LSM instruction manual (believe me, I looked).

It’s this sort of scaffolding that I was referring to, not basic boilerplate. You have to figure out how to build these things one piece at a time. Like, you want to write the compaction, but first you need something to compact. So you write the memtable. Then you need a place to compact to, so you write the block store. Now you can write the compaction. Now you have tables, so you need the manifest, so you write that. Now you can compact tables into tables! But you need k-way merge for that. And so on.

If you’re an experienced programmer you probably think nothing of this, but the juniors (and realistically probably some “seniors”) don’t know how to do this. They think it’s magic. That’s what mitchellh’s article is about.

LLMs are much smarter than mix new, and they can actually produce somewhat plausible output if you tell them “write me an LSM tree in Elixir”. However, in doing so you will rob yourself of learning how to architect such a thing yourself. The model will make all of the most important architectural decisions for you, and it probably has no idea what it’s doing because there are not many LSMtrees in the training set. At least not real ones.

That last point is a bit funny: I have seen people ask a model to write them Paxos and have it reply with a version that has the actual implementation stubbed out. Hilarious!

But there are mistakes, deeper mistakes, that you won’t be able to spot unless you’re familiar with the problem space. And how do you become familiar, except by doing it yourself?

1 Like

It is more than a touch obscure. To be clear, this is the sort of code where if I saw it in another codebase I might not even waste my time trying to understand it. One thing I’ve learned from reading a lot of database code is that there are times when it’s easier to derive what something must be doing from first principles than to actually read the code. Reading code sucks.

In this case the salient point is that even I had no idea wtf the code was doing and I just wrote it, so it was uniquely garbage.

Anyway, I will try to explain it just for fun. That code builds the index blocks for an LSM table. Each table in an LSM is a two-level structure (like a degenerate, immutable btree) where the first level holds sorted pairs and the second level holds an index entry for every, say, 64 KiB of data. The key pairs are divided into subtables every 64 KiB (these are traditionally called “blocks” rather than “subtables” but in my case the term was overloaded) and each subtable gets an index entry.

The entries are accumulated such that they end up in a structure of [{block_i, [{key, pointer}]}]. The block_i here really is a block and is not a subtable but this is not worth explaining further. The point is, it’s a nested list. And we need a nested loop.

“Idiomatic” is a funny thing because I think the most idiomatic way to solve this would be to flatten the whole thing into one list. The problem is that, as @Schultzer will tell you, this will cause a huge number of completely wasted allocations. So we want to traverse the nested structure.

My first attempt at this was absolutely disgusting. There was, like, a function head that noticed when one level of the loop was finished and popped it out. Just really confusing to look at.

Informed by versions of this code I had to write elsewhere, I realized that this was much more readable (and indeed likely faster) when implemented as two sets of recursive functions, i.e. as the functional analog of a nested loop. In hindsight this is obvious. But that’s the funny thing about hindsight.

BTW, I am making no claims that this code is now “good” and I’m sure I’ll end up rewriting it again. I think it may not be garbage anymore, though.

I also gave the variables better names. The important takeaway if anything is that stuff like that really matters and is not bikeshedding. I was having trouble reverse-engineering my own code because the variable names were not clear enough. I even wrote a bug while refactoring because of this (and the tests quickly caught it). The minutiae really do make a big difference when it comes to not writing bugs.

WRT the fuzzers, they helped because they are tests. Any test suite would help, it’s just that fuzzers are the only viable path to test code like this. You have to generate too much data just to get the thing to work at all, let alone the edge cases. The LSM fuzzers are still very immature, but even the most basic fuzzer runs circles around unit tests.

I’m sure this code is still loaded with bugs as I haven’t fuzzed it very aggressively yet. It’s solidifying.

1 Like

I’ll say idiomatic to whom, if you where to follow Elixir own anti-patterns then you would get slow and unoptimized code, mostly because most of the anti patterns are not derived from hard science. To me that is an anti pattern itself. As it turns out every optimized code is always smaller and more readable due to only essential abstractions and allocations. But none of this matter if you do not benchmark your code.

What people forget is that all these allocations here and there adds up, and the compiler cannot magically make your code faster.

1 Like

It’s strange hanoidb was not mentioned yet! I remember playing with it a long time ago :slight_smile:

1 Like

That is the question I was hinting at. In this specific case I do think flattening the list and traversing it would probably make the code a bit easier to understand. I can’t say for sure if that difference is inherent or a result of pre-conceived notions of what such code should look like. Probably both.

But in this case it would be an unacceptable performance cost. I am not seriously benchmarking yet (I will be soon), but I am not blind and I know what the BEAM would be doing if I flattened that list. Those cons cells gotta come from somewhere.

I am sympathetic to the idea that this is often true, but definitely not always. An extreme counter-example would be branchless algorithms, which are faster in many cases (because they fool the branch predictor) but are kinda inherently less intuitive.

Maybe that exception proves the rule, but in general I think if optimized code was always better we’d be writing assembly. Clearly there must be a tradeoff somewhere.

And that tradeoff is inherent to this project: if I wanted a database that was as fast as possible I would be writing it in Zig or something. But there are other factors: accessibility is very important, and running directly in the BEAM enables a level of extensibility that, to be honest, I think could be revolutionary. That wasn’t really my goal (I just wanted to store data lol) but the value proposition has slowly revealed itself.

1 Like

I believe a better analog is math, to a skilled mathematician then formals transfer way more information than English would ever do. But for someone unfamiliar with math then it’s just gibberish.

I don’t believe we all should write assembly, some minimal abstraction are great, the best I know is Erlang. I could never have built what I’m building if not all the hard problems where already solved for me and I just had to glue it together.

2 Likes

To use my favorite example, Lamport somewhat begrudgingly wrote Paxos Made Simple in plain English despite believing that algorithms like Paxos are best expressed through specs/code. The end result, of course, is that the name of the paper is widely mocked; as it turns out, no matter how you specify Paxos it remains extremely confusing. And to make matters worse the translation into English introduced a bug into the spec!

Exactly! Unfortunately the intervening four decades have, in my view, led to a rather large “strong consistency”-shaped hole in Erlang/OTP.

This is hardly the designers’ faults, as you may notice that Erlang actually pre-dates the first solutions to distributed consensus by two years. Let alone the two decades it took for those results to become common practice.

Still, the fact remains that OTP contains no useful primitives for building distributed systems with strong consistency guarantees. No consensus, no replicated storage, no autosharding, no storage engine.

There is an (entirely valid) argument to be made that these things are out of scope, but unfortunately Erlang does provide weak/mediocre implementations of all of these things: :global, :mnesia, :mnesia, and :dets respectively. I think this is important because it shows that the motivation is sound, but the implementations have not kept up with the times. We can do so much better now.

That is essentially what I’m trying to do here with Hobbes. Except it turns out that instead of having separate primitives for “consensus” and “replication” and “storage” it’s way better to have one abstraction layer that provides all of those things at once. FoundationDB can do anything Etcd can do, and it can do it in heels (scale out).

The other big thing is testing. One of the reasons distributed systems have gotten a bad rap in the popular consensus (haha) is that building them properly necessitates a newer testing strategy that is not yet well-known, and here of course I am speaking of (deterministic) simulation testing. Simtests are an absolute game-changer for building distributed systems. Seriously, they make the impossible possible.

I have now proven the viability of simtesting on the BEAM (it 100% works in practice on a real codebase), and the next step is to try to improve the performance. I am working on a new design (a replacement for Construct) which I believe will do just that.

All combined, I hope that I can provide a similar value proposition to Erlang: solving the hard problems for you. Or at least for me, because I am not in the mood for having to do any of this again any time soon lol.

1 Like