garrison

garrison

Hobbes - a low-level distributed database for the Elixir programming language

Hobbes is a low-level distributed database for the Elixir programming language.

Hobbes provides a simple, safe, and scalable storage layer for Elixir applications. It’s designed to scale horizontally, replicate data across machines, and handle disk or machine failures automatically and without disruption.

Hobbes offers a transactional key/value API which can model a wide range of data structures and indexes. Transactions can span the entire keyspace of a cluster, even across machines, and provide the strongest possible consistency guarantees (strict serializability) by default.

Elixir apps and libraries can use Hobbes to build systems that achieve modern standards of consistency, durability, and availability. Building distributed systems to these standards is notoriously difficult. Hobbes is a tool designed to make it easy.

You can find the source code for Hobbes on Tangled:

Most Liked

garrison

garrison

I think it’s time for a little update.

Storage engine development has been going very well. There is, mostly, a storage engine now. It’s named XKS (short for ExKeyStore, which I think is a hilarious pun) and you can find the code in the lib/xks directory. It’s unfinished and obviously still messy, but most of the constituent parts are now there and it’s just a matter of gluing them together.

For those unaware, a “storage engine” is just the part of a database that writes data to disk. Generally a storage engine exposes some level of abstraction (like “key/value store”) and atomic commit functionality. Strangely, storage engines are often completely separate pieces of software from the databases that utilize them. MySQL has InnoDB, MongoDB has WiredTiger, and so on. Some databases do have their own for various reasons: Postgres’s is quite bespoke and quite old (and quite bad), Tigerbeetle’s is very deeply integrated with their data model (it literally only stores fixed-length data, which is unique).

For those of you who remember some of my earlier comments on the topic (before Hobbes was formally announced), you may remember that I originally intended to just use SQLite as a storage engine and dodge this particular yak for now. In fact, that’s exactly what FoundationDB did for over a decade. They have since switched over to RocksDB, for some reason.

So why did I change my mind? Well, firstly at the time I just didn’t know how to write a proper storage engine, but as Hobbes’s development dragged on I ended up studying enough to fix that. But also, I came to realize that there are some areas where deep integration with the storage engine can actually simplify the implementation of Hobbes’s distributed features and improve correctness. And if there are any two things I like, they’re being lazy and not writing bugs. So, uh, yolo.

On the correctness side, rolling the entire storage engine from scratch means Hobbes can maintain zero dependencies (and if there’s anything I hate, it’s dependencies) and perform fully integrated simulation testing of the entire database as a whole. Whereas if I used something like SQLite, I would probably be mocking it out during testing, which is Not Great. Also, XKS is designed to be very resilient to corruption via comically aggressive cryptographic checksumming of the entire file a la ZFS or Tigerbeetle; it’s a modern design.

On the functionality side, FDB’s most unfortunate limitation is that read-only transactions can only be a few seconds long. Most of FDB’s limitations are Good Actually, including the limit on read+write transactions. Long transactions in an optimistic system (or really even a pessimistic system) are poison and to be avoided. But read-only transactions are different as they have no contention under an MVCC model. The reason FDB doesn’t support this is really a skill issue on the part of the SQLite btree: it doesn’t know how to store versioned data. Poor thing.

XKS uses an LSM tree design that pushes database versions directly down into the storage engine. LSMs are a natural fit for this because compaction provides an opportunity to garbage-collect old versions. And if you design this wrong you end up with Postgres’s VACUUM disaster, where, fun fact, they originally designed it so that old tuples would never be garbage collected at all and then apparently found out that that is a bad idea (lol).

The XKS design is quite unique and I have been unable to find another real-world example of an LSM that works this way. I’m sure the idea must exist in research somewhere, but in practice it seems like people just add another version to the keys “in userspace” (see CockroachDB/Pebble). Maybe Spanner’s storage engine does this but we’ll never know because they only gave us a one-paragraph description (WHY).

That’s all for now.

garrison

garrison

I am indeed familiar with Khepri, and I think both it and Ra are important contributions in the direction of strong consistency on the BEAM. I have remarked before that it is quite strange there is no consensus primitive in OTP (e.g. a Paxos implementation), and Ra is literally that. BTW, Erlang is actually older than the first working consensus algorithms (Viewstamped Replication and Paxos).

Ra and Khepri are not, however, sufficient for my goals in particular.

MultiPaxos-style replicated log databases like Khepri are not meant to scale out and are designed to store a pretty small amount of data. Their tradeoffs also require them to store an unnecessary number of copies of the main dataset, which is fine for a small dataset but very bad at scale. Khepri also happens to be an in-memory database (the entire dataset is in RAM), which is not an architectural limitation but a tradeoff they’ve decided to take (which I’m sure is fine for their use-case).

Databases like this (see Zookeeper, etcd, Consul) are generally used as control planes rather than used to store the main dataset. The problem with this approach is that it means you actually have to build an entire distributed database. Something like Zookeeper is maybe 5% of an actual database.

Hobbes inherits from FoundationDB’s architecture. FDB is a reconfiguration system which is explicitly designed to store large datasets but provides a very open-ended data model. So FDB is maybe 80% of a database, but it solves nearly 100% of the “hard problems” of building a distributed database. Correctness is very hard, and FDB provides an abstraction which is correct and scales out of the box.

As I’ve mentioned in the past, I am interested in building tooling to replace things like Postgres, S3, and so on. I need an abstraction which can scale up to “real” datasets so that I don’t have to keep solving the same distributed problems over and over again. I want to solve them once, because they are very hard.

Hobbes is designed to provide strong consistency guarantees while storing several orders of magnitude more data than something like Khepri (and serving equivalently more traffic). Architecturally, the difference in complexity to meet that requirement is quite substantial, but that is what achieves my goals.

If you’re interested in the tradeoffs here, check out this excellent article which covers some of them.

11
Post #5
garrison

garrison

Fuzzer caught a pretty good one today.

This code builds blocks for the on-disk Manifest Log and rotates the current block when it fills up. Unfortunately there was a typo and instead of prepending the new block to the blocks acc it prepended the new block to the remaining entries.

I wonder if there is a name for this sort of bug, which is obviously wrong but survives due to sheer luck. It just so happens that, technically, a single entry is a valid block, so in the right circumstances this is not completely broken. The fuzzer was able to find the right circumstances, but only when enough entries were running through it to trigger the condition; ironically there was a comment right above it warning me to be careful to fuzz that line. Good call! lol

This is also a great example of why I’m trying to be a lot more careful about naming variables. If the other tail had been named entries_rest (a convention I have since adopted, but only after this code was written) such a typo would be much harder to miss. Naming things carefully and caring about code structure are actually very important!

The reason this was finally caught today (this code must be 1-2 months old) is that I finally added updates/deletes to the log, and therefore finally implied a dependency on order. There was another bug caught by that change as well, where the list of blocks should have been reversed. This is why I always name reversed lists with a _reversed suffix, but apparently I missed that one. Thankfully the fuzzer did not.

These bugs together took hours to track down, btw. Yeah.

Anyway, the storage engine is coming together. I’ve rewritten most of the code since my last update, and it’s no longer garbage. I also suffered a brief bout of iterator psychosis if you missed that.

I’ll post a proper update (maybe on the blog instead) once it’s time to move on to integration, which will be soon. I’m at the stage where I have to keep a TODO list just to remember what’s left. You know how it is.

P.S. As a bonus bug, it turns out update_counter applies the update to the default value, unlike Map.update() and others which use the default directly. Oops.

Where Next?

Popular in Announcing Top

dominicletz
Hi, I thought I had posted my library before but seems I hadn’t. The project is still in early stages but it’s growing and so I think it...
New
asiniy
Hey there! I wrote a download elixir package which does exactly what its name about - an easy way to download files. I saw solutions ...
New
danschultzer
None of the current solutions worked well for me, so I went ahead and built a user management system from scratch. This project took far...
548 29305 241
New
OvermindDL1
I created a new library (rather I pulled out a couple files from my big project), it manages an operating system PID file for the BEAM. ...
New
mathieuprog
Hello :wave: Allow me to introduce you to Tz, an alternative time zone database support to Tzdata. Why another library? First and fore...
New
Crowdhailer
The latest release of Ace (0.10.0) includes serving content over HTTP/2. I have started writing a webserver to teach my self more about...
New
seancribbs
Today I released a new dialyzer Mix task as the dialyzex package! At the time we started writing this task, the existing dialyzer integra...
New
jakub-zawislak
Hi everyone, I’m coming from the Symfony (PHP) framework. I like Phoenix, but it has a one thing that was build much better in the Symfo...
New
mplatts
With HEEX released we decided to start a components library using Tailwind CSS - check it out here: Petal Components. We also have a boi...
New
marcuslankenau
I feel kind of stuck with the absence of a proper xml library for Elixir. Currently I use SweetXML which was ok for me more or less to pa...
New

Other popular topics Top

Harrisonl
We have an ECS cluster with 4 services, where each task joins a single cluster, via discovery ECS discovery service. Currently when I de...
New
lastday4you
I wanted to check elixir version in phoenix because i found that my elixir is 1.5 but when i use Enum.chunk_by it said the function is un...
New
AstonJ
Posting this to see if we can make things easier for people to get into Neovim. If you use Neovim and have a favourite distro please let ...
New
JorisKok
I have a server on AWS, and was running a load test using artillery. When looking at the Phoenix dashboard I see the Ports going to 100% ...
New
JeremM34
Hello, how can I check the Phoenix version ? Thanks !
New
boundedvariable
I am going through the kafka architecture. All the features what the kafka is providing are already in Erlang. I would like hear your opi...
New
romenigld
I am trying to run a deploy with docker and I successfully runned with this command: docker build -t romenigld/blog-prod . but when I t...
New
klo
Got a question about when to concat vs. prepending items to list then reversing to achieve appending. So i know lists boil down to [1 | ...
New
jononomo
For some reason my phoenix channels are working for me in my local dev environment, but as soon as I deploy via Docker, I get a 403 error...
New
vonH
In asking this question I am more interested about the expressiveness of the language itself and less concerned about the availability of...
New

We're in Beta

About us Mission Statement