I’ve been building a distributed key-value store in Elixir/OTP called Bedrock. It implements FoundationDB’s architecture but with a twist - instead of running as separate database servers, the components run embedded within your application’s supervision tree.
It’s a key-value store with:
ACID transactions with strict serializability
MVCC with snapshot isolation (repeatable-read + read-your-writes)
Write-ahead logs for durability + storage servers for serving reads
Components that run as OTP processes inside your app
The architecture follows FoundationDB’s model closely - resolvers, log servers, storage servers, commit proxies, sequencers, etc. The design is modular, so you can swap out or experiment with individual components without reworking the whole system. The interesting part is exploring what happens when these components run embedded in your application rather than as external services. Local reads can potentially happen at memory speed while still maintaining these strong consistency guarantees.
Current state: The core transaction pipeline is working. You can spin it up in a Livebook for experimentation or run it across a full cluster for real distributed transactions. It’s designed to scale from in-memory test instances, to local development through to production clusters. Recovery works when components fail, though there are still rough edges and optimization opportunities.
If anyone’s interested in distributed systems or has experience with FoundationDB, I’d love to get your thoughts on the approach. The modular design makes it pretty easy to experiment with different implementations if you want to try out ideas.
GitHub:
Happy to answer questions or hear what you’d like to use it for. PRs are always welcome!
Hey, someone else has been doing this?! Looks like beautiful work! I’ll have to spend some time looking through the code later.
So glad to see FDB’s architectural influence growing. The design is just so good, and it maps really well onto the BEAM; probably because they FDB guys were somewhat inspired by Erlang/Actor model.
Amazing work! Do you have a roadmap on things to work? I would gladly contribute for it, it’s something that has been on my mind for a good while
One really important thing that is missing here which in my opinion is the main feature for FDB being FDB (as well as tigger beetle and antithesis) is the simulation testing. From a glance it seems that nothing of the sort is implemented (correct me if I’m wrong), is something you have in mind? Having this system is pretty much the foundation of those databases and in the case of FDB was pretty much the first thing implemented before anything else
This looks very cool – congrats on the milestone release!
I ran through your Livebook and unit tests, and read some code, and everything worked as promised. Excited to see where this project goes – do you have an recommended starting points for an interested potential contributor like myself?
What are your thoughts on a storage engine? Is it possible to give the same Elixir treatment to FDB’s redwood?
Last thought, as the Layer guy: The cool thing about Data Layers is that it should be totally reasonable to swap out FDB with an API-compatible backend like this one, with a little glue code.
Yep, you’re right. No simulation testing at this time… and frankly, nowhere near enough testing period, unit or otherwise – it’s early days. I’ve been trying to think of ways to go about it, and I’ve not come up with anything i was happy with… I just don’t know enough about the internals of the BEAM to make it go. I’d be really interested to hear how you might go about it, if you have ideas!
Thanks for taking a look! When I read the fdb paper a few years ago, i thought exactly the same thing: “This would be so neatly expressed in Elixir/OTP.”
I made a #bedrock channel on the elixir-lang slack, if you (or anyone else!) wants to ask questions or nerd out.
This looks very cool – congrats on the milestone release!
Thanks! I’m an admirer of your ecto-foundationdb project… It kind of pushed me into building this, actually.
I ran through your Livebook and unit tests, and read some code, and everything worked as promised. Excited to see where this project goes – do you have an recommended starting points for an interested potential contributor like myself?
I’m glad to hear it all worked!
Absolutely there are places to contribute. I’m working on a list of things in the “issues” on github, and I’m very open to more ideas. I have a bunch of write-ups in the “guides” section of the repo that breaks out how the various components interact and how processes (like recovery) are working currently. I’m actively working to improve the docs to make the project more accessible.
Things that could be fun:
An ephemeral version of the log / storage servers (simple, backed by :ets) that could be used by end-users to run unit tests without needing to configure disk space, etc.
Storage recovery. There’s no mechanism to replace/clone a completely failed storage instance.
Data Distribution. There’s no mechanism to split or coalesce shards, and and :dets (which underpins the example storage engine I have in there now) is limited to ~2gb, making the system as a whole limited to ~2gb.
Testing. It isn’t glamorous, but this is a db, and people are going to expect it to work. There’s a fair bit to do here. I plan to spend a bunch of time working on this to make sure that what’s already done works.
Configuration. I can foresee some mix tasks to allow easy access to change the parameters and configuration of a running instance. It’d be pretty dope if you could start out with an ephemeral single-node system and configure it into a multi-node cluster without stopping it.
What are your thoughts on a storage engine? Is it possible to give the same Elixir treatment to FDB’s redwood?
The storage system is very pluggable. I’ve done some research into Redwood, and I think we can actually do better in some ways. The state of the art has advanced considerably in the db world since it was introduced. I’d like to try mixing some of the ideas behind Aerospike (in-memory indexes) and TiDB (Blob-backed storage with LSM tree compaction outside the critical path).
Another thought was that it might make sense to have multiple types of storage / log engines at play, at once. You could imagine a memory-only engine for high-speed reads. Another that persists data out on blob-stores like GCS for extreme durability. Another (like redwood) that would allow versioned reads going back further than 5s, to support snapshot reads that run much longer while remaining consistent.
Last thought, as the Layer guy: The cool thing about Data Layers is that it should be totally reasonable to swap out FDB with an API-compatible backend like this one, with a little glue code.
This was also my thinking. I could very much see adding higher-level Layers like sql (perhaps based on ecto_foundationdb), job queues like Apple’s QuiCK, etc. It would be wonderful if we could have these things all work within the same transactions – a very exciting prospect.
Exactly! I’ve been working on something fairly similar for a similar amount of time and it looks like we’re probably going to end up going public on here within a few weeks of each other (I’m still prepping), which is pretty hilarious! I will definitely give the codebase a full read and give some thoughts if you like. I’m confident between myself, you, and @jstimps we can achieve Elixir FDB supremacy
If I might offer one early point of feedback: as someone who naturally clicks directly to the “docs/architecture” section of any repo I’m finding the internals docs to be rather verbose (I assume LLMs are involved?). That style is okay for a README and such, but for the really technical docs I’d recommend seriously cutting down on the prose as the signal to noise ratio can suffer.
Yeah, you’re not wrong. It’s just been me… and technical writing isn’t a super-power of mine. So, I’ve been relying on ai-tooling to help, and it could definitely do with some tightening up. With the basics now in place, I plan to spend some time working on that.
I have been looking into this, it’s definitely not easy. A storage engine benefits from a low-level language like C/Zig/Rust for a number of reasons. Erlang also apparently lacks direct I/O support, which I found pretty surprising (even Node supposedly supports O_DIRECT, though I honestly wonder if anyone has ever used it). But maybe the mandatory aligned buffers are a problem, I don’t know. I’ve been meaning to make a thread about that.
There are a couple of pure BEAM storage engines, CubDB in Elixir (which is not really suitable) and leveled, the backing store for modern OpenRiak. leveled is an LSM written in Erlang.
For the record, Apple seems to be spending quite a bit of dev time trying to replace Redwood with RocksDB if you pay attention to the commits, though of course in typical FDB fashion they have said absolutely nothing about why. I think Redwood might have been a Snowflake project and it doesn’t seem like they contribute anymore. Apple is probably storing considerably more data in FDB than Snowflake so if I had to guess I’d say they want to cut down SSD costs with an LSM as Btrees can be faster but are more wasteful (SSDs are consumables at scale).
Redwood is a pretty recent storage engine. It might actually be the most recent Btree storage engine out there, at least of those operating at any appreciable scale.
Do you have more info about this? I don’t know much about TiDB.
I think LSM compaction is generally outside the critical path, though, which is why you get into so much trouble when they fall behind. There is actually one exception I’m aware of: Tigerbeetle’s LSM deterministically interleaves compaction with writes, which is very clever.
Yeah, the closest you get in erlang is :raw, which just isn’t the same – you still have the OS buffer-cache in the mix.
Another big issue is the copy-on-write semantics of the BEAM itself, and the fact that it’s hard to share any type of data between processes (that isn’t a binary >= 64 bytes) without copying it… so, an in-memory index structure would have to be designed so that one process can own the structure for a shard, find a binary page quickly and hand that off to other processes to finish the search.
I’d noticed them quietly doing this, too. I saw one thread talking about some of the reason being the ability to just ship around files instead of always having to copy shards through protocol. Nothing stops anyone from just making a rocksdb-based storage server implementation on Bedrock. It’d be pretty easy.
LSM trees have their own design issues, though. Compaction is greedy for space, and worst-case, you need 2gb free to compact two,1gb levels. Since compaction proceeds head-to-tail, and there’s no way to truncate a file from the head, you basically have to keep the old files until you’ve completely built the new one.
It is, and b-trees are fantastic. No argument. They’re just not the only game in town. There are LSM trees (as mentioned) and then there’s Aerospike which takes a very different approach and uses neither. WiscKey is another interesting take on optimizing for SSDs.
The BEAM definitely has some weaknesses that make direct implementation of things like b-trees problematic anyway (no good way to do in-place rewrites of the pages in memory is the big one)… but the BEAM has some strengths, too. The enforced immutability makes it possible to safely use some pretty extreme structure-sharing between versions of an in-memory index… and it also allows for some ridiculous levels of memory safe parallelism. I think that with the shift to SSDs, the ability to issue 30+ parallel reads may be an interesting way to offset. The WiscKey paper has some really good numbers on this.
The other thing to think about is that reading from the disk (ssd) isn’t always the bottleneck anymore. Pushing things across the network is. So, if we can pull the storage closer to the processes that are using it, we could see some gains that more than offset.
As it is generally implemented, yeah. Systems like TiDB don’t do compaction on the storage nodes, though, they write new levels, and at the same time, push those to other machines for compaction. When newly compacted levels are available, they pull them back and just delete the intermediates. It’s a clever way to throw more iron at the problem and it works into their replication/persistence strategy.
Yeah, that’s another reason. The btrees (Redwood) actually take data movement as an opportunity to defragment the tree so they would lose that by physically copying shards. LSMs have a natural anti-entropy mechanism (the compaction) which makes that less of an issue. Another advantage is that you can easily push the MVCC down into the LSM and then GC old keys during compaction. You can do this with a btree as well but it’s a little less natural and there would be a lot more space-amp. In practice you can design btrees with log-structured nodes, though, so there’s a spectrum at play here.
Oh yes absolutely. WiscKey’s learnings were eventually incorporated into RocksDB. I’ve always loved that paper not only because it’s great research but also because it has a fantastic LSM tree overview in the introduction.
Btrees benefit hugely from parallel reads as well, of course. One of the big wins of Redwood iirc was keeping a much deeper queue than the old SQLite btree.
Strangely I think network bandwidth has exceeded disk again in recent years (800Gb ethernet). But latency still matters a lot in practice, of course, so it’s never that simple.
That is very interesting, thanks. I guess they’re doing the “disaggregated storage on S3” thing so it doesn’t really matter where the tables get compacted. Interesting implications for a multitenant system!
Nice! I am also building a scalable messaging protocol, and we are about 90% complete. The Elixir supervisor tree is fantastic, and OTP is a real game-changer. We built this on top of Elixir Cowboy due to its legacy support for low-level TCP transport.
This application is one of the most unique systems I have ever built. Elixir gives me the flexibility to combine multiple GenServers to manage child-level dynamic supervisors and a mother-level supervisor.
I will come share my experience one day on the forum. The application root folder is named after me and includes a special ID called eid_epoi_id. It can be clustered using the Horde dependency, which allows the use of a global registry. The global registry helps scale the application for load balancing and consistency.
Hi, thanks for the library, it’s really nice. Finally, some good and well documented Elixir to read! I love how Finalization is written
I have some of comments:
It seems that you spawn a TransactionBuilder server on every transaction (and every operation has to be made in transaction explicitly) and it is passed like a pid. Wouldn’t it make more sense to just make it a structure instead of a gen_server? Every operation with transaction builder is a just a mutation of it’s internal state which is accumulated until commit happens, so it can be just a plain structure. I may think of an argument that having a server makes it possible for multiple processes to work inside one transaction, but why would anyone do this? And even if they do, they would still require to do the explicit synchronization on their side, since put operation is a cast to TransactionBuilder. You may say “batch insert operations processed in parallel”, but that’s also much easier to handover to users so they just perform batching themselves.
It seems that every call to put or fetch would require to call the sequencer which is a repo-wide singleton. Wouldn’t it make more sense to utilize the ets table in this case? I mean, leave the singleton, but expose the known_committed_version in ets. This would remove the unnecessary global ordering of every read operation.
Possible optimization: you can keep your Tx mutations sorted so that range operations would become faster (it would require to just discover the upper bound in the list and then exit the recursive function, instead of traversing the whole list)
It seems to me (but I am not quite sure) that it is possible to get into this situation: read "x = 1" in transaction 1, put "x = 2" transction 2, commit transaction 2, range read "x = 2" in transaction 1 because it looks like range reads don’t account for Tx’s reads since tx_visible variable contains only Tx.writes. I see that Tx.get_range is not used anywhere, so I am sorry for the comment on WIP code
It seems that I missed the functional tests like “read after write” in a single transaction from the database user perspective.
I don’t understand the logic behind the finalization of empty batches in CommitProxy. If we don’t have any transactions, why do we need to perform this unnecessary operation of pushing no information to log servers, etc. ?
I can see that you use the reply_fn = fn x -> GenServer.reply(from, x) end pattern, but I think that it would be better to just send the from and call the GenServer.reply where it’s needed. But that’s NIT.
It seems like links to docs in @moduledoc’s are incorrect.
I can see that you use :dets as the key value storage, but dets comes with it’s internal limitations, like it can close without ability to recover on open (I’ve experienced it myself several times)
dets — stdlib v7.0.2
Dets tables are not properly closed if the Erlang runtime system terminates abnormally
I can see that you use a lot of do_something_fn = Keyword.get(opts, :do_something_fn, &do_something/1) in code and you change this option only in testing. I can suggest to use the library like Repatch to just patch these functions in tests, instead of slowing down your critical path code with unnecessary branching which will never be used in any production.
P.S. The code was not very easy to read Imports make it difficult to navigate to the function implementation (even with LSP) and single letter variables and abuse of t make it hard to understand what functions do when I get to them. I can see that you tend to split single gen server code into multiple files of state, server, etc., and that’s new to me, since I am more used to OTP and Elixir code style where everything related to some gen server resides in the single module and file (no matter how big it gets).
And I also find it interesting that you defined helpers for GenServer.cast, GenServer.call and callback tuples like def noreply(state), do: {:noreply, state}. I am interested to hear what’s the story behind them
That’s all for now, I will read some more code tomorrow. And yeah, I am preparing a small PR with fixes and NITs I found while reading (like Enum |> Enum optimizations and such).
I am not the author but I am very familiar with this architecture (because I have been building something similar) so I can answer a couple of these. Note also that this project is, of course, incomplete and there are several parts of it (from what I’ve seen reading through) that are still toy implementations. One neat thing about FDB’s unbundled architecture is that it’s possible to implement it as a toy and then flesh out the services piece by piece, which is what I did and what the OP is clearly doing as well - perfectly reasonable.
It would, but a pure functional API is very dangerous because if you accidentally “lose” any of the intermediate structs (even on reads) you could end up with very serious consistency violations. A struct in the process dictionary would be ideal for most users IMO.
The Sequencer is a cluster-wide singleton and has to be called over the network so ets would not be appropriate. All calls to it (for read or commit versions) are aggressively batched in FDB. If that’s not the case here it may simply be because the project is still WIP.
Also, I don’t think it’s possible to implement an atomic get+update with concurrent ets, which would be necessary here.
I don’t understand this one but I’m curious. Can you elaborate?
In FDB the empty batches are used to propagate metadata mutations between the commit proxies (via the resolvers). I don’t think this project has implemented much of the metadata/shard map system yet but the empty batches are needed to prevent the shard map cache from becoming too stale.
The empty batches also propagate new commit versions to the tlogs, which propagate them to the storage servers. This information has to keep flowing or the cluster would essentially stall out. The 5s transaction duration limit uses the versions as a proxy for elapsed time, so if the read versions stopped advancing all commits would fail because they would be too old.
It certainly could be done that way, but there are some drawbacks. The biggest is API ergonomics - Take an operation like get(key). Unlike a Map, the get operations here have side-effects. (The fastest storage servers are located and cached, etc.) If done in a purely functional style, the return from get would need to look like {result, tx}. Pipelining would be extremely clunky for operations like put, breaking the idea that Bedrock is like some really big, durable, transactional map.
There’s an argument for using the process dictionary instead of a GenServer, for similar effect, but spinning up a new process is pretty cheap with the BEAM. I have some ideas that require managing timers and timeouts that pretty much require the use of another process… so, we’ll see what the future holds.
Here’s how it works. The first read gets a read version from the sequencer. All of the subsequent reads for this transaction (and all nested transactions) will use this same version. This one of the ways that Bedrock can fulfill it’s guarantee of repeatable reads and snapshot isolation: All reads for a key at a version will return the same value.
When a transaction is committed, if any reads were performed, the read version is used along with the keys that were read to check to see if any transactions modified those keys in the intervening time – this is part of what the Resolver does. If the keys that were read have been modified, the transaction is aborted and restarted (and it will then pull a higher read_version and so will see the newly modified data).
The sequencer is a cluster wide singleton, but it’s job is extremely simple. FoundationDB, the pattern on which Bedrock is based, uses the notion of “GRV” proxies, read-version proxies that will spread the load. Bedrock may introduce a similar concept that could make use of :ets tables as you describe on each node. For now, calling the Sequencer directly is simple and effective.
There are so many opportunities for optimization in front of us. Currently, the focus is on getting all of the basic operations right, increasing test coverage, getting the documentation right, and working towards getting the system into a place where people might actually dare to use it for real work.
If you see something you want to improve, though, we’re open to a PR!
The way transactions were being assembled in 0.1 left holes in the “reads” and “writes” that could slip through the Resolver if we tried to add range-reads into the mix. This only worked because 0.1 just didn’t do range operations. Take a look at develop, where 0.2 is coming together, though! I’ve recently (yesterday) overhauled the machinery for building transactions (and the transaction format as well) with an eye toward making range operations like range-reads and clears easier to implement, along with other fun possibilities.
Interestingly, in this system, the read-your-writes all happens entirely within the transaction builder. If in a transaction I read key “a”, and get “apple”… and within that transaction write “aardvark” to “a”, then any subsequent read (or range read!) should return the “aardvark” value for “a”. The rest of the system doesn’t see any of this. When a transaction is committed, passes resolution and is pushed to the logs… any subsequent read using that commit version would see the new value for “a”. If that transaction is rolled back, nothing needs to be done or communicated – we just throw the process away.
This is kind of an interesting thing that FDB does, and I shamelessly copied it!
So they have their famous 5s window for transactions, right? This brings us to a question: How do all of the machines involved know when that window advances?
They all have different clocks, and there could be all kinds of skew… so you can’t use local wall-clock time – synchronizing time is a hassle or requires super-expensive equipment (like Spanner). So clock synchronization is out. How do all the servers in the cluster know when that 5s window has advanced?
Here’s what they (and we) do (and I think it is super cool):
The sequencer issues versions in microsecond increments. This is monotonic time, and so will always be increasing. You can think of it as uptime for the cluster, because versions only advance when the cluster is in a running state. So, all of these servers don’t need to track their own time, they can just look at the most recent version to come along, and subtract ~5 million, and there’s your window. There’s a catch, though: This only works if transactions are happening. If the system is quiescent, nothing happens, nothing advances, the window doesn’t move.
Commit proxies will issue an empty transaction if no transactions have been processed within the last 1000ms. This advances the version, and slides the window ever-forward. As a bonus, it serves to “tire kick” the transaction system to ensure that the sequencer/resolvers/logs/ etc. are all reachable and functioning. Any error triggers a recovery.
You get this lovely three-for-one out of this: ensuring the parts are working, advancing the window, synchronization.
indeed.
Yep. Broken all over the place. It’s a big project, and it’s just me, and it’s hard to keep all of the references up-to-date as things are in flux. As the design settles down, and the number of people contributing (hopefully) grows, I imagine that things like this will be sorted out.
Yep. The storage server that’s in there now, “Basalt,” is a place-holder, an example. I used :dets in it because it’s simple and it’s built into the BEAM. It could easily be replaced with some other k/v store (like rocksdb), but this works to let people try out the system and do so with minimal fuss, even in a livebook.
One thing interesting thing to point out about this model, though: There doesn’t have to be just one kind of storage server in operation at a time. I could easily see using a mix of them for different properties. Maybe a memory-only one that doesn’t guarantee durability, but offers extremely fast reads? Or another that’s good at absorbing writes quickly (built around an LSM, perhaps), or maybe another that packs transactions and sends them off to AWS/GCS for disaster recovery and can answer reads, but maybe slowly. There are a lot of interesting possibilities here!
I considered this, and might reconsider it again at some point. Ultimately, I found that these patching tools (of which there are a few) don’t seem to play well with asynchronous tests, as different tests might need to patch things in different ways, and yet others may want to use the actual implementation.
I haven’t settled a on a good answer for this… and as you note, there are some small drawbacks to this approach. At this time, though, I think the focus needs to be on correctness. There will come a time, though, that those extra few cycles matter and I’m sure the code will change.
I’ll take that into consideration.
I tend to use a lot of pipelining - that’s the whole story. The macros ensure that the functions are defp, and the compiler will inline them. In the compiled code, there’s no difference between this and hard-coding the tuples, but the functions play nicely with pipelining and i don’t need to resort to then(&{:noreply, &1}).
Yay! I’m really quite happy that this is generating some interest. FDB is an awesome system and the patterns and ideas really should be more widely used. It’s just good tech. Keep the questions and feedback coming!
Note that in addition to batching the Sequencer call the proxies also batch the generation liveness check to the tlogs which is needed to prevent stale reads after a recovery. Read version batching was previously done by the CommitProxy until it was taken over by the (newer) GRV proxies. The commit proxies also used to ask each other for the latest version rather than reporting it back to the Sequencer after commit (but before replying to clients), but they switched to the latter to reduce tail latency.
It looks like you encode transactions into binary format, then you pass them to commit proxy, which batches them and then it decodes them back it in the Finalization and streams the mutations. Why? It just looks like unnecessary encoding/decoding step. There is also transaction encoding after sharding for logs and then transaction decoding in log.
I can see that you implemented custom encoding for keys and I thought that I will see the reasoning behind this somewhere in the code, but I didn’t and it looks like term_to_binary would do just fine. Am I missing something?
It looks like documentation in the code is inconsistent, because some very simple functions are extensively documented (for example, mutation_to_key_or_range) and others have no doc at all and use single letter variables (for example, Tx module and it’s children modules). What’s the reasoning behind it?
Yeah, I get it, 99% of these tools suck at async tests. That’s why I wrote Repatch, which does not suck in async and is built with async first in mind. I think I will rewrite some tests in Bedrock to prove my point, haha
But these are not macros which generate defp definitions, these are imports and imports are just a syntax sugar for Module.function and these are always compiled into so-called remote calls which do not inline and introduce overhead (though very very little overhead).
I’ve read the Finalization and I will continue reading tomorrow or on Monday. Plus I made a short PR with simple changes and NITs
FDB is an ordered key/value store, so the key encodings are designed such that keys have useful ordering properties in their encoded form. See data modeling and the typecode reference in their internals docs.
Erlang’s term_to_binary does not provide these properties. There is an amusingly-named library to encode terms in term order. However, Erlang term order is also insufficient because tuple length is compared before contents, which is the wrong way around for prefix scanning of indexes. Though I think maybe you could use lists?