Bedrock - a scaleable, distributed key-value database with better-than-ACID guarantees

Asd · August 21, 2025, 4:52pm

Well, yeah, I understand that, but you use dets as a storage which is unordered. And dets can store data as erlang terms. But you still encode keys into binary. You could use ets with tab2file, for example, and it would have the ordering and you wouldn’t even have to encode keys in the first place.

I understand that dets is a temporary solution, but even if you choose any other storage, ordering of terms is a problem on the storage level. And when you allow only some keys to be encoded, it means that your abstraction is leaking and the limitation of storage level is getting exposed to the end user.

The better approach would be to

Don’t encode keys and values during the whole pipeline
Introduce ordering with comparators instead of binary comparison.
Encode keys only when you have indexed them and you want to write them to the persistent storage

One possible algorithm I can think of is using B+ tree index with append-only file structure similar to what CubDB does. It’s not the best possible algorithm, but it just shows that sortability of keys-as-binaries is not a hard requirement and is rather a limitation of the approach you are planning to take in the future

I don’t quite get what you’re saying here. My best guess is that you’re talking about some algorithm which is not yet implemented and it allows to query data in the DB like by matching on the left part of tuple (and this tuple pattern has no size like {^first, ^second | _}). Well yeah, if someone wanted to do this, they would better have used the list as a key. Because this problem is not new, and ets has the same problem and the solution is also to just use lists

garrison · August 21, 2025, 5:11pm

First, keep in mind I am not the OP.

Yes, dets is not appropriate as a storage engine here because it is unordered. It cannot serve range scans (regardless of how you encode the keys). I assume, as you say, that it’s temporary. I actually wrote a longer reply about why dets cannot be used for this case not long ago on another thread.

In FoundationDB the abstraction is an ordered binary/binary key/value store. Exposing this abstraction was an explicit design decision which they made very carefully.

Try to think about what this would actually mean in practice. Clients would have to somehow provide a comparison function all the way through the system to the underlying storage engine.

What would happen if different clients wanted to order data in different ways? How would you shard the keyspace in that case?

Instead, FDB offers a binary keyspace which is lexicographically encoded (sharded across many servers) and clients are required to encode their data in a way that it comes out sorted in a way that is useful. The tuple layer encodings are provided as a simple way to do that.

garrison · August 21, 2025, 5:21pm

FDB is a tool to build databases.

When building a database you want to implement indexes. Like if you had %User{id: 1, first_name: "Jose", last_name: "Valim"}, you may want to build an index {last_name, first_name, id} to look up users by their names.

So you can encode a key Valim/Jose/1 (where these values are actually encoded in a particular way, see those docs I linked) and store it in the database under a prefix. Say, my_index/Valim/Jose/1.

Now, say you want to look up all "Jose Valim"s in the database. You can perform a prefix query on the index, which would be a range scan from my_index/Valim/Jose/ to my_index/Valim/Jose0 (0 comes after / in ascii).

That would return keys for all "Jose Valim"s for which you could then extract the ids.

Likewise, if you want to look up all Valims, you could prefix scan for my_index/Valim/. And so on.

What I was saying there, essentially, is that it’s desirable for tuples to sort based on the ordering of their elements rather than their length. That is, {"ZZZ"} should sort after {"Valim", "Jose"}. In Erlang term order it would sort before, which would break the last name prefix scan. You could be careful to always scan for {"ZZZ", ""} and in certain cases that would work, but it makes things more annoying.

I mentioned Erlang lists because I think they might sort this way (I only just realized this), but either way tuples definitely sort based on length first which is incorrect.

Edit: After testing, lists do indeed sort correctly (unlike tuples), which is nice. That means if you disallowed tuples you could get useful behavior from the term order. Erlang also orders integers/floats weirdly, but with a schema that might not matter.

You do still have to actually encode them to binary in such a way that they sort in term order, though, which term_to_binary does not (your original question).

jallum · August 22, 2025, 2:20pm

Always welcome!

As you point out, the transactions are encoded at the client in three basic sections: mutations that change the system, read conflict information and write conflict information. Not all of these sections are useful to all parts of the system. This is done at the client so that we can enforce the transaction size limitations as well as offer a potential entry point to other kinds of clients submitting transactions to the system.

An important thing to note, is that resolution and logging can be sharded. For resolution, there will be one resolver covering sections of the keyspace. In this situation, the transaction’s read and write conflicts must be divided into the respective shards to be sent. As a default, I have temporarily split the system and user keyspaces into two shards in order to be able to test the mechanics of this splitting mechanism – sort of a worst case. In the near future, for most setups, a single resolver will be used for the entire keyspace, and this splitting can be optimized away in that case.

The same is true of the logging system, though sharding is more likely here. Each storage team, responsible for a range of keys has a tag. Logs are assigned one or more tags. FoundationDB examines each mutation for the shard that it touches, assigns it one or more tags. (A clear-range mutation could easily cover multiple tags, etc.) For each log that’s tag set intersects with that mutation’s tags, that mutation is queued. Each log receives a transaction, whether or not any mutations end up affecting it. This is done for synchronization.

As you point out, there is a bit of unnecessary work that’s going on here. My general approach is to work out the correctness of a thing, lock down the tests, and then use the tests and the correct (but perhaps slower) version of a thing to test the improved/optimized version of a thing.

Yep. term_to_binary, while wonderful in many ways, does not guarantee that the resulting binaries will be lexicographically ordered in the same way as the original value. for keys this is very important, for values not so much. There are several places where encoding is done, and in each it’s for a different reason. term_to_binary is not always going to be appropriate for all of the needs.

Yep. You quite correctly point out that it’s not a polished / finished product. The team is very small, and we’re doing our best.

jallum · August 22, 2025, 2:31pm

Unordered on-disk storage is fine, as long as you have an indexing mechanism above it where order can be established. This is how mnesia uses :dets under the hood – it uses ordered :ets tables to index over it. I’m using :dets in the example storage driver because it’s built into the BEAM, it’s reliable if used correctly and in the case of a severe failure, the data is easily recoverable as key/values using minimal tooling.

The ultimate goal is to have a proper shard-management layer (DataDistributor, fdb calls it) that would split/merge shards such that none of them get near the limitations of :dets.

Asd · August 22, 2025, 3:00pm

I don’t understand why it is. I mean, data storage layer can just index things. You can compare keys as keys when doing log routing too.

garrison · August 22, 2025, 3:07pm

For the record, if you’re arguing in favor of exposing a tuple/record API instead of a binary key/value API I don’t really disagree with you.

But under the hood if you want to re-use an existing storage engine (LMDB, RocksDB, WiredTiger, etc) they pretty much all expose a binary/binary lexicographically-ordered API so that’s what you’re going to be dealing with.

I think some of these embedded DBs allow you to pass a custom comparator but then you would have to hand-write a “term order” comparator and pass it down. And would that really be faster? I doubt it.

At least internally I think byte-order is a pretty good abstraction. It’s universally understood.

garrison · August 22, 2025, 3:31pm

This will cause a lot of read amp for range scans.

jallum · August 22, 2025, 5:06pm

Nothing is free. LSMs don’t do range scans cheaply, either. Btrees do range scans well, but cost you on writes. Every way you go, there are gonna be trade-offs – and I’m not going to get too hung up on it. As long as the storage servers are easy to swap out, then a user can choose the one (or combination) that makes the most sense for their workload.

jallum · August 31, 2025, 6:10pm

You’re right! I did this on another project, and forgot to carry that over to here. I’ll fix that!

jallum · September 5, 2025, 9:09pm

Bedrock 0.2.0

Major Features:

KeySelector API: Added comprehensive key selector support across storage, gateway, and repo layers with range query capabilities
New Olivine Storage Engine: New B+ tree(ish) storage implementation with advanced indexing and persistence
Range Reads: Full range-read support with key selectors, conflict detection and read-your-writes within transactions
Enhanced Transaction Processing: Refactored commit proxy finalization with synchronous sequencer notification and exactly-once reply guarantees

Architecture Improvements:

Reworked transaction tree balancing and processing
Improved resolver capabilities with enhanced telemetry and tracing
Director recovery system with capability-based retry mechanisms
Shale log consistency fixes and transaction stream enhancements

Developer Experience:

Updated documentation with transaction format guides and architecture deep-dives
Subspaces and fast tuple packing/unpacking
Enhanced example bank livebook with range query demonstrations
Comprehensive test coverage for new features

This release represents a significant evolution of Bedrock’s transactional capabilities with improved performance, reliability, and developer ergonomics.

Asd · September 6, 2025, 2:16am

Oh nice, some more good code to read!

Well now that you use your custom storage implementation, it’s not necessary to have special encoding for keys, right? As far as I can see in the Olivine code, it uses :gb_trees for index structure with custom find_page_for_key implementation which looks for key with the default erlang term ordering.

I am also experimenting with storage stuff and I am currently writing a very simple and minimal LSM implementation to try out as a Bedrock backend.

@type t :: t() rocks btw

garrison · September 8, 2025, 4:53pm

Please do share when ready, I’ve been strongly considering writing an Elixir LSM as well

Another nice thing about using binaries is that, as I understand it, the BEAM will just keep slicing up the refcounted binary from the page load without any coping.

Imagine you have an LSM. If you have to decode the keys to binary search the index/data blocks all of a sudden you’re copying a bunch of memory and creating a bunch of terms that could have just been binary slices. This is not free.

The same would be true when binary searching a Btree page.

So even if you want term order I still think it would be better to use an encoding which sorts lexicographically in binary form.

I think I will be taking this approach in my own work so if you have a dissenting argument here I would of course very much like to hear it!

jallum · September 8, 2025, 5:17pm

Slices are great for temporary work, but the memory bloat can be more permanent. The slice retains a reference to the underlying binary and there can be significant amounts of memory that are accidentally / quietly retained. For example, you read in a 10mb chunk from somewhere, and slice out one 10 byte value to keep. As long as there’s a reference to the 10 byte value, the whole 10mb sticks around, even if you’ve discarded all references to the 10mb chunk. You can sever this linkage with an explicit :binary.copy, though. :binary.referenced_byte_size can tell you more about this, too. There are a bunch of fun toys in:

garrison · September 8, 2025, 5:25pm

This is useful info, thanks.

In the case of a storage engine you would only retain the page during a given query (only during the binary search, even) so I think what I said remains correct. I guess you would definitely want to copy out any binaries that you intend to return, though.

Is there any mechanism for the VM to recognize that it’s holding on to a huge binary for a small slice? I just assumed it would figure that out on its own tbh.

Also, tangential, but does anyone know how refcounted binaries interact with ets? Specifically, if you store a binary in an ets table is it copied?

(Nice work on the release, btw)

jallum · September 8, 2025, 5:58pm

Just so. Things destined for the network or disk might not matter, but things handed back to other processes (with their own arbitrary lifetimes) will definitely cause you issues with unexpected retains. My understanding is that :binary.copy only actually does anything if the parameter is a slice – it won’t duplicate things unnecessarily, and so it’s okay to use it defensively “at the exits.”

Not that i’m aware of. We’re just expected to manage this stuff.

Binaries (>= 64 bytes) are never copied. Everything else will be if the :ets table is accessed from another process.

In general, anything other than a “large” binary is copied when crossing a process boundary, whether it’s an :ets read, or a message send, etc. For :ets reads, the reader “pays the toll” and will recursively copy whatever is read into it’s own space and for message sends, the sender “pays the toll” of copying the message into the receiver’s space.

This is why it can be faster to pass large binaries around between processes instead of large structures (and why Bedrock now binary-encodes transactions early and passes those around).

Thank you! On deck next is a full slate of atomic operations, control over the read/write conflicts sets and other refinements necessary to support a fdb-style high-contention-allocator and directory layer. Stay tuned.

garrison · September 8, 2025, 6:24pm

Mm, I have thought about doing this. Especially since it may be possible to encode things in such a way that the batch fragment destined for a particular storage server can essentially travel through the TLog (and WAL on disk) without ever being decoded.

In practice I think the FDB tagging design (randomly assigning tags when the storage servers’ tlogs overlap) makes this messy. Frankly this design is bad and I don’t know what they were thinking. I guess it just pre-dates the idea of copysets.

It would be much better to create TLog teams as copysets and assign storage tags at the team level. When I was reading through your code I think that might even be what you did, in which case congrats for dodging that mess!

It’s a one-way door, though. Once you go full copyset you can only move data at the team level, so you can never go back. Meaning small deployments have to use copysets too even when it’s less than ideal. I think I’m just going to commit to that design, though. Spilling the TLogs without team-level tags is just so messy and it’s keeping me up at night.

I considered this, but if you want the DB to work embedded then you have no control over whether it’s headed for the network or not because you don’t know if the read is local to the node. So copying it is!

In practice copying the returned keys/values out of the page is exactly what you would do in a “normal” (non-Erlang) implementation anyway so it’s no big deal.

Asd · September 8, 2025, 7:57pm

This is a premature optimization. Let’s do the terms-as-keys and compare the performance. First make it working, then beautiful, then fast and so on.
This is a false dilemma. We can have 0-copy for binaries and simple tuples while having arbitrary terms-as-keys at the same time, because if we have string keys or tuple of strings keys, we can instantiate them with binary_to_term (or implement own version of binary_to_term for strings with fallback to :erlang.binary_to_term) without copying as well.
This is a more broader point than just this discussion about keys. If I wanted the most efficient, well tested and production ready FoundationDB in Elixir, I’d just use vanilla Foundation DB with NIFs interface. So, if someone is ever going to use Bedrock, it will be not because of it’s performance, but it will be more of it’s interoperability with Elixir, ability to work with it the Elixir way, introspect it in Elixir and so on. This is just my personal point, and I will understand if you think the other way about this project

P.S. I keep saying “we” when proposing solutions, but I am not in your team and it looks awkward (for me at least), and that’s just my poor English, please don’t mind it.

garrison · September 8, 2025, 8:48pm

The threshold for “premature” is IMO much higher for a storage engine. I have been debating (with myself) whether it’s even possible to write a useful storage engine in Elixir, and one of the biggest problems is a lack of control over memory allocation.

I’ve been leaning towards LSM in part because the design necessitates copying where a Btree would benefit greatly from in-place mutation of pages (in memory not on disk) which seems more natural in a functional language. Still, there are challenges.

Being able to binary search the blocks without decoding (and therefore copying) the keys seems to me like it would be a meaningful advantage. But of course some microbenchmarks could show it matters less than I think!

I don’t see how this would work because they have to be comparable (bin_a < bin_b) in encoded form to avoid decoding. But maybe I am misunderstanding you?

This is entirely reasonable. I personally have a specific set of goals which preclude the use of FDB, but Bedrock (while similar in some ways) is not my project and likely has different goals. That’s not up to me to comment on.

What other functionality do you think would benefit from using terms down to the storage layer? Introspection/debugging is a good example.

This is normal in English; native speakers do this all the time. It doesn’t look awkward to me.

Asd · September 8, 2025, 11:20pm

For example, I read from binary some encoded with :erlang.term_to_binary string. It will look like <<131, 109, size :: 32, string :: binary-size(size)>>. I can’t compare encoded (because it will compare by size first), but I can use this pattern to extract the string and then compare it without any copying since it will be just a reference sub binary.

This way we leave the decision to optimize the key for fast comparison (or not) to the user. If user wants to have really fast comparison, they can implement their own encoding and just write strings into the storage