CubDB, a pure-Elixir embedded key-value database

lud · September 28, 2020, 6:21pm

That makes me think that with more complex keys we could to this

With keys like {{App.Product, 1}, "some-product-id-1234"} you can select all products with min = {{App.Product, 0}, *} and max = {{App.Product, 2}, *} (where * can be anything as it is meaningless.

But that is pure implementation leaking. Unless those 0, 1 and 2 have a meaning, like “root”, “current products”, “draft products” …

lucaong · September 28, 2020, 8:37pm

Yes, one can use min_key: {__MODULE__, :some_key}, max_key: {__MODULE__, :some_key, 0, 0}, using exclusive ranges. Note that this is not perfect, as it would for example include a tuple like {__MODULE__, "xxx"} or {__MODULE__, :some_key, 0, -1}.

That said, for practical cases it might be specific enough, when you know there won’t be any 2- or 4-tuple key of that kind. You can also use a filter function on top of min/max key to filter out anything that is not a 3-tuple.

If you have more limits on the third element, one can do better than that, leveraging on the term ordering.

What would be really useful would be to have a “minus infinite”. The nice thing about it, is that it would be smaller than any term. That way, one could change the above to min_key: {__MODULE__, :some_key, minus_infinite}, max_key: {__MODULE__, :some_key, 0, minus_infinite}, and match only the tuples you describe. As far as I know though, there is nothing like that in Erlang/Elixir.

lud · September 28, 2020, 9:57pm

Well the problem is that it would also return {__MODULE__, :other_scope, 123} since tuples are compared by size first.

So, yes, I can then use a function filter, but what do you think about implementing a :key_filter in the select options ? That would work just like min/max, without loading unnecessary data from the disk.

lucaong · September 29, 2020, 8:46am

Hi @lud

Such a key_filter would have to load the keys, but it could skip loading the value associated to them, so it might be a good idea indeed. Note though, that min_key/max_key are more performant not so much because they avoid loading the data, but primarily because they avoid enumerating all entries, and make efficient use of the BTree instead. Therefore, a key_filter would be more or less as performant as a normal filter, unless one is working with large data.

In summary, and back to the question of selecting tuple keys, the best way is:

Specify min_key/max_key, making use of the term order, to filter the range of keys as accurately as possible. This is the most performant way, and often enough.
If that leaves some unwanted keys in the range, use filter/take_while/drop_while functions to filter them out. Using only the pipe functions would enumerate all keys, which is not performant, but using them on top of min_key/max_key is a good approach.

I might actually add a key_filter function, let me evaluate the trade off between a more complex API and the performance benefit.

lud · September 29, 2020, 9:04am

I might actually add a key_filter function, let me evaluate the trade off between a more complex API and the performance benefit.

Nice, thank you.

Looking at the state of the process I thought that all keys were kept in memory. But I had only a tiny number of keys.

lucaong · September 29, 2020, 9:18am

As you discovered, keys are not kept in memory, apart from the root of the BTree Your observation also gives me the chance to briefly explain why that’s the case.

The design goal of CubDB is to be a good embedded database that can run everywhere where Elixir/Erlang can run. This includes, for example, embedded devices (which is the primary field where I personally use CubDB). Therefore, CubDB in its first production release attempts to be as efficient as possible in the use of resources without any form of caching, leaving the trade-off between memory usage and speed to the user. On many use-cases, such as data logging and data stores on tiny embedded linux devices, this is an advantage.

Future releases might introduce optional caching, but it will always be an optional layer. Same goes with compression: CubDB initially implemented compression, but that was removed when it turned out that it’s a trade-off that is better left to the user.

In sum, CubDB strives to implement a small, very reliable, usable, and performant core in pure Elixir with minimal overhead. As a future step, optimizations that introduce trade-offs might be added as options on top.

Sebb · July 14, 2021, 9:36am

I’m looking for a place to store device-config in an embedded device (nerves).
Main concern is data-corruption in case of a power loss (which will happen very often).
CubDB seems to be a good fit. (before I was thinking SQLite but it would be overkill).
Did you ever stress-test CubDB with frequent power cuts while writing?

lucaong · July 14, 2021, 4:09pm

Hi @Sebb ,
that’s exactly the primary use case that I designed CubDB around. Power failures won’t corrupt data or break atomicity of transactions. If you set auto_file_sync to false, you might loose some recently written transaction (but still not break atomicity of the ones that are committed). If you set it to true (the default in 1.0.0), successful writes are always persistent.

I use CubDB in production for persisting configuration and data logging on embedded devices since years now, and never experienced a data corruption issue caused by a power failure. I do perform stress tests when releasing notable changes, but I am also looking at options for a more systematic approach (such as a raspberry pi with a system to randomly cut power and a test script reporting inconsistencies).

One warning from experience: supplying too low voltage to a RPi can produce data corruption on the file system, of the kind that neither CubDB nor SQLite can prevent, like random bit flipping. Very long USB cables (over 1.5m) with poor quality power supply used to power a RPi 4 are sometimes enough to cause such a voltage drop.

Exadra37 · July 14, 2021, 11:19pm

I don’t know your implementation, but the BEAM as a delay to write to the disk of around 2 seconds, even when it returns that its persisted, its not indeed persisted:

You can replicate it with the script I provide in the above post that uses directly File to create a file descriptor and then writes to the disk with IO.binwrite and you will confirm that persistence is only guaranteed after 2 seconds. This can be configured to be zero seconds, but then will affect performance,

Sebb · July 15, 2021, 6:20am

It is very unlikely to encounter a data loss even with a naive implementation, because you have to get the power-fail while you are writing. And normally write times are short and power failures are rare.

I propose the following:

Device booting into an Application running CubDB writing lots of data to the disk. Before the writing starts, the App also sends a command to an Relais to cut the power. The Problem is, that it takes some time from sending the command to the actual power fail, so we need to find a way to be sure we are writing while power-loss.
The Relais will automatically enable power after some time, so the process starts again.

I have all the hardware at hand to do that, and I’m willing to do it.

lucaong · July 15, 2021, 7:44am

Hi @Sebb , what you describe looks exactly like what I wanted to implement for automated stress testing.

Let me also explain how atomicity and corruption protection in CubDB works internally. CubDB data structure is an append-only B+tree saved in a single file. Every time an entry is written, the B+tree is not modified in place. Instead, exactly like immutable data structures, the new leaf and the path to the new B+tree root is appended to the data file, referencing the rest of the tree that did not change. The last thing that is appended to the file is a transaction header, containing a pointer to the new B+tree root and some meta data such as the size. The header is always written at an offset multiple of the page size.

Upon start, CubDB searches the data file backwards for the most recent “sane” header. If a header is corrupted, incomplete, or completely missing, CubDB simply skips backwards looking for a previous one.

Compaction operations take care of cleaning up entries that are not reachable anymore, similarly to garbage collection in memory. Compaction runs concurrently, creating a new compacted copy of the data file, and cleaning up the old one only when done, and can be stopped at any time without corruption.

The “immutable” and append-only nature of the data structure used by CubDB also makes it possible achieve isolation and to read concurrently to writes: readers simply traverse the most recent B+tree at the time the read started, as writers cannot mutate it in place.

lucaong · July 15, 2021, 7:56am

Hi @Exadra37 , thanks for your observation. CubDB does not use delayed writes, so when setting auto_file_sync: true a file sync is programmatically triggered at the end of each transaction. As you noted, this strong durability guarantee comes at a performance cost. Setting auto_file_sync: false conversely trades durability for performance.

When persisting configuration, it makes sense to keep durability, as performance won’t be a problem with sporadic writes. When using CubDB for data logging, or other high volume write use case, it might make sense to choose performance instead.

Sebb · July 15, 2021, 7:58am

I will set the hardware up and give it a try if you are willing to help implement the test-cases.

This assumes, that you can’t corrupt data already written when writing new data. Are you sure thats the case? I could image sth like this:

D - data already written
F - “empty”
E - new data

(1)
--- flash page start ---
DDDDDDDDDDDDDDDDDD
DDDDDDDDDDDDDDDDDD
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
--- flash page end ---

Now we append to the file, therefore the flash controller reads the begin of the page into ram, deletes the page (prepare for write), resulting in

(2)
--- flash page start ---
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
--- flash page end ---

and now appends the data

(3)
--- flash page start ---
DDDDDDDDDDDDDDDDDD
DDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFFFF
--- flash page end ---

If the power fails after (2) there is a problem.
I’ve no knowledge of how flash controllers really work, but you could build a controller this way.

lucaong · July 15, 2021, 7:59am

@Sebb it’s also worth noting that the test suite includes some tests that truncate the end of the data file in random ways, simulating a partial write because of a power failure, and assert that the database can still be opened and traversed.

lucaong · July 15, 2021, 8:01am

As long as the file is corrupted at the last page CubDB can recover it. Since the file is only appended, only the last page can be affected by a corruption.

Flash drives typically can delete no less than a full block, while they can write a single page (smaller than a block), therefore it would not make sense for a disk controller to erase before append-writing.

The details of how appending to a file works at the low level are down to the specific disk controller. Some will append writes to the next page, leaving “holes” if the previous page is incomplete. Some might work differently, but they should still write a page atomically. In the end though, at this level, the responsibility of ensuring sanity of written pages resides on the disk controller, not on the application code.

Sebb · July 15, 2021, 8:09am

For my use case it is OK to lose configuration done directly before power-fail.
But it is not OK to lose configuration that once worked.
Say I configure the device today, and it works. The Config ends up in the first half of a flash page.
Tomorrow I do some more configuration but power fails and now my configs from yesterday may be lost. That is not acceptable.

NobbZ · July 15, 2021, 8:26am

Sounds well thought, though let me give you some words of warning! Some filesystems don’t cope well with that approach.

I’ve seen ext4 truncating files to zero bytes when they’ve been actively written to at the time of an power outage.

Sebb · July 15, 2021, 9:07am

Do you have any insight why this happens?

lucaong · July 15, 2021, 9:09am

Thank you all for the good points. At the end of the day, one can strive to recover from anything that is recoverable at the level of the library code, using the file system primitives appropriately the responsibility of ensuring durability and avoiding corruption at the low level ultimately resides with the file system and disk controller.

If a particular disk controller or file system does dangerous stuff like truncating a file to zero bytes upon write or non-atomically updating a disk page, I am not sure how any approach could possibly ensure durability.

CubDB is designed around a simple principle that is logically understandable and easy to debug. My opinion is that more complicated approaches would introduce more possible failure modes while not solving those aforementioned degenerate cases.

The approach that CubDB follows is the same of CouchDB, so the latter experience can serve as an example. Note that earlier pre-1.0.0 versions of CubDB did get user reports of file corruption in corner cases, which were addressed: some users do have high volume write use cases.

Stress testing like proposed by @Sebb would be in my opinion the best approach to ensure that the approach is sound. Such stress test would still depend on the particular file system and disk used though.

mgibowski · July 18, 2021, 4:15pm

Hi @lucaong,
Firstly congrats on your project - CubDB looks good. However, I think there is an issue to be addressed.

My context is that I am advising a friend regarding the architecture of a Nerves-based project. The device will be accumulating data, and it will be very important that data is not lost and that it’s easy to perform backups. The data will not be relational and will be required to be accessible for long period (few years) because of legal reasons.

As this thread came up in the last few days, I considered CubDB for the persistence. However, after taking a closer look, there is one problem.

Currently CubDB uses :erlang.term_to_binary and :erlang.binary_to_term for serialization/deserialization.
Unfortunately, it is not guaranteed data serialized in one OTP version can be correctly deserialized in another one.
RabbitMQ once encountered this situation.

How this could be addressed? My suggestion would be to expose a serializer’s behavior in your library and allow alternatives. Just created a GitHub issue for that.