CubDB, a pure-Elixir embedded key-value database

I will set the hardware up and give it a try if you are willing to help implement the test-cases.

This assumes, that you can’t corrupt data already written when writing new data. Are you sure thats the case? I could image sth like this:

D - data already written
F - “empty”
E - new data

(1)
--- flash page start ---
DDDDDDDDDDDDDDDDDD
DDDDDDDDDDDDDDDDDD
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
--- flash page end ---

Now we append to the file, therefore the flash controller reads the begin of the page into ram, deletes the page (prepare for write), resulting in

(2)
--- flash page start ---
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFF
--- flash page end ---

and now appends the data

(3)
--- flash page start ---
DDDDDDDDDDDDDDDDDD
DDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFFFF
--- flash page end ---

If the power fails after (2) there is a problem.
I’ve no knowledge of how flash controllers really work, but you could build a controller this way.

@Sebb it’s also worth noting that the test suite includes some tests that truncate the end of the data file in random ways, simulating a partial write because of a power failure, and assert that the database can still be opened and traversed.

As long as the file is corrupted at the last page CubDB can recover it. Since the file is only appended, only the last page can be affected by a corruption.

Flash drives typically can delete no less than a full block, while they can write a single page (smaller than a block), therefore it would not make sense for a disk controller to erase before append-writing.

The details of how appending to a file works at the low level are down to the specific disk controller. Some will append writes to the next page, leaving “holes” if the previous page is incomplete. Some might work differently, but they should still write a page atomically. In the end though, at this level, the responsibility of ensuring sanity of written pages resides on the disk controller, not on the application code.

For my use case it is OK to lose configuration done directly before power-fail.
But it is not OK to lose configuration that once worked.
Say I configure the device today, and it works. The Config ends up in the first half of a flash page.
Tomorrow I do some more configuration but power fails and now my configs from yesterday may be lost. That is not acceptable.

Sounds well thought, though let me give you some words of warning! Some filesystems don’t cope well with that approach.

I’ve seen ext4 truncating files to zero bytes when they’ve been actively written to at the time of an power outage.

Do you have any insight why this happens?

Thank you all for the good points. At the end of the day, one can strive to recover from anything that is recoverable at the level of the library code, using the file system primitives appropriately :slight_smile: the responsibility of ensuring durability and avoiding corruption at the low level ultimately resides with the file system and disk controller.

If a particular disk controller or file system does dangerous stuff like truncating a file to zero bytes upon write or non-atomically updating a disk page, I am not sure how any approach could possibly ensure durability.

CubDB is designed around a simple principle that is logically understandable and easy to debug. My opinion is that more complicated approaches would introduce more possible failure modes while not solving those aforementioned degenerate cases.

The approach that CubDB follows is the same of CouchDB, so the latter experience can serve as an example. Note that earlier pre-1.0.0 versions of CubDB did get user reports of file corruption in corner cases, which were addressed: some users do have high volume write use cases.

Stress testing like proposed by @Sebb would be in my opinion the best approach to ensure that the approach is sound. Such stress test would still depend on the particular file system and disk used though.

6 Likes

Hi @lucaong,
Firstly congrats on your project - CubDB looks good. However, I think there is an issue to be addressed.

My context is that I am advising a friend regarding the architecture of a Nerves-based project. The device will be accumulating data, and it will be very important that data is not lost and that it’s easy to perform backups. The data will not be relational and will be required to be accessible for long period (few years) because of legal reasons.

As this thread came up in the last few days, I considered CubDB for the persistence. However, after taking a closer look, there is one problem.

Currently CubDB uses :erlang.term_to_binary and :erlang.binary_to_term for serialization/deserialization.
Unfortunately, it is not guaranteed data serialized in one OTP version can be correctly deserialized in another one.
RabbitMQ once encountered this situation.

How this could be addressed? My suggestion would be to expose a serializer’s behavior in your library and allow alternatives. Just created a GitHub issue for that.

It is fine if the encoded value is then decoded using binary_to_term, but is not fine if the binary is used as an identifier.

would this problem concern CubDB after all?

You don’t need to always update an embedded system to the latest version immediately.
Seems like there was a solution in the final version, right?
I don’t think OTP will release sth that will break term-to-binary forever.

Serializer could be done, but will increase complexity and you’ll not find anything that can compete with term-to-binary in terms of robustness, speed and ease of use.

1 Like

Hi @mgibowski ,
thanks for the kind words! :slight_smile:

Your concern is definitely understandable, but term_to_binary should not brake backward compatibility. The issue you linked was with a OTP release candidate, and was fixed before the final release. If term_to_binary would break backward compatibility, a lot of other things beyond CubDB would break.

Ultimately, the serialization format provided by term_to_binary is the most solid, low cost, and probably future proof serialization strategy available in Elixir/Erlang. I honestly feel that a custom serialization mechanism would introduce more possibilities for bugs and troubles than leveraging on the battle-tested default Erlang term serialization.

3 Likes

Just to clarify, my concern was based on this comment:

Note that you still have to do something about [y]our usage since it is not safe (you are assuming that the same Erlang term will be encoded to exactly the same external format in all versions of Erlang/OTP which is nothing we can or want to guarantee.

But who knows, that was 4 years ago. Maybe now it is not considered “not safe” anymore? Or a possibility of breaking backward compatibility so low it’s worth taking that risk…

Thanks @mgibowski ,
I think though that you misunderstood the comment you quoted: it means that the exact binary representation of terms could change between OTP releases, not that future releases might be unable to deserialize a binary created in older releases.

As a matter of fact, in the issue you posted, previous OTP releases were already “prepared” to be able to deserialize the new format that was to be introduced in the later release.

The issue with RabbitMQ was that it relied on hashing a binary created with term_to_binary and using the hash as an identifier. This can break, because if the binary format changes across OTP releases (even without breaking binary_to_term), the hash would change too.

CubDB uses term_to_binary only for its intended purpose: to serialize a term so that it can later be deserialized using binary_to_term. This should be safe across OTP releases.

6 Likes

Thanks @lucaong for clarification. Yes, I can see now you are right. I didn’t put enough attention to it. That’s awesome :slight_smile:

1 Like

Thank you for reporting your doubt @mgibowski !

Why? (I’m new to elixir so asking out of ignorance.)

Because it doesn’t have strict type enforcement. You can describe a column as an integer but sqlite3 will happily store a text inside of it without a question.

Understood; I don’t know that I’ve every been bitten by that (I tend to have layers between the data and the DML which helps with that) but as a lover of types, I get it.

Ah, don’t get me wrong. After I get back in shape, I am going back to my sqlite3 Elixir+Rust library and I’ll spend exorbitant amounts of time and effort to enforce strict typing long before each request hits sqlite3.

But I find sqlite3’s design around types concerning still.

I’d seriously pay something like $500 for a lifetime license for a 5-machine usage of an sqlite-like DB with strict typing – basically embedded PostgreSQL, I suppose.

3 Likes

Hi @lucaong

Thanks for writing cubdb. Looks interesting. What happens if you start_link to the same db twice? Does the library serialize access to multiple handles to avoid corruption?

Hi @macrocreation , thanks a lot for the kind words!

You should not start more than a single CubDB process on the same database, because writers would conflict. In fact, in most cases, if you attempt to do so CubDB would detect it and give you an error.

You can safely call functions on a single CubDB process concurrently from other processes though. In particular, readers do not block each other and do not block writers (and read operations spawn separate reader processes, so multiple readers can run concurrently).

4 Likes