CubDB, a pure-Elixir embedded key-value database

It is fine if the encoded value is then decoded using binary_to_term, but is not fine if the binary is used as an identifier.

would this problem concern CubDB after all?

You don’t need to always update an embedded system to the latest version immediately.
Seems like there was a solution in the final version, right?
I don’t think OTP will release sth that will break term-to-binary forever.

Serializer could be done, but will increase complexity and you’ll not find anything that can compete with term-to-binary in terms of robustness, speed and ease of use.

1 Like

Hi @mgibowski ,
thanks for the kind words! :slight_smile:

Your concern is definitely understandable, but term_to_binary should not brake backward compatibility. The issue you linked was with a OTP release candidate, and was fixed before the final release. If term_to_binary would break backward compatibility, a lot of other things beyond CubDB would break.

Ultimately, the serialization format provided by term_to_binary is the most solid, low cost, and probably future proof serialization strategy available in Elixir/Erlang. I honestly feel that a custom serialization mechanism would introduce more possibilities for bugs and troubles than leveraging on the battle-tested default Erlang term serialization.

3 Likes

Just to clarify, my concern was based on this comment:

Note that you still have to do something about [y]our usage since it is not safe (you are assuming that the same Erlang term will be encoded to exactly the same external format in all versions of Erlang/OTP which is nothing we can or want to guarantee.

But who knows, that was 4 years ago. Maybe now it is not considered “not safe” anymore? Or a possibility of breaking backward compatibility so low it’s worth taking that risk…

Thanks @mgibowski ,
I think though that you misunderstood the comment you quoted: it means that the exact binary representation of terms could change between OTP releases, not that future releases might be unable to deserialize a binary created in older releases.

As a matter of fact, in the issue you posted, previous OTP releases were already “prepared” to be able to deserialize the new format that was to be introduced in the later release.

The issue with RabbitMQ was that it relied on hashing a binary created with term_to_binary and using the hash as an identifier. This can break, because if the binary format changes across OTP releases (even without breaking binary_to_term), the hash would change too.

CubDB uses term_to_binary only for its intended purpose: to serialize a term so that it can later be deserialized using binary_to_term. This should be safe across OTP releases.

6 Likes

Thanks @lucaong for clarification. Yes, I can see now you are right. I didn’t put enough attention to it. That’s awesome :slight_smile:

1 Like

Thank you for reporting your doubt @mgibowski !

Why? (I’m new to elixir so asking out of ignorance.)

Because it doesn’t have strict type enforcement. You can describe a column as an integer but sqlite3 will happily store a text inside of it without a question.

Understood; I don’t know that I’ve every been bitten by that (I tend to have layers between the data and the DML which helps with that) but as a lover of types, I get it.

Ah, don’t get me wrong. After I get back in shape, I am going back to my sqlite3 Elixir+Rust library and I’ll spend exorbitant amounts of time and effort to enforce strict typing long before each request hits sqlite3.

But I find sqlite3’s design around types concerning still.

I’d seriously pay something like $500 for a lifetime license for a 5-machine usage of an sqlite-like DB with strict typing – basically embedded PostgreSQL, I suppose.

3 Likes

Hi @lucaong

Thanks for writing cubdb. Looks interesting. What happens if you start_link to the same db twice? Does the library serialize access to multiple handles to avoid corruption?

Hi @macrocreation , thanks a lot for the kind words!

You should not start more than a single CubDB process on the same database, because writers would conflict. In fact, in most cases, if you attempt to do so CubDB would detect it and give you an error.

You can safely call functions on a single CubDB process concurrently from other processes though. In particular, readers do not block each other and do not block writers (and read operations spawn separate reader processes, so multiple readers can run concurrently).

4 Likes

hey @lucaong,

thanks a lot for the very nice lib. I am using it for a project right now.

After reading this how-to, I wonder if I can efficiently empty a collection, without having to find out all the existing keys first?

Say I have devices collection, which potentially contains millions of entries. I am looking for something like:

CubDB.delete_range(cubdb, min_key: {:devices, 1}, max_key: {:devices, 1000000})

What do you think?

Hi @tduccuong ,
Thanks for the kind words :slight_smile:

Having to efficiently remove a whole big collection would be a possibly good reason to put the collection in its own CubDB database: it is then possible to use CubDB.clear to remove everything in a very efficient way (like a truncate table in SQL).

Otherwise, one can gather the keys to delete with a CubDB.select, then call CubDB.delete_multi passing all the keys (possibly doing this in batches if the collection is really big). This still removes each individual entry though, much like a delete in SQL.

Each CubDB database is a b-tree, much like each individual table or index in Postgres. The trade-off is that transactions only work on a single database, so you should decide whether to use a separate database per collection or a single one based on whether you need efficient “truncate” semantics, versus whether you need atomic transactions involving this collection together with other ones.

Sorry @macrocreation , I missed your message. EDIT: I did not miss it, apparently I replied and I did not see…

For the sake of those who read this thread, the answer is that, as specified in the docs, at most a single CubDB process should be started at any time on a specific data directory. It is then possible to call the CubDB functions from several processes concurrently, but there should be only one CubDB process managing each specific data folder.

In fact, start_link will fail when called twice on the same data directory within the same VM (ideally it should fail even if called twice on the same data directory across VMs, but it’s a lot harder to do in a cross-platform way as far as I know).

Thanks a lot for your quick reply @lucaong, given the flexibility in managing DB file per process, I think it makes sense to have each DB file as a collection.

Btw, I wanted to ask you if indexing value is in your feature roadmap? Something like, finding a range of devices based on user_id. I can simulate it now at application level, given the existing CubDB primitives. E.g., having another collection with ID as {user_id, device_id}, then CRUD that collection upon CRUD devices collection. But would be nice if that will be done at DB level.

Thanks again for your advice!

Hi @tduccuong ,
Yes, as you said, the recommended way to create a secondary index is to create a second collection in the same database, keyed by the index field, and using “primary keys” as values.

The main collection and its index can be updated together in atomic transactions (I am typing this from a phone, so I will definitely make some mistakes, but hopefully it gives you the gist of it):

# This implements:
#  - a collection of users by ID, with keys like {:users, id} and %User{} structs as values
#  - a collection, implementing the secondary index, of users by name,
#    with keys like {:users_name_idx, name} and lists of user IDs as values

def insert_user(db, user) do
  CubDB.transaction(db, fn tx ->
    tx = CubDB.Tx.put(tx, {:users, user.id}, user)
    ids = CubDB.Tx.get(tx, {:users_name_idx, user.name}, [])
    tx = CubDB.Tx.put(tx, {:users_name_idx, user.name}, [user.id | ids])
    {:commit, tx, user}
  end)
end

def get_user_by_id(db, id) do
  CubDB.get(db, {:users, id})
end

def get_users_by_name(db, name) do
  CubDB.with_snapshot(db, fn snap ->
    ids = CubDB.Snapshot.get(snap, {:users_name_idx, name})
    keys = Enum.map(ids, fn id -> {:users, id} end)
    CubDB.Snapshot.get_multi(snap, keys) |> Map.values()
  end)
end

At the moment there is no plan to implement secondary indexes as a first class concept in CubDB. That’s because CubDB strives to be first of all a simple, minimal, and versatile building block for higher level products.

That said, I would be lying if I said that I am not hoping to find the time to build a higher level library on top of it, providing facilities like “tables” and indices :slight_smile: but if you really depend on that level of abstraction, it’s probably better to use an SQL solution like SQLite.

2 Likes

By the way, for those interested in CubDB, the thread in #your-libraries-os-mentoring:libraries is the most up to date, and where announcements happen:

There is also a #cubdb tag that can be used for questions or topics related to it.

1 Like