CubDB, a pure-Elixir embedded key-value database

lucaong · June 25, 2019, 4:37pm

Hello Elixir and Nerves community,
I have been working for a while on an open-source embedded key-value database for Elixir, that I called CubDB. I use it for several IoT projects I run using Nerves, where I need to store large-ish amount of data locally to the device.

I am already using it in production, but before I release version 1.0.0 I would love some feedback from the Nerves (and Elixir) community.

You can find the CubDB repository here
And here the API documentation

A quick basic usage example:

{:ok, db} = CubDB.start_link("my/data/directory")

CubDB.put(db, :foo, "some value")
#=> :ok

CubDB.get(db, :foo)
#=> "some value"

CubDB.delete(db, :foo)
#=> :ok

CubDB.put(db, {:keys, "can", :be, 'anything'}, ["and", :values, 'too'])
#=> :ok

# Check out docs for advanced usage with select/3 and get_and_update_multi/4

I know that Elixir comes with ETS/DETS and Mnesia, but:

ETS is not persistent across reboots
DETS does not offer sorted collections, and is thus not ideal when one needs to select arbitrary ranges of keys, iterate in order, etc.
Mnesia is great, but on embedded projects I don’t need distribution
Sometimes I really just need a “persisted map”, sorted by key
It’s nice to be able to backup the whole DB by just copying one file

The use-cases I am primarily targeting is what described in this blog post by the Nerves team: Using Ecto and Sqlite3 with Nerves - Embedded Elixir

CubDB is somehow similar to SQLite in which it stores the data locally in a single file, but it is written in Elixir, is key-value and schema-less, and both keys and values can be any Elixir (or Erlang) terms, so no serialization/de-serialization is needed.

The data structure it uses is an append-only immutable B-tree, inspired by CouchDB: that guarantees robustness to data corruption (no in-place mutation), and enables features like concurrent read operations that do not block writes, and atomic transactions.

It was already a lot of fun for me to develop it, but I would love to hear your constructive feedback.

What do you think about it? Do you have a use-case where this could be useful? Do you have feedback about the API?

Thanks in advance

fuelen · June 25, 2019, 6:25pm

Do you have feedback about the API?

I think using function/1 or mfa for querying data would be more flexible instead of own DSL that just redirects some functions to Stream

tristan · June 25, 2019, 6:46pm

Did you compare with pure Erlang k/v stores?

I think the most recent is https://github.com/martinsumner/leveled

But there is at least also https://github.com/basho/bitcask and https://github.com/refuge/cowdb – maybe others as well.

lucaong · June 25, 2019, 7:31pm

Nice, I didn’t know those projects!

At a glance, CubDB works very similarly to CowDB and bitcask in that they all use an append-only copy-on-write data structure. I will look into them and hopefully learn more about their strategy to deal with the tricky bits, like compaction.

The main difference would be, I guess, that CubDB is written in Elixir (not implying that it’s a huge advantage, as it’s trivial to use Erlang libraries from Elixir).

dimitarvp · June 25, 2019, 7:36pm

Nice. I’ve been looking for sqlite alternative.

Do you plan to introduce a query API?
Do you plan to make a strict data typed variant, more akin to Postgres?
Would you consider adding a command to it that compacts it (and thus destroy the history of changes but oh well)?

lucaong · June 25, 2019, 7:42pm

Good point.

I initially wanted select/3 to just return a lazy Enumerable that users could use their own Stream functions on. The reason why I use that specific DSL in select/3 is that the lazy Enumerable references a specific point in the DB data file, and I need to know when no more reader references it, so a compaction operation can safely “garbage collect”. By “proxying” the stream operations, I know when they have completed, and can “check out” the reader.

Maybe there’s a better way though. I could possibly just accept a single arbitrary reduce function (that can do all that map, filter, take, … can do), but I felt it’s easier to operate with separate pipeline functions.

lucaong · June 25, 2019, 7:51pm

Hi @dimitarvp , thanks for your kind words!

The compact/1 function is already available. Even more conveniently, you can opt-in to auto-compaction either at startup or later with set_auto_compact/2.

Regarding the “strict” version, I didn’t plan it yet, but why not It should be reasonably easy to implement it as a layer on top of CubDB. I was kinda thinking of building an SQL layer as a separate library, mostly for fun, but have no idea when I will actually get to it.

About the query API, at the moment select/3 is the query workhorse. It supports efficiently selecting ranges of keys, filtering results, mapping, iterating, reducing, etc. That said, as @fuelen commented, it might be possible to improve its API to make it even simpler to use.

AndyL · June 25, 2019, 8:18pm

@lucaong - very nice!

With select/3 you can specify a key-range (min-key, max-key). It looks like a key can be any Erlang term. (is that right?) How would min/max ranges work with different key types (numbers, strings, lists, tuples, …) ??

Is there any way to use pattern-matching to select keys?

dimitarvp · June 25, 2019, 8:22pm

Let me give you some food for thought. Don’t take it as a wish list, I am just sharing.

There is currently a market for sqlite-like storage engines. Sqlite3 is an amazing little DB but it comes with quite a hefty load of legacy decisions – like a lack of proper timestamp type, lack of boolean type, lack of enums, accepting arbitrarily typed data in integer columns etc.

A lot of people out there use Sqlite3. It’s deployed on trillions of devices, literally. Yet it lacks some very common sense features like strict typing.

IMO the BEAM VM (and thus Erlang, Elixir, LFE, Alpaca etc.) is uniquely positioned. We basically don’t need Redis and Memcached due to ETS, DETS, Mnesia, Erlang’s persistent_term and several others. If we complement that with a self-sufficient single-file storage engine then the BEAM ecosystem becomes a de facto standard for a lot of development scenarios.

Again, don’t take this as a list of demands. It’s just my opinion that the BEAM ecosystem seriously needs a good Sqlite3-like experience.

lucaong · June 25, 2019, 8:35pm

Thanks @AndyL!

Yes, keys (and values) can be any Elixir or Erlang term. One neat thing about Erlang and Elixir is that order of arbitrary terms is well defined. Try for example :a > {1, “something”}

Even nicer, the ordering actually makes sense for tuples, because elements are compared in order. That can be used to good effects in CubDB. Imagine you want to store different “tables” in the same database. You could structure your keys as {:table_name, id} and, if you want to select only entries in the :users table, use select/3 with min_key: {:users, 0}, max_key: {{:usert, 0}, :excluded}. Because :usert is the lexical successor of :users, that would select all entries in the user table, and no other entry.

That, together with serialization of arbitrary terms, is a great thing that the Erlang VM offers and that CubDB leverages

AndyL · June 25, 2019, 8:50pm

OMG didn’t know that!

Any way to pattern match on keys? Or to fetch a [list] of keys?

lucaong · June 25, 2019, 9:30pm

Yes, typically you would select a range of keys with min/max_key, and use the pipe: option for any other filtering, mapping, pattern matching, etc.

Example, if you want to match keys that are maps containing a key :foo with number value:

CubDB.select(db, pipe: [
  filter: fn
    {%{foo: n}, _value} when is_number(n) ->
      true
    _ ->
      false
  end
])

When something can be expressed both with min_key/max_key and with filter, you should prefer the first though, as that avoids even loading from disk unnecessary entries. Of course, you can combine the two (and use more than just filter: map, take, etc. are available).

bryanhuntesl · June 25, 2019, 10:45pm

Beam already has sqlite3 library - it’s used by ielixir for example - IElixir/mix.exs at 564d3796f12dc37d4d84d9e4d162af5e9f0a5e5e · pprzetacznik/IElixir · GitHub

dimitarvp · June 25, 2019, 10:47pm

Well, I meant embedded inside Elixir itself with zero native code. Sorry that I wasn’t clear.

bryanhuntesl · June 25, 2019, 10:48pm

No worries - it’s an important distinction

lucaong · June 27, 2019, 4:35pm

@tristan I had a more detailed look at the Erlang k/v/ stores you posted. As I understand, both leveled and bitcask are more focused to provide a k/v engine for higher-level databases.

CowDB is instead very similar to CubDB. I just started delving through the source, and would love to get in touch with the maintainers to exchange ideas. It looks really nice, and I like its approach to transactions.

OvermindDL1 · June 27, 2019, 4:39pm

Eh not really? I’ve been using leveldb for years standalone. It’s a great embeddable KV system with a decent amount of features for a KV system.

lucaong · June 27, 2019, 4:47pm

Yes, leveldb is definitely a good standalone k/v store. I was talking about https://github.com/martinsumner/leveled , which seems to focus to provide an alternative Riak backend. But again, I just found out about the project, so I might be mistaken.

tristan · June 27, 2019, 4:53pm

If you are on or join the Erlang Slack you should ping Benoit (handle benoitc).

elcritch · June 27, 2019, 5:33pm

Excellent project! I haven’t looked at the code (yet), but also do IoT things and struggled to find a good KV store good for embedded. Currently I’m using erlang-rocksdb as SQlite3 doesn’t offer compression support and writing sql tables when you only ever need/want 1 instance of some config is annoying.

Personally, I got a bug about Datomic a while back. The idea of defining strict “columns” but then being able to mix and match them with any given entity is fascinating and could be handy in IoT cases where SQL tables have an impedance mismatch. Plus the Datalog like API would seem to match up with Elixir pretty well.