CubDB - A pure-Elixir embedded database

lucaong · June 28, 2022, 10:38am

CubDB is an embedded database written in pure Elixir, designed for robustness and minimal use of resources. It strives to be as developer-friendly as possible.

Some of you, especially in the Nerves community, already know and use CubDB since a while, and might have read the original thread about CubDB on the Nerves Forum.

Version v2.0.0 is now published on Hex, with big improvements and exciting new features, so I thought it is a good time for a proper library post here.

Since this is a long-ish post, here’s a Table of Content:

Why CubDB?
How does it compare to ETS, DETS, Mnesia, SQLite, etc.?
What’s new in v2.0.0?
How does it look in code?

Why CubDB?

CubDB is an embedded database written in Elixir. It runs inside your application, as opposed to on a separate server, and saves its data in a local file. In this respect, it is similar to SQLite, but offers an idiomatic Elixir API.

It is NOT a replacement for Postgres for multi-instances web applications, nor a distributed database, but rather a solution for cases when a lightweight but robust local data store is needed.

Typical use cases are applications running on embedded devices (CubDB runs well on Nerves), desktop applications, or applications running locally. CubDB is often used to persistently store data and configuration, as a data logging or time series store, or to persist state of an application.

Some of the features of CubDB are:

Basic key/value access, and selection of sorted ranges of entries.
Both keys and values can be any Elixir (or Erlang) term.
ACID transactions to perform atomic changes.
Multi version concurrency control (MVCC), allowing concurrent reads that do not block nor are blocked by writes.
Unexpected shutdowns or crashes won’t corrupt the database, nor break atomicity of transactions.
Manual or automatic compaction to reclaim disk space.

How does it compare to ETS, DETS, Mnesia, SQLite, etc.?

The FAQ section in the documentation has a chapter about this.

What’s new in v2.0.0?

Head to the CHANGELOG for more information, but in short:

Vastly improved concurrency
Improved CubDB.select function, which now returns lazy streams, allowing any custom composition of Stream and Enum functions
Atomic transactions with arbitrary operations with CubDB.transaction and the CubDB.Tx module
Zero-cost immutable snapshots with CubDB.with_snapshot and the CubDB.Snaphot module
CubDB.back_up for creating database backups

This major version comes with some backward incompatible changes, so refer to the upgrade guide on how to upgrade from v1.1.0 to v2.0.0. The data format is completely compatible across these major versions though, so you can upgrade and downgrade your code without needing to migrate data.

How does it look in code?

Start a CubDB database process by providing a directory to store its data:

{:ok, db} = CubDB.start_link(data_dir: "some/data/directory")

Basic key/value access

Key/value access works as you probably expect:

CubDB.put(db, :some_key, "some value")
#=> :ok

CubDB.get(db, :some_key)
#=> "some value"

CubDB.delete(db, :some_key)
#=> :ok

Both keys and values can be arbitrary Elixir (or Erlang) terms, such as scalar, tuples, maps, structs, and really anything:

CubDB.put(db, {:users, 123}, %User{id: 123, name: "Andrea"})
#=> :ok

CubDB.get(db, {:users, 123})
#=> %User{id: 123, name: "Andrea"}

Selection of sorted ranges

Selection of sorted ranges is done with CubDB.select, and returns a lazy stream that can be passed to functions in Stream and Enum. Data is fetched lazily, only when the stream is iterated or otherwise run:

# Put several entries atomically
CubDB.put_multi(db, [a: 1, b: 2, c: 3, d: 4, e: 5, f: 6, g: 7, h: 8])

# Get the sum of even entries between :b and :g
CubDB.select(db, min_key: :b, max_key: :g) # select entries in reverse order
|> Stream.map(fn {_key, value} -> value end) # discard the key and keep only the value
|> Stream.filter(fn value -> is_integer(value) && Integer.is_even(value) end) # filter only even integers
|> Enum.sum() # sum the values

Thanks to the fact that all Elixir terms have a well defined order, CubDB can be used to store and select multiple collections in the same database, akin to SQL tables.

Atomic transactions

Multiple operations can be performed atomically using the CubDB.transaction function and functions in the CubDB.Tx module:

# Swapping `:a` and `:b` atomically:
CubDB.transaction(db, fn tx ->
  a = CubDB.Tx.get(tx, :a)
  b = CubDB.Tx.get(tx, :b)

  tx = CubDB.Tx.put(tx, :a, b)
  tx = CubDB.Tx.put(tx, :b, a)

  {:commit, tx, :ok}
end)
#=> :ok

Alternatively, all the ..._multi functions perform their operations atomically.

Zero-cost immutable snapshots

If you need to ensure consistency when reading multiple values, but do not need to perform any write, there is a better alternative to transactions that won’t block writes: zero-cost immutable snapshots. Using CubDB.with_snapshot one can perform several read/select operations isolated from concurrent writes, but without blocking them. Think about this like immutability in Elixir data structures, but in a database:

# the key of y depends on the value of x, so we ensure consistency by getting
# both entries from the same snapshot, isolating from the effects of concurrent
# writes
{x, y} = CubDB.with_snapshot(db, fn snap ->
  x = CubDB.Snapshot.get(snap, :x)
  y = CubDB.Snapshot.get(snap, x)

  {x, y}
end)

Head to the API documentation for more information.

I hope you enjoy CubDB as much as I do

zachallaun · June 28, 2022, 1:40pm

Great write-up and really useful-looking system! I’m working on a small manufacturing analytics/monitoring project for my own use and have been looking for tools that make it as easily deployable as possible — this looks like a great option!

One thing that comes to mind: how easy do you think it would be to create a kind of streaming backup option a la Litestream? This seems like it should be pretty doable.

One benefit of this would be that I could have an application deployed on an embedded device, backing up its data to S3. In this way, I could easily “pull” a CubDB database to my local dev machine in order to develop against real data. It’s timeseries data, so it would even be possible (given the non-blocking writes) to create a little process that essentially copies one database to the dev db at the time intervals that the original data came in order to “replay” a particular window of time. May be simpler options, but this is definitely giving me some fun ideas that I’m looking forward to playing with.

lucaong · June 28, 2022, 1:54pm

Hi @zachallaun , thanks for the kind words

As you say, for time series data probably the easiest solution is to periodically select time-ranges of data and send it over: the selection will happen on a zero-cost snapshot, and will be isolated from writes while not blocking them.

Alternatively, if you need to produce a full backup, the CubDB.back_up introduced in v2.0.0 efficiently produces a complete backup of the database at the moment when it is called, again without blocking concurrent writes. Since CubDB saves its data in a single file, you can then send the file or the whole data directory over to S3.

That said, I am actually already thinking about possible ways to implement streaming of changes. The implementation is still in “brainstorming phase” in my mind, but some of the building blocks are already there. The background compaction process, for example, already efficiently selects updates since a given snapshot, to “catch up” with concurrent writes. The same mechanism could be in principle packaged as a public API, even though there are a number of issues to figure out first.

In short, your specific case might have a simpler solution. More in general, streaming changes could be on the roadmap, even though I cannot yet say how exactly it will work.

lud · June 29, 2022, 8:42am

Extensive CubDB user here, v2 looks amazing. Thank you so much !

lucaong · June 29, 2022, 9:02am

@lud happy to hear! I am quite excited about v2 myself

Kabie · June 29, 2022, 5:42pm

Sounds like a great replacement for dets. Would definitely try.

Gilou06 · January 25, 2023, 6:18pm

Hello,
This is probably a beginer question, but here it is.
Once I have started the CubDB with the start_link (probably in application.ex)

how do I retreive the db variable in the rest of my code. Do I need to wrap the start_link in a Genserver in order to store that db variable ?
Thank you
Jean-yves

zachallaun · January 25, 2023, 6:43pm

Completely reasonable question! If you aren’t used to this pattern, it can be hard to know what to do here.

Pass a :name argument to start_link — this will be forwarded to the underlying GenServer and accessible throughout your application.
You probably don’t want to be calling start_link directly, but letting your application supervisor do so. So here’s the pattern you’re probably going to want to use:

children = [
  {CubDB, data_dir: “some/dir”, name: :my_db}
]

Supervisor.start_link(children, strategy: :one_for_one)

You’d then use :my_db wherever you need to pass the db into the CubDB API.

Gilou06 · January 26, 2023, 10:13am

Thank you, it worked. And I know remember that I kind of knew that 2 years ago, but my brain garbages collect too often…
Jean-yves

lucaong · January 26, 2023, 10:26am

Hi @Gilou06 ,
@zachallaun already answered correctly to your (very common) doubt.

I just want to add that CubDB.start_link already starts a GenServer, which is why you can start it as a child of a Supervisor and/or give it a name.

If you give it a name, you can then use that name instead of the db variable:

{:ok, db} = CubDB.start_link(data_dir: "tmp/foo", name: :my_db)

# These two are now equivalent:
CubDB.get(db, "some-key")
CubDB.get(:my_db, "some-key")

As pointed out by @zachallaun , instead of manually calling start_link, you would probably add it to the list of children of your application supervisor. Then, the supervisor will call start_link for you, and you will still be able to refer to the CubDB process by name.

Using a name instead of the pid is usually better when running a process under supervision, because the supervisor might restart the process in case of a crash: then, the old pid won’t be valid anymore, but the name will still be valid (and refer to the new process).

These things are not obvious when starting with Elixir, and take some time and some thinking to get used to. Unfortunately, we often give them for granted in docs, where we simply say {:ok, pid} = MyProcess.start_link(...) - which works well in the console, but is not the way processes are usually started in a real application - and assume the reader knows what to do. I will try to improve CubDB docs on this aspect.

zachallaun · January 26, 2023, 2:49pm

Maybe it’s worth updating the Usage section to demonstrate starting it with your application supervisor? And then a note that says something like: You can also start a DB directly using start_link(), which we’ll use for the examples below.

Gilou06 · January 27, 2023, 8:42am

Hi @lucaong,
Thanks for the details.
I had a remote knowledge of start_link but never bridged that to what happens in application.ex with the children.
I saw magic there, when in fact it is just the application of start_link on a list.

It all make sense now.
I’m grateful.
Jean-yves

fabioticconi · December 30, 2024, 9:38am

This is an awesome library. I just have one question @lucaong - do you have any benchmarks of cubdb vs non-distributed mnesia? Especially with regard to concurrent writes. I think mnesia’s row-level locking offers an advantage here, but it probably isn’t possible with cubdb?

I’m not sure this is going to ever be an issue, but if there’s hundreds of concurrent accesses it would be nice to have some idea of the performance loss compared to alternative implementations.