CubDB - A pure-Elixir embedded database

CubDB is an embedded database written in pure Elixir, designed for robustness and minimal use of resources. It strives to be as developer-friendly as possible.

Some of you, especially in the Nerves community, already know and use CubDB since a while, and might have read the original thread about CubDB on the Nerves Forum.

Version v2.0.0 is now published on Hex, with big improvements and exciting new features, so I thought it is a good time for a proper library post here.

Since this is a long-ish post, here’s a Table of Content:

Why CubDB?

CubDB is an embedded database written in Elixir. It runs inside your application, as opposed to on a separate server, and saves its data in a local file. In this respect, it is similar to SQLite, but offers an idiomatic Elixir API.

It is NOT a replacement for Postgres for multi-instances web applications, nor a distributed database, but rather a solution for cases when a lightweight but robust local data store is needed.

Typical use cases are applications running on embedded devices (CubDB runs well on Nerves), desktop applications, or applications running locally. CubDB is often used to persistently store data and configuration, as a data logging or time series store, or to persist state of an application.

Some of the features of CubDB are:

  • Basic key/value access, and selection of sorted ranges of entries.
  • Both keys and values can be any Elixir (or Erlang) term.
  • ACID transactions to perform atomic changes.
  • Multi version concurrency control (MVCC), allowing concurrent reads that do not block nor are blocked by writes.
  • Unexpected shutdowns or crashes won’t corrupt the database, nor break atomicity of transactions.
  • Manual or automatic compaction to reclaim disk space.

How does it compare to ETS, DETS, Mnesia, SQLite, etc.?

The FAQ section in the documentation has a chapter about this.

What’s new in v2.0.0?

Head to the CHANGELOG for more information, but in short:

This major version comes with some backward incompatible changes, so refer to the upgrade guide on how to upgrade from v1.1.0 to v2.0.0. The data format is completely compatible across these major versions though, so you can upgrade and downgrade your code without needing to migrate data.

How does it look in code?

Start a CubDB database process by providing a directory to store its data:

{:ok, db} = CubDB.start_link(data_dir: "some/data/directory")

Basic key/value access

Key/value access works as you probably expect:

CubDB.put(db, :some_key, "some value")
#=> :ok

CubDB.get(db, :some_key)
#=> "some value"

CubDB.delete(db, :some_key)
#=> :ok

Both keys and values can be arbitrary Elixir (or Erlang) terms, such as scalar, tuples, maps, structs, and really anything:

CubDB.put(db, {:users, 123}, %User{id: 123, name: "Andrea"})
#=> :ok

CubDB.get(db, {:users, 123})
#=> %User{id: 123, name: "Andrea"}

Selection of sorted ranges

Selection of sorted ranges is done with CubDB.select, and returns a lazy stream that can be passed to functions in Stream and Enum. Data is fetched lazily, only when the stream is iterated or otherwise run:

# Put several entries atomically
CubDB.put_multi(db, [a: 1, b: 2, c: 3, d: 4, e: 5, f: 6, g: 7, h: 8])

# Get the sum of even entries between :b and :g
CubDB.select(db, min_key: :b, max_key: :g) # select entries in reverse order
|> Stream.map(fn {_key, value} -> value end) # discard the key and keep only the value
|> Stream.filter(fn value -> is_integer(value) && Integer.is_even(value) end) # filter only even integers
|> Enum.sum() # sum the values

Thanks to the fact that all Elixir terms have a well defined order, CubDB can be used to store and select multiple collections in the same database, akin to SQL tables.

Atomic transactions

Multiple operations can be performed atomically using the CubDB.transaction function and functions in the CubDB.Tx module:

# Swapping `:a` and `:b` atomically:
CubDB.transaction(db, fn tx ->
  a = CubDB.Tx.get(tx, :a)
  b = CubDB.Tx.get(tx, :b)

  tx = CubDB.Tx.put(tx, :a, b)
  tx = CubDB.Tx.put(tx, :b, a)

  {:commit, tx, :ok}
end)
#=> :ok

Alternatively, all the ..._multi functions perform their operations atomically.

Zero-cost immutable snapshots

If you need to ensure consistency when reading multiple values, but do not need to perform any write, there is a better alternative to transactions that won’t block writes: zero-cost immutable snapshots. Using CubDB.with_snapshot one can perform several read/select operations isolated from concurrent writes, but without blocking them. Think about this like immutability in Elixir data structures, but in a database:

# the key of y depends on the value of x, so we ensure consistency by getting
# both entries from the same snapshot, isolating from the effects of concurrent
# writes
{x, y} = CubDB.with_snapshot(db, fn snap ->
  x = CubDB.Snapshot.get(snap, :x)
  y = CubDB.Snapshot.get(snap, x)

  {x, y}
end)

Head to the API documentation for more information.

I hope you enjoy CubDB as much as I do :slight_smile:

39 Likes

Great write-up and really useful-looking system! I’m working on a small manufacturing analytics/monitoring project for my own use and have been looking for tools that make it as easily deployable as possible — this looks like a great option!

One thing that comes to mind: how easy do you think it would be to create a kind of streaming backup option a la Litestream? This seems like it should be pretty doable.

One benefit of this would be that I could have an application deployed on an embedded device, backing up its data to S3. In this way, I could easily “pull” a CubDB database to my local dev machine in order to develop against real data. It’s timeseries data, so it would even be possible (given the non-blocking writes) to create a little process that essentially copies one database to the dev db at the time intervals that the original data came in order to “replay” a particular window of time. May be simpler options, but this is definitely giving me some fun ideas that I’m looking forward to playing with. :slight_smile:

1 Like

Hi @zachallaun , thanks for the kind words :slight_smile:

As you say, for time series data probably the easiest solution is to periodically select time-ranges of data and send it over: the selection will happen on a zero-cost snapshot, and will be isolated from writes while not blocking them.

Alternatively, if you need to produce a full backup, the CubDB.back_up introduced in v2.0.0 efficiently produces a complete backup of the database at the moment when it is called, again without blocking concurrent writes. Since CubDB saves its data in a single file, you can then send the file or the whole data directory over to S3.

That said, I am actually already thinking about possible ways to implement streaming of changes. The implementation is still in “brainstorming phase” in my mind, but some of the building blocks are already there. The background compaction process, for example, already efficiently selects updates since a given snapshot, to “catch up” with concurrent writes. The same mechanism could be in principle packaged as a public API, even though there are a number of issues to figure out first.

In short, your specific case might have a simpler solution. More in general, streaming changes could be on the roadmap, even though I cannot yet say how exactly it will work.

1 Like

Extensive CubDB user here, v2 looks amazing. Thank you so much !

1 Like

@lud happy to hear! I am quite excited about v2 myself :slight_smile:

Sounds like a great replacement for dets. Would definitely try.

1 Like