CubDB, a pure-Elixir embedded key-value database

Thanks for posting your benchmark @skunkwerks, that’s really awesome!

Yes, I agree that snappy compression could be a good way to introduce compression without the huge performance penalty observed with the standard erlang term compression (I hope it might even improve performance with large values). I will experiment with it.

I also think that the current benchmark is hiding the true benefits of compressions, as the value being written is essentially random bytes, so it won’t be compressed much.

@skunkwerks Do you have an explanation, for the very similar performance of your big and fast machine compared to my medium sized laptop? I would expect your storage to be much faster than my ‘normal’ SSDs.

seb5law https://elixirforum.com/u/seb5law
July 24
@skunkwerks https://elixirforum.com/u/skunkwerks Do you have an
explanation, for the very similar performance of your big and fast
machine compared to my medium sized laptop? I would expect your storage

The short answer is that the test isn’t fully i/o bound, and it’s using a single
path both for storage & cpu, so we are not really stretching any of the
components. If this were running 20 or more io schedulers each doing
real work, the difference would start to show much much more.

It’s possible your laptop has an NVMe drive in it too, and definitely with
2 or 4 PCI paths for IO even if it’s an SSD.

For a single thread BEAM app writing to a single file, a large fast machine
will not show significant different than a smaller one - they’re both PCI
devices writing to broadly similar storage**, with broadly similar CPUs.

Now, if you try to write several thousand threads with concurrent I/O this
starts to show a difference - the drive is better, and the PCI bus has significantly
more channels, and your laptop will hit the wall much earlier.

The other interesting thing is that long IOs cause NVMe drive to get
noticeably hotter. The CubDB tests are not enough to make the drive
throttle for this, but if you leave it running for a couple of hours, you can
expect to see a significant difference as your drive gives up and reduces
its bandwidth to avoid overheating.

This will also show up on shorter tests with increased spread and divergent
percentiles.

A while ago I benchmarked a single dd process both on my laptop & my server,
and sure enough there is very little difference. Where things change is using
multiple dd processes, as high end NVMe drives can have 1000s of parallel
IO queues underway without breaking a sweat.

Indeed just reading off a 60GiB swap partition on the NVMe drive, I get 2x
the throughput as the laptop, and it can sustain this all day, as opposed
for a minutes or two with the laptop.

Raw notes here https://gist.github.com/dch/890ec336875663349a02cbe0b9b19171
I did plan to turn it into a blog post at some point, but it needs a bit more
filler info. The above data is also just raw disk throughput, when using
a filesystem like zfs on top, it’s a much more complicated picture.

1 Like

Yes, what @skunkwerks wrote sounds 100% correct to me. Also, the current benchmarks perform operations serially, not concurrently. Benchmarks of bulk writes and concurrent operations might show a more visible difference. I will add some.

In general, the fact that CubDB is written in pure Elixir, with no dependency, has advantages on the side of convenience and support of any Elixir-capable target, but makes it harder to saturate resources.

In general, CubDB main goals are ease of use from Elixir/Erlang, and data integrity. I am happy to optimize performance as long as it does not impact negatively the other goals. If one cares primarily about raw performance, general purpose C/C++ k/v stores like LMDB or LevelDB are probably a better choice. That said, there are likely still many low-hanging fruits to improve performance of CubDB.

Also consider that, when settings the strictest durability guarantees (auto_file_sync: true in CubDB and equivalent options in SQLite, LMDB, LevelDB, etc.), all stores would perform quite similarly and much slower than the theoretical limit shown in popular benchmarks, far from saturating IO anyway. In that (quite common) use case, ease of use would be, in my opinion, more important than shaving off a few percentage points in performance.

As a final note, thanks a lot people for the interest and for posting your insights and benchmarks! I am quite sure that, with your help, there are lots of opportunities for improving CubDB and make it one more nice entry in the Elixir toolbox :slight_smile:

2 Likes

Hi!

I have two small questions on how to best use CubDB:

  1. If you have two different kinds of data, does it make more sense to open two separate DB processes (pointing to different directories), or just use a single one and distinguish data by their IDs (and when selecting, applying the appropriate filter:?)
  2. Is it possible to (lazily) stream responses from a select?
1 Like

Hi @Qqwy,
thanks a lot for your questions, as they give me an opportunity to mention a few possibilities with CubDB :slight_smile:

The best is usually to use the same database, and leverage on the fact that keys can be any term. This way, you can have transactions across different kinds of data.

Suppose that I want to have two different kinds of data, users and messages for example. In a relational DB one would create two tables.

In CubDB, a great way to do that is to structure the keys as {:users, user_id} for users and {:messages, message_id} for messages:

user = %User{
  id: 123,
  name: "Margaret",
  employer: "NASA"
}
CubDB.put(db, {:users, user.id}, user)

message = %Message{
  id: 46,
  subject: "The Eagle has landed",
  text: "Tranquillity base here..."
}
CubDB.put(db, {:messages, message.id}, message)

Now, here is how you can query them, leveraging on erlang term sorting:

# Get a specific user with id = 123
CubDB.get(db, {:users, 123})
#=> %User{ id: 123, ... }

# Select all users (this works because tuples in Erlang/Elixir are
# compared element by element, nil is smaller than any map, and
# bigger tuples are greater than smaller ones):
CubDB.select(db, [
  min_key: {:users, nil},
  max_key: {:users, nil, nil}
])

# Select the first 30 messages, with id > 10:
CubDB.select(db, [
  min_key: {:messages, 10},
  max_key: {:messages, 10, nil},
  pipe: [{ take: 30 }]
])

That is also much faster than using :filter, because :min_key and :max_key avoid loading unnecessary entries from disk entirely. You can of course combine it with a filter, if you want to apply further restrictions that cannot be expressed by :min_key/:max_key (e.g. only take messages matching a certain subject).

Yes :slight_smile: When you use select/3 with the :pipe option, the entries are in fact lazily streamed through the pipeline operations:

CubDB.select(db, [
  min_key: {:messages, 10},
  max_key: {:messages, 10, nil},
  pipe: [
    filter: fn {_key, %Message{ subject: subject }} ->
      String.contains?(subject, ["error", "1202"])
    end,
    map: fn {_key, %Message{ subject: subject }} ->
      subject
    end
  ]
])

You might wonder why select/3 does not simply return a lazy stream. The reason is that, internally, CubDB has to keep track of all open readers and the data file that they reference. Upon a compaction operation, a new compacted data file is created, and the old one is only removed after no reader is referencing it anymore. This way, readers can safely operate concurrently with other readers, writers, and compactions, with none being blocked.

Using the :pipe options, select/3 takes care of the bookkeeping for you. If it gave direct access to the stream, you would have to manually “checkout” the reader after you are done using it, also in case of an exception, etc.

I hope this explains it well, and gives you some tools to model your case.

2 Likes

I want to take the time to thank you for your very insightful post!

I am now a couple of days down the road, and I have used CubDB in the ways you suggested: Having one database containing multiple datastructures so they can be used transactionally across different kinds of data, and I use the min_key/max_key tips you suggest :smile:.

Also, kudos for the way you run streams over the selected records :+1:. The code looks very clean, and I am very happy that memory usage while running a select with a reducer is essentially constant (rather than reading in all records before doing something with them).


I now do came across a bit of an issue. I am using CubDB on my Nerves device, but it seems that not all data is properly kept track of, and some data is lost during reboots:

iex> CubDB.select(MyApp.DB, reduce: {0, fn x, acc -> acc + 1 end})
{:ok, 3543}
iex>  Toolshed.Nerves.reboot
# Wait for Nerves to restart and come back up
iex> CubDB.select(MyApp.DB, reduce: {0, fn x, acc -> acc + 1 end})
{:ok, 1871}

It seems like many records are lost. What is going on here?

The database server is started as follows, and is part of my supervision tree:

CubDB.start_link("/root/data/my_db", [auto_file_sync: true, auto_compact: true], [name: MyApp.DB])

(Or to be exact, the app contains a module which has the following child_spec definition to allow it to be added to the supervisor using just MyApp.DB:

defmodule MyApp.DB do
  def child_spec(_) do
    %{id: __MODULE__, start: {CubDB, :start_link, [Application.get_env(:my_app, :my_db_location, "data/my_db"), [auto_file_sync: true, auto_compact: true], [name: __MODULE__]]}}
  end

  # ... some other helper methods that use CubDB.get/fetch/put under the hood.

end

)

1 Like

Hi @Qqwy,
Thanks a lot for reporting this. It sounds like a serious issue, and one that I didn’t encounter yet. It looks especially strange given that you are already using auto file sync.

Could you provide any more input, e.g. on how you fill up the db? If you manage to have a reproducible setup, I’ll pick it up from there and make sure to fix it before v1.0.

Also, does it make any difference if you call `CubDB.file_sync(db) before restarting? This one test is to check if maybe there is a bug in the auto sync logic.

Thanks a lot

Also, is it possible that some write or delete operation completes after the last select? Writes and reads do not block each other, so a select sees an immutable snapshot of the db at the moment that the operation started. If a write is performed concurrently, it won’t be visible by that select.

Sounds unlikely in your case, but I am trying to narrow down the possibilities.

Another useful test would be to disable the auto compaction, to see if that’s what’s causing the issue.

Thanks a lot for helping on this.

It might be that this behaviour happens especially because it is running on an embedded device. One of the things I will try to find out to make it easier to investigate this issue, is what size Nerves’ read/write partition has by default, and what happens if it were to get full.

How much is the overhead of CubDB’s internal tree structure?

I will try out if disabling the auto compaction helps, and I’ll build and share a minimal example that has the same behaviour as my real application (whose source I unfortunately cannot share).

In this case, I am saving a new value, which is an Elixir struct with about eight keys, containing some integers, strings and floats, under the key {:telegram "some_timestamp"} (where some_timestamp has the format "YYYYMMDDHHmmss").

Thank you very much for your responses :+1:.

1 Like

If the entries are small, the biggest overhead is due to the fact that headers are written only at page boundaries, so each write will take a minimum of 1024 bytes before compaction (each atomic operation writes a header).

Embedded device, and especially Nerves, are one of the primary targets of CubDB (and the initial motivation for the project), so I am especially interested in investigating possible bugs there.

I will soon setup a test device on a Rpi to exercise extensively conditions like sudden loss of power.

I am still puzzled by the issue you are facing, as I never encountered in automated tests nor in deployed Nerves devices. Any further insight you might discover is very valuable for me.

Thanks a lot for dedicating time to this!

1 Like

I have spent two hours today in an attempt to reproduce the issue, and have not been able to do so.

:thinking:

My current testing repository can be found here, if anyone else wants to take a whirl.

For now I’ll continue working on my application. If the issue reappears, I will try to find out more. :slight_smile:

2 Likes

After extensive testing on a number of test Nerves devices, I was finally able to identify the issue that @Qqwy reported.

It was a bug with the way the most recent database file is chosen, in cases when a restart happens right after a compaction, but before the old file is cleaned up, and CubDB sees more than one database file. The wrong file was chosen, leading to the new records disappearing.

The issue is solved with the latest release, v0.12.0, which is 100% backward compatible. Thanks a lot @Qqwy for reporting and helping. Version 1.0 is getting closer, thanks to valuable feedback from people in this forum :slight_smile:

4 Likes