CubDB, a pure-Elixir embedded key-value database

lucaong · July 24, 2019, 9:01am

Thanks for posting your benchmark @dch, that’s really awesome!

Yes, I agree that snappy compression could be a good way to introduce compression without the huge performance penalty observed with the standard erlang term compression (I hope it might even improve performance with large values). I will experiment with it.

I also think that the current benchmark is hiding the true benefits of compressions, as the value being written is essentially random bytes, so it won’t be compressed much.

seb5law · July 24, 2019, 12:05pm

@dch Do you have an explanation, for the very similar performance of your big and fast machine compared to my medium sized laptop? I would expect your storage to be much faster than my ‘normal’ SSDs.

dch · July 24, 2019, 2:10pm

seb5law https://elixirforum.com/u/seb5law
July 24
@skunkwerks https://elixirforum.com/u/skunkwerks Do you have an
explanation, for the very similar performance of your big and fast
machine compared to my medium sized laptop? I would expect your storage

The short answer is that the test isn’t fully i/o bound, and it’s using a single
path both for storage & cpu, so we are not really stretching any of the
components. If this were running 20 or more io schedulers each doing
real work, the difference would start to show much much more.

It’s possible your laptop has an NVMe drive in it too, and definitely with
2 or 4 PCI paths for IO even if it’s an SSD.

For a single thread BEAM app writing to a single file, a large fast machine
will not show significant different than a smaller one - they’re both PCI
devices writing to broadly similar storage**, with broadly similar CPUs.

Now, if you try to write several thousand threads with concurrent I/O this
starts to show a difference - the drive is better, and the PCI bus has significantly
more channels, and your laptop will hit the wall much earlier.

The other interesting thing is that long IOs cause NVMe drive to get
noticeably hotter. The CubDB tests are not enough to make the drive
throttle for this, but if you leave it running for a couple of hours, you can
expect to see a significant difference as your drive gives up and reduces
its bandwidth to avoid overheating.

This will also show up on shorter tests with increased spread and divergent
percentiles.

A while ago I benchmarked a single dd process both on my laptop & my server,
and sure enough there is very little difference. Where things change is using
multiple dd processes, as high end NVMe drives can have 1000s of parallel
IO queues underway without breaking a sweat.

Indeed just reading off a 60GiB swap partition on the NVMe drive, I get 2x
the throughput as the laptop, and it can sustain this all day, as opposed
for a minutes or two with the laptop.

Raw notes here This nvme drive does (in real-world terms) ~ 1400GiB/hr streaming reads from a zpool under FreeBSD 12.0p2 amd64 after a couple of years of sustained abuse. · GitHub
I did plan to turn it into a blog post at some point, but it needs a bit more
filler info. The above data is also just raw disk throughput, when using
a filesystem like zfs on top, it’s a much more complicated picture.

lucaong · July 24, 2019, 2:42pm

Yes, what @dch wrote sounds 100% correct to me. Also, the current benchmarks perform operations serially, not concurrently. Benchmarks of bulk writes and concurrent operations might show a more visible difference. I will add some.

In general, the fact that CubDB is written in pure Elixir, with no dependency, has advantages on the side of convenience and support of any Elixir-capable target, but makes it harder to saturate resources.

In general, CubDB main goals are ease of use from Elixir/Erlang, and data integrity. I am happy to optimize performance as long as it does not impact negatively the other goals. If one cares primarily about raw performance, general purpose C/C++ k/v stores like LMDB or LevelDB are probably a better choice. That said, there are likely still many low-hanging fruits to improve performance of CubDB.

Also consider that, when settings the strictest durability guarantees (auto_file_sync: true in CubDB and equivalent options in SQLite, LMDB, LevelDB, etc.), all stores would perform quite similarly and much slower than the theoretical limit shown in popular benchmarks, far from saturating IO anyway. In that (quite common) use case, ease of use would be, in my opinion, more important than shaving off a few percentage points in performance.

As a final note, thanks a lot people for the interest and for posting your insights and benchmarks! I am quite sure that, with your help, there are lots of opportunities for improving CubDB and make it one more nice entry in the Elixir toolbox

Qqwy · July 25, 2019, 2:08pm

Hi!

I have two small questions on how to best use CubDB:

If you have two different kinds of data, does it make more sense to open two separate DB processes (pointing to different directories), or just use a single one and distinguish data by their IDs (and when selecting, applying the appropriate filter:?)
Is it possible to (lazily) stream responses from a select?

lucaong · July 25, 2019, 3:19pm

Hi @Qqwy,
thanks a lot for your questions, as they give me an opportunity to mention a few possibilities with CubDB

The best is usually to use the same database, and leverage on the fact that keys can be any term. This way, you can have transactions across different kinds of data.

Suppose that I want to have two different kinds of data, users and messages for example. In a relational DB one would create two tables.

In CubDB, a great way to do that is to structure the keys as {:users, user_id} for users and {:messages, message_id} for messages:

user = %User{
  id: 123,
  name: "Margaret",
  employer: "NASA"
}
CubDB.put(db, {:users, user.id}, user)

message = %Message{
  id: 46,
  subject: "The Eagle has landed",
  text: "Tranquillity base here..."
}
CubDB.put(db, {:messages, message.id}, message)

Now, here is how you can query them, leveraging on erlang term sorting:

# Get a specific user with id = 123
CubDB.get(db, {:users, 123})
#=> %User{ id: 123, ... }

# Select all users (this works because tuples in Erlang/Elixir are
# compared element by element, nil is smaller than any map, and
# bigger tuples are greater than smaller ones):
CubDB.select(db, [
  min_key: {:users, nil},
  max_key: {:users, nil, nil}
])

# Select the first 30 messages, with id > 10:
CubDB.select(db, [
  min_key: {:messages, 10},
  max_key: {:messages, 10, nil},
  pipe: [{ take: 30 }]
])

That is also much faster than using :filter, because :min_key and :max_key avoid loading unnecessary entries from disk entirely. You can of course combine it with a filter, if you want to apply further restrictions that cannot be expressed by :min_key/:max_key (e.g. only take messages matching a certain subject).

Yes When you use select/3 with the :pipe option, the entries are in fact lazily streamed through the pipeline operations:

CubDB.select(db, [
  min_key: {:messages, 10},
  max_key: {:messages, 10, nil},
  pipe: [
    filter: fn {_key, %Message{ subject: subject }} ->
      String.contains?(subject, ["error", "1202"])
    end,
    map: fn {_key, %Message{ subject: subject }} ->
      subject
    end
  ]
])

You might wonder why select/3 does not simply return a lazy stream. The reason is that, internally, CubDB has to keep track of all open readers and the data file that they reference. Upon a compaction operation, a new compacted data file is created, and the old one is only removed after no reader is referencing it anymore. This way, readers can safely operate concurrently with other readers, writers, and compactions, with none being blocked.

Using the :pipe options, select/3 takes care of the bookkeeping for you. If it gave direct access to the stream, you would have to manually “checkout” the reader after you are done using it, also in case of an exception, etc.

I hope this explains it well, and gives you some tools to model your case.

Qqwy · July 30, 2019, 8:29pm

I want to take the time to thank you for your very insightful post!

I am now a couple of days down the road, and I have used CubDB in the ways you suggested: Having one database containing multiple datastructures so they can be used transactionally across different kinds of data, and I use the min_key/max_key tips you suggest .

Also, kudos for the way you run streams over the selected records . The code looks very clean, and I am very happy that memory usage while running a select with a reducer is essentially constant (rather than reading in all records before doing something with them).

I now do came across a bit of an issue. I am using CubDB on my Nerves device, but it seems that not all data is properly kept track of, and some data is lost during reboots:

iex> CubDB.select(MyApp.DB, reduce: {0, fn x, acc -> acc + 1 end})
{:ok, 3543}
iex>  Toolshed.Nerves.reboot
# Wait for Nerves to restart and come back up
iex> CubDB.select(MyApp.DB, reduce: {0, fn x, acc -> acc + 1 end})
{:ok, 1871}

It seems like many records are lost. What is going on here?

The database server is started as follows, and is part of my supervision tree:

CubDB.start_link("/root/data/my_db", [auto_file_sync: true, auto_compact: true], [name: MyApp.DB])

(Or to be exact, the app contains a module which has the following child_spec definition to allow it to be added to the supervisor using just MyApp.DB:

defmodule MyApp.DB do
  def child_spec(_) do
    %{id: __MODULE__, start: {CubDB, :start_link, [Application.get_env(:my_app, :my_db_location, "data/my_db"), [auto_file_sync: true, auto_compact: true], [name: __MODULE__]]}}
  end

  # ... some other helper methods that use CubDB.get/fetch/put under the hood.

end

)

EDIT: This particular code snippet is outdated as newer versions of CubDB expect different arguments to start_link. @BrightEyesDavid posted an updated snippet below.

lucaong · July 30, 2019, 8:57pm

Hi @Qqwy,
Thanks a lot for reporting this. It sounds like a serious issue, and one that I didn’t encounter yet. It looks especially strange given that you are already using auto file sync.

Could you provide any more input, e.g. on how you fill up the db? If you manage to have a reproducible setup, I’ll pick it up from there and make sure to fix it before v1.0.

Also, does it make any difference if you call `CubDB.file_sync(db) before restarting? This one test is to check if maybe there is a bug in the auto sync logic.

Thanks a lot

lucaong · July 30, 2019, 9:16pm

Also, is it possible that some write or delete operation completes after the last select? Writes and reads do not block each other, so a select sees an immutable snapshot of the db at the moment that the operation started. If a write is performed concurrently, it won’t be visible by that select.

Sounds unlikely in your case, but I am trying to narrow down the possibilities.

Another useful test would be to disable the auto compaction, to see if that’s what’s causing the issue.

Thanks a lot for helping on this.

Qqwy · July 31, 2019, 9:00pm

It might be that this behaviour happens especially because it is running on an embedded device. One of the things I will try to find out to make it easier to investigate this issue, is what size Nerves’ read/write partition has by default, and what happens if it were to get full.

How much is the overhead of CubDB’s internal tree structure?

I will try out if disabling the auto compaction helps, and I’ll build and share a minimal example that has the same behaviour as my real application (whose source I unfortunately cannot share).

In this case, I am saving a new value, which is an Elixir struct with about eight keys, containing some integers, strings and floats, under the key {:telegram "some_timestamp"} (where some_timestamp has the format "YYYYMMDDHHmmss").

Thank you very much for your responses .

lucaong · July 31, 2019, 10:07pm

If the entries are small, the biggest overhead is due to the fact that headers are written only at page boundaries, so each write will take a minimum of 1024 bytes before compaction (each atomic operation writes a header).

Embedded device, and especially Nerves, are one of the primary targets of CubDB (and the initial motivation for the project), so I am especially interested in investigating possible bugs there.

I will soon setup a test device on a Rpi to exercise extensively conditions like sudden loss of power.

I am still puzzled by the issue you are facing, as I never encountered in automated tests nor in deployed Nerves devices. Any further insight you might discover is very valuable for me.

Thanks a lot for dedicating time to this!

Qqwy · August 17, 2019, 3:00pm

I have spent two hours today in an attempt to reproduce the issue, and have not been able to do so.

My current testing repository can be found here, if anyone else wants to take a whirl.

For now I’ll continue working on my application. If the issue reappears, I will try to find out more.

lucaong · August 26, 2019, 10:04am

After extensive testing on a number of test Nerves devices, I was finally able to identify the issue that @Qqwy reported.

It was a bug with the way the most recent database file is chosen, in cases when a restart happens right after a compaction, but before the old file is cleaned up, and CubDB sees more than one database file. The wrong file was chosen, leading to the new records disappearing.

The issue is solved with the latest release, v0.12.0, which is 100% backward compatible. Thanks a lot @Qqwy for reporting and helping. Version 1.0 is getting closer, thanks to valuable feedback from people in this forum

ryanwinchester · December 4, 2019, 9:21am

I get error :erofs when I try to start CubDB. Still pretty new to nerves, what am I doing wrong?

lucaong · December 4, 2019, 9:38am

Hi @ryanwinchester , thanks for giving CubDB a try

erofs is normally indicating that you are trying to write on a read-only file system (see the documentation of file-related error codes here). Double-check if the location of your data directory is read-only.

If you are on Nerves, one thing to know is that most partitions are read-only (in order to protect the firmware from corruption). You can use a directory under /root as your data directory for CubDB (e.g. /root/db, which will be created by CubDB if not existing), as the partition mounted at /root is read/write.

I hope this helps

elcritch · December 5, 2019, 5:11am

P.S. I’ve been using CubDB for some smaller configuration databases. Been running fine for a month or so with no corruption or other issues. Nice work!

lucaong · December 5, 2019, 1:08pm

Thanks for the great feedback @elcritch ! It’s great to hear you are finding CubDB useful and reliable

As for me, I am running CubDB in a customer-facing product for data logging and to persist configuration. As the project creator, I was a very early adopter, so I initially had to deal with occasional issues, but all is running very smoothly since months now, on multiple devices deployed remotely.

In other words, CubDB is a version 0.x.y because a database deserves extensive testing in a variety of scenarios before being declared v1, but it’s used in production with stability requirements. It’s API won’t unexpectedly change in backwards incompatible ways, unless strictly needed and clearly communicated.

Always very valuable for me to hear users’ feedback

ryanwinchester · December 7, 2019, 1:27am

It’s working good for me now as well, thanks for the help.

Mainly config persistence here as well, but won’t be in production for a few months yet.

lud · December 9, 2019, 11:22am

Hello,

I have this error this morning:

[error] 'File operation error: emfile. Target: /usr/local/Cellar/erlang/22.1.8/lib/erlang/lib/stdlib-3.10/ebin/sys.beam. Function: get_file. Process: code_server.'
[error] 'File operation error: emfile. Target: sys.beam. Function: get_file. Process: code_server.'
[error] 'File operation error: emfile. Target: /usr/local/Cellar/elixir/1.9.4/bin/../lib/logger/ebin/Elixir.Logger.Translator.beam. Function: get_file. Process: code_server.'
[error] Task #PID<0.228.0> started from #PID<0.225.0> terminating
** (stop) exited in: GenServer.call(Gem.EventsTest.Repo, {:get_and_update_multi, [{Gem.EventsTest.Account, 1575890069540}], #Function<9.14759660/1 in CubDB.get_multi/3>}, 5000)
    ** (EXIT) an exception was raised:
        ** (UndefinedFunctionError) function :sys.get_log/1 is undefined (module :sys is not available)
            (stdlib) :sys.get_log([])
            (stdlib) gen_server.erl:888: :gen_server.error_info/7
            (stdlib) gen_server.erl:869: :gen_server.terminate/10
            (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
    (elixir) lib/gen_server.ex:1009: GenServer.call/3
    (cubdb) lib/cubdb.ex:506: CubDB.get_multi/3

The context is I am doing a lot of writes and reads per second. I post here (not on the repo) because it seems that it is not related to CubDB but to the erlang runtime : for some reason the :sys module seems undefined after some point. But maybe you have encountered this before ?

This happens after a good amount of read/writes, not from the get go.

lucaong · December 9, 2019, 2:56pm

Hi @lud,
Thanks for reporting this. As you already mention, I think this is not directly related to CubDB. It seems that the error about :sys might be a side effect of the emfile error. The emfile error indicated that too many files are open.

CubDB only keeps a single file open (or two during a compaction), and performs all low-level file operations through the same process, so a high volume of read/writes should not cause too many files to be opened.

A few ideas:

Could this be caused by some logger? I see a reference to Elixir.Logger, maybe that’s something worth investigating.
Do you, by any chance, start multiple CubDB processes? I should clarify this better in the docs: only one CubDB process for the same data directory should be created. Other processes can concurrently call it, so if you need to perform many parallel operations from many processes, you still start a single CubDB, and then call it from multiple “parallel” processes. Not sure if this is your case, but just making sure as you mention many read/writes.

I never encountered this error, but I’d be interested in knowing what you find out. Should this turn out to be related to CubDB, I will work on a fix.