Realtime status on CubDB compaction process?

Hi @lucaong,

first, thanks a lot for the great library. I am using CubDB as embedded DB for my project. The service runs on an SD-card based Pi node that continuously writes lots of log lines into CubDB. To protect the SD card, I create a RAM disk to host the CubDB file. Periodically (every 2 hours), I backup the RAM disk content to a persistent folder on SD. Next time the service starts, it copy the content of that folder to the RAM disk.

My question is whether this approach makes sense for CubDB? If it does, can I query CubDB to know if it is currently doing compaction, so I can wait till it is done to avoid backing up a half-done DB file?

Thanks a lot!

Hi @tduccuong , thanks for the kind words :slight_smile:

My question is whether this approach makes sense for CubDB?

I don’t see problems with your approach, but it could be simpler. If you use a RAM disk, you probably do not need complete durability in case of a restart, and you are ok loosing very recent logs (after the most recent disk copy). In that case, you could do without a RAM disk, and simply use the option auto_file_sync: false. This will avoid flushing buffers to disk for each transaction, and instead let the OS control the flush to disk. You can also explicitly call CubDb.file_sync(pid) periodically to force a flush. In all likelihood this would cause disk writes more frequently than every 2 hours though, so if that’s a problem you might still follow your approach.

You can check if there is an ongoing compaction by calling CubDB.compacting?(pid), or you can disable auto compaction completely with the auto_compact: false option, and force a compaction manually when you see fit, with CubDB.compact(pid). The latter option can be especially useful if you have “quiet times” when you can perform a large compaction, and “busy times” when you prefer not to add overhead. Delaying a compaction won’t affect read or write performance, and will simply cause the database file to be larger than it could. Compactions are also more efficient during quiet times, when the database does not need to catch up on writes that happen after compaction is started.

I hope this helps :slight_smile:

Thank you @lucaong very much for the detailed answer. Your suggestion is much better, since I am also concern about memory overflow when the DB gets too big.

Just so I understand you correctly, if i set auto_file_sync to false, and so long that I don’t explicitly call CubDb.file_sync(pid) then no flush to disk?

Never mind my last question, I got the answer after going through your code base. I think my approach now is:

  • auto_file_sync = false
  • Build a BTree in memory as buffer, that exposes same API as CubDB (put/2, put_multi/2) for client code
  • Periodically flush the BTree to disk via CubDB.put_multi/2

Do you think this helps protect the SD card better?

Thanks for your help!

Flash is consumable. The cells can only withstand so many write/erase cycles before failing.

Writes to the drive are done at some page size. Say 4kb, though I think it may be larger in hardware now. If you write less than that size and then flush to disk you will cause write amplification and waste cycles.

Flash drives have an erase block size which is larger than the page size (for complicated reasons). If you partially erase these blocks you may eventually force the drive to rewrite the remaining contents before erasing the block, which again causes write amp and wastes cycles.

The most efficient way to write to such a device is therefore in a log-structured format. You buffer data in memory and append it in pages which are at least as large as the (hardware) page size. You also want your erases to have strong temporal locality so that pages written to the same erase block (i.e. those which were written around the same time) are erased together.

Finally you also want to avoid filling the drive too much as wear leveling is harder when there is little remaining space and this could result in uneven wear which harms the drive (and/or harms write performance).

Fortunately your workload (storing literal log lines) is, as it turns out, log-structured!

Btrees and LSM trees are data structures which are designed to allow for random writes, that is writes which are out of sorted order. Your workload, logging, should have perfect order, meaning you don’t actually need a Btree/LSM.

However, both Btrees and LSMs have optimizations which cause them to fall back to near perfect log-structured behavior in the face of an append-only workload. For a Btree you can simply fill the pages in order and then flush them. There is still write-amp for the nodes up the tree but the more you buffer them the cheaper it gets.

Unfortunately CubDB as I understand it uses a compaction strategy similar to a copying GC where it copies all of the good pages out of the DB. This is… an unfortunate deficiency, here. Maintaining a freelist would be better.

But if you simply avoid flushing to disk for as long as possible you will build up most of the mutations in-memory (or in the OS page cache) and avoid most of the write-amp, which will protect the drive. There is little you can do about the compactions other than to avoid them as long as possible.

By writing the entire database out to the drive each time you are creating an enormous amount of write-amp. Remember that your workload is append-only, so it’s counterproductive to rewrite the existing pages since they never change.

And now that I’ve written that I will also point out that literally just writing your logs to a log file, rotating the log file regularly, and erasing old log files will also nicely minimize the wear on the flash :slight_smile:

The OS (and the Elixir Logger, I think) will buffer the pages for you. The filesystem and the device will try to put the pages in the right place. These things were (broadly speaking) designed to work this way!

A Btree (or LSM) is really useful when you have to deal with inserts in random order.

Thanks for the clarification @garrison, the reason I don’t use regular log file + rotation, is because I need to also frequently query the logs based on timestamp and certain metadata. CubDB provides good trade off for that, only the problem of disk wear. But also, at some point in time, the DB can grows larger than RAM.

So I guess the best trade off now that I can do is to buffer write for CubDB with relatively long flush to DB interval (via CubDB.put_multi/2).

Consider that CubDB is already structured as an append-only btree, so at the file level writes are always appended (even in case of an update or a deletion). Therefore, the write characteristics are very similar to a plain log file, but with the added ability to perform point and range queries efficiently.

What’s the expected write volume in terms of entries per unit of time? Depending on it, it might be worth using CubDB straight away, with the options discussed above.

They are similar except there is an additional page written for each node up the tree, i.e. 3x write amplification for a 3-deep tree. This wears an SSD out 3x faster than the simplest possible case (not writing any sort of index at all). Now correct me if I’m wrong but CubDB is a true COWtree and has no freelist so that also means a lot of space amplification (from all of the useless parent node pages).

And unfortunately because the GC copies the used pages out (again correct me if I’m wrong) there is a lot of write amplification from compaction if deletes are rare, but there are also a lot of free pages floating around in the file because of the overwritten parent nodes.

I assume CubDB left to its own devices (without flushing) will buffer those updates in-memory (or the OS page cache will) so that helps with the first problem, but not the second.

I wonder if a good compromise here would be to rotate the CubDB database every so often (like you would rotate a log file) and throw away the old one after some time. And then just query all living DBs in parallel.

Based on timestamp you could probably binary search the log files if you felt like it, but if you’re writing secondary indexes (metadata) then I see where you’re coming from here.