I’ve written a small embedded key-value database in Elixir called Goblin. It’s based on the Log-Structured Merge (LSM) tree architecture and is a pure Elixir implementation with zero external dependencies. The API syntax is heavily inspired by CubDB by @lucaong, but Goblin is optimized for write-heavy workloads.
Some features include:
Background processes for automatic flushing and compaction of Sorted String Tables (SST)
ACID transactions with conflict detection
Concurrent reads
Range queries (Goblin.select/2 returns a stream over the data)
Write-ahead logging and database state tracking through a manifest
Some future work is planned, including compression of SSTs in deeper levels, transaction retries on conflicts, and SST verification via CRCs.
This is great, super interesting! Happy to see that CubDB is providing some inspiration, and that the ecosystem is growing Looking forward to find some time to delve into this and try it out. LSM trees definitely provide some interesting possibilities and optimizations.
I see you mention transactions with snapshot isolation. Could you provide a brief overview of the concurrency control impl? I think I found some of it in transaction.ex.
Keeping a lock per SST is a valid approach, but I am personally more fond of maintaining a global snapshot window and deleting SSTs outside of that window. I can’t remember which one RocksDB does, though.
I know you’re probably at a stage where benchmarks aren’t really meaningful, but would you be willing to share roughly what numbers you’re seeing? I ask only because I’m literally implementing an Elixir LSM right now and I’m curious what to expect.
Anyway, please keep at it! The ecosystem needs more databases
Thanks @lucaong! Looking forward to hearing your feedback on it!
Thanks @garrison for the questions and especially for mentioning the snapshot isolation, it made me realize that it was not working as intended in Goblin. It is now fixed. Once a transaction starts, then that transaction gets a fallback read function that first checks the immutable mem tables from the writer (snapshot of current mem table and flushing mem tables) for the key and then checks the store if not found in the mem tables. When checking the store then it matches against a sequence number from when the transaction started so that no later writes after the start of the transaction are returned. Upon commit, then read and write sets of the transaction are checked for potential conflicts.
Interesting about the global snapshot, are SSTs removed from this snapshot once signalled for deletion and no more reads against this SST are ongoing? If you have a reference I will gladly read it.
Here is an early sample of a benchmark for writing via put_multi/2 (it is not complete yet though):
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.19.0
Erlang 28.1
JIT enabled: true
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: huge: 1M, large: 100K, medium: 10K, small: 1K
Estimated total run time: 28 s
Excluding outliers: false
Benchmarking put_multi/2 with input huge: 1M ...
Benchmarking put_multi/2 with input large: 100K ...
Benchmarking put_multi/2 with input medium: 10K ...
Benchmarking put_multi/2 with input small: 1K ...
Calculating statistics...
Formatting results...
##### With input huge: 1M #####
Name ips average deviation median 99th %
put_multi/2 0.59 1.70 s ±10.14% 1.67 s 1.88 s
##### With input large: 100K #####
Name ips average deviation median 99th %
put_multi/2 8.19 122.03 ms ±4.60% 119.36 ms 135.44 ms
##### With input medium: 10K #####
Name ips average deviation median 99th %
put_multi/2 117.30 8.52 ms ±4.09% 8.46 ms 9.71 ms
##### With input small: 1K #####
Name ips average deviation median 99th %
put_multi/2 1.23 K 814.17 μs ±6.01% 804.83 μs 956.99 μs
Hi, good library, I’ve been working on something similar like 2 months ago: GitHub - hissssst/nanolsm: Tiny Elixir LSM KV implementation. I love how in the recent months there have been like 4 new embedded Elixir DB projects published. Maybe we should start a discord server about it, hehe
I’ve read your code and I have these questions and notes:
What’s your plan for the library? Do you have any new planned features in mind?
What’s the point of write lock on the sst? SST is written once, by a single process, during compaction and is never accessed before the compaction finishes
What’s the point of read lock on the sst? Every access to SST is opening a new file descriptor and SSTs are immutable (in a sense that files are only written once, and deleted)
Do you plan to have block cache?
It seems that read path calls singleton (per Goblin instance, not global) Goblin.Store genserver. Why? I’d suggest to use the ets with the names of opened files and key ranges
You use :file module which spawns a process with a new file descriptor on every read call and essentially introduces a copy (while message passing) on every read. You don’t share the descriptor between processes, so I’d just suggest to use a :prim_file instead, to avoid copying or use opened :file descriptors and store them in ets (from the point 1)
I see you also implemented memtable as a map which is stored in singleton genserver. Why? I’d suggest to use ets too
As far as I can see, if a process dies during transaction, Writer will have the copies of the old memtables forever in it’s state. This sounds like a bug
In writer, has_conflict checks are performed against state.transactions[pid].commit which is always an empty list, and it looks like these conflict checks must be implemented against current memtables in the writer. Looks like a bug
Looks like WAL is not writing to disk on every call which can result in data losses. This design choice must be reflected in the doc about the DB.
WAL does full flush of the state on the disk periodically, but I’d suggest to just append new entries for the performance and clean the state when these entries are appended to the log.
Right now the implementation has bottlenecks in reads and writes, since all reads and writes are genserver calls to the singletones (Store and Writer), has problems with WAL durability and overall performance. I would be happy to share the knowledge about implementing parallel reads and writes, improving WAL and performance
My eyes always go right to the consistency guarantees because that’s the tricky part
So this is an optimistic (OCC) approach. I see the code that checks the read/write sets and it looks like it only searches the memtables. That means if memtables have been flushed during the transaction those writes would be missed. But if memtables cannot be flushed while a transaction is active over them then I think this is correct.
Presumably the memtables contain writes from transactions older than the current transaction, so this will result in false positives which is unfortunate. If you assign every txn a read/commit seqno and check that against the write seqnos you should be able to check conflicts more accurately. That means adding seqnos to the writes in the memtables (which is generally how it’s done).
If you want to guarantee snapshot isolation you only need to check write-write conflicts, so you only need to check the write set. If you want to guarantee serializability you instead need to check the read set against the past writes. In neither case do you need to check both.
Unfortunately Erlang does not support direct I/O so the OS will be caching those blocks for you anyway no matter what you do. It would save you copying memory from the kernel into the refc binary but I wonder how much overhead that actually is. It would also save recomputing the checksum if you trust your RAM, and of course you kinda have to trust RAM anyway or you’re done for (so use ECC).
Tragically, without direct I/O it’s not possible to build a safe storage engine in current year, so I think a NIF will be unavoidable in my case. Not looking forward to that.
Personally I found that using maps for stuff like this upsets the generational GC (causing nasty latency spikes), and using :ets fixed that (presumably because it gets everything off the heap fast). So now I just abuse :ets for basically everything lol.
But yeah, in this case you want an ordered map anyway so Map is off the table.
I had been meaning to ask you about this. This module is not documented (from what I saw) so I assume this is some BEAM dark magic?
If you’re talking about O_DIRECT, I agree, but such block caching is useless, since values and keys are not decoded in it and it still needs to be marshalled. I was thinking about decoded values cache with index or at least binsearch tuple, not just a “block as bytes cache”
But yeah, prim_file has other issues too. I was talking about with erlang core team about it on erlangforum and they said that they are planning to rewrite the fs modules, but they don’t have a deadlines, estimate or plan for it, and the strangest part is that I asked if I can contribute and work on the impl and they declined saying that they are unwilling to accept contributions. That was strange
Yeah. Goblin does a very interesting trick where it’s implementation of transactions uses snapshots, and these snapshots cost nothing to create when you use maps as memtables, since they are immutable by default. Very nice trick, but it bottlenecks the performance since it requires a singleton
It’s not a dark magic, enterprise databases in Erlang are using it directly. It is a NIF which wraps libc calls like pread, fopen and such. But it comes with limitations which can break the VM. One of the limitations is that it must always be uses only by a single process. It uses NIF resource mechanism which works in a way where if some process dies (or is killed) while using the resource, it will close the file descriptor. So if you share the prim_file descriptor (which just holds an integer libc file descriptor) and it gets closed in one process, other process will continue using it and when somewhere a new file descriptor is created, it will be the same number as this closed one and the process will accidentally read from the new one, breaking a lot of internal VM assumptions and potentially crashing the VM.
But if you open descriptor, never share it and use it in the process it was opened in, you’re all good
@Asd Thanks for your questions and bug finds! nanoLSM looks good and seems interesting as well, I will look more into it. I will try to answer all your questions.
This started as a fun project to learn about LSM databases, but the end goal is to have a generic database than can be used in some projects, either in some IoT nerves device or some mix project where necessary. It will not be as feature complete as other production LSM databases, therefore block caches will probably not be implemented at first hand unless really necessary for read performance.
You’re right about the SST files, the locks are completely unnecessary and can be removed altogether, which I am working on.
The writer and store module implementations will be changed to ets tables, good insight!
I use :file which normally spawns a pid for the file descriptor, but I use the :raw mode when opening a file, which opens a :prim_file descriptor instead, which you suggested, so I believe this is the same.
That’s a legit bug that will be fixed, processes that start a transaction should be monitored so that the Writer can clean up after the process dies during the transaction.
The commit list for a transaction is updated when a transaction completes while another transaction is ongoing. I’ll double-check this to make sure it is still correct.
Yep, I should definitely add to the docs that the WAL does periodical syncs and that this can potentially cause data loss if it crashes before a sync, good catch!
Please do share how you would implement parallel reads and writes, sounds interesting!
@garrison Thanks for the insights, that clears up the transaction handling. With the read/commit number, do you mean a sequence number is given to the transaction when it starts and reads for that transaction can only read below this seq no? And write/write conflicts are between the start seq no and the latest seq no in the writer then? En ets table in the writer makes a lot of sense, clearing up some of the bottleneck performances @Asd mentioned as well. I will fix the transaction handling to allow snapshot isolation and disregard serializability. Blocking flushes until there are no ongoing transactions seems like a natural implementation then.
Pretty much. But in your case a transaction would only start at the current seqno and so only the active memtable (or future active memtables) will receive new writes. Meaning you would only actually have to do multiversion reads against those memtables.
By holding a snapshot of the database itself (i.e. of the manifest) open you ensure that none of the SSTs relevant to active transactions are deleted early. That does mean long-running transactions would hold memtables open (I wonder how RocksDB deals with that), but long-running transactions are bad anyway.
A conflict occurs when a key is written in the period between the begin and commit timestamps of a transaction. For snapshot isolation, write-write conflicts are checked. As far as I am aware this is a historical accident, i.e. at the time they thought it was correct and then later found out it was not. There are nasty anomalies which can happen with snapshot isolation, most famously write skew. I am quite fond of the less-famous read-only anomaly, which is much harder to understand but also much more disturbing.
Perhaps surprisingly, with OCC the only difference between snapshot isolation and serializability is validating the read set instead of the write set. This causes more conflicts but has the benefit of actually being intuitively correct, unlike snapshot isolation and other weak consistency levels which have failure modes that are extremely difficult for the average dev to understand (or anyone, really).
Given that the implementations are almost identical I would strongly recommend that you default to serializability. Specifically strict serializability, which should be trivial since this is not a distributed database.
It would be much better to actually sync before returning from transaction commit(). You should always default to proper durability guarantees. If users want to turn them off that’s their own business, but it’s best not to hand footguns to unsuspecting devs.
Generally you want to bunch a number of transactions together before syncing the WAL. This optimization is known as “group commit”. After the batch is committed you then unblock the clients’ commit calls.