zteln

Goblin - another embedded key-value database for Elixir

Hello Elixir community,

I’ve written a small embedded key-value database in Elixir called Goblin. It’s based on the Log-Structured Merge (LSM) tree architecture and is a pure Elixir implementation with zero external dependencies. The API syntax is heavily inspired by CubDB by @lucaong, but Goblin is optimized for write-heavy workloads.

Some features include:

Background processes for automatic flushing and compaction of Sorted String Tables (SST)
ACID transactions with conflict detection
Concurrent reads
Range queries (Goblin.select/2 returns a stream over the data)
Write-ahead logging and database state tracking through a manifest

Some future work is planned, including compression of SSTs in deeper levels, transaction retries on conflicts, and SST verification via CRCs.

Similar projects in Erlang & Elixir:

CubDB - COW B+tree embedded key-value database
LevelEd - LSM key-value database intended as a backend for a Riak KV database

A lot of inspiration was drawn from RocksDB, Facebook’s production LSM database.

Goblin is still in a very early phase and any feedback would be greatly appreciated!

Github:

https://github.com/zteln/goblin

18 comments

#library #database

51 1193 18

2026-01-08 16:21:32 UTC

Most Liked

Asd

Hi, good library, I’ve been working on something similar like 2 months ago: GitHub - hissssst/nanolsm: Tiny Elixir LSM KV implementation · GitHub. I love how in the recent months there have been like 4 new embedded Elixir DB projects published. Maybe we should start a discord server about it, hehe

I’ve read your code and I have these questions and notes:

What’s your plan for the library? Do you have any new planned features in mind?
What’s the point of write lock on the sst? SST is written once, by a single process, during compaction and is never accessed before the compaction finishes
What’s the point of read lock on the sst? Every access to SST is opening a new file descriptor and SSTs are immutable (in a sense that files are only written once, and deleted)
Do you plan to have block cache?
It seems that read path calls singleton (per Goblin instance, not global) Goblin.Store genserver. Why? I’d suggest to use the ets with the names of opened files and key ranges
You use :file module which spawns a process with a new file descriptor on every read call and essentially introduces a copy (while message passing) on every read. You don’t share the descriptor between processes, so I’d just suggest to use a :prim_file instead, to avoid copying or use opened :file descriptors and store them in ets (from the point 1)
I see you also implemented memtable as a map which is stored in singleton genserver. Why? I’d suggest to use ets too
As far as I can see, if a process dies during transaction, Writer will have the copies of the old memtables forever in it’s state. This sounds like a bug
In writer, has_conflict checks are performed against state.transactions[pid].commit which is always an empty list, and it looks like these conflict checks must be implemented against current memtables in the writer. Looks like a bug
Looks like WAL is not writing to disk on every call which can result in data losses. This design choice must be reflected in the doc about the DB.
WAL does full flush of the state on the disk periodically, but I’d suggest to just append new entries for the performance and clean the state when these entries are appended to the log.

Right now the implementation has bottlenecks in reads and writes, since all reads and writes are genserver calls to the singletones (Store and Writer), has problems with WAL durability and overall performance. I would be happy to share the knowledge about implementing parallel reads and writes, improving WAL and performance

Post #5

lucaong

This is great, super interesting! Happy to see that CubDB is providing some inspiration, and that the ecosystem is growing Looking forward to find some time to delve into this and try it out. LSM trees definitely provide some interesting possibilities and optimizations.

Post #2

Asd

If you’re talking about O_DIRECT, I agree, but such block caching is useless, since values and keys are not decoded in it and it still needs to be marshalled. I was thinking about decoded values cache with index or at least binsearch tuple, not just a “block as bytes cache”

But yeah, prim_file has other issues too. I was talking about with erlang core team about it on erlangforum and they said that they are planning to rewrite the fs modules, but they don’t have a deadlines, estimate or plan for it, and the strangest part is that I asked if I can contribute and work on the impl and they declined saying that they are unwilling to accept contributions. That was strange

Yeah. Goblin does a very interesting trick where it’s implementation of transactions uses snapshots, and these snapshots cost nothing to create when you use maps as memtables, since they are immutable by default. Very nice trick, but it bottlenecks the performance since it requires a singleton

It’s not a dark magic, enterprise databases in Erlang are using it directly. It is a NIF which wraps libc calls like pread, fopen and such. But it comes with limitations which can break the VM. One of the limitations is that it must always be uses only by a single process. It uses NIF resource mechanism which works in a way where if some process dies (or is killed) while using the resource, it will close the file descriptor. So if you share the prim_file descriptor (which just holds an integer libc file descriptor) and it gets closed in one process, other process will continue using it and when somewhere a new file descriptor is created, it will be the same number as this closed one and the process will accidentally read from the new one, breaking a lot of internal VM assumptions and potentially crashing the VM.

But if you open descriptor, never share it and use it in the process it was opened in, you’re all good

Post #8