Why is table fragmentation not possible with local_content tables in mnesia?

code · April 2, 2025, 8:26am

mnesia disc copies have a size limit of 4Gb, a solution to this is to fragment the table, however, this feature is not available to local_content tables, why is that?
And is there a solution other than implementing a feature that checks the size of the current table and create a new one when it approaches the size limit, which will consequently lead to implementing another feature for querying multiple tables when a record from that table is needed.

It seems like mnesia is starting to get a bit outdated because of dets. On the other hand, I don’t want to use something like leveldb as it’s not a part of the ecosystem. What would you do?

Thank you.

garrison · April 2, 2025, 7:41pm

I don’t know the answer to your question. I tried a bit of git blaming to see when that line in the docs was added and it goes back at least 16 years (probably older), so I wonder if anyone even remembers

I think this is the understatement of the century lol

Probably use SQLite. It’s well-adopted here (Exqlite) and very stable.

For the record, I have been slowly progressing on an Elixir DB project but it goes without saying that it’s very tricky. I will be using SQLite as the underlying (disk) storage engine, at least for a while. Writing a storage engine is not easy.

There are a couple more beam-native K/V stores, though. There’s leveled built as a pure-Erlang storage engine for Riak. It’s an LSM so no good for sequential reads.

There’s CubDB which is an on-disk COW tree. It’s not a very advanced cowtree (no incremental GC), so probably no good for “serious” usage, but it might get the job done.

Finally there’s Khepri which is a raft-replicated in-memory (but also persisted) K/V store, which I think is intended as an Mnesia successor. I don’t know what sort of limits they have for table size, but you would be limited by your RAM at least.

dimitarvp · April 2, 2025, 8:46pm

Any more details? What problem will it solve?

garrison · April 2, 2025, 9:39pm

Still not ready to talk about it yet, but the goal is high availability, durability (replication), horizontal scalability, and strict serializability (external consistency) by default with an ordered K/V model (ordered is very important), in pure Elixir (with SQLite for a while). I am now pretty confident I can actually meet those requirements, though performance remains an open question (things have been looking acceptable to me so far).

I then intend to layer a relational database on top of it so that I can finally escape Postgres and ascend into the BEAM, or something.

code · April 2, 2025, 11:20pm

Thanks a lot for the much needed confirmation. It felt weird that no one was talking about the severe limitations of mnesia. I wish you the best of luck in your project. It is much needed.

I decided to use leveldb for on disk storage (as I don’t need sequential reads) and handle the distribution manually. I will also use mnesia for in-ram storage and cofig settings.

al2o3cr · April 3, 2025, 1:27am

FWIW, there’s also mnesia_eleveldb:

Doesn’t seem particularly active, but neither Mnesia nor LevelDB has much churn…

dimitarvp · April 3, 2025, 2:32am

There was this recently posted on HN. I liked it and bookmarked it. Though currently I have no need for it, I will absolutely reach for it next time I need an MQ of sorts.

garrison · April 3, 2025, 5:03pm

I appreciate the “from scratch” approach but that is one boastful readme lol.

It’s really not that hard to “dynamically shard” a KV store when you don’t care about consistency or durability. All of the complexity shows up when you need to provide a consistent view of the data while moving it.

And all that for 130,000 “ops” per second (no mention of r/w split is interesting). If you’re working with any other platform I could see the value of a “minimal” caching store like this (there is really barely any code, which I appreciate). But on the BEAM I’m pretty sure you could beat that performance with like one ets table, so idk.

dimitarvp · April 3, 2025, 5:38pm

It’s a delicate balance in expression when you want to be both revolutionary and an experienced techie at the same time – and that guy didn’t nail it, that much is certain.

But the core of his argument is sound: everyone just immediately reaches for algorithms A and B when it comes to K/V stores, and algorithms C, D and E for distribution etc.

He’s right that uncritically reaching for stuff and just hoping you are the one who assembles a LEGO juuuust a smidgen better than the 561 guys before you is… not optimal. And not productive.

So I appreciate him for being a bit of a revolutionary. We’ll see though, some 6 months down the line, after the whole thing marinates a little bit.

Sure, but the BEAM ecosystem has the drawback of most stuff being ephemeral. As the BEAM ecosystem – and all others – mature and try to converge into more stable and less “reinvent the wheel every time” approach, I believe persistent should become a default.

(And I realize you were talking about caches. I don’t do, I mean just K/V stores and message queues in general.)

garrison · April 3, 2025, 7:13pm

I mean, I totally agree with you. But I was reacting to the project you linked, which is an in-memory hash-table KV store with no transactions. Unless I’m missing something? The hash table in question is literally 130 lines. Which, like, in a vacuum is totally cool. But this is literally “reaching for algorithm A”, it’s the most standard hash table you can imagine!

The algorithm to move shards is literally the first example from the wiki page I linked. It’s not even the fancy incremental one, lol.

There is nothing to remove expired entries incrementally! They are GC’d “when the shard is resized”!

And again all of this is fine, I’m not trashing this dude’s project, I like it. But from the README you would think he built Spanner! I mean, hey, it worked for MongoDB, so I won’t count him out.

dimitarvp · April 3, 2025, 7:26pm

You are not missing anything, I simply mixed expressing what I would prefer when needing K/V stores or MQ software, with your criticism towards the README of nubmq.

Apologies.

As for the rest of your remarks, I don’t disagree. I appreciate the guy for starting from scratch. If he ended up rediscovering old wisdom that’s also a very valid outcome.

legoscia · April 8, 2025, 6:23pm

mnesia disc copies have a size limit of 4Gb

No, disc_copies tables are only limited by the amount of available RAM (on 64-bit systems, at least).

disc_only_copies tables are implemented using Dets tables, and thus limited to 2 GB. disc_copies tables used to use Dets tables as well, but that was changed in Erlang/OTP R7B-4.

garrison · April 8, 2025, 6:29pm

Do you know any details about the storage engine used for disk_copies? I (briefly) had a look at the docs and it sounds like it is literally just an append-only log. If there is no checkpointing I can’t imagine this would be useful in production save for a small amount of data with a low write load.

code · April 8, 2025, 6:42pm

What is the current backend used for on-disc storage?

legoscia · April 9, 2025, 8:12am

In addition to the append-only log, Mnesia regularly “dumps” the log, writing the records into the data files. There are some details here: Mnesia System Information — mnesia v4.23.5

garrison · April 9, 2025, 6:18pm

Thank you, that is very interesting.

I wonder if anyone knows in detail how the checkpointing (“dump”) is performed? These docs link to docs for disk_log, but that module makes no mention of checkpointing from what I can see.

In particular, I’m curious how the checkpoint is created “online”. One way would be to just dump the in-memory table, but that would mean blocking writes to get a consistent snapshot and the docs imply that the system continues to operate while checkpointing.

Another would be to merge the previous checkpoint with the log operations, but it is also implied that the entire table is stored in one flat file (i.e. not a tree or LSM structure) so you would have to read the whole thing into memory in order to apply the log operations, doubling memory usage (and consuming a lot of CPU to apply the writes, presumably).

In an LSM tree for example this can be done efficiently because the logs are actually sorted by key so you can just do an on-disk merge sort, but here the log is an actual WAL.

One very clever trick is if the log operations are idempotent then you can checkpoint an inconsistent snapshot and then just reapply all logs created since before the checkpointing started. That could be it?

If nobody else knows I’ll try to track down the code later.

legoscia · April 9, 2025, 6:35pm

I suspect you’ll get a better answer over at the Erlang Forums. I am for the most part blissfully ignorant about Mnesia internals!