Ram, an in-memory distributed KV store

ostinelli · December 28, 2021, 3:52pm

Are all of the database in this benchmark strongly consistent? It feels a little like comparing apple and peers. For instance BangDB has a tunable consistency. My simple library requires 100% confirmation, and also copies all of the data on every node, which of course comes with a cost.

ostinelli · December 28, 2021, 3:54pm

Thank you for the suggestion. This is an experiment so I might invest some time to investigate, though I was really looking for something simple. ram currently replicates 100% of its data to all nodes, and requires a 100% replication consensus, so as simple as it gets.

ostinelli · December 29, 2021, 10:55am

So, I’ve tried using GitHub - rabbitmq/ra: A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.

I’ve found some minor quirks (like inconsistencies when using ra:start() vs application:ensure_all_started(ra), or ra:overview() blowing up if ra is not started), but overall the main issue I can see with my main requirement of simplicity of use is that ra needs specific handing of cluster initialization & changes. Just like mnesia, in some way.

You can’t simply add / remove a node automatically, you need to specifically manage its addition to an existing cluster, and you can only do so one node at a time. You can’t simply “merge” nodes of different clusters for instance, afaik. This requires some level of sys-op tooling. It’s not an issue per se’ of course, it’s just that I’m looking for something simple to use and deploy.

Also, I’ve tried reading about how to manage conflict resolution handling in case of net-splits, but I couldn’t find a specific place where to look at. Admittedly I haven’t spent a lot of time searching for this one in particular, so I may have just missed it.

My little experiment ram handles all of these things btw, but it probably has other issues that are instead perfectly handled by ra.

ostinelli · January 4, 2022, 6:38pm

A small update on this.

My original (experimental) intent was to have a consistent distributed in-memory database, which would also automatically be able to automatically manage dynamic clusters (addition / removal of nodes) and to recover from net splits.

I have made strongly eventually consistent systems that automatically configure themselves before (like syn), however consistency is of course a whole different beast.

ram v0.2.0’s operations used global locks and transactions with a 2-phase commit. However, in case of net-splits I’ve basically reached a point where I can ensure strong consistency… on the split sub-clusters. Which basically means that the system as a whole is only eventually consistent.

This is kind of a simple thing really: if a system is set to auto-configure itself, it doesn’t really know which ones of its node are there to stay and those that are not (i.e. differentiate net-splits from node decommissioning), so it cannot define what consistency means. I’ve tried playing with some algorithms but I wasn’t really going anywhere useful.

So, I’ve followed @RudManusachi and @fabriziosestito’s suggestion and used Ra in the current ram implementation. This means of course that cluster definition (along with node additions and removals) need to be explicitly set with sys-op tooling. So far, I’m happy of the results.

If you wish, you can check the new ram docs:
https://hexdocs.pm/ram

As usual, feedback and discussion are welcome. Thanks to everyone that took the time to be part of this thread.

dom · January 5, 2022, 6:33am

Is it still “in-memory”? My understanding of Ra is that it persists data in a WAL file on disk.

ostinelli · January 5, 2022, 10:54am

It does persist the machine state, yes. But ATM it still is as I need to finalize a couple of things to fully take advantage of that.

ostinelli · January 8, 2022, 7:35pm

@dom I confirm that now ram is not only in-memory but persisted to disk

groovyda · January 11, 2022, 7:39am

Many thanks @ostinelli - this is fantastic!!

I can see the persistence to the disk indeed - but I can’t figure out how to get the persisted data out the next time I start the cluster. I get back :undefined when I try to get the value of a key stored in a previous session.

I tried saving 10000 random key-val pairs as well, as I thought it might be something to do with the release_cursor_count: 1000 setting, but still no luck.

Could you please let me know how to do this?

Thanks again

ostinelli · January 11, 2022, 6:30pm

@groovyda That’s because you couldn’t!

Thanks to your feedback, you now can by calling restart_server/0 on every node that rejoins. Maybe I should also put a restart_cluster/1 helper as well?

Hope this helps,
r.

beepbeepbopbop · January 12, 2022, 4:26am

I’ve also tried using Ra to implement a KV store (Elixir however, given I have no real experience with Erlang), and found the same issue. I initially thought it could be something with how this gets thrown into the console logs:

[debug] wal: recovering 00000009.wal
[debug] wal: recovered 00000009.wal time taken 11ms
[debug] wal: opening new file 00000010.wal open mem tables: #Reference<0.242204176.4200202245.257691>
[debug] ra_log_segment_writer: error sending ra_log_event to: RA_KV1LP2NCHQSD3X. Error: No Pid

However, once I started to read into the Ra internals, this particular line definitely seemed to resonate:

When ra:start_server/4 or ra:start_cluster/3 are invoked, an UID is automatically generated by Ra.

This is used for interactions with the write-ahead log, segment and snapshot writer processes who use the ra_directory module to lookup the current pid() for a given ID.

This is mentioned again in the recovery section too. So, I tried to hard code the UID that was generated by updating the :ra.start_cluster and :ra.start_server calls by passing the full configuration map. Once I did, I confirmed that the state of the Ra server was brought up to date. In my excitement, I brought up a follower and logged the state machine apply/3 calls and found that everything the leader had was replicated!

I reckon Ram here could do the same, but I don’t know the full extent of hardcoding this. It doesn’t sound great, but WAL replication definitely survive node terminations.

ostinelli · January 12, 2022, 9:03am

There also is a method called start_or_restart_cluster/4 which, however, fails randomly with a name_not_registered error.

I’ve implemented it in a branch but tests fail randomly.

There are some inconsistencies with docs and specs in this library, but being production ready and used by the rabbitmq team is the main appeal to me.

ostinelli · January 12, 2022, 12:34pm

@beepbeepbopbop that’s what they do in khepri here so you should be good. I’m gonna try this approach as well.

groovyda · January 13, 2022, 5:09pm

Thanks @ostinelli. Yes please - restarting the cluster is the main use case I was looking for indeed.

groovyda · January 13, 2022, 5:11pm

I see - thanks, and makes sense

Exadra37 · April 8, 2022, 10:25pm

So, in my understanding :ets is not used directly, but ra seems to use :ets.

I couldn’t find in the docs a reference to configure the ets table type. I am asking because I would like to use the duplicate_bag type.

mmmries · June 18, 2022, 3:51pm

@ostinelli thanks for starting this project, it looks really cool. I was curious if you have ever used lbm_kv before? GitHub - lindenbaum/lbm_kv: A dynamically-distributed, highly-available, partition-tolerant, in-memory key-value store built with Mnesia.

It uses mnesia and handles a dynamic cluster so nodes can join and leave

ostinelli · June 20, 2022, 9:19am

I haven’t, but from a quick glance it still uses the unsplit mechanisms to merge nodes after a netsplit (code here), and in my experience this mechanism does not work well with clusters > 2 nodes.

Ulf still wanted to test it on more than 2 nodes, and if I’m not mistaken evenlbm_kv does only test it on a two nodes conflict.

I still like mnesia but I’ve had way too many issues with it, after all it was born even before the CAP theorem or modern cloud solutions came to life. I also moved away from mnesia in my syn library, which in v3 only uses ETS.

shifters98 · January 3, 2023, 11:35am

Is there a configuration option to only have the data in memory i.e. i don’t want or need to persist it to disk

I have checked the RA and RAM config options but nothing obvious springs out

thanks

Tom

hst337 · June 21, 2023, 8:55am

This is the default for RAM. You don’t need any option, since RAM is in-memory by design

shifters98 · June 21, 2023, 9:42am

Ah thanks for the clarification!

Tom