Flushing mnesias stopped db nodes

Fake51 · June 24, 2021, 12:59pm

As the title says - is there any way of flushing the “stopped db nodes” for mnesia?

Context: we’re running a distributed app in k8s, and on deploy we just let k8s do it’s business. This means pods get taken down and new ones spun up. However, it’s possible that new pods coming up will have the same identifiers as earlier pods - and the longer things run, the more likely it is. Basically, this means that pod A could connect to the mnesia cluster, later get removed, and pod B - with the same name - try to connect as well.

In itself, this shouldn’t cause problems - but we copy tables/data to each pod connecting to the cluster, and we just had a crash from a pod upon startup. Based on the logs, it failed to properly received data, and later just couldn’t connect to mnesia at all.

So, looking for possible fixes for this - is it possible to scrub “stopped db nodes” from mnesia?

ndrean · May 6, 2022, 12:33am

I stumbled on this post, probably too late. With the focus of cleaning the space taken by the data of obsolete nodes, I have a naive implementation working for a local dev cluster in Gossip mode with libcluster. When I get a nodedown from Erlang or a :mnesia_down from Mnesia system event, then in a handle_info , I simply erase the data folder of the node. In local dev mode, I configured :mnesia, :dir with to create a folder with the node name: ../mnesia_a@127.0.0.1for the node a@...
With k8, you ensure to have 1 surviving node, so any new node spinning will get an up-to-date copy.

However, is this is portable to a k8 solution as I need the path to the folder??

Fake511 · May 6, 2022, 8:47am

As such I don’t see a problem with k8s - there’s nothing stopping you from configuring the paths just because you’re deploying in k8s.

I haven’t tried out your solution so I don’t know how mnesia will behave if it data folders suddenly go missing - if that’s perfectly fine, or there might be some state issues. That’s for you to test I’d say.

My only concern would be things like netsplit - what happens if a node gets disconnected from half the network, and then later rejoins? How well will it handle a complete netsplit?

ndrean · May 6, 2022, 5:04pm

As you pointed, an :init.stop is :ok but an Node.disconnect of a particular node makes Mnesia misbehave. There are errors and inconsistencies between existing data folders and what :mnesia.system_info returns. I have an accumulation of “mnesiaCore.xxx” files. Not every node is always recovered. Now if I don’t trigger the folder removal, if I send Node.disconnect from a node, I have to restart the sending node to be in sync.
I might be wrong as I was mainly concerned by the accumulation of dangling volumes in the case of working with k8.

ndrean · May 6, 2022, 11:53pm

Well, in case of any interest, I put the File.rm_rf(..) cleanup in the terminate callback to the :init.stop(). This works. Then I had a bug so that a call Node.disconnect(other@node) now behaves normally: the supervisor reconnects. To be tested IRL.