How to reach a stable state after Mnesia nodes disconnect?

danyalmh · March 19, 2022, 3:05pm

Hi,

If a mnesia replicated node disconnect from master,
show a message similar like: ( inconsistent )

my question is here,

nodes can auto reconnect ??
how reach to stable state ??
other distributed DB how solved this problem ??

jchrist · March 19, 2022, 8:16pm

Hi there,

I believe mnesiac implements automatic reconnections to mnesia nodes. From my understanding it will copy over data from other nodes when a node comes online.

For network inconsistencies, there’s unsplit, which does the following:

Unsplit starts a subscription on the ‘partitioned network’ event, and forces Mnesia to merge the “islands” that have been separated. It inserts itself into the schema merge transaction, claiming table locks on all affected tables. It then runs user-provided merge callbacks for each table, fetching data from one side, comparing the objects, and writing back the data that should be kept.

That said, I think with the upcoming OTP 25 release partition events should be prevented by OTP itself somewhat (see the “global will now by default prevent overlapping partitions due to network issues” point).

This is probably a bit much too go into detail in a single post - the Dynamo paper is definitely a good read if this interests you, Riak was also inspired by it. As for how Riak stays available when nodes disconnect:

Writes are routed by consistent hashing over the ring
Each data item is replicated to a configurable amount of other ring partitions (therefore distributing it across nodes)
If a node goes down, another node temporarily takes over handling that node’s writes. If the node comes back online, the takeover propagates the new data back to the old host, and everything continues as normal

All of this works fine as long as the ring is available, which is achieved by having enough nodes online such that quorum on the ring is achieved.

al2o3cr · March 19, 2022, 10:06pm

Turns out a lot of them haven’t, at least not all the way. There’s a lot to learn from these analysis reports about how distributed systems can fail:

https://jepsen.io/analyses