How to reach a stable state after Mnesia nodes disconnect?

Hi,

If a mnesia replicated node disconnect from master,
show a message similar like: ( inconsistent )

my question is here,

  1. nodes can auto reconnect ??
  2. how reach to stable state ??
  3. other distributed DB how solved this problem ??
1 Like

Hi there,

I believe mnesiac implements automatic reconnections to mnesia nodes. From my understanding it will copy over data from other nodes when a node comes online.

For network inconsistencies, there’s unsplit, which does the following:

Unsplit starts a subscription on the ‘partitioned network’ event, and forces Mnesia to merge the “islands” that have been separated. It inserts itself into the schema merge transaction, claiming table locks on all affected tables. It then runs user-provided merge callbacks for each table, fetching data from one side, comparing the objects, and writing back the data that should be kept.

That said, I think with the upcoming OTP 25 release partition events should be prevented by OTP itself somewhat (see the "global will now by default prevent overlapping partitions due to network issues" point).

This is probably a bit much too go into detail in a single post - the Dynamo paper is definitely a good read if this interests you, Riak was also inspired by it. As for how Riak stays available when nodes disconnect:

  • Writes are routed by consistent hashing over the ring
  • Each data item is replicated to a configurable amount of other ring partitions (therefore distributing it across nodes)
  • If a node goes down, another node temporarily takes over handling that node’s writes. If the node comes back online, the takeover propagates the new data back to the old host, and everything continues as normal

All of this works fine as long as the ring is available, which is achieved by having enough nodes online such that quorum on the ring is achieved.

3 Likes

Turns out a lot of them haven’t, at least not all the way. There’s a lot to learn from these analysis reports about how distributed systems can fail:

https://jepsen.io/analyses

4 Likes