Help reviewing distributed Mnesia cache in Pow

danschultzer · August 12, 2019, 6:01pm

I’ve been working on making the Mnesia cache in Pow work out of the box in a cluster, and got a PR up at https://github.com/danschultzer/pow/pull/233.

This is a large enough PR in something I don’t have that much experience with, that I would like some feedback to figure out if it is the right way to handle clusters in Mnesia. The documentation for mnesia clusters are lacking, and there are very few examples. Any code reviews are very welcome, as well as comments on this thread

The changes to make the cluster work is here:
github diff
mnesia_cache.ex file

And I added distributed nodes test here:
github diff
mnesia_cache_test.exs

All nodes have disk copies. When a node connects to an existing cluster, it’ll purge the disk data, and then initiate replication. I think this makes sense since all Pow cache data is ephemeral, and I won’t have to deal with merging data. Worst case scenario is that a session key is lost, so the user have to log in again, or a reset password token is expired. An obvious caveat here is if you use Mnesia for other stuff too on that node.

As keys can expire, I also let the nodes communicate with each other when a TTL is updated to ensure that a timer is set on the other nodes as well. This ensure that elements will expire even if the node that wrote the cache element went down. I think I may change this so it’s just handled by periodic flush instead.

I would appreciate comments on refactoring, and changes unrelated to the mnesia logic, but the feedback I mostly want is whether the cluster logic is sound or there is some potential pitfalls that should be taken care of. The init_mnesia/1 method is where that starts.

Also, I would be happy to hear from anyone who is is testing this out in their distributed system!

The last thing I’m now looking at for this PR is split-brain recovery.

danschultzer · August 16, 2019, 2:00pm

I’ve refactored, and added a GenServer that can self-heal the cluster in case of netsplit. Hopefully I’ve built a pretty solid solution here which will make auth instant in distributed systems (and no need to use JWT)

It’s unfortunate that there are so few libraries using Mnesia, especially for clusters.

tangui · August 16, 2019, 5:02pm

Hi, no review, sorry, but I’d love to have a mnesia auto-clustering lib too as I fell like this is done again and again in various project (such as Redex for the most recent). Would be great for Asteroid too.

If I may ask, which libraries do you take inspiration from? Have you taken a look at Mnesiac? (Haven’t had time to assess it yet.)

I considered writing such a lib and found jc_cache. The code for netsplit recovery is here: https://github.com/jr0senblum/jc/blob/master/src/jc_netsplit.erl if it can be useful. And there’s unsplit for merging too.

danschultzer · August 16, 2019, 5:34pm

Yeah, I feel that this last month I’ve looked at nearly all libraries in erlang/elixir that does Mnesia clusters, including Mnesiac, jc_cache, and unsplit for netsplit handling. I’ve also looked at rabbitmq, ejabberd, and a bunch of other libraries or repos I can’t remember.

What I gathered is that everyone does it differently, and most looked like it was just hacked together. So I turned to the Mnesia docs and found that the :mnesia.set_master_nodes/2 was pretty much all I needed to heal the cluster since I don’t care for the potential data loss (it’s only ephemeral data). I like that I can just leverage Mnesia itself instead of having to figure out custom handling for it.

I do recommend using unsplit for fine control though: https://github.com/danschultzer/pow/blob/dc01a6dff074e2fb8aaa1d61c61b00bd8904739c/lib/pow/store/backend/mnesia_cache/unsplit.ex#L17-L18

Also, I actually do think that Mnesia makes it super simple to set up clusters, and don’t think a library is necessary. It’s just the docs that’s lacking. More guides/tutorials/blog posts would help immensely. Also the error messages are at times cryptic.

This is all you need to join start/join a cluster:

:mnesia.start()
:mnesia.change_config(:extra_db_nodes, Node.list())
:mnesia.change_table_copy_type(:schema, node(), :disc_copies)
:mnesia.add_table_copy(@tab, node(), :disc_copies)
:mnesia.wait_for_tables([@tab], timeout)

In Pow, I use :mnesia.set_master_nodes/2 before starting Mnesia to prevent any potential partition issues.

Mnesia is pretty incredible

Here’s the custom self healing GenServer I built for Pow: https://github.com/danschultzer/pow/blob/master/lib/pow/store/backend/mnesia_cache/unsplit.ex