Migrating actors or processes from one node to another within a cluster

I am new to Elixir and looking to run my distributed application (a state machine) on an elastic cluster. I want to know:

(a) If it is possible to migrate a process from one node to another, without losing the state of the actor or its messages (those in inbox and the one that is being processed).

(b) Are there good resources to follow or actively maintained libraries that I can look up for this purpose?

Some reasons when I would want to migrate could be when there is a severe imbalance in the load across nodes in the cluster (say, a new node got added to the elastic cluster) or when the application gets a notification that a particular node is about to go down.

3 Likes

:wave:

You might want to check out erleans and/or horde.

6 Likes

In migrating there are 3 things you need to consider:

  1. process state
  2. process registered name aka how do other processes still contact it
  3. messages in queue

Process state
You can place the state into a struct or map and have the new SM start_link with this state this would solve this issue.

Registered name
Either use global or have an internal name/process resolver to ensure current processes can still access the migrated SM.

Message in queue
This is a bit trickier. Before migrating you will have to dequeue the messages into a list and provide it to the new SM.

This is a common trick to always dequeue messages into an ets table and process off the table. This trick is typically uses for high transaction processes or for message reordering.

6 Likes

I haven’t worked on erleans in a while (the company I developed it at is no more) but would like to again and to create an Elixir example or interface if it makes sense to offer some nicer abstractions when using in Elixir. And I suppose an ecto hookup for persistence would be good to do.

So it isn’t production ready but happy to help if anyone tries it out.

8 Likes

Thanks @idi527 (and @tristan !) - I will check out those libs

Thanks @tty. I will keep those in mind when evaluating solutions. On persisting messages: Is this the case only for inter-node migrations or is this pattern to be followed even when the process is restarted on the same node?

This technique can also be used to restart on the same node if it is important to retain those messages.

A comparable technique is to store state in Mnesia for node restart persistence (and distribution). When a process is started it checks Mnesia if there is a last state stored and spins up with that state. By relying on the distributed properties of Mnesia this process is now node agnostic.

I wouldn’t use mnesia for distribution unless you have a very controlled environment (i.e. not the cloud).

3 Likes

:wave:
could you explain a bit on why

?

Mnesia just was not built for such an environment where instances and the network are coming and going. I’m sure someone who has had the fun of attempting this would have better details than I can provide. The best I can do is link to Jesper’s post about it https://medium.com/@jlouis666/mnesia-and-cap-d2673a92850 which has some of the history behind Mnesia.

There are libraries to improve the situation, like Ulf’s unsplit which he discusses in this erlang-questions thread http://erlang.org/pipermail/erlang-questions/2010-February/049213.html

2 Likes

I use mnesia because it comes with Erlang/Elixir. Substituting DynamoDB, or a RDMS is also possible. Might need to do term_to_binary and store a blob, however still very doable.