Hi everybody, this is my first time here so sorry if I make some mistakes.
I’m working on a little project for my thesis. During my study on Elixir I found the swarm library and it suites perfectly on what I need to do. The problem is that I have some troubles using it:
Process state handoff: I have a GenServer with a very simple struct and I need to distribute this worker around my cluster. When the node on which the worker is running goes down, Swarm resume the process on a new node but I cannot manage to resume the state. I’ve followed the worker example provided by the original author on github but it seems to not work. Here’s my code:
defmodule MyApp.Worker do
use GenServer
defstruct elements: %{}
def start_link(opts \\ []) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def init(opts) when opts == [] do
state = %EventStore.Worker{}
{:ok, state}
end
def init(opts) do
state = opts
{:ok, state}
end
def handle_call({:swarm, :begin_handoff}, _from, state) do
{:reply, {:resume, state}, state}
end
def handle_call({:swarm, :end_handoff, state}, _from, state) do
{:noreply, state}
end
def handle_cast({:swarm, :end_handoff, state}, state) do
{:noreply, state}
end
def handle_cast({:swarm, :resolve_conflict, _delay}, state) do
{:noreply, state}
end
def handle_info({:swarm, :die}, state) do
{:stop, :shutdown, state}
end
end
More details. I’ve used Swarm.register_name(workername, workername, :start_link, [opts]) for starting my GenServer and I’ve tried to kill the node via double CTRL+C and via :observer.start too. I really don’t know what is the problem since I’ve also looked at the Swarm tracker.ex file and saw that there’s a call with :begin_endoff which I (think) correctly handle in my GenServer. Indeed if I do the same call manually via shell, my worker handle it.
Same worker on more nodes: if I have a cluster of 3 or more nodes and I would like to run the same process on more nodes at the same time, how can I do? Because after the first registration using the same Swarm function mentioned above, Swarm warns me that there’s already that registered process.
I’m not super well versed with Swarm but I think that handoff only happens if the node that is going down is allowed to gracefully exit. It sounds like you were force killing the node. If you force kill the node it has no time to do this handoff. You can try :init.stop() as described here: Graceful shutdown on SIGTERM? to see if that graceful shutdown triggers the handoff.
Thank you, I will look at that!
But now my doubt is: at this point it’s not a bit useless in a real case scenario? I mean, a node could go down for many cases without have the proper time to save tha state (unless implementing a sort of periodic backup).
I’ve been running Elixir in prod for 3 years now and it is exceedingly rare for a healthy node to explode. The 99.99 percent most likely cause for a node shutdown is a deploy. During a deploy you can give your old nodes some time to gracefully shutdown and utilize the Swarm handoff use case.
Elixir/erlang is very resilient when it comes to crashing and unhandled errors. That being said, I am a firm believer of writing code that can rebuild it’s state in the worse-case scenario. So Swarm is a nice to have but not the only line of defense when it comes to rebuilding whatever state we had stuffed in Swarm.
Erlang is resilient, we’re all at the mercy of the Linux OOM kill and machines being recycled by a cloud provider. I think it is worth investing in a ‘phone home’ for workers who are unable to recover their state, if it isn’t being distributed by mnesia or by other means.
Exactly. If you want to protect against losing state when losing nodes you need to either replicate it or simply store it in external storage. In Erleans (based on Microsoft Orleans) the latter is the default, https://github.com/erleans/erleans/. The state for a grain is saved to a storage provider and will be reloaded when reactivated.
Erleans also has stateless grains which may be a solution to your second question about the same process on multiple nodes, but not positive.