How to correctly handoff process state using bitwalker/swarm

Dome · December 2, 2020, 5:00pm

Hi everybody, this is my first time here so sorry if I make some mistakes.

I’m working on a little project for my thesis. During my study on Elixir I found the swarm library and it suites perfectly on what I need to do. The problem is that I have some troubles using it:

Process state handoff: I have a GenServer with a very simple struct and I need to distribute this worker around my cluster. When the node on which the worker is running goes down, Swarm resume the process on a new node but I cannot manage to resume the state. I’ve followed the worker example provided by the original author on github but it seems to not work. Here’s my code:

defmodule MyApp.Worker do
  use GenServer

  defstruct elements: %{}

  def start_link(opts \\ []) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def init(opts) when opts == [] do
    state = %EventStore.Worker{}
    {:ok, state}
  end

  def init(opts) do
    state = opts
    {:ok, state}
  end

  def handle_call({:swarm, :begin_handoff}, _from, state) do
    {:reply, {:resume, state}, state}
  end

  def handle_call({:swarm, :end_handoff, state}, _from, state) do
    {:noreply, state}
  end

  def handle_cast({:swarm, :end_handoff, state}, state) do
    {:noreply, state}
  end

  def handle_cast({:swarm, :resolve_conflict, _delay}, state) do
    {:noreply, state}
  end

  def handle_info({:swarm, :die}, state) do
    {:stop, :shutdown, state}
  end

end

More details. I’ve used Swarm.register_name(workername, workername, :start_link, [opts]) for starting my GenServer and I’ve tried to kill the node via double CTRL+C and via :observer.start too. I really don’t know what is the problem since I’ve also looked at the Swarm tracker.ex file and saw that there’s a call with :begin_endoff which I (think) correctly handle in my GenServer. Indeed if I do the same call manually via shell, my worker handle it.

Same worker on more nodes: if I have a cluster of 3 or more nodes and I would like to run the same process on more nodes at the same time, how can I do? Because after the first registration using the same Swarm function mentioned above, Swarm warns me that there’s already that registered process.

SophisticaSean · December 3, 2020, 7:27pm

I’m not super well versed with Swarm but I think that handoff only happens if the node that is going down is allowed to gracefully exit. It sounds like you were force killing the node. If you force kill the node it has no time to do this handoff. You can try :init.stop() as described here: Graceful shutdown on SIGTERM? to see if that graceful shutdown triggers the handoff.

Dome · December 3, 2020, 7:47pm

Thank you, I will look at that!
But now my doubt is: at this point it’s not a bit useless in a real case scenario? I mean, a node could go down for many cases without have the proper time to save tha state (unless implementing a sort of periodic backup).

SophisticaSean · December 4, 2020, 9:45am

I’ve been running Elixir in prod for 3 years now and it is exceedingly rare for a healthy node to explode. The 99.99 percent most likely cause for a node shutdown is a deploy. During a deploy you can give your old nodes some time to gracefully shutdown and utilize the Swarm handoff use case.

Elixir/erlang is very resilient when it comes to crashing and unhandled errors. That being said, I am a firm believer of writing code that can rebuild it’s state in the worse-case scenario. So Swarm is a nice to have but not the only line of defense when it comes to rebuilding whatever state we had stuffed in Swarm.

mpope · December 4, 2020, 10:26pm

Erlang is resilient, we’re all at the mercy of the Linux OOM kill and machines being recycled by a cloud provider. I think it is worth investing in a ‘phone home’ for workers who are unable to recover their state, if it isn’t being distributed by mnesia or by other means.

mpope · December 4, 2020, 10:33pm

For number 2, you’re going to have to add some randomization. Something like

"my_worker_#{:erlang.node()}_#{:erlang.self()}"

works well for naming things.

tristan · December 9, 2020, 5:23pm

Exactly. If you want to protect against losing state when losing nodes you need to either replicate it or simply store it in external storage. In Erleans (based on Microsoft Orleans) the latter is the default, https://github.com/erleans/erleans/. The state for a grain is saved to a storage provider and will be reloaded when reactivated.

Erleans also has stateless grains which may be a solution to your second question about the same process on multiple nodes, but not positive.

mpope · December 9, 2020, 6:08pm

My favorite usage of lasp