Genserver isn't restarted on graceful shutdown

alex88 · February 28, 2019, 7:29pm

I’ve a very simple genserver, I start it like this:

defmodule MyApp.MatchKiller do
  use GenServer

  def start_link() do
    GenServer.start_link(__MODULE__, [], name: :match_killer)
  end

and in my app start function I call

Swarm.register_name("match_killer", MyApp.MatchKiller, :start_link, [], 15_000)

So I start two iex sessions and the process is started and registered globally:

[debug] [swarm on a@AlessandrosiMac] [tracker:handle_call] registering "match_killer" as process started by Elixir.MyApp.MatchKiller.start_link/0 with args []
[debug] [swarm on a@AlessandrosiMac] [tracker:do_track] starting "match_killer" on a@AlessandrosiMac
[debug] [swarm on a@AlessandrosiMac] [tracker:do_track] started "match_killer" on a@AlessandrosiMac

[debug] [swarm on b@AlessandrosiMac] [tracker:sync_registry] local tracker is missing "match_killer", adding to registry
[info] [swarm on b@AlessandrosiMac] [tracker:awaiting_sync_ack] local synchronization with a@AlessandrosiMac complete!
[info] [swarm on b@AlessandrosiMac] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on b@AlessandrosiMac] [tracker:handle_call] registering "match_killer" as process started by Elixir.MyApp.MatchKiller.start_link/0 with args []
[debug] [swarm on b@AlessandrosiMac] [tracker:do_track] found "match_killer" already registered on a@AlessandrosiMac

now, since the process is running on a if I ctrl-c a on node a the process is restarted on host b:

[debug] [swarm on b@AlessandrosiMac] [tracker:handle_topology_change] topology change (nodedown for a@AlessandrosiMac)
[debug] [swarm on b@AlessandrosiMac] [tracker:handle_topology_change] restarting "match_killer" on b@AlessandrosiMac
[debug] [swarm on b@AlessandrosiMac] [tracker:do_track] starting "match_killer" on b@AlessandrosiMac
[debug] [swarm on b@AlessandrosiMac] [tracker:do_track] started "match_killer" on b@AlessandrosiMac

So far so good, the problem is that if instead gracefully shutdown sending a SIGTERM the process is just shut down (this is a when I send SIGTERM to b which is running the process):

iex(a@AlessandrosiMac)10> [debug] [swarm on a@AlessandrosiMac] [tracker:handle_monitor] "match_killer" is down: :shutdown
[info] [swarm on a@AlessandrosiMac] [tracker:nodedown] nodedown b@AlessandrosiMac
[debug] [swarm on a@AlessandrosiMac] [tracker:handle_topology_change] topology change (nodedown for b@AlessandrosiMac)
[info] [swarm on a@AlessandrosiMac] [tracker:handle_topology_change] topology change complete

if I then restart b (while a is still running) the process is restarted on b because probably it sees that it’s not running.

the problem here is that when I scale down the nodes it won’t restart the process and it won’t transfer its state. Am I missing something?

balena · May 6, 2020, 1:51pm

Hey @alex88,

I don’t know if you still have this problem, but recently I had exactly the same problem at my side using Swarm and decided that it needs a very simple feature from the regular Supervisor implementation: a restart flag, which indicates how your process should live, cluster-wise. The restart parameter assumes the same values as those passed to Supervisor child spec (check hexdocs.pm):

:permanent means a handoff will always occur,
:transient means that the process will be shut down only if the termination reason is other than :normal | :shutdown | {shutdown, term} ,
:temporary means it is never restarted, only when the node gets down abruptly (generating a :DOWN event with reason :noconnection). This is the default, compatible with the original implementation.

I provided a PR, waiting for approval (or not). It’s here: https://github.com/bitwalker/swarm/pull/139

hugrave · October 5, 2020, 4:32pm

Hello @balena and @alex88 ,

I still face the same problem. I am running Swarm on different Kubernetes Pods and whenever the deployment is updated a SIGTERM signal is sent to the containers. When the GenServer shutdowns with a SIGTERM signal, the supervisor won’t start it again on another node. Did you solve the problem?

alex88 · October 5, 2020, 6:40pm

Unfortunately not, I had to avoid using distributed jobs and run a single pod with all the cron tasks