I’m observing an odd behaviour when gracefully shutting down a kubernetes pod. I’m handling SIGTERM
signals so that I can delay the application node shutdown in order to properly finish whatever is needed. I’m also using Swarm to help me distribute processes across my cluster and redistribute them when a node goes up/down. The odd behaviour that I’m seeing is that when a node receives a SIGTERM
few seconds after this node disconnects from other nodes and reconnects right back, like a netsplit, leaving effectively the cluster only when a grace period of time expires.
Logs demonstrating the behaviour:
2018-10-23T14:39:33.442Z node1@10.61.86.45 SIGTERM received. Stopping in 20000 ms
2018-10-23T14:39:38.112Z node2@10.61.79.22 [swarm on node2@10.61.79.22] [tracker:nodedown] nodedown node1@10.61.86.45
2018-10-23T14:39:38.092Z node1@10.61.86.45 [swarm on node1@10.61.86.45] [tracker:nodedown] nodedown node2@10.61.79.22
...
2018-10-23T14:39:38.098Z node1@10.61.86.45 [swarm on node1@10.61.86.45] [tracker:ensure_swarm_started_on_remote_node] nodeup node2@10.61.79.22
2018-10-23T14:39:38.120Z node2@10.61.79.22 [swarm on node2@10.61.79.22] [tracker:ensure_swarm_started_on_remote_node] nodeup node1@10.61.86.45
...
2018-10-23T14:39:53.443Z node1@10.61.86.45 Stopping due to earlier SIGTERM
2018-10-23T14:39:54.530Z node2@10.61.79.22 [swarm on node2@10.61.79.22] [tracker:nodedown] nodedown node1@10.61.86.45
Does anyone has a clue of what might be causing this behaviour?
P.S: I’m using libcluster with the Kubernetes
strategy to handle the application clustering. Also I’m using the following script to start a node (I’m not using releases):
#!/bin/bash
set -e
exec elixir --name "$ERL_BASENAME@$POD_IP" --cookie "$ERL_COOKIE" -S mix phx.server