Temporary netsplit on kubernetes pod graceful shutdown

ecunha · October 23, 2018, 3:31pm

I’m observing an odd behaviour when gracefully shutting down a kubernetes pod. I’m handling SIGTERM signals so that I can delay the application node shutdown in order to properly finish whatever is needed. I’m also using Swarm to help me distribute processes across my cluster and redistribute them when a node goes up/down. The odd behaviour that I’m seeing is that when a node receives a SIGTERM few seconds after this node disconnects from other nodes and reconnects right back, like a netsplit, leaving effectively the cluster only when a grace period of time expires.

Logs demonstrating the behaviour:

2018-10-23T14:39:33.442Z node1@10.61.86.45 SIGTERM received. Stopping in 20000 ms
2018-10-23T14:39:38.112Z node2@10.61.79.22 [swarm on node2@10.61.79.22] [tracker:nodedown] nodedown node1@10.61.86.45
2018-10-23T14:39:38.092Z node1@10.61.86.45 [swarm on node1@10.61.86.45] [tracker:nodedown] nodedown node2@10.61.79.22
...
2018-10-23T14:39:38.098Z node1@10.61.86.45 [swarm on node1@10.61.86.45] [tracker:ensure_swarm_started_on_remote_node] nodeup node2@10.61.79.22
2018-10-23T14:39:38.120Z node2@10.61.79.22 [swarm on node2@10.61.79.22] [tracker:ensure_swarm_started_on_remote_node] nodeup node1@10.61.86.45
...
2018-10-23T14:39:53.443Z node1@10.61.86.45 Stopping due to earlier SIGTERM
2018-10-23T14:39:54.530Z node2@10.61.79.22 [swarm on node2@10.61.79.22] [tracker:nodedown] nodedown node1@10.61.86.45

Does anyone has a clue of what might be causing this behaviour?

P.S: I’m using libcluster with the Kubernetes strategy to handle the application clustering. Also I’m using the following script to start a node (I’m not using releases):

#!/bin/bash

set -e

exec elixir --name "$ERL_BASENAME@$POD_IP" --cookie "$ERL_COOKIE" -S mix phx.server

ecunha · October 25, 2018, 3:14pm

Found out what was the issue and I’m leaving it here in case someone has the same problem.

libcluster k8s strategy polls an k8s internal API to check when a new pod joins or leaves the cluster, when we request a cluster scale down it takes immediate effect on the result returned by the API, so libcluster stops seeing the pod that is shutting down. Since we trapping the SIGTERM signal the pod doesn’t terminate immediately so after a few seconds it gets disconnected from the cluster by libcluster, the problem is that Erlang VM has a mechanism that when a node disconnects it tries to connect again, which is successful since the node is still running “normally”. libcluster won’t disconnect it anymore since it got removed from its state and it only disconnects when the node shuts down.

To fix this we used an altered version of libcluster k8s strategy that doesn’t disconnect from removed nodes. We don’t see any problem coming from this, since eventually the node goes away and gets disconnected.