Blog: Graceful shutdown on Kubernetes with signals & Erlang OTP 20

erlang
otp
docker

#1

Somebody was asking about Erlang OS signal handling on Slack; I’ve posted up my learnings:


#2

Lots of helpful data! I suppose I’m surprised that the shutdown procedure doesn’t handle load draining at the load balancer level. That is you say about :init.stop()

That’s great, but it’s not good enough for us, since we want to let any current requests finish before shut-down.

The way I’m handling this on AWS ECS is that each container is drained from the load balancer and then only after it’s completely disconnected from the web is :init.stop called.


#3

How do you check if the instance is completely disconnected?


#4

Pod shut-down and routing seems to be only ‘eventually consistent’ in terms of all the various levels of responsibility/abstraction in K8S, and will probably vary depending on whether you are using a service-mesh like istio etc. We found in practice that without this kind of mechanism we get ‘unhealthy back end’ type responses while the routing catches up with deployment.

As it stands, K8S doesn’t know if you’ve finished processing requests already received, or are doing something that isn’t request based, like a periodic task, which I guess is why there’s that healthy default 30s pause. If you can shut down quicker, you can save more resources, preventing, for example K8S scaling out an EC2 node during a deployment, just so the old and new deployments can run at once.

Probably something that will improve as K8S and service-mesh implementations mature.


#5

Besides draining incoming connections and letting k8s wait until the app serve all current requests… is there a way to the app itself shutdown after serving all requests?

Looks like it requires 1) propagating the signal down to all supervisors (e.g. phoenix) and 2) making them to honor that.


#6

If I understand you correctly, that’s exactly what this code does: it calls init:stop() after the delay period is up, causing the Erlang VM to shut-down gracefully.

All this does is delay the normal shut-down process to allow requests to finish, and traffic to be re-routed.


#7

What I want to do is not just “delaying shutdown” but let Phoenix shutdown after serving all requests.

Also I’m not sure delaying shutdown would work if it accepts new connection while being delayed…