Trying to delay Phoenix shutdown

I’m running a Phoenix install, and I have a need to delay shutdown for a couple of seconds to ensure graceful shutdown. Basically, what I’d like to be able to do is trap SIGTERM, then have the server run for another couple of seconds, and only then stop responding.

My approach this far was to try and add an exit handler in the application supervisor tree - as it’s the last child it gets shut down first. This works, in the sense that it’s the first to stop and it halts the shutting down of the other children - so, as expected.

However, the server still stops responding as soon as the SIGTERM is received - even though processes are still running. So I’m looking for any info/help in making sure the server doesn’t stop until my exit handler is done.

Edit:
Tried Application.prep_stop/1 as well, with the same behaviour - it is indeed called before the application supervisor is terminating it’s children, however, before prep_stop/1 is called, the server stops accepting requests.

And now, looking at why this might be the case, I can see I’ve :inets in mix.exs under application.extra_applications … that wouldn’t be part of the supervision tree in my application, so I’m guessing that would receive SIGTERM asynchronously. Could that be the reason?

Hmmm nope, doesn’t seem to be related. Still at a loss on this one

Just wanted to say this is very odd to me. I’m doing exit trapping to cleanly shutdown WebSockets, ongoing requests, and pipelines, and everything responds throughout the process (until the process shuts down).

Is it 100% the server not responding, or is there some type of proxy in front of it that’s aware of the shutdown state and is not sending traffic?

I just tried doing a clean new Phoenix install (1.5.7 according to mix.exs), and then setup a simple shell oneliner:

while true ; do curl localhost:4000 -s > /dev/null ; done

I’ve then added an exit handler, that looks like this:

defmodule Testy.ExitHandler do
  use GenServer

  require Logger

  def start_link(_) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end

  @impl true
  def init(state) do
    Process.flag(:trap_exit, true)
    {:ok, state}
  end

  @impl true
  def terminate(_msg, state) do
    Logger.info("Sleeping")
    Process.sleep(4500)
    Logger.info("Done sleeping")
  end
end

I run mix phx.server, and then I send a SIGTERM to it. I see no requests between SIGTERM and the Done sleeping in the log, nothing from the curl calls.

I might be misunderstanding how to wire all of it up. What does your solution look like?

And to clarify: this is running directly on the localhost, there’s nothing in between.

I’d love to check this out on Github. Do you have a link? I could very easily be the one misunderstanding how it works here!

Sure thing - uploaded it here: https://github.com/Fake51/phoenix-exit-handler

1 Like

I took a look yesterday and today but couldn’t figure out the mechanism causing the connections to stop being accepted. I think it’s in cowboy (they recently shipped an improvement to do request draining on shut down) but can’t figure it out yet. Going to dig in more tomorrow.

Thanks for having a look at it :slight_smile:

Seems like you’re testing “accepting new requests” as “responding”. If your server still accepts new requests after received SIGTERM - then when will it be “safe” to stop? :wink:

What you actually want is graceful shutdown with connection draining, which stops accepting more requests, and finishes existing requests.

The behavior you explained, which is the server is running but not accepting more requests, seems desired, actually.

Related:

1 Like

When I tell it to stop. I have a race condition - a load balancer that is notified in parallel about the container stopping. If the container stops before the loadbalancer is updated, I have dropped requests.

I’m aware of connection draining. That is not going to solve my problem, though - although it is a part of the overall solution.

However, on the topic of connection draining - far as I can tell, it’s part of Phoenix 1.5. But is it necessary to configure it?

No, not in my scenario. In the scenario I outlined it causes dropped requests - very undesirable.

1 Like

Reading the Elixir docs, I can see this wrt connection draining:

https://hexdocs.pm/phoenix/Phoenix.Endpoint.html#module-endpoint-api - just above there’s a part about Adapter configuration. Default is Phoenix.Endpoint.Cowboy2Adapter and you can set a :drainer key. If not set, it defaults to []

Looking at https://hexdocs.pm/plug_cowboy/Plug.Cowboy.Drainer.html all options have defaults, possibly apart from :refs.

So, looks to me like it should be in place

Oh, now I understand the issue better. I’m wondering what LB and how your CD ends up parallel updating, without LB knowing container health. Could you provide more details? (I think we’re going to hit X-Y problem)

It seems a little bit similar to this (which might not be an issue anymore) in that there are some timing issues (this case is not a racing condition)

However I don’t think “adding a few seconds before starting actual termination” into an application is a good idea, since it requires updating all existing apps. What if you don’t have access to the source code? What if it’s a compiled docker container from other party? What if you have apps in multiple languages? See the feedback on this issue for example.

So then what’s the option? The best option would be fixing the deployment flow. If some are dependent, then you have to do that in the right order.

If that’s not feasible… then probably I’d make a small wrapper to execute your app process as sub process and send the delayed SIGTERM/SIGKILL to the process, so that I can reuse it for all app containers.

Another option is to make LB to work with two different health attribute - see why k8s has two health - liveness and readiness. In this case, you can turn readiness off so that k8s does not route new requests anymore to the pod, if ingress/LB supports it.

1 Like

Naively I’d think that you would:

  1. Bring up a new container.
  2. After it’s confirmed to be up (via tools like k8s liveness / readiness probes), send SIGTERM to the Phoenix app inside the old container.
  3. Make your LB send new requests to the new container.
  4. In the meantime the old container should already be doing connection draining. Your deployment – k8s or anything else – should account for this and not just kill the container. It should wait for it, say, 30s or so.
  5. After all the old connections are drained – or a hard upper limit of the time to wait has been hit – the old container is gone, while the new one likely is already serving requests.

Sadly I was never able to master k8s that well so as to make this exact setup but I’ve seen people claim that they are doing it.

I can certainly try :slight_smile: So, it’s a distributed Phoenix application that runs on 3-5 pods in Googles Kubernetes service. It’s exposed via an internal loadbalancer. Deploying the service is handled via rolling deploy - kubernetes will bring new pods up, taking old ones down their replacement is live.

Turns out one of the problems was a misconfiguration of the deployment - the readiness probes lacked a failureThreshold. Fingers crossed I can now tweak the lifecycle to ensure no dropped requests

2 Likes

Yeah, this was my assumption as well and what I’m trying to do :slight_smile: the problem is the amount of tweaking you need to be able to get things working