Phoenix Tracker Timeouts when Kubernetes (GKE) Scales Down

I’m using Tracker (not Presence) to distribute some counters across nodes. Think like a counter for a cluster-wide rate limiter. I’m not even doing any diffs with Tracker, I’m just using it to distribute the rate for each node, and when I need the cluster rate I just do a Tracker.list(MyTracker, node_rates) which gives me all the values for the nodes and add them up and cache it.

All works great on dev and staging, until I try and scale down the cluster. I start getting Tracker timeouts on all Tracker calls when nodes disconnect. Mostly it seems to be update calls but I’m doing that more than any other. It’s completely intermittent happening about 50% of the time. Full scale downs from 6 to 3 nodes will work with zero errors 50% of the time. Scaling up works flawlessly. The staging cluster is doing maybe 25 requests a second on average. This also doesn’t happen on dev when nodes disconnect.

Anyone ever seen this kind of thing? Happy to provide more details…

Much appreciated!

1 Like

Hard to tell with the current info, but I’ve seen things like this when my application’s shutdown order was a certain way.

When your cluster goes down, do you ever see “process doesn’t exist”, or is it just timeouts? What is the order of your Tracker process in your Application vs your Endpoint?

One thing I had to do for a project was stop accepting new connections, drain all of the connected WebSockets on process down (0 traffic on server), and then the Tracker would continue to shutdown. If I let it run it’s normal course, it would give me a lot of errors.

You can see the sort of convoluted system of drainers I use on https://github.com/pushex-project/pushex/blob/master/lib/push_ex/supervisor.ex#L14

3 Likes

Are you already using some form of graceful shutdown code? Or are the nodes yanked hard from the cluster? I’ve read/seen some info about this that clusters have issues with the abrupt killing of cluster members. I would try forms of graceful shutdown to see if that helps. That is: hook into kubernetes lifecycle post stop hook (as seen here: https://github.com/ElixirSeattle/tanx/blob/master/deploy/tanx-deployment.yml#L38)
Now you’ve got a very small window before the node gets killed/shutdown in which you could unregister your node. (maybe through https://hexdocs.pm/phoenix_pubsub/Phoenix.Tracker.html#graceful_permdown/1 ? not sure )

Maybe these are some pointers that can help you out.

(extra link/info about termination handling: https://medium.com/@ellispritchard/graceful-shutdown-on-kubernetes-with-signals-erlang-otp-20-a22325e8ae98 )

4 Likes

Kubernetes will send a SIGTERM to pods before it removes them, right? (I’m not sure if that’s something people can disable.) You can hook into when a SIGTERM is received and then your application doesn’t need to care that it’s deployed with k8s.

Even if the node is unregistered, though, it will still be serving real WebSocket connections because those are not killed until the actual web server shuts down (or you kill them yourself). I imagine that the WebSockets living on a shutting-down node is the root cause of the issue.

1 Like

This is all great, and gives me some direction for sure!

It’s a timeout.

children = [
      {Cluster.Supervisor, [topologies, [name: Logflare.ClusterSupervisor]]},
      supervisor(Logflare.Repo, []),
      supervisor(Phoenix.PubSub.PG2, [
        [
          name: Logflare.PubSub,
          fastlane: Phoenix.Channel.Server
        ]
      ]),
      worker(
        Logflare.Tracker,
        [
          [
            name: Logflare.Tracker,
            pubsub_server: Logflare.PubSub,
            broadcast_period: 1_000,
            down_period: 5_000,
            permdown_period: 30_000,
            pool_size: 1,
            log_level: :debug
          ]
        ]
      ),
      supervisor(LogflareTelemetry.Supervisor, []),
      Logflare.Users.Cache,
      Logflare.Sources.Cache,
      Logflare.Logs.RejectedLogEvents,
      {Task.Supervisor, name: Logflare.TaskSupervisor},
      supervisor(Logflare.Sources.Counters, []),
      supervisor(Logflare.Sources.RateCounters, []),
      supervisor(Logflare.Source.Supervisor, []),
      supervisor(Logflare.SystemMetricsSup, []),
      supervisor(LogflareWeb.Endpoint, [])
    ]

We’re catching the sigterm from Kubernetes.

Our sigterm handler… just delays the shutdown to let our broadway pipelines drain after traffic stops. But I will issue a graceful Tracker shutdown there and see if that helps. And/or try manually killing all the websockets on that node.

defmodule Logflare.SigtermHandler do
  @moduledoc false
  @behaviour :gen_event
  require Logger

  @grace_period Application.get_env(:logflare, :sigterm_shutdown_grace_period_ms) ||
                  throw("Not configured")

  @impl true
  def init(_) do
    Logger.info("#{__MODULE__} is being initialized...")
    {:ok, %{}}
  end

  @impl true
  def handle_info(:proceed_with_sigterm, state) do
    Logger.warn("#{__MODULE__}: shutdown grace period reached, stopping the app...")
    :init.stop()
    {:ok, state}
  end

  @impl true
  def handle_event(:sigterm, state) do
    Logger.warn("#{__MODULE__}: SIGTERM received: waiting for #{@grace_period / 1_000} seconds")
    Process.send_after(self(), :proceed_with_sigterm, @grace_period)

    {:ok, state}
  end

  @impl true
  def handle_event(ev, _state) do
    Logger.warn("#{__MODULE__}: has received a system signal: #{ev} and redirected it to :erl_signal_server")
    :gen_event.notify(:erl_signal_server, ev)
  end

  @impl true
  def handle_call(msg, state) do
    Logger.warn("#{__MODULE__} has received an unexpected call: #{inspect(msg)}")
    {:ok, :ok, state}
  end
end

Does your k8s networking layer do anything like disconnect the pod from the rest of the cluster when it goes down?

1 Like

I’m not entirely sure. Let me see!

Are your Tracker updates periodic/infrequent? Tracker will be a very poor fit for rate limiting if every rate limit bump invokes Tracker.update, since Tracker updates perform replications across the cluster. If you are only bumping the local count periodically it may be a reasonable enough eventually consistent value, but constant calls to update will perform poorly so I want to double check. Also, check out our graceful_permdown/1 API, which will let other nodes know you are going away permanently:

3 Likes

Correct, Tracker updates are not called with every request. I have a rate limit genserver that once a second updates tracker with what that node has seen recently.

It’s standard k8s pod termination lifecycle where endpoint load balancers stop routing traffic to the pods coming down. We use a headless service libcluster DNS strategy.

Do the k8s shutdown hook and handle it gracefully with a release script - SIGTERM is a very blunt instrument.

Yeah I think it’s not catching the SIGTERM correctly on staging actually. Will update with progress. Had to jump over to something else.

Hi @chasers, do you end up finding the source of the issue? I’ve investigation a similar issue and notice that giving more time to properly shutdown the servers seems to solve it. I could only replicate this behavior when using 6 nodes.