Phoenix Socket draining

bismark · March 22, 2022, 5:45pm

We’re attempting to make our deployment process a little less… disruptive, so we’d like to find a way of manually draining all open websockets.

Background

We’re running on AWS autoscaling groups with an Application Load Balancer. When we deploy, we register the a new instance with the load balancer target group, then deregister the old one. Unfortunately, target groups deregistration is ignorant of websockets, so it waits the 300s for all open connections to complete (which of course the websockets do not do), and then forcefully closes all of the connections. This causes a bit of a stampede (yes, yes, this is partly client issue, let’s just assume the clients are bad actors) and we much prefer to do this more gently.

Question

Is there any “official” way of traversing the open websockets and closing them down? Aside from spelunking through the Supervision tree that is.

Though I’m open to hearing about other options .

rjk · March 22, 2022, 7:15pm

There is an example on how to close a channel (i.e. websocket) server-side in the docs (Phoenix.Socket — Phoenix v1.6.6)

MyAppWeb.Endpoint.broadcast("users_socket:" <> user.id, "disconnect", %{})

But AFAICS it always needs those ids, which you could gather using presence track which keeps its updates (presence_diff) private.

On a shutdown signal you could iterate these ids and broadcast that above disconnect to your sockets.

RudManusachi · March 22, 2022, 7:28pm

In one of the projects we use SocketDrano — socket_drano v0.5.0, though personally I haven’t worked with it yet so, can’t provide details

bismark · March 22, 2022, 8:49pm

That’s handy, thanks!

From the docs:

Important: This library currently leverages an undocumented internal function in Phoenix to achieve its magic of closing local sockets.

bismark · March 22, 2022, 8:50pm

But if we’re running in a cluster, Presence will return a global list correct, not just the locally connected sockets.

We could probably do some other form of local bookkeeping with say a Registry, but I was hoping to avoid that…

rjk · March 22, 2022, 9:18pm

Presence.track can store key/values so you could store hostname and userid? So on a quit signal do a lookup on hostname and you have a list of connected sockets for that node? And you could use Phoenix.PubSub — phoenix_pubsub v2.0.0 to keep the broadcast local to the node.

bismark · March 29, 2022, 5:27pm

We’ve come up with a not-so-great workaround for now.

SocketDrano had a couple issues for us:

it relied on channel join telemetry events, so it would miss sockets that do not join a channel
it was oriented around draining upon sigterm, but we’re relying on CodeDeploy lifecycle events for triggering the draining

So, we ended up “extending” Phoenix.Socket by creating our own Socket module that has overrideable callbacks:

gist.github.com

https://gist.github.com/bismark/c827cfc6b4b145d5038cca889f747b83

socket.ex

defmodule MyApp.Socket do
   @moduledoc """
   This module extracts the __using__ macro from Phoenix.Socket and makes the
   Phoenix.Socket.Transport callbacks overridable
   """

    defmacro __using__(opts) do
     quote do
       ## User API

This file has been truncated. show original

Then we create a custom init callback:

@impl Phoenix.Socket.Transport
   def init(state) do
     super(state)
     |> tap(fn
       {:ok, {_state, %Phoenix.Socket{transport_pid: pid}}} when is_pid(pid) ->
         track_socket(pid)

       _ ->
         :ok
     end)
   end

track_socket then does essentially what SocketDrano does (monitor the socket pids, disconnects them when told to).

I’ve opened a thread on the Phoenix Core mailing list asking about better solutions being added Phoenix itself: https://groups.google.com/g/phoenix-core/c/1umhh2X1oAM/m/_sfxU8xIBgAJ

darnahsan · March 30, 2022, 9:03am

just an off topic to connection draining, why are you using ALB when they are not guaranteed to maintain persistent connections, NLB is recommended for persistent connections. Do you guys see any connection issue with ALB or just not at the scale where you ALB would scale out/in in the background for you to drop connections

Exadra37 · March 30, 2022, 9:33am

Sorry to ask but I have no clue of what ALB and NLB means, but I would guess that is something related with load balancing.

darnahsan · March 30, 2022, 9:59am

AWS Application Load Balancer (L7) and Network Load Balancer (L4)