Phoenix Socket draining

We’re attempting to make our deployment process a little less… disruptive, so we’d like to find a way of manually draining all open websockets.

Background

We’re running on AWS autoscaling groups with an Application Load Balancer. When we deploy, we register the a new instance with the load balancer target group, then deregister the old one. Unfortunately, target groups deregistration is ignorant of websockets, so it waits the 300s for all open connections to complete (which of course the websockets do not do), and then forcefully closes all of the connections. This causes a bit of a stampede (yes, yes, this is partly client issue, let’s just assume the clients are bad actors) and we much prefer to do this more gently.

Question

Is there any “official” way of traversing the open websockets and closing them down? Aside from spelunking through the Supervision tree that is.

Though I’m open to hearing about other options :slight_smile:.

1 Like

There is an example on how to close a channel (i.e. websocket) server-side in the docs (Phoenix.Socket — Phoenix v1.6.6)

MyAppWeb.Endpoint.broadcast("users_socket:" <> user.id, "disconnect", %{})

But AFAICS it always needs those ids, which you could gather using presence track which keeps its updates (presence_diff) private.

On a shutdown signal you could iterate these ids and broadcast that above disconnect to your sockets.

In one of the projects we use SocketDrano — socket_drano v0.5.0, though personally I haven’t worked with it yet so, can’t provide details

That’s handy, thanks!

From the docs:

Important: This library currently leverages an undocumented internal function in Phoenix to achieve its magic of closing local sockets.

:blush:

But if we’re running in a cluster, Presence will return a global list correct, not just the locally connected sockets.

We could probably do some other form of local bookkeeping with say a Registry, but I was hoping to avoid that…

Presence.track can store key/values so you could store hostname and userid? So on a quit signal do a lookup on hostname and you have a list of connected sockets for that node? And you could use Phoenix.PubSub — phoenix_pubsub v2.0.0 to keep the broadcast local to the node.

1 Like

We’ve come up with a not-so-great workaround for now.

SocketDrano had a couple issues for us:

  • it relied on channel join telemetry events, so it would miss sockets that do not join a channel
  • it was oriented around draining upon sigterm, but we’re relying on CodeDeploy lifecycle events for triggering the draining

So, we ended up “extending” Phoenix.Socket by creating our own Socket module that has overrideable callbacks:

Then we create a custom init callback:

@impl Phoenix.Socket.Transport
   def init(state) do
     super(state)
     |> tap(fn
       {:ok, {_state, %Phoenix.Socket{transport_pid: pid}}} when is_pid(pid) ->
         track_socket(pid)

       _ ->
         :ok
     end)
   end

track_socket then does essentially what SocketDrano does (monitor the socket pids, disconnects them when told to).

I’ve opened a thread on the Phoenix Core mailing list asking about better solutions being added Phoenix itself: https://groups.google.com/g/phoenix-core/c/1umhh2X1oAM/m/_sfxU8xIBgAJ

4 Likes

just an off topic to connection draining, why are you using ALB when they are not guaranteed to maintain persistent connections, NLB is recommended for persistent connections. Do you guys see any connection issue with ALB or just not at the scale where you ALB would scale out/in in the background for you to drop connections

Sorry to ask but I have no clue of what ALB and NLB means, but I would guess that is something related with load balancing.

1 Like

AWS Application Load Balancer (L7) and Network Load Balancer (L4)

1 Like