Keeping track of WebSocket disconnects

Next step is to replace the socket monitor GenServer with a simple process spawn in init/1 that monitors and waits for the :DOWN message:

defmodule MyAppWeb.DriverSocket do
  use Phoenix.Socket
  defoverridable init: 1

  def init(state) do
    res = {:ok, {_, socket}} = super(state)
    monitor(socket)
    res
  end

  defp monitor(socket) do
    spawn fn ->
      Process.monitor(socket.transport_pid)
      receive do
        {:DOWN, _ref, :process, _monitored_pid, _reason} ->
          Logger.info("Driver #{socket.assigns.driver_id} disconnected")
      end
    end
  end
end

This is so simple in comparison! And for this particular use case it seems to suffice. Do you see any drawbacks?

The biggest downside of a monitor vs a link is if the spawned process dies, you’re going to lose your monitoring. However, that chance is fairly low and something you can probably live with if this is just for analysis.

If you’re doing business logic based on the socket being up, then you could end up in a messed up state. Phoenix tracker solves this by linking the channel process to a single process so that deaths are bidirectional (although the tracker process handles it via a down)

Yeah, in considering this I am taking into account that the spawned function is trivial, the template would be:

defp monitor(socket) do
  spawn fn ->
    Process.monitor(socket.transport_pid)
    receive do
      {:DOWN, _ref, :process, _monitored_pid, _reason} ->
        on_disconnect(socket)
    end
  end
end

defp on_disconnect(socket) do
  # Log disconnection.
  # Decrement gauge for users connected to this node in metrics.
end

I do not even match on the socket pid, because obviously that process cannot receive any other :DOWN message.

That function does not even use OTP, if the Process.monitor/1 call crashes, or the process leaks because the message is not received when the process managing the socket dies, there would be something really broken in the language implementation.

I’ve see that socket.transport_pid is equal to self() in the test suite. Therefore, you cannot test what happens when you disconnect by disconnecting!

My pragmatic compromise for this right now, subject to anyone suggesting a better idea, is to assume that the monitor code is exactly this simple:

defp monitor(socket) do
  spawn(fn ->
    Process.monitor(socket.transport_pid)
    receive do
      {:DOWN, _ref, :process, _monitored_pid, _reason} ->
        on_disconnect(socket)
    end
  end)
end

and testing on_disconnect/1.

I have been able to test socket disconnection by using a separate process. The idea is the following.

First, in the decorated init/1 you store the PID of the monitor in the socket assigns:

def init(state) do
  {:ok, {state, socket}} = super(state)
  monitor_pid = monitor(socket)
  {:ok, {state, assign(socket,  :monitor_pid, monitor_pid)}}
end

Then, in the test you execute the connect/2 call in a separate process, so killing it does not kill the actual test (because, as I explained above, the transport PID is the current process PID).

We need to block until the socket monitor has processed the :DOWN message and exited, to be able to test what should happen in a disconnection, so this needs some coordination:

test "we log driver disconnections" do
  parent = self()

  child = spawn(fn ->
    {:ok, socket} = connect(Subject, @params)
    send parent, {self(), socket, :connected}
    Process.sleep(:infinity)
  end)

  receive do
    {^child, socket, :connected} ->
      kill_socket_and_wait_for_its_monitor = fn ->
        ref = Process.monitor(socket.assigns.monitor_pid)
        Process.exit(socket.transport_pid, :kill)
        receive do
          {:DOWN, ^ref, :process, _monitored_pid, _reason} ->
            # Just block until the socket monitor has processed :DOWN and do
            # nothing.
            nil
        end
      end

      assert_log :info,
                 kill_socket_and_wait_for_its_monitor,
                 "Driver #{socket.assigns.driver_id} disconnected"
  end

There, assert_log/3 is a wrapper around capture_log/2, not relevant for the technique.

I find the test a bit ugly, but will do for 1:55 AM. Will probably polish it after some sleep.