Optimizing WebSocket Compression in Plug Cowboy: Reducing CPU Overhead on High-Traffic Socket Server

halfmetaljacket · December 27, 2024, 1:18pm

I’m facing a peculiar challenge with a socket server I’ve built using Elixir and Cowboy WebSocket (Cowboy WebSocket documentation). This server has been in production for a while, handling substantial traffic. It consumes messages from RabbitMQ, processes them, and publishes messages to clients based on their subscribed channels.

The issue arises with data-out costs. To tackle this, I enabled built-in compression in Cowboy. However, the problem is that messages are compressed separately for each client. For instance, if a message needs to be sent to 1000 clients, it gets compressed 1000 times, one for each client process. This approach has caused high CPU overhead and spiking latencies, especially during message bursts.

To address this, I’m considering an alternative:
Pre-compressing messages when they’re consumed from RabbitMQ and sending the pre-compressed messages directly to clients that support compression. For clients that don’t support compression, the original uncompressed message would be sent instead. The plan is to add relevant headers so that clients (mostly web browsers) can automatically decompress messages without requiring any changes on the frontend.

However, I’m unclear about how this approach interacts with WebSocket compression features like server_context_takeover, server_max_window_bits, etc. Since Cowboy optimizes compression by managing compression contexts across frames, how would this work when the messages are already pre-compressed?

Has anyone encountered a similar problem or implemented a solution for this? I assume this is a common challenge for socket servers serving public data.

Any insights, best practices, or ideas to optimize CPU and latency in this scenario would be greatly appreciated!

Edit: GoLang’s Gorrila Websocket has a functionality called PreparedMessage that will solve my issues. But plugging this functionality into the cowboy library is way beyond my skill. I can try to implement it when I have some free time.

mpope · December 27, 2024, 9:24pm

This does not answer the question of how to serve shared compressed values, but have you tried changing the deflate options to adjust compression settings?

github.com

ninenines/cowlib/blob/master/src/cow_ws.erl#L38


      
          
          -export([frame/2]).
          -export([masked_frame/2]).
          
          -type close_code() :: 1000..1003 | 1006..1011 | 3000..4999.
          -export_type([close_code/0]).
          
          -type extensions() :: map().
          -export_type([extensions/0]).
          
          -type deflate_opts() :: #{
          	%% Compression parameters.
          	level => zlib:zlevel(),
          	mem_level => zlib:zmemlevel(),
          	strategy => zlib:zstrategy(),
          
          	%% Whether the compression context will carry over between frames.
          	server_context_takeover => takeover | no_takeover,
          	client_context_takeover => takeover | no_takeover,
          
          	%% LZ77 sliding window size limits.

A more experimental option would be to create your own zlib context and emulate what Cowboy does but return the raw frame {binary, iodata()} from the websocket handler directly, where the payload is pulled by sequence number from a shared ETS cache.

halfmetaljacket · December 29, 2024, 6:44am

I did a POC for this for, the compressed payload almost doubled after increasing just one level, which is fine with me. Yet to do a proper benchmark.

rjk · December 31, 2024, 1:42pm

Given your problem I think you’re on the right track to pre-compress the payload you want to send. I would not use the deflate options of the websocket itself as this is always a 1:1 relation. I would assume there’s a single point in your app where you broadcast from the single rabbitmq message to multiple (websocket) receivers. That would be the place in your code to do the one-time compression and send it out to all the receivers as a binary frame.

I’ve made a lot of websocket implementations for high traffic websocket servers and a lot of them implemented compression of their payload but none of them used the built-in compression extension of websockets (rfc7692), they just send it out as binary frame and specify in their documentation which form of compression they have used. Another reason I would not opt for the built-in compression is that a client can always ignore it or express that it doesnt support compression (following the standard negotiation flow) and opt for the uncompressed stream and that would undo what you want.

Another thing which can also help you a lot with burst traffic is to implement some form of buffering and sending data out in “ticks”. In my implementation I sometimes had to handle 1000s of messages per second and then I would just buffer everything within 250ms of 1sec and compress and send it out once per that time interval. I was not able to give back pressure to my (external) upstream so this helped a lot with high traffic spikes and not fill up the mailboxes of the involved actors/processes.

HTH

(EDIT1: I just see that there’s way more context about your issue on reddit, link to that post: https://www.reddit.com/r/elixir/comments/1hn91p6/need_help_in_optimizing_websocket_compression/ )

EDIT2: super interesting post about what the compression levels do in reality: Configuring & Optimizing WebSocket Compression - igvita.com

rjk · January 1, 2025, 11:52am

Thought more about this, you can get the compress once behaviour out of the box with phoenix channels which uses something called fastlane within the serializers. (see this forum post explained by LostKobrakai How to scale Phoenix Pubsub event publishing? - #2 by LostKobrakai ). When you broadcast to a channel it will serialize once and reuse that for each listener. You could write a custom serializer which would compress the payload with msgpack or zstd (train a dictionary on a bunch of your data and use that on both sides which would compress way better on smaller payloads). An example of a custom serializer is something like this; (source: blog/binary_data_over_phoenix_sockets/web/transports/message_pack_serializer.ex at b164ae5e8fb4701ee40925aca9aef2297b80be95 · Kaizen-Gaming/blog · GitHub )

defmodule BinaryDataOverPhoenixSockets.Transports.MessagePackSerializer do
  @moduledoc false

  @behaviour Phoenix.Transports.Serializer

  alias Phoenix.Socket.Reply
  alias Phoenix.Socket.Message
  alias Phoenix.Socket.Broadcast

  # only gzip data above 1K
  @gzip_threshold 1024

  def fastlane!(%Broadcast{} = msg) do
    {:socket_push, :binary, pack_data(%{
      topic: msg.topic,
      event: msg.event,
      payload: msg.payload
    })}
  end

  def encode!(%Reply{} = reply) do
    packed = pack_data(%{
      topic: reply.topic,
      event: "phx_reply",
      ref: reply.ref,
      payload: %{status: reply.status, response: reply.payload}
    })
    {:socket_push, :binary, packed}
  end

  def encode!(%Message{} = msg) do
    # We need to convert the Message struct into a plain map for MessagePack to work properly.
    # Alternatively we could have implemented the Enumerable behaviour. Pick your poison :)
    {:socket_push, :binary, pack_data(Map.from_struct msg)}
  end

  # messages received from the clients are still in json format;
  # for our use case clients are mostly passive listeners and made no sense
  # to optimize incoming traffic
  def decode!(message, _opts) do
    message
    |> Poison.decode!()
    |> Phoenix.Socket.Message.from_map!()
  end

  defp pack_data(data) do
    msgpacked = MessagePack.pack!(data, enable_string: true)
    gzip_data(msgpacked, byte_size(msgpacked))
  end

  defp gzip_data(data, size) when size < @gzip_threshold, do: data
  defp gzip_data(data, _size), do: :zlib.gzip(data)
end

Maybe a completely different direction but giving you different options

VictorGaiva · January 2, 2025, 1:02pm

Your issue seems extremely related to what Discord went through, and blogged about a few months ago. Take a look How Discord Reduced Websocket Traffic by 40%