Disconnected from Phoenix channel when sending multiple large messages in a small time frame

Hello,

First of all I’d like to say that at my company we are huge fans of the Elixir ecosystem and particularly of the Phoenix framework.

###Context

  • A Phoenix app called “the broker” exposing a channel.
  • A Phoenix app called “the client” connecting to the broker channel using the package phoenix_gen_socket_client (see phoenix_gen_socket_client | Hex)

##environment
Broker app:

  • Elixir 1.5.1
  • Phoenix v1.3
  • Freebsd jail v10.3 amd64

Client app:

  • Elixir 1.5.1
  • Phoenix v1.3
  • Mac os X

###Problem
When sending multiple messages over the channel, using for example a loop or Enum.each, only few messages manage to get to the broker channel handle_in function. Then the client crash with the following error:

21:45:58.027 [error] ** State machine #PID<0.471.0> terminating ** Last message in was {:send_frame, {:binary, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 188, 187, 231, 206, 235, 74, 146, 45, 248, 42, 131, 51, 63, 217, 183, 232, 93, 253, 163, 104, 69, 47, 209, 115, 106, 80, 160, 247, 222, 179, 113, 223, 125, 184, 247, 62, 231, 84, …>>}} ** When State == :connected ** Data == {:context,
{:websocket_req, :wss, ‘broker.dexyde.com’, 443, ‘/socket/websocket?user_token=SFMyNTY.g3QAAAACZAAEZGF0YW4HAIClCS9hXAVkAAZzaWduZWRuBgC9YkVUXwE.ZRbhkBm4BxzGEXLynPpfXDuGTcuZQMdhgcJszmZd3zo’, 30000, :undefined, 1, {:sslsocket, {:gen_tcp, #Port<0.20422>, :tls_connection, :undefined}, #PID<0.476.0>}, {:transport, :ssl, :ssl, :ssl_closed, :ssl_error, [mode: :binary, active: true, verify: :verify_none, packet: 0]}, “12krm8DaxFlr/ZuaS8p6Uw==”, 139669, 1, 2, :undefined, :undefined}, {:transport, :ssl, :ssl, :ssl_closed, :ssl_error, [mode: :binary, active: true, verify: :verify_none, packet: 0]}, , {:wss, ‘broker.dexyde.com’, 443, ‘/socket/websocket?user_token=SFMyNTY.g3QAAAACZAAEZGF0YW4HAIClCS9hXAVkAAZzaWduZWRuBgC9YkVUXwE.ZRbhkBm4BxzGEXLynPpfXDuGTcuZQMdhgcJszmZd3zo’}, {Phoenix.Channels.GenSocketClient.Transport.WebSocketClient, %{keepalive: 30000, socket: #PID<0.469.0>}}, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 188, 187, 231, 206, 235, 90, 118, 45, 248, 42, 141, 211, 63, 105, 155, 57, 213, 63, 138, 81, 204, 18, 51, 187, 26, 5, 230, 156, 51, 141, 251, 238, 151, …>>, false, :undefined, 0} ** Reason for termination = ** {:error, :closed}

We are using binary messages with custom serializer on the broker side. The observed size of the payload is, before zipping it, between 500K and 1M character.

###Tests done
Add a :timer.sleep(3000) between each push from the client so that the messages arrive in the handle_in function of the broker and it can reply to the handle_reply of the client.
Reducing the size of the message payload seems to work too.

###Our conclusion…so far
It seems that when a client app is sending multiple large messages over the channel in a small time frame, it got disconnected from the broker channel.

###Questions

  • Do anybody has already experience something similar when sending many large message in a small time frame ?
  • Is there a kind of channel or socket limitations to be known ?

Any though or remarks are most welcome on this subject.

Many thanks for your help.

Regards,

Julien

1 Like

A few notes:

  • Websocket’s are TCP.
  • Phoenix I think uses a 60s or so timeout to receive a heartbeat or so message.
  • If it takes longer than that to send a single message, then phoenix thinks the pipe is frozen or overwhelmed or something, so it kills it.

Should probably break up large messages into multiple smaller ones, or there might be a way to increase the timeout? Not checked about that yet… ^.^;

3 Likes

I responsed to this on the GH issue. Here’s the copy of my response:

It’s hard to say without a reproducible example. Can you somehow make a simple project which demonstrates the problem? If yes, we’d be happy to debug.

Without an example, and based on your description, my first hunch is that a netsplit occurs because of long deserialization time on the Phoenix side. At our company, we actually have a similar use case to yours. We occasionally send some large messages, and originally we used a custom JSON/gzip serializer. At some point we noticed various problems with very large messages, though I can’t recall if the symptom was exactly the same as yours.

Our solution boiled down to two things:

  • compressing/decompressing the payload outside of the serializer
  • serializing/deserializing messages using :erlang.term_to_binary and :erlang.binary_to_term

The thing is that on the Phoenix side, messages are deserialized in the socket process, so long deserializations of large messages block the entire communication, which might provoke a netsplit.

With our new approach, compression/decompression is not done in the socket process, which vastly reduced the decoding time in the socket process.

Since the payload is compressed, we went a step further and switched the serializer to be based on :erlang.term_to_binary and :erlang.binary_to_term, which work pretty fast, even for larger payloads.

So on the Phoenix side, a message processing now goes as follows:

  1. The serializer decodes the message (using :binary_to_term).
  2. The payload of the message is still encoded (gzipped), but the message is now forwarded to the channel process.
  3. The channel process can now unzip the message without blocking the socket process.

We actually went a step further, and uncompressed the message in a separate GenServer. We did this to keep the channel responsive to other messages.

With those changes, we were able to significantly reduce processing time in the socket process, and in the channel process. As a result, these processes were much more responsive.

I’m not sure if this would fix your problems, but you could give it a try.

2 Likes

Thanks a lot for your quick reply and your interesting insights.
We are going to test the zip/unzip outside the socket process and put less logic on the channel handle_in to focus it on transport.