Phoenix Channel Disconnect with 1006 error

phoenix
troubleshooting
gigalixir
Tags: #<Tag:0x00007fbca8236880> #<Tag:0x00007fbca8236740> #<Tag:0x00007fbca8236600>

#1

Hello, community!

We are faced a problem, phoenix sockets are reconnected randomly from time to time.
We try to debug and find out that the socket reconnecting because of some channel error( which we join after socket was connected).
We debug error which comes from channel, and it is “undefined”. By the time those reconnection take place, there are no more connection to that channel, i mean no one except this client has joined to the channel.

Here are screenshots of our debugs:

here we may see how socket joins a room for the first time, and than after 9 minutes get an error of channel and went to reconnect whole socket and join room again:

Part 2.
we dig deeper and found out that phoenix close connection with code 1006, see screenshot below, but what does that means? it works fine for several minutes and than goes to disconnect

Please share your thoughts about how to dig into the problem, how to debug such kind of errors and log them


#2

1006 code means “Abnormal Closure”:

Used to indicate that a connection was closed abnormally (that is, with no close frame being sent) when a status code is expected.

Maybe check your backend logs for errors / dropped tcp connections. Can be anything really, a buggy browser can unexpectedly drop tcp connections after a few minutes, VMs on amazon sometimes drop long running tcp connections (heard about it from people who tried to use it with apns), so the cause can be both on the client and on the server …

Maybe https://stackoverflow.com/questions/19304157/getting-the-reason-why-websockets-closed-with-close-code-1006 would help


#3

We migrate to gigalixir.com not long ago, i wonder if it may cause an issue, as far as i know gigalixir creator here as well, @jesse?


#4

Try checking the logs to know for sure.


#5

well, yes, i am looking for them,
btw, in parallel we connect to digital ocean, same project, same sockets, same channel join. It go smooth for a 15 minutes already.


UPDATE

i have to admit, there is nothing wrong with code, it is still running on digital ocean with no drops (35 mins), unfortunately it is due to gigalixir server.


#6

I wonder if something is “timing out”. Do you know if the websocket connection is inactive during this time? Either way, it’s something we will investigate today. 10min is a suspiciously round number. Something is probably configured wrong somewhere.


#7

It could be happening at the load balancer level, but I’d say it’s still worth making your app robust so it reconnects cleanly. Some browsers are aggressive about putting tabs to sleep when they’re not visible so reconnections can be quite frequent.


#8

My current hypothesis, is that the disconnect is due to Google Cloud’s idle tcp connection disconnect after 10m.
https://cloud.google.com/compute/docs/troubleshooting/general-tips#communicatewithinternet

Note that idle TCP connections are disconnected after 10 minutes. If your instance initiates or accepts long-lived connections with an external host, you can adjust TCP keep-alive settings to prevent these timeouts from dropping connections. You can configure the keep-alive settings on the Compute Engine instance, your external client, or both, depending on the host that typically initiates the connection. Set the keep-alives to less than 600 seconds to ensure that connections are refreshed before the timeout occurs. The following examples sets the keep-alives to one minute (60 seconds). Note that applications running on Linux systems don’t enable keep-alive by default. Thus server or client need to explicitly set the SO_KEEPALIVE socket option when opening TCP connections (see also Linux TCP Keepalive HOWTO).


#9

i am not really sure what i have to do to increase tcp connection, if i try
this command
sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
gives me file not foud

btw, it is not strictly 10 minutes, disconnect may happen right after join was established, you may see on screenshot, it may happen after 2 minutes as well as after 6mins


#10

IIRC the Phoenix client also pings the server constantly to avoid idle timeouts.


#11

yes, every 30 seconds by default the client pings the server over the established WS connection.


#12

Thanks @josevalim and @chrismccord; that helped narrow it down. I’ve narrowed the problem down to the nginx ingress controller which resets websocket connections when its config is reloaded. Unfortunately, under Kubernetes, the config is reloaded quite often. I’m investigating a few options to fix or workaround this issue such as dynamic configuration.

@rurkss if the websocket is being reconnected automatically, is this causing an end-user problem for you right now? If so, I could simply bump the timeout from 10min up to something like 30min. Would that help?


#13

definitely 30 mins will be better than 10 mins or what we have right now, we also appreciate if you solve that problem


#14

Okay, I’ll bump it up to 60m. For what it’s worth, Heroku appears to have the same behavior (see screenshot).


#15

And for others reading this, to be clear the phoenix.js client automatically recovers from dropped websocket connections, which can happen for all kinds of reasons even if the server had no fixed window drop, so the client has you covered :slight_smile:


#16

fwiw: I ran a test against heroku, and stopped it after 8 hours with no reconnects… as such timeout behaviour would be critically undocumented from their side… so does not seem to be a problem with heroku.
(you have of course the 24h cycle - and I would suppose you might see disconnects if their router/load balancer is having problems and they are cycling those around…)

but yes, phoenix.js recovers, and your frontend should also be designed to handle reconnects in a graceful way at any point in time…


#17

@outlog can I ask how you did ur test so I can try and reproduce it?

For mine, I just went to phoenixchat.herokuapp.com opened the network tab and left it open. When I came back I saw a bunch of reconnects every 30-80m.


#18

I have an admin panel, that (channel) connects to a server on heroku, so just opened up that one, and disabled sleep on my laptop…
(it does run heartbeatIntervalMs of 5000, as it shares code with a react native app, where network conn is assumed more fragile eg more heartbeats needed)

will try and test the phoenixchat.herokuapp.com one, as well as increasing the admin panel heartbeatIntervalMs to see if any reconnects appear…

fwiw - I’m on solid wifi, and more than solid fiber… I suppose wifi/ISP issues should not be ruled out…


#19

Thanks @outlog. More data points: I just tried using a few command line apps like wscat and ws. I also tried from my dev machine and also a machine on Google Compute Engine. I’m still getting 1006 (abnormal closures).

Here is wscat on GCE

$ time ws ws://phoenixchat.herokuapp.com/ws
> {"topic":"rooms:lobby","event":"phx_join","payload":{},"ref":"2679"}
< {"topic":"rooms:lobby","ref":"2679","payload":{"status":"ok","response":{}},"event":"phx_reply"}
< {"topic":"rooms:lobby","ref":null,"payload":{"status":"connected"},"event":"join"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":null},"event":"user:entered"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"SYSTEM","body":"ping"},"event":"new:msg"}
> {"topic":"rooms:lobby","event":"new:msg","payload":{"user":"js","body":"hi"},"ref":"3997"}
< {"topic":"rooms:lobby","ref":"3997","payload":{"status":"ok","response":{"msg":"hi"}},"event":"phx_reply"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"js","body":"hi"},"event":"new:msg"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"SYSTEM","body":"ping"},"event":"new:msg"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"SYSTEM","body":"ping"},"event":"new:msg"}
...
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"SYSTEM","body":"ping"},"event":"new:msg"}
websocket: close 1006 (abnormal closure): unexpected EOF

real    98m58.921s
user    0m0.286s
sys     0m0.203s

And then with wscat on my local machine

$ time wscat -c ws://phoenixchat.herokuapp.com/ws                                                                                                                 
connected (press CTRL+C to quit)
> {"topic":"rooms:lobby","event":"phx_join","payload":{},"ref":"3707"}
< {"topic":"rooms:lobby","ref":"3707","payload":{"status":"ok","response":{}},"event":"phx_reply"}
< {"topic":"rooms:lobby","ref":null,"payload":{"status":"connected"},"event":"join"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":null},"event":"user:entered"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"js","body":"hi"},"event":"new:msg"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"SYSTEM","body":"ping"},"event":"new:msg"}
< {"topic":"rooms:lobby","ref":null,"payload":{"user":"SYSTEM","body":"ping"},"event":"new:msg"}
disconnected (code: 1006)

real    110m0.246s
user    0m0.540s
sys     0m0.089s

#20

So… never heard of wscat before, how do you run it:

╰─➤  time wscat ws://phoenixchat.herokuapp.com/ws                                                                127 ↵
module.js:478
    throw err;
    ^

Error: Cannot find module 'commander'
    at Function.Module._resolveFilename (module.js:476:15)
    at Function.Module._load (module.js:424:25)
    at Module.require (module.js:504:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/usr/lib/nodejs/ws/bin/wscat:13:15)
    at Module._compile (module.js:577:32)
    at Object.Module._extensions..js (module.js:586:10)
    at Module.load (module.js:494:32)
    at tryModuleLoad (module.js:453:12)
    at Function.Module._load (module.js:445:3)
wscat ws://phoenixchat.herokuapp.com/ws  0.06s user 0.01s system 99% cpu 0.072 total

Yes node-commander is installed as well (it came as a dependency for node-ws)…