Phoenix Socket Disconnect After Heartbeat

morinap · May 25, 2023, 9:34pm

I’ve been attempting to track down a problem with a Phoenix/LiveView app and a long-running client. The client keeps a LiveView open sometimes for days at a time, through deploys, etc. Generally, the socket connection mechanism works well; if the connection is dropped, it reconnects and resumes. In these cases, I see a socket response code of 1006, as I’d expect.

However, under some as-yet-undetermined conditions, the connection disconnects with a code of 1001. In these instances, I do not see anything abnormal in my server logs to indicate a forced closed connection, and no navigation has occurred on the client side. I’ve captured this happening in my websocket messages log, as seen here. It happened immediately after sending a heartbeat up the channel to the server.

This presents an additional problem because as of Live View 0.18.4 (see 9a9d7c84c18fc19d2deffd79c1b85346206d4085), code 1001 causes the socket to call unload so my client never reconnects. This is noted as having been added to solve a Firefox navigation issue. It’s important to note that I have definitely duplicated this in Firefox and not yet in another browser, but since it happens inconsistently that may just be due to timing (I’m attempting to catch an instance in Chrome). Also noteworthy that this happens with no current interaction from the client side.

Has anyone experienced this or does anyone have any insight into where I can look further on this?

morinap · May 26, 2023, 11:40am

Update: I have been able to duplicate the same problem under Chrome. I’ve also duplicated not in a direct response to a heartbeat, so the heartbeat timing appears to be a bit of a red herring.

This is an environment built in Docker and hosted in Render, so it’s possible that either of those two pieces of the stack are causing this. It’s interesting that it works for a long time until it doesn’t, though, so it doesn’t feel like a configuration problem but some actual event triggering this close.

chrismccord · May 26, 2023, 1:18pm

It’s possible render’s proxy sends 1001 in some scenarios, but I can’t really say what’s going on with the info we have:

(1001) Indicates an endpoint is being removed . Either the server or client will become unavailable.

It is my understanding that 1001 is a code we should not attempt to reconnect from because either the client or server is going away gracefully, so I’m not sure how LiveView/channels would be able to handle this if the server sends a 1001 but expects the client to treat it as an error (which has it’s own code 1011)

morinap · May 26, 2023, 1:39pm

Thank you for the reply

I guess that’s what I’m left trying to determine - is this something that Phoenix’s endpoint code is doing, or something Render and its proxy are doing. After duplicating in Chrome, I feel like I’ve ruled out the browser at least.

My next step in this regard is trying to duplicate this on a development environment where my browser is calling directly into a Phoenix endpoint with no proxies in between.

Yeah I don’t really think the Phoenix or LiveView code is doing anything “wrong” here on the client end(although one note is that Phoenix without LiveView would attempt to reconnect here if I’m reading the code correctly, it’s LV that’s calling unload()). I’m mostly trying to understand if this is something happening in the Phoenix server. I dove into the Cowboy source a little bit to try to understand that but I haven’t come up with anything as of yet.

chrismccord · May 26, 2023, 2:17pm

We have since added unloading code to external form submits and anchor clicks, so this check on the close code for FF should be redundant now and I have removed it on main.

morinap · May 30, 2023, 3:20pm

Great, thank you for the update. I’m still trying to get to the bottom of why this is happening within our stack, but it’s good to know we can probably safely proceed/ignore those 1001 closed responses.