LiveView: Connections Cycling on GCP

awful_waffle · September 22, 2020, 2:50pm

Hello all!

Recently, I have begun testing a project which relies heavily on long-lived LiveView connections. Everything works fantastically on my machine, and even in the cloud (hereafter ‘prod’) everything runs dang-near flawless. However, I’m hitting one weird issue, of which I am having trouble nailing down the cause.

On the site, I have a simple chat interface which uses temporary_assigns to populate messages. Users are expected to hang on the site for a while (20 - 60+ mins), so this approach seems the most ‘correct’ in that it saves memory and all that. This all works as expected and the dev experience was great.

The issue: After testing on prod, it’s become evident that after an arbitrary amount of time, all LiveView sockets disconnect then reconnect (almost instantly). For the most part, this doesn’t impact user experience, except all of the chat messages are wiped out upon reconnect, and Presence alerts users of all the other users reconnecting.

Typically, all ~8 users cycle connection together, but I noticed instances where only a few would cycle and others would follow several minutes later. (I could not observe a correlation between those that cycled and those that didn’t, or the duration between the different group cycles).

What I’ve done to investigate:

Integrated AppSignal to investigate possible errors. No errors have been raised. It does show that the app has run on a few different GCP nodes.
Monitored network traffic to ensure heartbeats etcs are being sent/handled and sockets aren’t dying from inactivity
Removed some extraneous code which I thought was filling memory - essentially a list of ‘pending messages’ which was not marked as a temporary_assign. This seemed to help, but I think it was a fluke.
Explicitly printed a log in the LV’s terminate function - which just indicates a normal shutdown of the process. Nothing of help there.
Adjusted the websocket timeouts in my app’s endpoint, thinking maybe they timed out after a fixed amount of time. This had no effect, and the duration between disconnects seemed arbitrary each time. (That is - I couldn’t find a correlation with i.e. default socket timeouts)
Adjusted the amount of RAM available through the host PaaS (Gigalixir), thinking it was an OOM issue. This didn’t do anything.
Attempted to force a connection cycle locally by sending tons of chat messages and overwhelming the system. This failed to happen at 60 messages/sec (which is an exorbitant amount for my use case, anyway).

My server is hosted (via Gigalixir) on GCP. I’m guessing that these socket connections are getting passed around different nodes, cycling connections and losing state in the process.

My questions at this point are:

Can GCP support long-lived socket connections for LV? Would I experience this same issue on AWS?
Is this something that can be safeguarded against when using temporary_assigns or something? (This idea doesn’t sound right - even if I fixed the disappearing chat messages, Presence still registers connection changes and tells chat that users have disconnected).
Are LV’s just not meant to live as long as I need them to?

My next thought to try is to move my server over from GCP to AWS, but I’d like to avoid that whole process if I could. My gut tells me it’s a GCP thing, though.

Sorry for the big wall of text, but thank you for those that took the time to read it!

LostKobrakai · September 22, 2020, 4:55pm

GCP afaik does have timeouts on websocket connections. This is likely what you’re observing. One way to work around this might be fetching the last X chat messages on every mount, so the user would only loose any existing “history” of messages beyond those last X.

awful_waffle · September 23, 2020, 1:39am

I bit the bullet and switched my app from GCP to AWS, and the problem has disappeared. I’ve been able to maintain a solid 2+ hour socket connection since the switch.

Thank you for the help!

darnahsan · September 23, 2020, 9:03am

Just out of curiosity you use gigalixir to run on AWS ? or using AWS itself

awful_waffle · September 23, 2020, 3:02pm

I use Gigalixir, but they offer you the choice to use GCP or AWS as the actual server under the hood. (I probably could just run it on AWS myself, but I’m happy to pay for the convenience of Gigalixir’s UI/toolset/support.)