Recently, I have begun testing a project which relies heavily on long-lived LiveView connections. Everything works fantastically on my machine, and even in the cloud (hereafter ‘prod’) everything runs dang-near flawless. However, I’m hitting one weird issue, of which I am having trouble nailing down the cause.
On the site, I have a simple chat interface which uses
temporary_assigns to populate messages. Users are expected to hang on the site for a while (20 - 60+ mins), so this approach seems the most ‘correct’ in that it saves memory and all that. This all works as expected and the dev experience was great.
The issue: After testing on prod, it’s become evident that after an arbitrary amount of time, all LiveView sockets disconnect then reconnect (almost instantly). For the most part, this doesn’t impact user experience, except all of the chat messages are wiped out upon reconnect, and
Presence alerts users of all the other users reconnecting.
Typically, all ~8 users cycle connection together, but I noticed instances where only a few would cycle and others would follow several minutes later. (I could not observe a correlation between those that cycled and those that didn’t, or the duration between the different group cycles).
What I’ve done to investigate:
- Integrated AppSignal to investigate possible errors. No errors have been raised. It does show that the app has run on a few different GCP nodes.
- Monitored network traffic to ensure heartbeats etcs are being sent/handled and sockets aren’t dying from inactivity
- Removed some extraneous code which I thought was filling memory - essentially a list of ‘pending messages’ which was not marked as a temporary_assign. This seemed to help, but I think it was a fluke.
- Explicitly printed a log in the LV’s
terminatefunction - which just indicates a normal shutdown of the process. Nothing of help there.
- Adjusted the websocket timeouts in my app’s endpoint, thinking maybe they timed out after a fixed amount of time. This had no effect, and the duration between disconnects seemed arbitrary each time. (That is - I couldn’t find a correlation with i.e. default socket timeouts)
- Adjusted the amount of RAM available through the host PaaS (Gigalixir), thinking it was an OOM issue. This didn’t do anything.
- Attempted to force a connection cycle locally by sending tons of chat messages and overwhelming the system. This failed to happen at 60 messages/sec (which is an exorbitant amount for my use case, anyway).
My server is hosted (via Gigalixir) on GCP. I’m guessing that these socket connections are getting passed around different nodes, cycling connections and losing state in the process.
My questions at this point are:
- Can GCP support long-lived socket connections for LV? Would I experience this same issue on AWS?
- Is this something that can be safeguarded against when using
temporary_assignsor something? (This idea doesn’t sound right - even if I fixed the disappearing chat messages,
Presencestill registers connection changes and tells chat that users have disconnected).
- Are LV’s just not meant to live as long as I need them to?
My next thought to try is to move my server over from GCP to AWS, but I’d like to avoid that whole process if I could. My gut tells me it’s a GCP thing, though.
Sorry for the big wall of text, but thank you for those that took the time to read it!