Releasing new versions of Phoenix application without losing state/connectivity

tozz · June 21, 2021, 12:38pm

Been doing some basic Phoenix stuff and have started looking at channels/presence, one thing I can’t figure out (coming from a SaaS/external services heavy background) is how to you keep clients happy (as in network/cached things like sessions/similar).

So my use case is essentially having Phoenix running with active clients and then wanting to release a new version of the application. Restarting Phoenix will kill all the network connections (obviously), but is there a way to migrate or gracefully handle this? I think one of the big upsides of Phoenix is the ability to keep things in the same stack and same code base.

Just as a comparison, today I use memcached for storing sessions and Pusher for pub/sub, so even if the application itself is restarted the clients won’t notice anything, their sessions are persisted and their browser will never drop the connection. I can use memcahed (or redis) for persisting sessions in Phoenix too (but as I mentioned before I really like the idea of keeping things together), but how are people generally handling application updates in production?

kokolegorille · June 21, 2021, 2:06pm

Hello and welcome,

You might be looking for code upgrade.

egze · June 21, 2021, 4:13pm

Channels or LiveView have a reconnect feature. What exactly is your concern?

stefanchrobot · June 21, 2021, 8:42pm

Seems like that app needs to handle disconnects/reconnects gracefully so this scenario would be covered “for free”. But maybe there’s a better way to handle this for zero-downtime deployments? I think some sort of process handover would be possible, but that requires node clustering which is not always easy.

tozz · June 22, 2021, 5:37am

Yes, reconnect does of course work, but if you kill the application you kill the state. Imagine you have 100 participants, the application gets restarted, now everyone who reconnects will see 0+reconnected participants, it will be kind of jarring if the UI has a list of people (much like how a netsplit behaves on IRC and then reconnects, just much slower).
I was just curious if there was established solutions/patterns to this, considering how much nice things we already get with Elixir/Erlang.

For user sessions the Plug.Session seems to have an easy enough way to add support for Redis with a custom store, the documentation even mentions it explicitly

tozz · June 22, 2021, 5:39am

Thank you!

Every time I read about this it seems that there’s a lot of gotchas and it’s seldom recommended for actual production use, especially not for something larger like a phoenix codebase. I could be wrong though, it’s just the general feeling I’ve gotten while researching it.

LostKobrakai · June 22, 2021, 6:51am

Generally the answer to having no downtime is to never shutdown completely. Running multiple instances with a rolling or blue/green deployment strategy will give you that. Given this works for any programming language there’s a bunch of experience held by people in the industry not just in the beam community.

That one is hot code updates. You’re correct that usually this is usually not recommended, but not because it doesn’t work. The cause for that is that often the prev. solutions are just good enough. You don’t need specifically BEAM experienced people to handle it and you don’t need such an intimate knowledge of your runtime state to be able to properly update it using hot code updates. Also sometimes updates might be to complex (say you overhauled the process tree structure) you don’t want to do hot code updates and suddenly you need a fallback solution anyways.

So hot code updates are great, they’re however not necessarily a “it just works” type of solution.

The default for Plug.Session is to store its details in the session cookie on the client, so unless you change the apps secret key nothing is lost.

tozz · June 22, 2021, 7:11am

Blue/green wouldn’t work with presence (or pub/sub where clients would talk with each other), would be the same “problem” as with reconnecting clients (and with a longer split duration). This is how I would do it normally, with the presence part being separated from the application.
Thanks for the clarifications regarding hot code updates.

In general I’m not completely opposed to just killing the application, restarting and having the users experience a small hiccup, just a matter of curiosity trying to understand if there’s alternatives

LostKobrakai · June 22, 2021, 7:16am

It depends on how “permanent” the connections need to be. There’s a difference between e.g. a casual chat and a video call. Also blue/green doesn’t necessarily mean dropping clients as is. If what individual users do is a reasonably “short” lived activity you could also send new connections only to new nodes and drain old ones over time.

wanton7 · June 22, 2021, 8:45am

Our company also where searching if we could do this. We are using TypeScript & NodeJS currently. After our research we decided it’s not worth it and decided to have planned update window instead. We also have notifications and a timer that start 2 hours before for users that server going to have a short maintenance.

Also you have to anyway take into account that connection with phoenix channels might drop due to someone’s internet connection going down especially when they are using mobile internet connection. Just make sure your reconnection code is robust and I bet user’s won’t notice a small blip.

egze · June 23, 2021, 8:25am

What we did in the previous company - nodes were in a cluster with libcluster. A deployment would spin up more nodes and they would be added to the cluster. Once added, old nodes would be shut down. And to keep the state, we had state hand-off with the help of Horde. Everything worked and noone ever noticed a deployment. Our LiveViews would just quickly reconnect to a new node and the state would already be there.

tozz · June 23, 2021, 8:58am

That sounds awesome, a bit over my pay grade right now, but sounds like it solves exactly what I want to solve, thanks!

hubertlepicki · June 23, 2021, 9:08am

I agree with @egze that Horder + Libcluster is the way to go to preserve the server-side state that’s kept in memory between the releases, and I do that in couple projects too. It’s great, but you shouldn’t absolutely rely on this happening as this is the cloud and weird things happen in the could. So your system should start up properly even if the state wasn’t passed during deployment for some weird reason.

But this does not solve all problems, most notably if you are using Phoenix Channels and / or Phoenix LiveView these connections still drop and the state of LiveView for example gets re-set. I never got into solving the disconnections for websockets, I suspect it’s doable with some sort of websocket proxy and maybe even Nginx can do this these days… But by default, on deployment, all 100% of the clients drop/ reset.

So this is a biggest problem if you have like registration form that doesn’t persist data anywhere and just keeps it in memory on LiveView. It gets disconnected and starts from scratch after deployment. By default Horde + libcluster doesn’t solve this.