Production Phoenix LiveView on cloud-agnostic Kubernetes

eclark · September 11, 2024, 4:34pm

Hence why distribution adds latency. foo gets the first assigns computed. Assuming that your state has some cache locality and starts some running proccesses (in production it matters a lot more than the LV community assumes). So then bar gets the websocket request and has to decide to send messages for the state that’s running on foo or move/respawn those processes.

Reconnects are another example of adding latency if requests aren’t pinned to the same machine because of assign recomputation/process startup time.

If the entirety of your state for a session is small, none of this matters. However, that’s only the case on younger applications.

LostKobrakai · September 11, 2024, 4:43pm

That’s a footgun you built into your application then though. None of that happens by LVs doing between the static request and a later websocket connection.

Yes there is certain time to gather all the assigns based on the session values. But that happens if you reconnect to the same node as well. There’s no benefit to connecting to the same node over a different node, unless you build additional caches into your system. But at that point you need sticky sessions because of your caching setup, not because LV inherently requires it.

eclark · September 11, 2024, 4:57pm

I think that, combined with all that, it’s best practice for all live view applications.

Most web sessions are very short on the order of seconds, people bounce to the page and then around; that means that a large percentage of your visitors will be seeing the conversion from a dead view static to the websocket often. It’s not just about long-running processes (though that’s why I feel it so intensely in my current project); database and cache connections take time to set up. So hitting a new node without a warm TCP connection for the shards you need for cache or db will significantly affect your long tail. Tail latency matters a lot more than you think

Then add on that most web sessions are through a phone. These have a latency that’s highly unpredictable and looks like disconnects often. Developers testing with phones on the same network or maybe on the same cell tower don’t get the effect that this has.

LostKobrakai · September 11, 2024, 5:04pm

There’s a lot of assumptions in those statements. While I would agree that they are true for some usecases they’re by no means absolute truths.

That’s why ecto does keep a pool of database connections. Each of your nodes will have multiple warm TCP connections to your database.

I’d argue if that’s your target group then I’m not sure LV is the technology for you. There’s no benefit in a technology enabling interactivity if there’s no time for any interaction before the user is gone.

Again, maybe not a prime spot for LV, but also at that point anything, but a offline capable service worker driven JS app, will struggle. Maybe quic or webtransport will eventually make this less of a problem, but where there’s no connection, there simply is no connection.

D4no0 · September 11, 2024, 5:26pm

Bounce around where? If your project consists of only liveviews you can navigate between them without dropping the initial weboscket connection.

eclark · September 11, 2024, 6:13pm

Your friendly neighborhood database engineer (/wave) will likely run upgrades or maintenance causing TCP sockets to go stale. Also, remember this gets worse as you add shards and read/write replicas; there’s a limit on the number of connections we can keep open. The downstream microservice will have a new version deployed to Kubernetes needing new SSL connections to be negotiated. Etc. From experience one of the largest sources of tail latency is this kind of needed re-connect that’s stale.

After node foo does the first dead view, those connections will be set up and warm. We likely paid the cost for them in the first computation of the assigns. Now, if we go to bar, there’s a lower chance that all of those connections are needed for exactly this user’s database shard, this page’s cache shards, and so on. This is the cache locality for requests to TCP connections effect.

Mobile traffic is above 50% of all view time broadly measured. In some countries it’s above 80% since the phone is the one and only computer that’s connected to the internet.

Yes, to some degree it’s a hard goal, you need client-side support to ensure that reconnects are handled well and a timeout/LRU mechanism to expire processes for disconnected sessions. We’re not that far off in LiveView.

My ideal wish would be that the first render started a LV process with a session ID that was left running for a while, then when the transition to websocket happened, we re-used the likely still running process. If the process isn’t there then re-do the process mount/render. The nice part about that is re-connect can be handled the same way. LiveView processes that aren’t actively connected can be treated as a cache resource. LRU and life time timeouts would be good tuning knobs. Production can trade off, leaving extra processes running to save re-computes/re-renders potentially down the road. My experience has been that being able to tune memory for latency is a pretty huge win.

Bounce out of your application to something/somewhere else. The web visit for even the top websites is measured in minutes. Most web properties will get attention on the order of seconds a day per user. Users these days will not spend many minutes before being randomized, jumping to a different task, getting frustrated, etc. Yes, it’s short, and it means many new connections are the norm, even for long-running reactive web properties.

Today I was working on the AWS console. I checked something, navigating through ~10 pages in 2 minutes. Then I closed the tab and went off to something else. But that’s not that far outside the realm. 10 page loads, with the first being where LiveView would have to render/recompute twice, and cache locality would matter. 10% of page views are made faster or better with cache locality. Those 10% are also the first experiences, so time to first real interactivity matters a lot in users’ experience.

MarthinL · September 11, 2024, 6:34pm

That might have been true of the sad history of the internet as perverted by commercial interests, but in the future I’d want to help co-create those would die out along with the dinosaurs that created them. Web-sites or pages that are able to deliver their full end-user value in an engagement lasting mere seconds should be hosted (if they must exist) in S3 via CloudFront, not in Phoenix with LiveView. I for one, and I’m sure most of the people working with Phoenix and LiveView would hope to create sites that add enough value to keep users construtively engaged for hours if not days at a time.

I’d regard that as a fundamental flaw in the design of the app - it doesn’t add sufficient value, period.

You seem to be ignoring that LV backends may and often does have native mobile application front ends running in parallel with the web application. One backend, multiple front-ends.

But even without mobile traffic just this forum alone counters your argument. You yourself like many of us has spent significant time on here debating this issue. Go cater for the old stateless web-based advertising canvas all you want, but don’t expect us to use your product if you’re not able and willing to adapt to a new way of doing things.

eclark · September 11, 2024, 6:54pm

I’m not in love with the anger in the thread. So I’ll hop out of here. I am not the stateless services person you assume I am.

MarthinL · September 11, 2024, 7:18pm

Remember, you hijacked a conversation I started promoting your wares by saying:

Yet you then proceeded to contradict everything we’ve been trying to tell you about Phoenix and LiveView. We’re not in your target audience, so if you built your product for a Phoenix designers you’ve completely misread the need and opportunity. If you built it for other ecosystems you’ve completely failed to take into account how Phoenix and LiveView has been redefining how apps are written.

Read the room, buddy.

eclark · September 11, 2024, 8:02pm

We are built on Phoenix LiveView, so the technical routing discussion comes from creating the application that runs on and with those things (and history with some of the world’s most sophisticated reactive web systems). It wasn’t about selling to people using Phoenix; I’ll clarify that next time. Thanks.

MarthinL · July 5, 2025, 12:53pm

For what it’s worth, it irked me that my phoenix live-view backed website didn’t respond to ping and trace route like just about every other well known website and service. I hoped a properly configured Ingress controller could help remedy that but in the process of figuring out the correct way to combine an Ingress controller and load balancing with MetalLB in BGP mode I also learned that exactly none of the involved components are geared for ICMP at all.

Ultimately I managed to make my “service” pingable with an ICMP DNAT rule on the firewall. Crude but effective.

I’ve decided to switch to the Ingress based config anyway for one significant reason (that I was mistaken about before, if you happened to catch this post before I edited it) - cookie based session affinity - which Ingress offers but MetalLB doesn’t. It remains true that because of the long-poll mechanism does not depend on sticky sessions to ensure the requests pushed from the page gets back to the same app instance that served it, having sticky sessions does help to ensure that initial page requests (resulting in a call to mounted as opposed to updated) also gets back to the same serving instance. This becomes useful when the app uses some advanced techniques for that for example caches additional data in the node itself, such as what happens when using the memoize package.