Scaling LiveView for low latency in multiple regions

te_chris · November 12, 2020, 11:22am

So we’re using LiveView in production and it’s mostly been fantastic. BUT, we’re UK based and serving the UK market - from GCP europe-west2, i.e. London. We’ve got devs in NZ currently, and they have said that the site is, as imagined, really slow from there.

What are peoples thoughts on scaling LiveView across multiple regions? It’s not an immediate problem for us, but it was a known since we decided to adopt it - thus far I think the productivity and quality of the UK experience has paid off. We’re currently using GKE, running stateless elixir pods, nothing shared or distributed between. This is working well for us and hasn’t given us any problems with the sockets. Given this, my first thought is just expand the Kube cluster into other regions and use node affinity and google’s global load balancer routing to sort it out…but I haven’t tried this or similar yet.

Has anyone had experience of this? It feels like the final hurdle for LiveView in some ways, so be good to know how people have handled it and whether it’s been smooth

joshtaylor · November 12, 2020, 11:46am

What backing store do you have? When you start to geolocate your nodes, you’re then changing the latency from User<->Server to User<->"Edge Server"<->Database Server (for example).

LostKobrakai · November 12, 2020, 11:58am

It this actually a liveview specific problem? I’d imagine you’d have the latency for every other tool as well. The only thing truely liveview related I can see being problematic in such a setting is the initial double request of http + websocket connection.

te_chris · November 12, 2020, 12:03pm

Good point, I had considered that but forgot to write it…so yeah, assume at least read replicas in regions too, I guess? Postgres is DB. We’re e-commerce so not write heavy and can eat the latency on puchases etc to a main master.

te_chris · November 12, 2020, 12:09pm

Maybe not LiveView specific, as yeah, any client-server application will suffer latency, but as LiveView mixes UI and data in what it returns from the server, the experience could be worse than an SPA calling just for data.

Another reason why I feel like it’s a bit special for Elixir, is due to BEAM and OTP being able to be distributed, but I don’t have much experience with that so am unsure if help or hindrance.

LostKobrakai · November 12, 2020, 12:17pm

Generally payload sizes should at least be comparable, but I can see a difference in e.g. loading a big chunk of items and paginating purely client side, vs. being easy on the initial request, but going back to the server for sorting/filtering and such.

50kudos · November 14, 2020, 2:30pm

I’m surprised this question didn’t come up earily and I even wondered do elixir devs actually serve app worldwide? Demo-ing liveview on localhost:4000 is not impressive. And I only hear stories about single region cluster.

Ok, real answer: I just use fly.io (alternative: appfleet.com), though they are not cheap, so I have my prod there, and staging on DO’s PaaS (the app platform) where db and app server sit together. The experience so far is that those edge servers are actually valuable, fast and is suitable for liveview. But the edge server <-> db distance is still tricky. So I read from edge server and response first, then (send self and handle_info) read,merge,write back that result to db. Mostly the former response and the one written to db is consistent. If it’s conflict later, just respond error follow former optimistic ui.

I also plan to add something like Cachex in prod for read (for optimistic ui), and still write back to db. My logic is not near db, it supposes to have legit result in order to write (to db), otherwise, response error or blow up, to maintain data consistency.

For worldwide liveview users with smooth UX/UI, I’d love to hear about other choices as well. I think most folks just target local users, on-prem, or latency expectation that is slightly better than read/write db for every request.

Qqwy · November 14, 2020, 4:36pm

Distribution does not magically solve the problem of a high-latency connection from someone far away; you’ll still need to use e.g. a load-balancer in front (or maybe region-specific DNS settings, or both?) to make sure people try to connect to the server closes to them.

Distribution does however allow people to e.g. be notified of each-other’s changes without having to do a database round-trip, which might be very valuable depending on what you are building.

For the initial page load, LiveView does support that you load it e.g. through a CDN.

And besides this: In general (of course it depends on your particular application) I think it is a good idea for a growing interactive application to decouple the database from your application logic as much as possible; in essence: do not make requests to the DB block usage of the UI, but perform them asynchronously and display the results whenever they become available (even if this takes a second or more). This ensures that the UI experience is smooth even when results take a while to be fetched.

josevalim · November 14, 2020, 6:18pm

Correct. This is not a LiveView specific problem and it is going to happen on whatever app running between EU<->NZ. The LiveView specifics to this discussion are:

The initial request is rendered twice: once with a regular HTTP requests and another one over WebSockets
All upcoming live_patch/live_redirect happen on the established connection. This improves UX because we don’t send the layout again nor the browser has to load it again (this is similar to what you get with SPA, Turbolinks, unpoly, etc)
LiveView automatically caches and reuses templates in the client - so it sends less data than other server-rendered HTML solutions (similar to what you get with SPA)
LiveView runs on WebSockets, which means we don’t need to parse headers, authenticate user on the DB and so on on every request which improves response times (you can get similar with SPA if you are running on WebSockets)

So besides the initial request, LiveView should be helping with the user experience, but the discussion is definitely more general. So it probably makes more sense to open the discussion beyond the context of LiveView.

EDIT: Oh, you can also call liveSocket.enableLatencySim(200) in your browser console to have LiveView simulate latency so you can see which part of your app is not providing proper UX under high latencies.

50kudos · November 14, 2020, 7:02pm

I wonder can we do something like this on mount to distinguish first connect from reconnect
(It’s more likely there is a reason this can’t be, but i didn’t know):

socket = 
  if get_connect_params(socket)["_mounts"] == 0, do: mark_no_render(socket), else: socket

josevalim · November 14, 2020, 7:05pm

If the goal is to optimize at this level, I think it is easier to skip the “disconnected render” and render something like “Loading…” (or nothing) instead. You can do so if your pages are private (i.e. they require login) or based on user agent, etc. Although I haven’t really seen a need for this in practice yet.

methyl · November 16, 2020, 8:09am

So besides the initial request, LiveView should be helping with the user experience, but the discussion is definitely more general.

@josevalim that’s of course if you are comparing server-side rendered templates served via standard HTTP requests. I think that’s valid, but another point of comparison is SPA, where latency doesn’t matter until you actually need to make a server round trip to save or load some data.

In LiveView you need roundtrips on each UI state change (if you live up to the promise of not writing JS code), that’s why latency is a bigger issue in LiveView comparing to standard React apps.

I’d love to hear what’s your angle on this, since that’s one of the biggest showstoppers for LiveView IMO.

LostKobrakai · November 16, 2020, 8:13am

Imo this is not really a promise LiveView tries to make, exactly for that reason. It’s always been marketed to solve (primarily) those problems, where the server needs to be involved anyways. As mentioned before there is overlap (e.g. in pagination), but generally one should use client side tooling for purely client side interactions. Nobody wants latency in opening dropdowns or simple accordions.

josevalim · November 16, 2020, 8:23am

Exactly what @LostKobrakai said. In fact, the docs even say you should not be using LiveView for UI-only state changes. To quote directly:

animations, menus, and general events that do not need the server in the first place are a bad fit for LiveView

CSS, support for hooks and even the recent callback for integration with Alpine.js are alternative mechanisms so you don’t funnel UI-only behaviour through the server.

methyl · November 16, 2020, 9:50am

That makes perfect sense to me, thanks the clarification.

I guess this misunderstanding comes from how LiveView is described in some of the sources, that it’s a replacement for JavaScript when it comes to building web apps:

As an application developer, you don’t need to write a single line of JavaScript to create these kinds of experiences

Phoenix LiveView leverages server-rendered HTML and Phoenix’s native WebSocket tooling so you can build fancy real-time features without all that complicated JavaScript. If you’re sick to death of writing JS (I had a bad day with Redux, don’t ask), then this is the library for you!

te_chris · November 27, 2020, 2:25pm

Interesting! Thanks for contributing, Jose.

Apologies for taking so long to reply, I got lost in a data engineering hole…

I think our problem is clearly that we’ve fallen too much in love with LiveView. We get away with it because we’re UK only, so latency is fine, but will need to refactor. We just really like the programming model.

For some context, we used to lean heavily on react, and now we’ve almost replaced all of our react with liveview, except for our main configurator (https://www.stitched.co.uk/product-builder?type=curtains for reference). We’re doing some new UX though, and the intention is that this process will be modified and the new version will be LiveView. We definitely drank the koolaid. I feel a blogpost coming on, if someone on my team (probably me) ever finds the time

As a programming model, it’s unlocked a lot of productivity for us, even if we’ve clearly abused it by building things like our nav in it. I appreciate the points raised in this discussion, and can see an audit in the not too distant future where we assess how we’re using LiveView and move a few things back to the client. If anything though, this ability for progressive enhancement just shows the strength of the programming model and is reassuring me that we made a good choice. Thanks for taking part, everyone, has thoroughly answered my questions!

entone · November 27, 2020, 9:55pm

I think your original idea is solid, you may want to look at CockroachDB as a possible backend replacement, line compatible with Postgres, but built around a geo-replicated KV store.

There are a few gotchas with the migration which this gist covers pretty well.

gist.github.com

https://gist.github.com/cohawk/df29c1c54abd858dd19d8327e862822a

README.md

Over the weekend I spun up a new Elixir :ecto, "2.2.7" project that I was able to get working with CockroachDB and this adapter fork - with some caveats I wanted to share to help others.


1.  [Only the `root` user can create databases](https://www.cockroachlabs.com/docs/stable/create-database.html#required-privileges)
This requires you configure the `Ecto.Adapters.Postgres` `username` as `root` or else the `mix ecto.create` command will always fail.  You can go back and change your configured `username` to something else after the database has been created, or create your database and user permissions using `cockroach sql` and skip the `mix ecto.create` command.

2. Configuring Ecto primary_key ID to be created by CockroachDB
By default when configuring your `Ecto.Schema` using `autogenerate: false` it appears either CockroachDB, Ecto or the Postrex adapter (I did not investigate this) uses the `BIGINT` `unique_rowid()` function as the default value for IDs
  ```elixir
  @primary_key {:id, :id, autogenerate: false, read_after_writes: true}

This file has been truncated. show original

The latest versions of Cockroach work fine with Ecto as well.

wtd423 · December 2, 2020, 6:49pm

With two servers, one in NZ and other in UK how would one deploy a new version of the application to both servers? Can Phoenix Presence also work between the two servers so all users can talk to each other?

When using the region DNS solution and it suddenly switches the server for a user do they stay logged in?

dch · December 3, 2020, 4:10pm

This starts to be a Very Tricky Thing To Solve… normally one would use anycast and ECMP BGP to provide the same external IP address from many different physical locations. In general the internet will do the right thing, even if a server goes down, so long as you have load balancers and your backend syncing between regions appropriately.

This is not a trivial setup, but once you have it working, its very very cool. anycast + eBGP + load balancers is how all the big companies build their cloud services anyway, so it’s very much worth spending the time to understand how this works, even if you don’t actually directly use it.

wanton7 · December 3, 2020, 5:00pm

Do you need zero downtime? Just take take all apps offline same time and update all regions and then put them back online? There is also Cosmos DB if you use Azure and you should be able to use it through its REST API from Elixir. https://docs.microsoft.com/en-us/azure/cosmos-db/distribute-data-globally

I’m not sure if it would be even possible so that you would have all regions with different subdomain like uk.app.com, nz.app.com. Your app’s domain would be app.com that would only have static site app that would check region and do websocket connections to correct subdomain. So from users point of view they would be connected to app.com but actual websocket connections would be connecting to uk.app.com or nz.app.com.