Clustered LiveView app web socket

Pardon the dumb question, but when the LiveView JavaScript establishes its web socket (apparently by upgrading the regular HTTP GET request of the /live endpoint to a bi-directional TCP socket), how does data sent to the server via the socket reach the process holding the session state (assigns, etc.)?

It all works well. I am just confused by concrete evidence of the page request being served from one pod, but the LiveView socket connects to another pod in a cluster, i.e. two different BEAM instances on two different VM.

There is no shared state on the server between those requests. The only shared state is the session (cookie/encoded in the markup).

2 Likes

More ignorance I suppose but that reads like the session state isn’t maintained on the server after all. Instead the client actually “remembers” the state, sends it back to whatever server answers the call who then reconstructs the session state from what was sent. After spending so much time debating how to control and optimise how much session memory to use server side I’d feel a right arse if the above as to be true meaning the server doesn’t keep session state at all.

Alternatively I am failing to distinguish accurately between the web socket mechanism and the session, in which I am still in the same dark about how the code handling the events pushed on the web socket ends up having access to the session data.

Or I am mistaken about assigns being part of session memory?

Thanks, for helping out. It must seem like trivial a d/or overthinking but I have a lot riding on layers of software built on a possible misconception.

One thing to consider is that Elixir/Erlang/the BEAM natively supports clustering. Whether your multiple pods form a cluster from Elixir’s point of view depends on your configuration.

For the LiveView socket state, it is as you started your reasoning: client request reaches your Kubernetes ingress, eventually reaches a Bandit/Cowboy web server in one of the pods, the server returns the “disconnected” rendered HTML, the browser runs JS which connects the socket - it may land on any pod - the server upgrades the connection gets to a WebSocket and the LiveView process and state lives in this pod that served the second “connected” request.

If you have the BEAM setup as a cluster, then processes on different pods can communicate, and for example exchange messages through PubSub.

2 Likes

What he means is the LiveView is mounted and rendered twice, first as an HTML page and then again after the socket connects. Everything in your mount/2 callback will be called both times unless you guard the code with connected?/1. There is nothing shared between these renders, which is why you observe no issues with them running on different nodes.

By session he’s referring to the thing actually called session in the docs/code, whereas you seem to be using the word “session” to refer to the assigns. The assigns are stored on the server as you would expect.

Indeed, and I believe LiveView takes advantage of distribution to route requests from the long polling fallback to the correct node seamlessly, which is a fantastic demonstration of what is so special about Elixir/Erlang :slight_smile:

I do load and config libcluster (mode: dns, hence using statefulset) in stripped version too, but didn’t expect to be playing any significant role in the LiveView magic. No PubSub in the stripped version.

Yeah, I got that from many sources many times over, yet somehow never ended up with a database query getting logged or some debug statement I output to the console in the part of the code that supposedly always runs twice for a LiveView page, or never noticed it, which is equally strange. Been tracing, for unrelated reasons, quite closely through my code involved in setting up a LiveView session (the state, not the cookie or variable).

I get the rationale behind that, and how the documentation can be read to mean that, but for some mysterious reason I’m yet to come across something that confirms, for example, that the database query feeding the assigns is run a second time, so I’m having a rough time accepting that the two runs are identical.

There’s another oddity involved as well. By the code the JS on the client appears to be sending the HTTP request it intends to upgrade to a web-socket to the /live route which is indeed defined in endpoint.ex. The logs I have set up at the moment never shows that path being accessed, but it might not be looking in the right places to see that.

It would explain a lot if the /live endpoint operates largely independently, takes all the detail it needs from the url/cookie sent by the LV JS code to eventually result in the web socket connecting to process where the state data for the session is kept, but that does not seem to be the explanation I’m getting here.

By your understanding and the interpretation of the code and documentation this double and identical execution of mount/2 would result in all the setup work including the requisite database queries being done twice (which I truly think I’d have picked up by now), but the code reveals that mount/2 has two heads, the first of which matches on a param called “token” being set, which strongly suggests that being the version that gets called when liveview sets up the web socket while the other head matches the regular code involving loading the assigns from the database etc.

Meaning that though mount/2 most likely is getting called twice, the two calls do entirely different things (and thus) share nothing (that isn’t contained in the token, url and cookie values) and need not run on the same node).

I (naturally) lose track of the implied consequences of the token-matching mount/2 calling push_navigate with its (potentially updated) socket parameter, particularly if that contains an opportunity for the second process/pod/pid handling the web-socket setup to somehow locate the process/pod/pid the other mount/2 ran in and where the session has its state in memory. I’d still want and need to understand how that happens, but if that is the net result it would be a solid starting point.

At the moment the evidence I see supports the notion that somehow the process which responded to the original HTTP request (and ran the second version of mount/2) ends up reading and writing to the web-socket in the context of the state it keeps in memory relating to the session.

In principle I also favour of explicit coding as opposed to magic things happening but I’m not a stickler for it and can appreciate a well managed bit of magic doing wonders for me. It’s just in this particular case where I’m forced by circumstance to learn exactly how LiveView and it’s web-socket actually works as starting point for matching the behaviour I see to the load balancer rules I am looking to assert control over.

It’s critically important because of the gap between bare metal kubernetes and hosted kubernetes which are identical and compatible in most respects except for load balancing. Cloud providers typically offer their own load balancing that’s fully integrated into their kubernetes products while on bare metal you’re on your own with half-solutions like MetalLB, Ingress-controllers doubling as load balancers themselves and a whole spectrum of CNIs getting involved as well.

My aim has always been and remains to retain portability between cloud hosted and bare-metal deployments of my appication to keep all my options open as I deal with varying growth rates. In that context my current mission is to come up with a load balancer solution for bare metal that mimics enough of the capabilities of the cloud provider’s load balancers for my needs. I’ve managed (largely) to do that using HAProxy external to the cluster and now it’s down to defining the right rules for it that get the job done but for which there also is a clear mapping to the rules I can define in (each of) the cloud providers’ load balancers. I’ve not sought to or managed to make it “plug-compatible” with the cloud-based versions, firstly because it’s beyond my abilities to mess around with kubernetes at the API level to automate LB-IPAM provisioning and secondly because I’m given to the the notion that everything should be made as imple as possible, and no simpler.

I was planning on writing down my current understanding of exactly what happens and ask for affirmation, refutal or correction of each individual step/statement. As soon as I am able to write those down without needing conditional statements and disclaimers, I’ll surely do that, but I’m not there yet.

Thank you for your efforts to so far, and pretty please for what lies ahead. You guys are awesome.

If you’re not aware of it it’s easy to miss duplicated logs if you have a lot of debug logs :slight_smile: But the double render is definitely there! Start up your app locally and throw a dbg {"hello world", connected?(socket)} in your mount/3 and you’ll see it.

I think I’ve lost track of what you’re referring to here. mount/3 is a callback within a LiveView; the heads in question should be your code, no?

Double render I knew about all along, but the double databae reads probably did get lost in the brrage of database reads I do anyway.

Right, you mentioned mount/2 but I was looking at mount/3, more specifically at this exerpt from what I recall to be a largely generated user_settings liveview MyApp.UserSettingsLive module.

  def mount(%{"token" => token}, _session, socket) do
    socket =
      case Accounts.update_user_email(socket.assigns.current_user, token) do
        :ok ->
          put_flash(socket, :info, "Email changed successfully.")

        :error ->
          put_flash(socket, :error, "Email change link is invalid or it has expired.")
      end

    {:ok, push_navigate(socket, to: ~p"/users/settings")}
  end

  def mount(_params, _session, socket) do
    user = socket.assigns.current_user
    email_changeset = Accounts.change_user_email(user)
    extras_changeset = Accounts.change_user_extras(user)
    password_changeset = Accounts.change_user_password(user)

    socket =
      socket
      |> assign(:current_password, nil)
      |> assign(:email_form_current_password, nil)
      |> assign(:extras_form_current_password, nil)
      |> assign(:current_email, user.email)
      |> assign(:email_form, to_form(email_changeset))
      |> assign(:extras_form, to_form(extras_changeset))
      |> assign(:password_form, to_form(password_changeset))
      |> assign(:trigger_submit, false)

    {:ok, socket}
  end

Despite the /2 vs /3 misnomer it seems we are talking about the same function after all, so I’ll assume we are.

Now, I’ve acknowledged from the start that I mqy well have missed the double database queries in the logs. Plus I’ve read about that life-cycle countless times online as well as in the the evolving Programming LiveView book I bought long ago. I have no specific issue with the double load at all, and I’ve now added some debug statements into both mount/3 heads but only managed to get the second one called, twice as expected. No idea when the first would match. I was thinking it would be oduring the web socket setup which does set that parameter, but I wasn’t able to catch it in the act just yet. That would however be a third call to mount/3 if I still count correctly.

Still I’ve not been able to close the gap in my understanding. If the HTTP request that gets upgraded is can be served from another pod, cool. If during the upgrade the other pod is involved cool, cool, but how? If the original process respnding to the original request is abandoned/forgotten and the whole liveview conversation carries on with the process on the pod that answered the second HTTP request, cool, but when and how are the resources the original process held in anticipation of the web-socket landing back there released and cleaned up? If the original process on the priginal pod doesn’t keep any resources but just terminate, could I see how it knows to do that?

It’s not in production yet, but in a pipelined version I am doing a lot more “prep” work per user at the start of a new session. Expensive database operations I need for the intial render but would do well to avoid repeating straight away for the second render, but to make smarter choices about that I would need to have a much better grasp on what my options are, why the default LiveView behaviour doesn’t do something along the same lines and consequently what I’m getting myself into if I want the state loaded during the first render to become available to the second render.

I’m not there yet either, that’s still dopwn the line. Right now I am still a little stuck fincing correlatin between what I expect to see based on the load balncing rules in haproxy and the combined logs of the pods in a clustered deployment. The most confusing bit being that I’ve yet to see a consistency in what gets served where and from what point onwards to be able to say whether it’s the pod reporting the websocket handshake or the pod serving the liveview that ends up with all the web socket traffic from that point onwards.

For one, I don’t see (in the logs) two requests to the (in this case it would be /users/settings) url, I see only one. And I don’t ever see a request logged (to the phoenix console) to the /live url from the code and the documents. I’ve assumed that the “[info] CONNECTED TO Phoenix.LiveView.Socket in 24µs” message was logged by the endpoint defined for /live in endpoint.ex in lieu of the normal logging that prints out the route/path/url. That might be at the heart of my confusion, who knows. I’m just trying to make sense of what I’m seeing so I can confirm if and when the rules and behaviour I am seting up is working as planned or not. Remember that the original reason I went down this rabbit hole was because one of the pods disappearing while a web socket was open ended up being picked straight away b the client (lost the internet) and by the looks of it by the server and its proess management, but the software that served as load balancer at that point (and still does actually, until I can roll out a fix) was the last to learn about the demise of the pod, insisted on sending trafic to it and got forced into sending a bad gateway error the the client trying to reconnect. I’m assuming responsibility for this and not blaming it on a bug or bad behaviour or such. The whole clustered deployment was entirely my own initiative and I’d be surprised if this was the only mess I made in the process. But I need to sort it out which unfortunately means having to keep pestering people more clever and better informed than myself because reading the raw code is not as definitive to me as it should be. I’m not really a programmer, nor a network engineer. I used to program in my youth and spent most of my career as a systems architect, but I’m really an inventor building a system to do something every single person I’ve mentioned it to concluded was impossible to do, so yeah, I’m in way over my head and not ashamed to admit it. That’s why I need all the help I can get interpreting what the code actually does and how that matches up with what it sets out to do in support of what I intend to do.

Thanks again.

P.S. I finally managed to get the first mount/3 to activate. Turns out the token referred to there is not the csrf_token I saw mentioned in the web socket setup call in the JS, but the actual user token generated by the auth system to authenticate the request that arrives from the url sent to the user. I just happened to pick a test function that didn’t follow the usual pattern which I suppose it to see only one head for the mount/3 function and no mention of the token put in the request by the JS. Sorry about that.

First off, yes, /2 in my original reply was a typo. Everything I wrote was referring to mount/3 (and everything you wrote as well, I think).

Anyway, I assume that is phx.gen.auth code. I don’t actually use Phoenix generators personally so I often forget what’s in them (even though I have read the templates).

The first argument to mount/3, as mentioned in the docs, is the params map. These params come from the URL, either /users/whatever/:token in the router or /users/whatever?token=foobar in the query params.

The purpose of the first mount/3 head there is to handle a particular function (updating the user’s email via a link with a token). After doing that it redirects to the actual settings page. Since it immediately redirects via push_navigate, that clause never renders an actual page.

This is a pretty bad example, pedagogically speaking, because most LiveViews don’t have an extra clause like that. The authors of this code were essentially trying to stuff some extra behavior into the LiveView module.

I’m not sure exactly where in the code to point to here (it’s probably many abstractions deep), but I would imagine it dies when it’s finished just like any other request which has been completed. It’s not needed for anything else. To turn your question around, why would you expect it to stay alive?

Ah, I see you pieced that together :slight_smile:

You can guard the queries with connected?/1, which means that the content will not be present in the dead render and will “pop in” once the live render goes off. This is fine for SPA type stuff but bad for SEO or other cases where you want non-JS clients to be able to read the page content. I believe the LiveView async assigns functionality supports a nice abstraction over this approach.

Alternatively you can cache the results of the queries in such a way that reading them again shortly afterwards is cheap. Any cache solution will do (a simple ets table would be my starting point, or maybe Cachex).

If you’re caching locally on the node (which is much easier) then you probably want to configure the load balancer to be sticky so that the clients end up on the same node both times.

All fair and true once I get there, but I really haven’t pieced together the most crucial part, which is what exactly the sequence of requests, calls, messages and events are that result in the web-socket (I hear I should actually say Channel as the abstraction of either a web-socket or a long-poll setup depending on what the client and the network between can handle) connecting the JS in the client and the process in the server pod where the state is presented as the socket parameter to a variety of callback functions and event handlers that returns a modified version of it. I recognise that as gen_server style behaviour so I’m not overly confused by state as a parameter, but if the fact that the app runs across multiple clusters is in play in any way, manually or automagically, I need to know about it as basis for everything else I figure on doing. It’s fine to say Elixir is designed to run on clusters and imply I shouldn’t overthink it and let the magic do its thing, but unfortunately that hadn’t worked out all that brilliantly for me of late so now I am compelled to figure out exactly what enables it. There ultimately has to be a difference between a single node BEAM and a cluster of BEAM nodes and part of my concern is the number of people calling BEAM clustering, being fully meshed, way too chatty, meaning network intensive, for its own good and certainly too much for WAN links to work.

I’m intent is to grow beyond a single cluster over WAN links so if I don’t know how the clustering gets involved if at all then I don’t stand a chance at all of keeping intra-cluster and inter-cluster comms separate, and that would be a costly mistake.

So, I have a proposal. Tell me if the following statements are 100% accurate and if not what adjustment they need to be accurate. Once we’ve agreed on what should be accurate, I shall go do my best to confirm that the behaviour I see matches that or bring you proof that it doesn’t.

  1. a LiveView page’s two renders does calls the same callbacks but as part of two very different algorithms.
  2. the two versions build the same data in memory independent of the other which means they can run on different nodes without any coordination of communication between the nodes.
  3. One of the processes dies as soon as the page content had been assembled and sent to the client despite being part of a LiveView setup.
  4. The other process, aware of its role in the LiveView life cycle keeps the socket open (perhaps without even sending the rendered data) and waits for the js on the client to upgrade the socket to a web-socket or engages long poll.
  5. once the socket is established the second version of the process listens for messages from the client in the socket, processes it using application callbacks to which it sends the assigns and other socket attributes derived during the duplicate startup processing as the socket parameter which serves as the state “variable”.
  6. When the network layer changes the state of the socket itself, either because the connection was reset or terminated from the client side, the socket handler either participates in an attempted reconnection or closes the socket as well and abandon the state it had kept in memory.
  7. When the js code detects a change in the socket state it attempts to reconnect, failing which it reports a problem and waits for the user to reload the page which re-initialises everything from scratch in a brand new session.

It remains to be seen if that is accurate or not, but if it is, and from where I’m sitting right now, it appears to have been a rather trivial exercise to avoid the double render and associated double data loading and processing. Especially if the second copy doesn’t really send through the initially rendered content but only the subsequent deltas calculated from that. If that turns out to be true I will go reread the rationale in the lifecycle descriptions another thousand times to figure out what I missed that made it a bad idea to do the hard work only once.

I wanted to note another important fact about the LV architecture: you don’t get double mount/3 and double database queries and everything else all the time.

That only happens at the initial request entering a “live session”. Navigation (push_navigate and push_patch) within a live session (that you declare in router.ex) goes over the WebSocket connection, only sends minimal HTML diffs and in those cases mount/3 is called only once (and connected? would always return true in those cases).

You’re on the right track to piece everything together!

Just so we’re clear here, I am not a maintainer and I don’t have infinitely deep knowledge of LV internals. Maybe someone who is/has those things can comment, but it’s not me.

With that said, your understanding sounds mostly correct to me up to here:

A WebSocket is a TCP connection. TCP will attempt to retransmit lost packets to some degree, but if the connection fails I don’t think there is any further reconnect logic on the LiveView side.

I believe what would happen is the TCP connection would close, the LiveView process would die, and the LiveView client would attempt to reconnect and go through the entire mounting dance from the beginning. I’m not sure if the reconnect would result in two mounts or one, but my guess is it would be one (the socket connection) since you already have an HTML page loaded.

You should run some experiments and report back. I don’t think any of this is documented in full detail but it would be nice.

If the socket is down LV adds a CSS class to the page which can trigger display of a “reconnecting” message. What exactly is displayed is up to you, but there is a default in the generated layouts which is what you’re seeing.

I think the client just tries to reconnect. It’s not waiting for the user to do anything.

I know, that’s the trouble with forum discussions I suppose, and English. No distinction between you (singular, personally) and you (plural, collectively, or whomever the shoe fits). It’s too demanding if you @ddresss someone specific and if you don’t then eyes tend to glaze over as soon as the conversation gets a bit tricky. Thanks for staying engaged even when it’s outside your comfort zone.

Agreed, which is a protocol that needs to be complied with and consequences when you don’t. No harm, no foul. How to respond when the lower layers are done with a socket is an application layer question, so in this case what I meant was that there’s two possibilities. Either the demise of the socket leads to the entire LiveView session having to be set up from scratch with another server or the application layer repeats parts of what it does during setup to attempt to replace the broken socket with a new one. Seeing that the client side explicitly tries to reconnect I stated the assumption that the server side participates in reconnecting the socket to the same process if possible rather than tear everything down all every time and starting another process by default. I don’t know what the truth is, but I am hoping someone does and is willing to tell me. I don’t even have an opinion about what the better way would be - rescuing the state in memory might seem the more noble approach but may be the worst idea because failing hard and early actually restores functionality the quickest and avoids a boatload of complicated rescue code that never gets fully battle tested or end up becoming part of “normal” without anyone noticing the failures it’s compensating for. You know, the age old argument about defensive programming s failing as early as soon as any unexpected condition occur so the higher level code can try again with a clean state.

Thanks for the confirmation but that is how I had it too.

Yes, it tries to reconnect automatically, and often does when the session had been inactive or on a background tab for a long time, but that wasn’t the case I referred to. I was talking about the situation where the client-based reconnection times out, so fails. I’m not sure if it will keep trying periodically, how often or triggered by what, or if it would, once an attempt to reconnect has failed (timeout or error response) it gives up, displays the error and never try again unless the user reloads. That seemed to be the behaviour I saw for myself and got user feedback about when my current load balancing arrangement results in the bad gateway errors.

I’d like to:

  1. clarify that this wasn’t meant as criticism, and
  2. venture a guess as to why the double render approach won out.

My guess is that it had to do with getting js code required to play its part in upgrading the socket ready for action in time. To get a socket the client first issues a regular get request and then asks that it gets upgraded to a socket. If the that initial get request is the standard request issued by the browser when you navigate to a url, it ultimately means that the code that needs to do the asking (for the upgrade) can only do that once it has arrived at the browser, which on high latency connections and big initial page loads could make it unrealistic to expect the code to activate in time to still catch the browser’s initial request in time to make the upgrade request on it. It was likely far more predictable if the original request was allowed to run its course to completion as per usual and then, once the js is in place, for it to issue its own, second or new get request it can reliably expect to get upgraded if the network allows for it.

If that is more or less true, it would mean a second get request will always be the safer option. But I am still curious about what would prevent the use the platform’s capabilities to see to it that the process responding to the second http get request can put the state variables it requires in order to service the socket the LiveView way from the work done by the original request. I understand the double work is only in play for new page requests, and maybe my sample is still a bit skewed, but it still seem to me those “initial” page loads happen often enough to consider alternatives.

I am fairly confident the server does not hold on to any state after the socket closes due to disconnection. The client will receive a new server process when it reconnects. But there is always some chance I am wrong, at least in some edge case, and someone will come along and correct me. The nature of forums indeed :slight_smile:

As for the client, it will keep trying to reconnect for quite a while, maybe forever(?). Note that there is a bug in Firefox which causes an enormous backoff (like, a minute or two) to build up if the server is down for a few minutes. I’ve seen it discussed on here before.

The reason the double render exists, from what I understand, is that LiveView was originally an addon feature to Phoenix, a framework which already had routes and pipelines and so on. As time has gone on LiveView has become much more powerful, acquired much of that pipeline functionality for itself, and has become somewhat of a flagship feature. The double render is, essentially, tech debt. Hopefully someone will put the work in eventually to pay it off, but it won’t be easy and someone has to actually do that work. It would probably even mean breaking changes.

In hindsight having two request pipelines essentially glued together doing double the work is quite absurd. If you were designing everything from scratch you probably wouldn’t do that. But things are rarely designed from scratch!

Can you confirm that it’s the server that initiates the upgrade to socket, or is that done from the client side as I’ve been assuming until now?

The WebSocket connection always initiates on the client.
Without JavaScript code in your app.js, there won’t be any WebSocket connection established.

The client requests that the connection is upgraded to a WebSocket, then the server needs to do their part in the protocol (including refusing the upgrade).

The above is agnostic to server technology, thus not specific to Elixir/Phoenix/LiveView.

Notes:

  • Unless you’re serving client traffic directly from your BEAM-based web server, there are typically reverse proxies between clients and your server. That’s the case with Kubernetes, the case if you proxy traffic through Cloudflare, and many other possible setups. In those cases, it’s worth knowing that what you may think as one WS connection between client/browser and server/application is, in fact, 2 or more connections, as many as the number of hops between client and server. I say that because when you think about reconnections, it is worth considering those.
  • I think somewhere in this thread I saw the word “Channel”. WebSocket is just one of the options how to make LiveView (built on top of Phoenix Channels) work. The system is made to be transport-agnostic, and there is a Long Poll transport that ships out of the box, and that can be used when, for any reason, the client is unable to establish a WS connect. The intention to connect/reconnect and what transport mechanism to use is always initiated from the client/JS.

Thanks, that’s how I understood it and why I got confused when you stated the server upgrades the connection to a socket. My understanding now being that it is requested by the client but the server has to agree and could decline.