LiveView client crashes with 502 Bad Gateway Error after server restart

When deploying an updated phoenix release to AWS, existing LiveView clients attempt to reconnect for a short while, but then abruptly throw a 502 Bad Gateway Error before the new release is running properly.

In the browser console, with debugging enabled, the following error flashes briefly before a 502 error page is loaded:

Is there a way to configure the client to keep polling for longer? Or some other strategy for doing rolling deploys better?

2 Likes

Hey @petrus-jvrensburg you’ll have to elaborate as to how you are doing your deploys in AWS. In general yeah you want to make sure the new releases are up and running and properly hooked up to the load balancer to receive traffic before winding down the old instances. How you do that though is going to be generally depending on your deployment approach.

1 Like

Thanks Ben,

I guess my deployment approach is a bit naive: I’m using a single server, and running through the following steps for each deploy:

  1. upload the new release (next to the current release)
  2. stop the current release
  3. start the new release
  4. clear out the old release

But I’m not sure that something like a Blue/Green deployment would fix the issue either: I can replicate the same 502 Bad Gateway error just by restarting the running release (my_app/bin/my_app restart).

Do you think it could be an issue with how the sockets are being terminated?

I would expect the LiveView to stay in place and keep trying to reconnect ad infinitum, but in stead it reloads the page from the client as soon as the server process terminates, and then just sits there displaying the error, even after the server is back up.

1 Like

Ah gotcha. That is surprising, my experience is that it retries indefinitely. If you open your JS console do you see any JS errors?

Notably restarting the app will still take the app completely offline for a brief moment, which will still mean that your reverse proxy or load balancer or whatever will return a 502 even if briefly.

1 Like

Yes, before the page reloads to show the 502 Error, the console flashes:

destroyed: the child has been removed from the parent
join: encountered 0 consecutive reloads
1 Like

I usually would have this error when running nginx in a reverse proxy configuration, sometimes it would work without but generally you have to reload nginx after each new deploy.

1 Like

LiveView will try reconnecting forever, but we treat a crashed join (crashed LV mount) in a special way. We consider the following case unrecoverable on the client:

  1. WebSocket connection is established
  2. LiveView mount repeatedly crashes with 500 error

We consider this unrecoverable so we failsafe refresh the page to take the user thru the full HTTP flow, which will at least potentially show them a fail whale page. It sounds like your deployment setup allows the proxy to issue sucessful 101 ws upgrades before the app is actually running? Then I guess it eats the web socket messages? It’s not clear, but LV is behaving properly in this case in my opinion, so hopefully that helps you narrow things down.

2 Likes

Yes, but shouldn’t it be possible to restart the running release without banishing al the connected LiveView clients to a 502 Bad Gateway page? Or at least only to show that page after some reasonable timeout?

I was thinking that it might be a misconfiguration of the AWS Load Balancer that’s causing the clients to error-out, but I haven’t figured it out yet.

If a hard refresh of the page is triggering a 502 bad gateway, that means any visitor to your app during this phase will also be treated to a 502 bad gateway page, so this is necessarily something on your ops side to fix.

Your deploy has downtime. There is no way to avoid a 502 if you have a period of time with 0 active servers.

2 Likes

Okay thanks. So it is a feature, not a bug. I get that.

Wouldn’t it make sense to keep the client alive when there is a 502 Bad Gateway error specifically?

It differs from other ‘server errors’ in the sense that it usually isn’t issued by the server at all, but by the load balancer / reverse proxy when the server is unavailable. So even if the client keeps polling to retry indefinitely, it doesn’t flood the logs in the way that other server errors (which may be occurring on mount) would.

And the upside would be that you could restart a running release without all of the connected clients being violently disconnected and then shown a static error screen. They would only see an error screen if they tried to refresh the page at the exact moment that the release is being restarted, which is to be expected.

I would be happy to work on a PR for this if it makes sense.

There’s no guarantee the error ever goes away again. Your server could crash and fail to restart.

Yes, but that is not a problem in itself. The LiveView could hang around in a disconnected state, until the user refreshes or navigates away.

That is the behaviour that I’m seeing in dev, so it came as a surprise to me that in production a graceful shutdown of the server behaves differently.

What I am seeing when restarting my prod release on AWS behind a load balancer, is that the graceful shutdown of the server process triggers a hard refresh on the client, at precisely the worst possible time. The server can’t respond, since it’s being restarted, so the load balancer issues a 502 and the client is stuck on that static error page with no ability to reconnect when the server is back up.

So from a user’s perspective, they navigate to my website, they leave the tab open for hours or days, and if at any point I restarted the server (or deployed an upgrade) then the page they were looking at has turned into a 502 Error page, without them even touching it.

Wouldn’t it make sense to handle a graceful shutdown / disconnect different from a generic server error?

1 Like

I found a workaround, that keeps the client in failsafe mode for longer, while the server restarts:

There are two parameters that can be set when initialising liveSocket in assets/js/app.js to set a min- and max bound on the time that should be spent in failsafe mode, e.g:

let liveSocket = new LiveSocket("/live", Socket, {
    params: { _csrf_token: csrfToken },
    hooks: Hooks,
    reloadJitterMin: 15000,
    reloadJitterMax: 20000
})

If reloadJitterMin is bigger than the time it takes for the server to get back up after it went down (because of a restart / new deploy), then the client re-establishes connection, avoiding a 502 Bad Gateway error.

The default value for reloadJitterMin is currently 5000 and for reloadJitterMax it is 10000 (they are specified here).