LiveView infinite reconnection on errors

sanderson1 · May 13, 2020, 2:53pm

Hello,

I’m wondering if anyone can help me with my problem.

I have the following LiveView code (this is just an example to show the problem):

defmodule LiveViewErrorsWeb.PageLive do
  use LiveViewErrorsWeb, :live_view

  def mount(_, _, socket) do
    Process.send_after(self(), :foo, 10000)
    {:ok, socket}
  end

  def handle_info(:foo, socket) do
    nil.foo()
    {:noreply, socket}
  end
end

On mount, it sends itself a message in 10 seconds. The handler for that message throws an error, causing the process to die. LiveView helpfully tries to reconnect, but since mount is then called again, in another 10 seconds the same error happens again.

In our real life scenario, we have some old data in the database, which doesn’t have the required fields that newer data does. So when someone goes to view one of these old records, we load the data but the rendering fails because one or more fields is not present. Obviously we can just fix the bug, but the problem is that whilst we haven’t fixed all of the bugs, if the user then leaves their browser window open, we get an infinite stream of error messages in the logs, from the LiveView failing and the reconnecting.

This pollutes the logs, and is also causing us issues with error tracking software where a single person leaving a LiveView tab open can fill up our quota of errors quite quickly.

I’ve looked into various ways around this, but I can’t seem to figure out a way to handle it nicely, whilst also preserving the useful feature that the LiveView will reconnect if there is a network error, or a redeployment on the server side, etc.

Does anyone know if this is possible? The only thing I can currently think of is to write a hook in Javascript that would check if we’ve reconnected more than N times in the last T time period. If so, I could somehow stop it from reconnecting. However, it would be nicer if I could do this in Elixir in the mount function, so that I could then show an error message instead of running the error-producing code again.

Thanks for any help you can give!
Steve

chrismccord · May 13, 2020, 2:57pm

This is a feature We recover on channel crash. In your particular case, you expect the crash to happen 100% of the time, but systems can fail for all kinds of reasons, so the client should not treat a channel exit as a permanent failure. We use exponential backoff on retries, but we will indeed keep retrying. You should fix the bug, and you should be able to filter errors on whatever tool you’re using.

chrismccord · May 13, 2020, 3:03pm

To expand, the true fix is handling this incomplete data on your app, but something as simple as wrapping your expected error conditions in the LV and redirecting would also fix it. Ideally you should already know in mount whether you have valid data, or post-populate the invalid data into valid data, but as a stop gap you can isolate the potentially erroneous calls. There’s nothing to be done on the LV side in this case

sanderson1 · May 13, 2020, 3:13pm

Hey, thanks for the reply!

I get that it’s a feature and that it’s a good thing to do, but I feel like maybe you’re focussing too much on this specific issue. Ok in this case it’s due to some dodgy data. But in general there are going to be bugs in code that we can’t know of ahead of time. In a normal Phoenix app, for example, the user would see a 500 page, which is bad, but at least it’s clear that an error has happened. In the Live View world, the user ends up seeing a constantly refreshing page, with no indication that an error has actually happened (I know you can style the div to show there has been an error, but it doesn’t give them any indication of when to stop waiting). And it’s also not ideal that a bug you never knew existed is discovered by someone at 11pm, and by the morning your logs are completely full and we’ve used up all our error tracking credits.

I think in general my counter point is it’s impossible to know ahead of time what my “expected error conditions” are, or where my “potentially erroneous calls” are, otherwise I would have fixed them already, if that makes sense? It could be for example that the database has gone down, and I’d like to actually show the user a message rather than a constantly refreshing page. Ideally I’d like to be able to say “ok, we’ve retried this page 10 times in the last minute, and it’s failed every time, let’s just show the user an error message and ask them to come back later”.

I’ve noticed there is an internal field called join_ref in the LiveView process state, which seems to be the kind of thing I could potentially use for this kind of generic error handling, but haven’t been able to find a way to get access to it via the socket.

Thanks again for your help!

chrismccord · May 13, 2020, 3:43pm

The same goes for users of any application tho. Wether clicking a stateless HTTP form submit, or a JS app sending an AJAX call into the abyss. The LV approach is to only refresh the page on a failed websocket mount, and only up to a maximum of 10 times. In all other cases, such as a handle_info causing an exit, we only rejoin the channel without refreshing. So you may experience more error logs than a stateless HTTP app, but it’s not that different compared to a user refreshing the page to retry, or a user spamming a failed submit button that appears to not work in a SPA.

chrismccord · May 13, 2020, 3:50pm

By refreshing the page, the standard HTTP request flow takes place and when your page 500’s, then your application error page will be shown. The only possible exception is if you have an error after connecting, in which case it can happen at any time, and it would be very difficult to come up with what constitutes the “we’ve retried”. In your case, it’s a send_after that is not based on user action, so how do you specify that as an “attempted action” ? The mount was successful and everything was good until it wasn’t. There is nothing we can quantify safely to say that “10 channel crashes over the span of 5 minutes” means stop the app.

sanderson1 · May 13, 2020, 3:55pm

Hey thanks again for the reply!

I’m not sure I agree with that, if a JS app sends off an AJAX call, we can tell if the call returns a 500, or any other error code, or indeed I could also program in a timeout so if nothing happens within a certain amount of time the user is notified of that. I agree it would be bad user experience that you just click a button and nothing happens. But I’m trying to make a good user experience Anyway I fear this is going off topic a bit.

In response to your second message. I think I may have confused things a bit by saying “refresh the page”, sorry, that’s my sloppy terminology. In real life what we have is that on mount() we kick off a process to load some stuff that takes a long time to load, and put a is_loading key into the socket. Then the view updates to show the user that it’s loading, and at some point in the future, when the loading is done, the “loading process” sends a message back to the LiveView with the stuff that it loaded. And the LiveView then updates the socket and tries to re-render. In this specific case, the render fails because the data doesn’t have a required key. So LiveView sees the process has died, and re-mounts the LiveView. Which kicks off the loading again. Hopefully that makes sense and sorry for the confusion.

I’m not suggesting that LiveView itself does anything to handle this by the way, it’s just in our circumstances we’d like to be able to stop retrying after a certain number of failures. But it sounds like that’s not possible at the moment.

Thanks again for all your help.

And it goes without saying thanks for all your work on Phoenix, we use it extensively and it’s awesome!

chrismccord · May 13, 2020, 3:56pm

I’ll close out by saying that this error log situation applies to any application at scale. A stateless HTTP api with a bug in its hot code paths will flood the logs. The best behaving external clients might backoff requests, but they will continue to consume the API regardless, so I understand your logs may be noisy, but this is not inherently different than any buggy deploy at scale.

chrismccord · May 13, 2020, 4:04pm

Thanks!

A final thought for consideration is we’ve had the opposite reporting about recovery, where folks deploy a bad change and then are elated to find that the LV recovers automatically and fast enough that users often are not even impeded. But that’s the issue in this case. Your usecase is some level of successful mounts, but then later crashes constitutes a permanent failure, but other times the LV may be entirely useable, but some particular user interaction crashes it. In this case, it’s impossible to automatically say what actually constitutes “Sorry, this page is broken. Try again later”, because it’s app specific. The folks who push a bug in 1 feature, absolutely want recovery for users to continue to use the app for the other working features. See the issue?

sanderson1 · May 13, 2020, 4:12pm

Yeh I totally get what you’re saying there.

I guess my point is it would nice if it was possible to decide for myself and stop retrying, but I don’t have the information available to my code (except possibly in the JS).

Anyway thanks again for your help.

karolsluszniak · October 2, 2020, 11:16am

Hello,

A specific case of infinite reconnection is a situation when the origin check fails. I’m wondering if such behavior is proper in this case. I understand that we shouldn’t ever let arbitrary domain connect for security reasons but what, for instance, if user just wants to use Google Translate to translate our website? Maybe we could withdraw LiveView connection attempt entirely and just leave user with a server-side-rendered content but one that’s not refreshing every N seconds.

What do you think, @chrismccord?

And, since I didn’t have a chance yet - thanks for the amazing, incredible work on Phoenix and LV!

Schultzer · October 2, 2020, 2:05pm

There is also the case where a user unknowingly has programs installed, that are blocking web sockets, this would result in 426 errors and the handshake would fail, which causes LiveView to go into an infinity loop, and AFAIK you wouldn’t be able to know this from the server side.

karolsluszniak · October 2, 2020, 2:54pm

It would be great if we could provide customized fallback experience for these cases. For instance, LV could add a class like phx-disconnected and offer JS API for sake of reconnecting. This would allow to:

inform user about the issue
e.g. .phx-never-connected .never-connected-alert { display: block }
give an action button to act on it
e.g. <a onclick="window.liveSocket.reconnect()" ...>
disable/hide UI elements that don’t make sense with disconnected LV
e.g. .phx-never-connected a[phx-click] { opacity: 0.5; pointer-events: none }
still allow to read the page & use regular HTML/JS (live_redirect links could just work like regular links in this case)

From my perspective it seems that UX is a far greater concern than log pollution here.

karolsluszniak · October 7, 2020, 5:49pm

After giving it some thought, and in process of considering fault tolerance options for LV project approaching production, I’ve stumbled upon one another case here.

It’s when the Phoenix server suffers downtime but we have CDN in front of it, capable of serving cached HTML and assets. Here too it would be great to get out of user’s way and offer an unobtrusive way to init connection. Seems to be a great backup plan for cases like server memory overload - something I consider likely to happen e.g. when a high-traffic live view gets deployed without temporary_assigns in place.

As it is, with hardcoded reloads, I think the only way to get there with this or either of the above cases would be to create a separate preliminary ws connection to see if the actual connection has a chance of success, but that would be inefficient, buggy and unreliable.

DaAnalyst · April 28, 2021, 4:13pm

Chris, by “exponential backoff on retries” here, do you mean you intentionally keep on prolonging the time LV tries to reconnect after each time it fails which would then explain why it takes it “forever” to reconnect after I restart the Phoenix server a minute or so after I shut ti down?

If that is so, is there a way for us to configure this “exponential backoff” to become some regular time interval instead?

hauks96 · January 11, 2023, 1:20pm

Shouldn’t the error just be propagated to the respective error handlers if the mount fails intentionally? I am for example raising an exception when the page is reached when some required data in the database does not exist. I do not really see any use case for reloading on a mounting error in that case, as I want the error handled by custom error handling views.