Liveview - retry / reload loop on long queries - tracing/debugging

This is really kicking my ass …

I have a strange issue in my app where the liveview page is in a loop (retrying) ~ 10 secs. This is due to a long/expensive query …

For all the debugs / logging I have enabled, there isn’t a single error showing (crash or timeout etc.). I think this is in the murky area of the cowboy_plug and integration into phoenix. There are a couple of posts that I have found that talk about this (including the view that this behavior is a feature) :

Turning on the cowboy_plug telemetry showed me start and stop events, however, no timeouts. Can anyone point me in the right direction on how to log/trace ? I feel a bit helpless without the right tracing/logging …

BTW> I already have the phoenix dashboard, browser console, erlang observer, logging level to debug in phoenix, ecto and anything else I could find, telemetry events from ecto and cowboy, plenty of IO.inspects … but something is still missing …

Thanks in advance …

Yikes. Not to be a bummer, but you might have to do this the hard way. I also recently had an issue trying to debug latency on Liveview and could not find the problem. I ended up resorting to the tried and true print/IO.inspect/removing components until the issue was isolated.

In this case, you may have to do the inspects inside of cowboy or plug to figure out where it is.

Also, is it only a single page that is the problem? Could there be a way to potentially re-write the page so that it does the query on update instead of mount? Perhaps that could solve it? This means the page will have already connected the websocket and so it will avoid this particular issue.

The default timeout of joining the underlying channel is 10s. You can override this in the LiveSocket constructor, new LiveSocket(..., Socket, {timeout: 15000}) but you almost certainly don’t want to. With LiveView, we can run work asynchronously and report progress/not lock the UI so locking the UI for long queries doesn’t make a lot of sense. You also don’t want to start maxing time timeouts because overload scenarios usually start piling up and you want the client to fail early and retry later (with backoff as it does automatically). So I would leave the default timeout and perform the potentially latent work in a task that updates the LV when complete.

2 Likes

For posterity, if you liveSocket.enableDebug() on the client it will show something like:

[Log] phx-FwxFxJedsNgB-w8h error: unable to join -  – {reason: "timeout"} (app.js, line 12588)

In this scenario.

1 Like

I am suspecting this is because of payload size or time. When I cut the number of rows by 1/2 in my table, it reduces the time of the queries to < 10 secs. It then works. Obviously, this will be refactored/redesigned, BUT, my question is where is the error ? Is there anything I can enable in logging so I can see where this is happening … it feels like some timeout or size limit is being exceeded somewhere ?

Also, the exponential backoff doesn’t seem to be happening … it just keeps retrying with the same 10 sec cadence. It would be great to get some clue as to how to approach the debugging so, I can forensically understand the behavior …

Sorry … the timeout/retries appear to be a 15 sec cadence (no exponential delay) … I used my iphone lap timer to get some empirical data … that is the tooling I am reduced to using :wink:

Ok. I think I am getting my bearings … this is in app.js / (browser/client side behavior). So, phoenix (server side doesn’t really know what is going on). Hence nothing to log …

What I love about phoenix is that it just works ! And even newbies like me can build something with it.

I have no clue how to go about - “… and perform the potentially latent work in a task that updates the LV when complete” … but will learn/understand. Thanks for the pointer Chris.

For completeness of the thread so it may be useful for others - here is something that talks about how to design the interactions

2 Likes

So there are two main calls for liveview.

  1. mount
  2. update / handle_params

So the trick here is not to do any work in the mount, which will allow the client to connect without the timeout, but will show a page without the data populated. Post-mount the client will call back to the server on the websocket and run the update function.

https://hexdocs.pm/phoenix_live_view/Phoenix.LiveView.html#module-life-cycle

The easiest way is using Tasks and handle_info

https://hexdocs.pm/elixir/1.13/Task.html
https://hexdocs.pm/phoenix_live_view/Phoenix.LiveView.html#c:handle_info/2

  def handle_event("fetch-data", _params, socket) do
    self = self()

    Task.start_link(fn ->
      send(self, :working)
      # do long task here
      :timer.sleep(5000)
      send(self, {:task_done, "Here's the data"})
    end)

    {:noreply, socket}
  end

  def handle_info(:working, socket) do
    {:noreply, assign(socket, data: "Fetching data")}
  end

  def handle_info({:task_done, data}, socket) do
    {:noreply, assign(socket, data: data)}
  end
4 Likes

There’s no need to reinvent how results are sent back to the parent process. Using tasks in OTP processes is documented (not just for the happy path) and works without the Task API as well: Task — Elixir v1.13.4

2 Likes

TIL, thanks :slight_smile:

1 Like

I had gone down the path of using the update handler, however, opened up the rabbit hole of the “double” queries due to the mount called twice. There are a number of threads/discussions on this. There are suggestions on how to avoid calling queries twice using the if connected?(socket) do thing in the mount handle, however, your suggestion of doing this in handle_params (also called twice) with the ‘if connected?(socket) do’ check may work. I will test this.

I think at the end of the day, the pointer from Thomas/Chris is the way to go, but thank you for your suggestions.

There appear to be a number of ways to do this, however, as a learning reference I am using the live_dashboard code - https://github.com/phoenixframework/phoenix_live_dashboard although,
There is a level of design sophistication in this piece of code that is way way beyond my current skill level and understanding of elixir/phoenix.

connected?(socket) is a plain old function. You can call it wherever you need it (and you have access to socket).

Ok. I can confirm using the update handler doesn’t work … same timeout issue. This makes sense as any data fetch calls (synchronous) are still subject to the same client side timeout constraints (app.js - Livesocket) …

So, the only solution for long data loads is really async calls i.e. Tasks.

1 Like

How interesting. If that’s the case, you may need to also do async push from the server. I’m not sure if you’ll run into the same issue if you use handle_event.

You may need to use handle_info and push_event.

https://hexdocs.pm/phoenix_live_view/Phoenix.LiveView.html#push_event/3
https://hexdocs.pm/phoenix_live_view/Phoenix.LiveView.html#c:handle_info/2

Pulling it together … this is what I have so far that works nicely …

  • no double queries
  • avoids the timeout restrictions on the LiveView sockets (client/browser side)

  def mount(%{} = _params, _session, socket) do
    # do the minimal work here

    {:ok, socket}
  end

  def handle_params(payload, url, socket) do
    if connected?(socket) do  # avoids the double query due to mount() called twice in LiveView
      self = self() # Task.start_link is doing a fork so new pid, hence remember self (parent_pid)

      # This is a push to client/browser so, this isn't subject to the app.js connection timeout
      Task.start_link(fn ->
        send(self, :working)  # use this to update a spinner etc on web page ..
        data = fetch_data()   # expensive data fetch
        send(self, {:task_done, data})
        socket
      end)
    end
    {:noreply, socket}
  end


  @impl true
  def handle_info(:working, socket) do
    # For purposes of updating page spinner, data loading status etc.
    {:noreply, assign(socket, data_loading: true)}
  end

  def handle_info({:task_done, data}, socket) do
    socket =
      socket
      |> assign(data)
      |> assign(data_loading: false)

    {:noreply, socket}
  end

I am still trying to unpack LostKobrakai’s comment on not having to reinvent how data is passed back i.e. use GenServer/OTP tasks natively … what is confusing me is the layering.

Tasks is really cool and opened up pandora’s box for me. I can see how I can “parallelize” my client requests/queries now … there is a level of detail to this however, ie. I need to understand how “isolation” works for each client’s Tasks i.e. is it all within one big application pool of Tasks (crashes, concurrency control, rate limiting for a specific client etc.) or can it be subdivided/named/traceable. I think this is all taken care of / solved … really cool stuff !

Thank you all so much for the pointers/direction and the patience.

Yes, this is definitely a neat method @thomas.fortes . I might have to play around with this myself.

It is just that you don’t need to send info inside the task, it sends automatically, you can replace my example code with this:

  def handle_event("fetch-data", _params, socket) do
    send(self(), :working)

    Task.async(fn ->
      :timer.sleep(5000)
      "Here's the data"
    end)

    {:noreply, socket}
  end

  def handle_info(:working, socket) do
    {:noreply, assign(socket, data: "Fetching data")}
  end

  # this will be called if the task succeeds
  def handle_info({_ref, data}, socket) do
    {:noreply, assign(socket, data: data)}
  end

  # Here you handle the task failure
  def handle_info({:DOWN, _ref, _, _, _reason}, socket) do
    {:noreply, socket}
  end

If the task called using Task.async crashes it will bring the caller (the liveview) down with it, the doc has suggestions for how to deal with it, but it is a starting point :slight_smile:

Failure handling looks a bit tricky here … handle_info was called twice.

*** HANDLE INFO : task completed <= handle_info({_ref, data}, socket) do
*** HANDLE INFO : task failure [normal] <= handle_info({:DOWN, _ref, _, _, _reason}, socket)

Probably an easier way to interpret the callbacks (the docs are precise but tricky)


*** HANDLE INFO : task results              <= handle_info({_ref, data}, socket) 
*** HANDLE INFO : task ended [normal] <= handle_info({:DOWN, _ref, _, _, _reason}, socket)