More Telemetry Events for LiveView

bamorim · October 31, 2023, 11:32am

Story behind

Recently, I gave a talk on a meetup about improving performance of Phoenix applications and the example app was a LiveView app. One of the problems I created was a N+1 where the solution was to use the new update_many/1 callback on LiveComponents as an example. The demo showed the “discovery” phase of performance issues by looking at OpenTelemetry traces (I was using OpentelemetryLiveVIew, but now OpentelemetryPhoenix also have some LiveView spans in the new unreleased version).

The problem I noticed was that the N+1 was only clear on full renders where OpentelemetryPhoenix would create a http server span around. Navigating through live patches, didn’t made it clear that there was a problem because there were a bunch of “rogue” Ecto spans in an isolated trace.

What this have to do with Phoenix LiveView itself? Well, these OTEL libraries base their OTEL spans on top of :telemetry spans. Looking at the Phoenix LIveView code base I realised a few more :telemetry spans would be helpful, so the proposal here is to list a few ideas and where to put these.

More callback spans

I first want to start with more callback events

[:phoenix, :live_view | :live_component, :update | :update_many]
- To help find where a query is coming from and be able to solve a N+1 problem, maybe it would be good to wrap calls to update / update_many.
- It should be easy since the only place that calls it are on Phoenix.LiveView.Utils.maybe_call_update!/3
[:phoenix, :live_view | :live_component, :render]
- A normal N+1 problem is caused by rendering many live components that do a query in their update. Having the “wrapper” component render span be a “parent” of a “update” span would allow us to spot the N+1
- For LIveView, this would require wrapping both Utils.to_rendered and Diff.render calls. We could wrap the whole render_diff function in Phoenix.LiveView.Channel or at least the true branch on the if (when force == true or when it changed). This would allow it to encapsulate all “nested update/render spans”.
- For LiveView components, it happens internally to Diff.render, in the Diff.component_to_rendered function, so it should be easy to wrap that in a span as well

Maybe it is also worth adding handle_call, handle_cast and handle_info for the cases where we forward that to the live view (socket.view.handle_call(...)), but also including the handle_result subsequent call.

Channel Lifecycle Spans

Now is where I don’t have a clear idea whether it is a good idea or not, but maybe worth exploring. To be able to wrap a “full cycle” on the channel process for each type of “message”. The challenge here is that this could be leaking internals that should not be treated as the public. The benefit is that we would be able to treat it on OpenTelemetry as “wrapping messaging spans”.

In general the idea would be to wrap the outermost callbacks on Phoenix.LiveView.Channel to be able to have a higher level view on what’s happening.

Some examples:

[:phoenix, :live_view_channel, :mount]
- This would be the “join” call where the live view is mounted. Not sure where to put it, if it should encompass the whole handle_info call with the join data or if it should be internally on the mount or verified_mount functions. I guess it depends on how much data we want to include in the :start event? Also there are questions on what we should send when we reply with an error. This is because the return value can be {:stop, :shutdown, :no_state} so there is no reason on what caused it (we only have a slight hint on the GenServer.reply call) to stop. So maybe we should add some data on the return? but that would require the whole tree to be changed: there are many places that do that, so it might be a challenge to organize this in any sane way.
[:phoenix, :live_view_channel, :handle_event]
- Wrapping the whole handle_info case for the %Message{topic: topic, event: "event"} message
[:phoenix, :live_view_channel, :live_patch]
- Wrapping live patch calls. However that can be challenging since it can happen on the %Message{topic: topic, event: "live_patch"} but also internally on the functions mount_handle_params_result and handle_redirect.

Also, it might be worth having a span for each message it receives, from redirect messages, to async results.

Contribution

If this is desirable, I can open a PR on phoenix_live_view repo with some of these. I think the extra telemetry events for the live view / live component callbacks are almost certainly worth it while others probably need more discussion.

josevalim · November 3, 2023, 5:26pm

Hi @bamorim, very nice thoughts here.

I think we should definitely consider instrumenting either [:phoenix, :live_view, :render] or [:phoenix, :live_view, :update_many] (invoke once per component type). It depends on how much granularity you want. A LiveComponent doesn’t really execute those actions - it does render but it is very fast, none of what LiveView is doing - so I wouldn’t include it.

About channel events, I am not sure I understand. We already capture the user handle_event/handle_params/mount. I am not sure we are doing internally something meaningful to be worth capturing more. I guess for mount we do a lot of work, but we don’t know what we are mounting until the expensive work is done, so the span would be vague.

bamorim · November 4, 2023, 8:02am

Thanks for the reply @josevalim.
Thinking about it, I agree that [:phoenix, :live_component, :render] will not bring much value since not much happen “internally”. My initial thought was to capture the inner updates caused by that render, but looking at the diff code, it doesn’t seem to be that easy and it seems that the recursive render is at the view level. But for live_view it can be really useful to capture all these nested update/update_many events.

As for the update, the idea is to have a [:phoenix, :live_component, :update_many] span that is one per component type, but can repeat if the same component renders itself again or if they end up being rendered at different depths of the tree, right? Otherwise it would be a little bit more complicated or even impossible to do.

I can open a PR for each of these events, I think discussing the how at the code level will be easier (will start with the live_view render since it seems to be the easiest to agree on).

As for the “channel” spans, my thoughts were a little bit confusing because I myself haven’t grasped exactly what a solution would look like. That said, let me try to explain the problem and the train of thought:

For the first render, everything happens inside the same HTTP request, so we end up with the one single “wrapping” span for everything that happen (the one emitted by Plug Telemetry). This is really useful on the OpenTelemetry context because that plug :telemetry span is converted to an OTEL span which becomes the root span of the trace and this helps visualize the full “story” of the request.

On subsequent renders, caused by any event, live patch, etc, everything happens through the websocket and there is no wrapping span to be able to see that “full story”.

The idea would be to wrap in a span all messages (handle_call / handle_info / handle_cast) that the channel genserver receives that could cause changes (that could end up calling a live view callback internally) but still wrap them in “meaningful enough” spans.

For example, if you have component that does a live patch on a handle_params that end up changing it’s own params or some “param changing chain” like that, we would not have any “wrapping” span that would allow us to correlate all the child “handle_param” spans.

Again, this makes more sense on the OpenTelemetry context because that wrapping span would initialize the OpenTelemetry Context with a new trace and a root span, allowing child spans to just be attached to that same context and thus being able to link everything.

Did I managed to make it clearer or just even more confusing? xD

josevalim · November 4, 2023, 11:50am

Correct. We will wrap the existing code.

Perfect.

I am not sure we can have a spawn wrapping the whole process lifecycle (and even if we do, it could last dozens of minutes?).

bamorim · November 4, 2023, 3:56pm

I am not sure we can have a spawn wrapping the whole process lifecycle (and even if we do, it could last dozens of minutes?).

The idea is not to wrap the whole process lifecycle, but each “cycle”, from receiving the message to updating the state to sending the diff back.

For example, we could have a mount event that would be wrapping the whole (or a good part of) defhandle_info({Phoenix.Channel, auth_payload, from, phx_socket}, ref).

josevalim · November 5, 2023, 7:29pm

Yeah, mount would be the only case but, even then: