Hey everyone! New to the forum, not new to Elixir.
In any case, recently had a discussion with @benwilson512 on overhead during GQL requests, and then saw a related question in the Absinthe channel on Slack, so decided to open it up here for input/guidance.
Problem
Weāve been tracing our APIās performance in Datadog and have seen some oddities that we canāt quite diagnose. Pictures are worth 1000 words, so here are some flame graphs for one request that demonstrates the issue:
Whatās happening in the gaps after auth/context and after the actual resolution?
This is just one request, but many of our requests have this shape, where the majority of the request has something unknown happening for at least 50ms or so. In some cases, we have 100ms or so unaccounted for, usually on the tail-end of the request post-resolution.
Environment
App is released via mix release to AWS (EC2 machine) and deployed behind an ELB (https).
One key thing to check on is what your scalars are doing. Scalars are serialized in the result phase, which happens entirely after the resolution phase. In the past, poorly optimized serialization functions have lead to lengthy post-resolution times.
To dig in further, some additional metrics would be handy. Can you provide an example query document? How large is the result JSON in bytes? How large is the schema (approximated by MyAppWeb.Schema.__absinthe_types__ |> map_size)?
Hereās a gist for a sample GQL doc that would generate the flame graphs in my original post. Iāve taken some fields out for IP purposes, but thatās generally the shape. Most types you can assume the type of. There are aggregate types when it makes since (user, for example). Also of note, weāre not currently using any custom serialization.
The result of __absinthe_types__ |> map_size is 34, which doesnāt seem to be that large to me, but perhaps thatās incorrect?
Update
So Iāve shaved some time off because I wasnāt .uniq-ing IDs in batch loaders, but there are still two sizable time gaps for which weāre unable to account:
Hi @derekbrown I still need some sense of the size of the returned document. This will help diagnose time spent after the resolution phase constructing the actual JSON result.
While
I recommend using Telemetry to provide a few more data points, specifically the [:operation, :start] event would be very helpful because right now although it is likely that Absinthe starts work immediately after the context is set, in theory there could be other plugs or issues within Absinthe.Plug that are introducing delay. See https://hexdocs.pm/absinthe/telemetry.html#content for details.
The document is relatively small, so processing on the document should happen pretty quickly. Iām going to take that document and create a project with a similar schema to see if we can reproduce the observed performance issues.
Sorry about that! Missed the question in your first reply, @benwilson512 . The return document size ranges depending on the query variable, obviously, but the return document for the graphs above was 155kb. FWIW, we see the same ālagā regardless of the size of query.
We have Telemetry events already setup; I can get that to you ASAP. Not a bad idea to add those events to the tracer as well. If Iām able to do that easily, Iāll update this thread with new graphs that show the start & stop events on the operation.
As a side note, it will be super interesting to see this data alongside a PersistentTerm schema once weāre able to release that.
Update: Iāve been able to add Absinthe Telemetry events to the tracer, per reply above from @benwilson512 . Below are some observations and pretty pictures. Iāve tried to keep the same query here from the original post, for some apples-to-apples comparison. Hope this all helps.
Youāll notice thereās a significant reduction in speed. The query document remains the same, along with the response size, more or less (if anything, the response is likely bigger). Part of this is attributed to some logic/DB work weāve been doing underneath the query, and some of it is directly attributed to the .uniq issue I mentioned earlier. Hope that helps anyone else with looking to increase performance.
Overall shape of the query. The first flame of spans is the context loading, followed by the operation.start event, followed by the first resolution.start event. Weāll zoom in subsequent screenshots, but thereās a significant gap between those two events, and after the operation.stop event.
Context to Operation Start. Thereās a miniscule gap (ā300µs) between context being added and the operation starting, but not significant enough for us to blink.
Operation Start. The span on the right of the screenshot is the first resolution.start event, which is followed by the fetch_messages query in the overall screenshot above. Would we expect there to be a ā6ms delay or so between operation start and resolution start?
Operation Stop. Thereās a ā10ms gap after operation.stop thatās unaccounted for. Iāll dig into what could be happening there (perhaps other plugs?). I donāt assume so thereās any cleanup or anything that Absinthe would be doing after the operation, is there?
Maybe a clue, or could be a rabbit trail: it seems that the delay to context matters whether or not itās a query or mutation. The gap from trace start to context being added is much less for mutations.
The above is a rabbit trail. Doesnāt seem to be anything deterministic causing the delay before Context gets called.
The time spent between the operation starting and the first resolver running is time spent parsing and validating the document. The time spent between the last resolver and operation stop is time spent turning the raw Elixir results from resolution into the nested string keyed map that will ultimately end up as JSON, and validating / serializing scalars. Time spent after the operation has stopped is gonna be mostly JSON encoding I think.
With respect to the comment you made in the github issue, the resolution phase is expected to be the longest because it runs the resolution functions. The sum of the time spent in the resolution functions will be less than the time spent in the resolution phase due to the overhead of walking the tree and its associated results.
Thanks, Ben! Makes sense. Youāve been super helpful and Iām appreciative.
And JSON encoding being a majority of the tail makes sense as well; I think that length of time is proportionate to our response object. Iām still trying to diagnose if itās our nested query documents (sometimes 4-5 levels) thatāre causing the perf gap during validation or if itās something else.
With respect to this, the best bet would be to create something we can both iterate on that exhibits the behaviour youāre running into. I should also ask about the server youāre running this in.
I didnāt unfortunately. But I do think that a lot of it is resolution. We made a couple of changes to batch resolvers that seemed to have helped. There were a couple of uncaught n+1 errors as well on our side. Still a lot of unexplained gaps though, but the response times are at least manageable now.
The offer to investigate any runnable example stands! Do make sure to check out the persistent_term backend Absinthe.Schema.PersistentTerm ā absinthe v1.6.4 which weāre promoting out of the Experimental namespace in the next release. It has proven to be both faster and easier to work with.
Iām seeing the original post was on Absinthe 1.5.0, and we definitely fixed an N^2 regression in our variable handling somewhere in a release between that and latest. Hopefully youāve updated since!
Yes! Forgot to mention that: weāre using PersistentTerm as well (have been since it was in PR last year), and itās definitely provided gains as well. Huge thanks to @benwilson512 and team for that.
FWIW, when I say that our response times are manageable, I may have been understating. Our overall times have been reduced to a third of what they were.
Itās still a mystery to me exactly whatās happening, but itās now a low priority mystery for us.
Awesome. Do note that once OTP 24 is released weāve seen significant improvements from the JIT as well, so we expect the overhead to drop a further ~60% from that alone.