Absinthe cannot seem to handle even 1000 concurrent subscriptions

aloukissas · July 7, 2023, 5:37am

What we’ve discovered with a simple load testing experiment is that when there are as few as 1000 concurrent clients subscribed to a query, doing Absinthe.Subscription.publish can take 10s of seconds or even minutes, regardless of the resources we give to the system (e.g. many GBs of RAM). Some details of our setup below:

Absinthe v1.7.1, with absinthe_graphql_ws
Phoenix 1.6.16, using Phoenix PubSub for subscriptions, setup pretty much exactly how the absinthe guide for subscriptions prescribes
Nodes clustered with libcluster using Kubernetes.DNS strategy (although problem manifests even with single node, clustering disabled)
Experiment is run with artillery with graphql-ws engine that simply sets up each runner to subscribe to a query on our schema

What we’ve been able to isolate the issue to is the publish_mutation calls (both for remote and local). In the same setup, with all websockets connected, we are able to subscribe and publish on topics with Phoenix.PubSub without problem.

Has anyone here successfully run Absinthe with subscriptions at any scale?

I’ve also built similar systems with Phoenix PubSub + Channels (much more low-level than Absinthe) that behave really well at much larger scale (100s of thousands of concurrent users), without any special configuration.

josevalim · July 7, 2023, 6:06am

What data are you storing on your Absinthe Context subscription? If it is slow, it is often because the context is really large, and then serializing all data in and out of ets takes a long time. Try running it by storing the minimum amount necessary (user_id+org_id) and see what happens.

josevalim · July 7, 2023, 6:13am

You can try further validating this by using observer or the LiveDashboard and listing all ETS tables alongside their size. Also please check dataloaders.

aloukissas · July 7, 2023, 6:45am

This is something I’ve not tested yet! I do store the entire User struct in the socket’s assigns (it’s not large but could be the issue).

aloukissas · July 7, 2023, 7:54am

We are using dataloader already. To reduce variables in the debugging, the resolver just resolves a single field to {:ok, nil} so it should be a noop. I am also exploring leveraging the context_id for dedup, as per Absinthe’s documentation.

benwilson512 · July 7, 2023, 8:15am

Is this code for this experiment available? publish_mutation is blocking, in that it does not return until all subscriber documents have been run and individually published. If you have 1000 subscribers and their docs take 10 milliseconds each to run (which would be quite fast) then that’s still 10 seconds worth of delay.

This is fundamentally very different from Phoenix PubSub. If you have 1000 subscribers to a topic, publishing to that topic just involves basically a send to every subscriber, there is not a large body of code to execute for each subscriber.

EDIT: As a tiny note I’d make sure to use the Absinthe.Schema.PersistentTerm — absinthe v1.7.3 schema backend as that has improved copying characteristics. I don’t think that’s going to be a game changer though.

aloukissas · July 7, 2023, 8:33am

Thanks for confirming this, this is what I understood by reading the code. In this case, would setting context_id to “global” help?

benwilson512 · July 7, 2023, 8:38am

Yes setting a context_id to a fixed value in your case would make a massive difference as it transforms the work Absinthe has to do from 1000 x (1 doc exec + 1 publish) to 1 doc exec + (1000 * pubsub broadcast) where pubsub broadcast here is basically a raw phoenix pubsub call.

Smart use of context_id is definitely critical for scenarios where you have many subscribers to the same thing.

aloukissas · July 7, 2023, 12:35pm

Hey @benwilson512 - it seems like the combination of global context_id and using the currently unreleased version of Absinthe (1.7.4) which includes this PR seems to fix it. Without this fix, we were still seeing latency linear to the number of subscribers (the global context_id made no difference).

When can we expect 1.7.4 to be properly released? We’re pointing to the git commit in our deps for now.

benwilson512 · July 7, 2023, 3:59pm

Glad it worked for you! v1.7.4 has been published

aloukissas · July 7, 2023, 4:40pm

Thanks! By the way, I think it would be useful for future readers to explicitly spell out in the docs that subscriptions carry this O(n) performance penalty unless the global dedup flag is enabled. In our experiments, anything about a couple hundred subscribers made the system pretty unusable.

benwilson512 · July 7, 2023, 6:05pm

Sure. Absinthe.Subscription — absinthe v1.7.6 has some notes but they’re pretty outdated and the language speaks of a “beta” release which is obviously very old.

Again it depends a lot on whether those subscribers are all subscribed to the same thing or not. You can easily have tens of thousands of subscribers if they’re spread out over thousands of topics. When you get concentrations of subscribers though then this can definitely be a challenge.