I’m having a hard time debugging this issue related to Phoenix.PubSub on our production system. We have serveral Elixir instances but one is experiencing messages overloaded in the PubSub server. Sometimes the gen server cannot handle incoming messages fast enough so they got piled up in the messages queue. I’m checking the message_queue_len in 3 seconds interval.
Here’s what i know so far:
We are using Phoenix.PubSub with PG2 as adapter
The aforementioned instance is working under normal workload. Connected sockets got spread pretty evenly between all the instances.
During the time when message queue got piled up, we do have a slight increase in incoming messages rate.
AFAIK, the Phoenix.PubSub server is doing nothing but receiving messages from remote process, query from ETS and delegate to local socket processes. There are no heavy tasks involved.
So my question is:
Is there any flaw in my measurement method? I feel like since message_queue_len does not directly reflect message latency, my assumption could be incorrect.
What is the production safe way to debug this issue?
Unfortunately we do not use Phoenix Dashboard at the moment.
Well, the gen server is implemented by Phoenix, Phoenix.PubSub.PG2Server to be precise. There’s only one instance of this gen server running on each Elixir node.
We also need to know if other parts of the system are causing contention with the schedulers. Do you have any unbounded message queues in the app (such as those subscribed to the incoming pubsub messages)? Any other large message queues that appear alongside the pg2 server’s inbox staring to grow? Once you start get unbounded msgque buildup, other processes can start falling behind or calls start timing out even tho they are not the real cause as the scheduler tries to keep up. I’m not completely ruling out pg2server being your bottleneck, but we need to know a lot more to say what’s going on or to place the blame there. Our synthetic benchmarks of pubsub have handled 500k msg/sec on my macbook. What kind of broadcast rate are you pushing thru the cluster when you see this?
Are you filtering messages, thus not consuming all the incoming ones?
If so, then the mailbox will start to grow and for each received message the mailbox will be parsed from the oldest to the newest, until it matches a message you want to consume, thus creating unnecessary work that will affect the throughput.
To solve this issue you need to have a catch all in order to consume the message, otherwise the runtime will not have it removed from the mailbox, and at some point in time you can even run out of memory and bring the entire system down.
Yep the processes which subscribe to those event are in our control but they are doing nothing beyond pushing message down to clients through web sockets. And since they only receive messages from PG2 server without any blocking (the server just send and forget), I highly doubt that they are the problem.
Currently i’m measuring the message in rate of the Phoenix.PubSub server with :sys.statistics. The rate is around 1400-1600 messages/s even in times when the message queue starts to piles up, it barely increase. Most of them (i can say 95%) are broadcast messages, the rest are messages from Phoenix.Presence.
Please noted that i’m pooling with 3 seconds interval and the :sys module only measure time in second so the result might be a bit off.
Also how does CPU look up to and during the backup? Any estimate of size of the total # presences? You can go CPU bound as you get large number of users being tracked thru Presence, which could be related.
would it be possible to use pubsub 2.0 - which uses pg instead of pg2?
The pg2 module is deprecated as of OTP 23 and scheduled for removal in OTP 24. You are advised to replace the usage of pg2 with pg. pg has a similar API, but with an implementation that is more scalable. See the documentation of pg for more information about differences. Erlang -- pg2
I am not saying this is happening but you shouldn’t rule out the possibility of I/O being a bottleneck (in fact, it’s quite often a bottleneck). If these messages are being pushed to clients, there has to be an open socket connection, those connections will eventually buffer up if you are sending data to clients faster than they can handle receiving it.
I would need to check what’s the behavior and safeguards channels have if any for this situation when clients are slow and transmission rate is high, but in principle that can be your bottleneck. I am sure @chrismccord could shed some light what is happening in such situation.
It’s possible you are running up against the limits of your instance hardware for that number of concurrent presences. How partitioned is that 70k presences? IE do you have a few topics with 10’s of thousands of presences, or many topics with much fewer presences per topic?
Update on the issue at hand. The Erlang VM schedulers are kinda busy when messages start to piled up. I checked through :erlang.statistics(:run_queue). The exact reason why this happens is still unclear though since the CPU’s utilization (and load average) stays consistently low.