Phoenix.PubSubServer messages overloaded

I’m having a hard time debugging this issue related to Phoenix.PubSub on our production system. We have serveral Elixir instances but one is experiencing messages overloaded in the PubSub server. Sometimes the gen server cannot handle incoming messages fast enough so they got piled up in the messages queue. I’m checking the message_queue_len in 3 seconds interval.

Here’s what i know so far:

  • We are using Phoenix.PubSub with PG2 as adapter
  • The aforementioned instance is working under normal workload. Connected sockets got spread pretty evenly between all the instances.
  • During the time when message queue got piled up, we do have a slight increase in incoming messages rate.
  • AFAIK, the Phoenix.PubSub server is doing nothing but receiving messages from remote process, query from ETS and delegate to local socket processes. There are no heavy tasks involved.

So my question is:

  • Is there any flaw in my measurement method? I feel like since message_queue_len does not directly reflect message latency, my assumption could be incorrect.
  • What is the production safe way to debug this issue?

Thanks

1 Like

if you are using Phoenix Live Dashboard you can go there and see the list of processes, memory usage and message queues of processes. This could help you pin point the issue.

My hunch is that the message queue that is piling up is on your GenServer, that it’s single process doing the work and not distributing it to a pool of workers?

Unfortunately we do not use Phoenix Dashboard at the moment.
Well, the gen server is implemented by Phoenix, Phoenix.PubSub.PG2Server to be precise. There’s only one instance of this gen server running on each Elixir node.

Alright, but what is listening to these events? There’s some sort of process that receives and processes them that you control, right?

We also need to know if other parts of the system are causing contention with the schedulers. Do you have any unbounded message queues in the app (such as those subscribed to the incoming pubsub messages)? Any other large message queues that appear alongside the pg2 server’s inbox staring to grow? Once you start get unbounded msgque buildup, other processes can start falling behind or calls start timing out even tho they are not the real cause as the scheduler tries to keep up. I’m not completely ruling out pg2server being your bottleneck, but we need to know a lot more to say what’s going on or to place the blame there. Our synthetic benchmarks of pubsub have handled 500k msg/sec on my macbook. What kind of broadcast rate are you pushing thru the cluster when you see this?

2 Likes

Are you filtering messages, thus not consuming all the incoming ones?

If so, then the mailbox will start to grow and for each received message the mailbox will be parsed from the oldest to the newest, until it matches a message you want to consume, thus creating unnecessary work that will affect the throughput.

To solve this issue you need to have a catch all in order to consume the message, otherwise the runtime will not have it removed from the mailbox, and at some point in time you can even run out of memory and bring the entire system down.

Thanks for your replies. Regarding your question:

Do you have any unbounded message queues in the app (such as those subscribed to the incoming pubsub messages)?

AFAIK, there are no other unbounded message queues in our system, at least for the part which we written. Not sure if that’s the case for our external libraries.

Any other large message queues that appear alongside the pg2 server’s inbox staring to grow?

I will look into this next time it happens again.

What kind of broadcast rate are you pushing thru the cluster when you see this?

Just broadcast to a given topic with Phoenix.Endpoint.broadcast

Yep the processes which subscribe to those event are in our control but they are doing nothing beyond pushing message down to clients through web sockets. And since they only receive messages from PG2 server without any blocking (the server just send and forget), I highly doubt that they are the problem.

I’m asking specifically any idea of the rate of broadcasts? Even an estimate would help get an idea of something beyond pubsub is more likely the cause.

Currently i’m measuring the message in rate of the Phoenix.PubSub server with :sys.statistics. The rate is around 1400-1600 messages/s even in times when the message queue starts to piles up, it barely increase. Most of them (i can say 95%) are broadcast messages, the rest are messages from Phoenix.Presence.

Please noted that i’m pooling with 3 seconds interval and the :sys module only measure time in second so the result might be a bit off.

Do you have access to the Erlang observer? Maybe that can help you getting some better measurements

I’ve tried to setup observer on remote node in the past but failed for many reasons. Do you have any idea what to look into specifically?

It can give you more real time information and also tell you something about the other processes as well. So more of an overall view on the system to see what the cause could be

Also how does CPU look up to and during the backup? Any estimate of size of the total # presences? You can go CPU bound as you get large number of users being tracked thru Presence, which could be related.

would it be possible to use pubsub 2.0 - which uses pg instead of pg2?

The pg2 module is deprecated as of OTP 23 and scheduled for removal in OTP 24. You are advised to replace the usage of pg2 with pg. pg has a similar API, but with an implementation that is more scalable. See the documentation of pg for more information about differences. http://erlang.org/doc/man/pg2.html

if nothing just to rule it out…

1 Like

The CPU usage stays pretty stable at 45-60% throughout the whole time. Estimated size of total presences should be around 70k.

I’m trying to measure the execution time of the broadcast function in this module. I’m assuming this is the major work of the PubSub server.

We are using OTP 22 so pg is not available yet. But I don’t think we hit the limit of the pg2 module in term of scaling.

1 Like

I am not saying this is happening but you shouldn’t rule out the possibility of I/O being a bottleneck (in fact, it’s quite often a bottleneck). If these messages are being pushed to clients, there has to be an open socket connection, those connections will eventually buffer up if you are sending data to clients faster than they can handle receiving it.

I would need to check what’s the behavior and safeguards channels have if any for this situation when clients are slow and transmission rate is high, but in principle that can be your bottleneck. I am sure @chrismccord could shed some light what is happening in such situation.

1 Like

It’s possible you are running up against the limits of your instance hardware for that number of concurrent presences. How partitioned is that 70k presences? IE do you have a few topics with 10’s of thousands of presences, or many topics with much fewer presences per topic?

Update on the issue at hand. The Erlang VM schedulers are kinda busy when messages start to piled up. I checked through :erlang.statistics(:run_queue). The exact reason why this happens is still unclear though since the CPU’s utilization (and load average) stays consistently low.