Latency spikes and presence timeouts

Hi,

I’ve got an Elixir application running with Phoenix Channels in productions and monitoring it with AppSignal.

At times my app gets these huge latency spikes (usually running within milliseconds but increasing to at least 50 seconds) with the channel tracking Presence timing out. The presence join isn’t doing anything special, just the basic track that you can get from the example documentation but I suspect that there’s something else causing the issue.

I attached an Observer to my production app and saw that there were a LOT of processes for Elixir.Phoenix.Channel.Server:init/1 though I’m not sure why.

I’m deployed on Fly.io and these instances are occurring even with low traffic (maybe 100 connections at most on one server). I’ve checked all the other metrics on the hosts and they’re more than within comfortable limits.

Any help that someone could give me to help track down what may be the issue would be appreciated.

P.S. I’m only new to Phoenix/Elixir (maybe about 3 months of learning)

Can you share more details about :

  • do all clients connect to a common channel like “users” or “online” ?
  • are you intercepting messages in the channel ?
  • high level overview what you are doing with channel - fetching data or performing calculations or forwarding requests to GenServer.
  • did you check the process stats like are you seeing some spike in memory?
  • how are msgQs are they full when timeout happens ?
  • latency spikes resolve themselves after sometime or they become progressively worse?
  • do all clients connect to a common channel like “users” or “online” ?

No, all channels connect to their own subtopic.

  • are you intercepting messages in the channel?

I am getting incoming messages, but mostly just broadcasting them to other connected clients.

  • high level overview what you are doing with channel - fetching data or performing calculations or forwarding requests to GenServer.

My application has 4 channels, they’re mostly just using broadcast_from to send messages to other clients connected on the same topic and none have intercept the outgoing message. The only thing of note for my channels is that I’ve got a process that spins up to forward messages to DynamoDB in one channel, on one event. I use Task.start_link to spin up a new process.

  • did you check the process stats like are you seeing some spike in memory?

There isn’t a spike in memory

  • how are msgQs are they full when timeout happens ?

My message queues are empty.

  • latency spikes resolve themselves after sometime or they become progressively worse?

They mostly spike, I had an incident today though where they remained high when I tried to but my database process (that I mentioned before) into a Dynamic Supervisor (though I did have some other changes).

I thought that maybe it could be related to an influx of users, but connections remain the same when these spikes happen (they don’t spike). Though there are a higher amount of users where issues to happen.

EDIT: I forgot to mention that I’m also using a process in each channel (except one) to check the authentication of the user with Process.send_after but I’m making sure to cancel the timer in the called function

EDIT: Double checked my queues and they aren’t empty but the highest peak was 30. There have been smaller peaks as well but these also occur at times when there isn’t any latency