Phoenix Presence mailbox full

sb8244 · July 13, 2018, 8:32pm

For the past 2 days, my Phoenix.Presence tracker has been tipping over. I caught it today with up to 420k message in the mailbox, and error_logger was doing a ton of reductions due to timeouts to the presence process. I wanted to give some overview of my Presence usage and see if there’s something obvious:

After my channel is joined, I kick off my presence to track the user

{:ok, _} = ApiChannelPresence.track_user(socket, tenant_id: tenant_id, user_guid: user_guid)

# ApiChannelPresence

  def track_user(%Phoenix.Socket{} = socket, opts = [tenant_id: _, user_guid: _]) do
    id = identifier(opts)

    track(socket.channel_pid, id, id, %{online_at: inspect(System.system_time(:seconds))})
  end

  defp identifier(tenant_id: tenant_id, user_guid: user_guid) do
    to_string(tenant_id) <> ":" <> to_string(user_guid)
  end

I then query the Presence (maybe 1rps per server):

  def user_connection_count(opts = [tenant_id: _, user_guid: _]) do
    id = identifier(opts)

    list(id)
    |> Map.get(id, %{})
    |> Map.get(:metas, [])
    |> length()
  end

This is the extent of my presence usage. There are about 20k sockets across 5 kubernetes pods.

Does anything stand out as setting me up for this random crashing in the day? I can’t see an increase of RPM or anything else before the crash starts. Today it happened after a deploy, which booted the sockets. Yesterday it happened without any changes.

OvermindDL1 · July 13, 2018, 8:37pm

How much memory does the server have, how much memory is the BEAM using, how many physical cores? Are all the sockets connected to the same presence tracker? Same topics?

@chrismccord This might be interesting?

sb8244 · July 13, 2018, 8:54pm

The server is using k8s request limits which makes answering core / memory a bit tricky. The underlying nodes have 4 cores, and we give “1 cpu” as a k8s limit. We have a 1.5 GB memory limit. The nodes are currently in a healthy state and the one I’m looking at has:

595.6406 MB allocated mem
376.9602 MB used mem
4 logical processors / schedulers (1 CPU limit)

I don’t have allocated stats from the issue earlier today, but I can see that the total mem was 1.37 GB, with 1.148GB as process mem. The presence process had 300MB of memory consumed with a mailbox size of 352000. It’s current function was “:lists.foldl”

All sockets are connected to the same presence tracker. The presence topic is tenant:user, which means that every user has their own topic. The channel topic is slightly different name, but still split by tenant:user.

michalmuskala · July 16, 2018, 11:20am

First, I think it’s weird that you have just one CPU but 4 schedulers - it will lead to degraded performance. You should probably run with just one scheduler.

Apart from that, I think using the code from this PR should help you. https://github.com/phoenixframework/phoenix_pubsub/pull/79 It’s hard to say something more without seeing the application, but the high message queue stems from the presence process being overloaded - too many schedulers probably also add some overhead and the performance is less than it could be.

benwilson512 · July 16, 2018, 3:13pm

@sb8244 What process calls the user_connection_count function? If it’s your sockets, then you actually have closer to 20_000 requests per second on the presence system, not 1rps.

sb8244 · July 16, 2018, 5:01pm

@michalmuskala I think it’s a bit more complex than that. “1 CPU” is a kubernetes term but doesn’t actually mean 1 core. It means that there is 1 unit of CPU allocated to the application. This unit can combine across cores. I’m actually trying to reduce scheduler count to 2, hard coded, to see if that helps at all. I need to research a bit more into this. That PR looks like it would help out.

@benwilson512 the process that calls user_connection_count func is a genstage when new events are created in the system (roughly 500 up to 8k rpm). Interestingly, the 8k rpm comes from a backup in another part of the system (ruby sidekiq worker from the app that calls the elixir API endpoint) and may be contributing to an overwhelming behavior in the number of function calls made. The sockets are not calling user_connection_count.

sb8244 · July 16, 2018, 5:37pm

Some more info. I just ran into this again after a fresh deploy. I had a Presence process hitting 120k messages in mailbox…the entire cluster stopped operating effectively. I did Process.exit(pid, :kill) on the presence server. It booted back up and the cluster came back to normal on all elixir nodes.

One bit of info I realized which I hadn’t included. I am using the following code to probe, every 5s, the number of connected sockets local to a node.

defmodule MyWeb.ApiSocketCounter do
  def connected_socket_count() do
    acc = fn {channel, _}, counter ->
      case channel do
        "api_socket:" <> _ -> counter + 1
        _ -> counter
      end
    end

    :ets.foldl(acc, 0, NotificationService.PubSub.Local0)
  end

  def connected_socket_count(_exometer_metric), do: connected_socket_count()
end

sb8244 · July 16, 2018, 6:37pm

I’m unsure of best forum courtesy with respect to new posts. However, I’ve been troubleshooting this all day and have more info. I have confirmed that the scheduler count should be set to 4, as the underlying cpu concept is for throttling and not for how the cores are distributed. It will still have access to all 4 cores.

This only happens to 1 Elixir node per deployment. All other nodes seem to operate fine.
The node becomes overrun with messages that look like:

  {:"$gen_call", {#PID<0.30538.0>, #Reference<0.1499947089.2763784196.118656>},
   {:track, #PID<0.30538.0>, "topic",
    "topic", %{online_at: "1531765300"}}},

The message growth is large. Over 30k messages per minute by eye measurement
I am attempting to recreate it to get more info about the messages before my server boots me for OOM

Edit: The current process mailbox looks like:

pid = pid("0.2225.0")
messages = :erlang.process_info(pid, :messages) |> elem(1)

Enum.group_by(messages, fn msg -> 
  case msg do
    :heartbeat -> :heartbeat 
    tup -> elem(tup, 0)
  end
end) |> Enum.map(fn {k, l} -> 
  {k, length(l)}
end)

["$gen_call": 110884, EXIT: 18273, heartbeat: 1, pub: 1203]

DavidAntaramian · July 18, 2018, 2:13am

I discussed this issue with @sb8244 in the Elixir Slack, and pointed out the call to inspect/1 in ApiChannelPresence.track_user/2. A benchmark on my machine showed that using inspect/1 like this can slow down execution severely. @michalmuskala mentioned that inspect/1 (and IO.inspect/2) should not be used in production code like this because it’s designed to be pretty not fast. Stephen actually doesn’t need a binary here, which is what inspect/1 would produce, but if one was needed, Michał suggested using Integer.to_string/1.

@sb8244 Did this end up helping you at all?

dom · July 18, 2018, 9:32am

Can you call Process.info(process, :current_stacktrace) a few times on the tracker while it’s busy? That should give us a better idea what it’s doing.

sb8244 · July 18, 2018, 5:52pm

Came to a bit of a standstill because my redeployments actually caused another issue due to the sheer volume of sockets asking for authorization credentials. I’m going to pick up diagnosing this again once I’m able to deploy freely. It might be a couple of days to resolve that issue and get back to it.

My solution for now has been a watchdog process which kills the presence channel if it goes over 5000 messages in mailbox. I may need to remove this if it gets in the way of testing.

@DavidAntaramian I’ll see if the error still occurs when I’m able to start re-deploying frequently. I made another change in addition to removing inspect, which is to give 1.5 CPU units rather than 1 per node in the system.

@dom I will run this the next time I’m experimenting and see the problem.

benwilson512 · July 18, 2018, 6:36pm

How many pods are you running? I’d definitely consider aiming for a smaller number of pods with large CPU values vs smaller CPU values and more pods.

sb8244 · November 18, 2018, 9:34pm

I’ve recently been working on a load test for an open source push server that I’m building. One of the component of the load test has been testing different channel distributions.

When I run a topic using a unique ID, the test is able to max out my agents (64k conns per agent) without trouble (3 agents maxed out).

When I run a topic with a more constrained topic spread, I run into the Presence_shard0 timeout like this thread mentions. The example I have setup is hitting the ceiling very quickly with 100 topics, within 20k connections (1500 connections/s). The mailbox has over 70k messages usually when this occurs.

Nothing new to report on solution yet, but I do have this generically reproducible and easily tweakable. Unfortunately I’m in the process of open sourcing it, but I’m game to try out new ideas or share the problem over screenshare.

The software is DigitalOcean servers 8GB mem / 4vCPUs. When running the unique spread, My CPU maxes out at 50%. When running the constrained spread, I get 80%+ CPU.

sb8244 · November 19, 2018, 12:50am

I ended up discovering that the time was being spent in broadcasting the presence_diff to channels locally. I had been intercepting presence_diff in the channel and doing nothing, and I changed that to overriding handle_diff/2 in my Presence module.

The Presence still tracks safely across the cluster and can respond to questions like “is anyone there?”, but doesn’t broadcast out to channels.

sb8244 · December 5, 2018, 5:35pm

Final reply here. After discussing with Chris, it is advisable to use Phoenix.Tracker directly when you don’t want all features of Phoenix.Presence. The docs have been updated to reflect this: https://github.com/phoenixframework/phoenix/pull/3198