Hello everyone,
This is my first post here, so please excuse me if I have accidentally missed any guidelines.
In my current company, we have developed and released an Elixir application that collects user events and conditionally responds to them in realtime at scale. Such user events are piped through Phoenix sockets/channels and/or HTTP APIs. We have 2 node replicas that usually auto-scale to 6 during traffic times. Each node can hold up to 30k active socket connections, that belong to ~10k unique users.
Given that user events are piped from multiple processes across the cluster, and node stickiness is not guaranteed, I’m currently working on centralizing the event pipeline for each user through establishing one global process per user across all nodes, given that they have at least one active socket connection.
# On new socket connection
user_ref = {:global, "manager_#{user_id}"}
case GenServer.start(__MODULE__, state, name: user_ref, spawn_opt: [fullsweep_after: 10]) do
{:ok, _pid} ->
GenServer.call(user_ref, {:connection, socket, params}, @connection_timeout)
{:error, {:already_started, _pid}} ->
GenServer.call(user_ref, {:connection, socket, params}, @connection_timeout)
{:error, reason} ->
{:error, "Could not start user process with error: #{reason}"}
end
# On user event
case GenServer.whereis(user_ref) do
nil ->
{:error, "Received an event for a user that has no global process registered"}
_pid ->
GenServer.call(user_ref, {:event, context}, @event_timeout)
end
Additionally, each user may have background tasks running, which are also globally named after their session IDs and unregisters at end of work.
task_ref = {:global, "manager_#{user_id}_#{session_id}"}
Task.Supervisor.start_child(
:user_tasks_supervisor,
fn ->
:global.register_name(task_ref, self())
# do work
:global.unregister_name(task_ref)
end,
shutdown: @shut_down_interval,
restart: :transient
)
For each unique user, only one global registration will execute. Subsequent registration attempts from extra connections for the user, will call whereis/1
which is fast and reliable considering no bottlenecks in registration.
I’m worried that global registers/unregisters might cause a bottleneck. Especially that we still do not have a proper connection draining mechanism on new releases yet (code changes); All connections on terminating pods are disconnected at once and new pods receives reconnections at bulk, hence why we avoid releasing during traffic times.
I have tried to find if someone has already benchmarked the global module and found conflicting results so far:
- This post shows that Erlang’s global module seems to suffer in terms of registrations per second, as soon as you scale beyond 1 node. Numbers don’t look good at all.
- This study [Figure.5][Figure.6] does conclude that throughput and latency is affected by global commands, but not significantly before 10 nodes.
I have used our end to end stress testing framework against my change and did not see any significant throughput bottlenecks against our production code on 2 nodes.
I was wondering if anyone has had an experience with Erlang’s global module for registering and looking up a large number of processes across a cluster, or could point me into a direction to verify this further.
Many thanks!