Load testing: Struggling to get more than 50K-60K connections to a Phoenix Channel using Tsung

I’ve been trying load test of phoenix channels for an app I’ve been working on for a client. I’m using tsung to do so. I’m struggling to get more than 50K-60K connections at a time. I feel like I’m missing something simple, but I’ve been struggling with this for a while now.

EDIT: I also just tried it with a brand new phoenix project:
https://github.com/praveenperera/bare_channel

Here is my info:

Tsung setup:
Version: 1.6.0
5 worker nodes: DigitalOcean - 16GB 6vCPUs
1 controller node: DigitalOcean - 16GB 6vCPUs

Application server setup:
DigitalOcean - 8GB 4vCPUs also tried 16GB 6vCPUs

Elixir: 1.8.1
Phoenix: 1.4.6

I removed all code from the channel. Now on join it just sends back an :ok response with an empty payload.

What I’ve tried so far:

  1. Increasing limits on all the workers, the controller and the node running the phoenix application. I used this script to do so:
#!/bin/bash

#limits
sudo sysctl -w fs.file-max=12000500;
sudo sysctl -w fs.nr_open=20000500;
ulimit -n 20000001;
sudo sysctl -w net.ipv4.tcp_mem='10000000 10000000 10000000';
sudo sysctl -w net.ipv4.tcp_rmem='1024 4096 16384';
sudo sysctl -w net.ipv4.tcp_wmem='1024 4096 16384';
sudo sysctl -w net.core.rmem_max=16384;
sudo sysctl -w net.core.wmem_max=16384;

echo "fs.file-max = 1048576" >> /etc/sysctl.conf
echo "# limits
* soft     nproc          1048576
* hard     nproc          1048576
* soft     nofile         1048576
* hard     nofile         1048576
root soft     nproc          1048576
root hard     nproc          1048576
root soft     nofile         1048576
root hard     nofile         1048576
" >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
sysctl -p

touch /root/setup_log
echo "COMPLETE AT: $(date)" >> /root/setup_log
  1. Different versions of tsung, different settings.
  2. Tried the websocket instead of tcp on tsung but it wasn’t sending messages properly
  3. Switched over to using releases instead of doing mix phx.server
  4. Removed all application code, only replying with an :ok and empty payload
  5. Deploying a new version straight on the VPS, avoiding docker and k8s
  6. Turned logger level to warn
<?xml version="1.0"?>
<tsung loglevel="warning" version="1.0">
  <clients>
    <client host="tsung-worker-1" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-2" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-3" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-4" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-5" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
  </clients>
  <servers>
    <server host="134.209.166.67" port="9962" type="tcp"/>
  </servers>
  <load>
    <arrivalphase phase="1" duration="300" unit="second">
      <users maxnumber="2000000" arrivalrate="2000" unit="second"/>
    </arrivalphase>
  </load>
  <options>
    <option name="ports_range" min="1025" max="65535"/>
  </options>
  <sessions>
    <session name="websocket" probability="100" type="ts_websocket">
      <request>
        <websocket type="connect" path="/socket/v1/websocket?vsn=2.0.0"/>
      </request>
      <request>
        <websocket type="message">          ["1", "1", "payload:url!!!commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4","phx_join", {}]        </websocket>
      </request>
      <for var="i" from="1" to="1000" incr="1">
        <thinktime value="30"/>
      </for>
    </session>
  </sessions>
</tsung>

Results:

I am only hitting about 60% CPU usage and 50% memory usage. Does anyone know how many connections I should be reasonably be expecting on a server this size?

6 Likes

Have you looked into tweaking the underlying HTTP server parameters at all? The http and https config options for Phoenix.Endpoint allow passing options to Plug.Cowboy. Options like max_connections under transport_options sound relevant…

3 Likes

side note, when doing Hadoop clusters some things are often recommended

  1. hard code the dns ip of all the nodes in cluster in the /etc/hosts of each machine (avoids lookups)
  2. make sure the nodes are topologically close to each other (few hops)
  3. make sure the connections are not shared, i.e. some kind of isolated network
  4. make sure the network connections between nodes is fast

did you try to run the cluster on a single machine?

thanks for sharing btw, very inspiring and interesting

2 Likes

Thanks for the suggestion. I had some time late a night today, so I tried changing it in the endpoint. I feel like this isn’t the right place to do the config, I will investigate tomorrow: https://github.com/praveenperera/bare_channel/commit/b6e537e41778ce130451f9438f485cf4ce64c365.

So far no effect.

I did also notice that the first plateau always seems to happen at 60 seconds. I changed the Tsung arrival rate to 1000/sec (down from 3000/sec). And the first plateau again happened at 60 seconds but this time with only about ~12,000 connected. And it levels off around ~17,500 after 150 seconds.

Really makes me wonder if the problem is with the tsung configuration and not phoenix itself.

You’re right, Endpoint doesn’t take transport_options, so that doesn’t do anything. It’s a config option for cowboy, so it’s passed as part of http.

Here are the docs for Endpoint

https://hexdocs.pm/phoenix/Phoenix.Endpoint.html

and here are the docs for Plug.Cowboy

https://hexdocs.pm/plug_cowboy/Plug.Cowboy.html

As you can see from the second doc page, the default limit to max_connections is set to 16_384.

Ranch docs say I could also set it to :infinity. Are there any weird side effects if I do so or choose a higher number than the default, e.g. 30_000?

:infinity sounds like an invitation for ddos, but I doubt setting it a higher number to cause any issues if your system can handle the load.

Good point. My use case is not related to websockets but as I understand, this setting is for all requests. My project needs to handle several thousand mobile apps that send updates on a regular basis so I am trying to prepare it as good as possible and it seems like it’s a good idea to increase this setting ahead of time.

Oops that was a dumb late night mistake, I fixed it this morning: https://github.com/praveenperera/bare_channel/commit/6152884c8c59e01849f1eef242f03a53891ab133

config :bare_channel, BareChannelWeb.Endpoint,
  ...
  http: transport_options: [max_connections: 1_048_576, num_acceptors: 1000],
  ...
iex(1)> Application.get_env(:bare_channel, BareChannelWeb.Endpoint)
[
  render_errors: [view: BareChannelWeb.ErrorView, accepts: ["json"]],
  pubsub: [name: BareChannel.PubSub, adapter: Phoenix.PubSub.PG2],
  url: [host: "example.com", port: 80],
  check_origin: false,
  cache_static_manifest: "priv/static/cache_manifest.json",
  http: [
    :inet6,
    {:port, 9962},
    {:transport_options, [max_connections: 1048576, num_acceptors: 1000]}
  ],
  secret_key_base: "4+x4ETzHP43Kwt0HKS7X6md8kSKqvBLYxKLsp3QltmJwMkfFd3mRKl60Ay9KQgiPu8TGR23Nl3UG00ANM7ZxM3BWtugdiXHnifHmcysezKENcMHM73Jc1FOVapyE/iga3qyT1Q7PwH+YH"
]

But same result, there is a plateau at 60 seconds. I’m assuming this option has no effect on websockets?

First picture is with an arrival rate of 1000. Second one is with an arrival rate of 2000.

60-sec plateau seems quite strange. What happens if you increase the arrival rate to 10x? Also, you have 5 testing servers and connections is also around 10k * server_count.

Could you try to set up another completely separate set of testing servers and start stress testing at the same time on both sets of servers? This should eliminate Tsung from bottleneck suspects.

1 Like

60 seconds is often FIN_WAIT, and if you only had one node running Tsung I would say “aha! you’re running into port exhaustion”, but you have 5 nodes, so unless all 5 Tsung nodes somehow come at your target server through the same gateway, in other words a single IP, that’s not it. But this 60 second thing seems too consistent to be coincidence, so I’d look into that, even if it’s not clear how it could the problem.

1 Like

would think 60 seconds is the default timeout for Phoenix Channels (or rather the socket https://github.com/phoenixframework/phoenix/blob/master/lib/phoenix/transports/websocket.ex#L9) … since you are not sending heartbeats, the channels will close down - pretty much at the same rate as new ones are added - thus the flatline…

see https://gist.githubusercontent.com/Gazler/53b842764f778fe57757/raw/9509c3d980f13bbb739f4ae117dc84ef1d721076/phoenix.xml which might help (it sends the heartbeats…)- though I’m searching for a more recent config.

      <for var="i" from="1" to="10" incr="1">
        <thinktime value="10"/>
        <request>
          <websocket ack="no_ack" type="message">{"topic":"phoenix","event":"heartbeat","payload":{},"ref":"3"}</websocket>
        </request>
      </for>

alternatively:
set the phoenix channel timeout to infinity… eg
user_socket.ex:
transport(:websocket, Phoenix.Transports.WebSocket, timeout: :infinity)

9 Likes

Oh wow that was it! Thanks a lot. I had the heartbeat message in on an earlier test configuration but I was having the same problem. Might have been caused by something else at the time. But it seems to be working now! I had a feeling it was going to be something simple that I over looked.

Thanks for everyone’s input and help!

The commit that fixed it: https://github.com/praveenperera/bare_channel/commit/00d56537653867a2143ce5d486d00064374ff9c0

NOTE: I am getting different errors and crashes around 140 secs now, but I can definitely work through them.

I will report back, I am curious to see what it takes to max out this box (8GB, 4CPU $40/month)

7 Likes

Update: By increasing the max number of processes I was able to max out the CPU and Memory
MIX_ENV=prod elixir --erl "+P 5000000" -S mix phx.server

Maxed out at 271,092 connected channels which is great. This is an empty channel with nothing being returned. But this will be my benchmark.

I was able to run this test at an arrival rate of 10,000

side note: currently trying to figure out how to increase max number of processes when using releases
edit: nevermind figured it out, just add +P 5000000 to the vm.args file generated by distillery

3 Likes

Would by great to test on Kubernetes to find how it affects bare-metal performance :slight_smile:

2 Likes

That’s actually going to be my next test. I will post the results.

4 Likes

I ran two tests

  1. On VPS (no k8s) using distillery release: https://github.com/praveenperera/bare_channel/commit/ba93e01317b092cfd494b3685a54d7f883099aa6

  1. On K8s cluster using distillery release: https://github.com/praveenperera/bare_channel/commit/d540f6845f6a0d6fb85ea5835e6d18fdeb9400de

I feel like I’m probably missing something again because this is a huge drop. I think I’m going to do the rest of my testing on the VPS and come back to trying to improve performance on the K8s cluster.

More info on cluster:

  • Running on DigitalOcean managed K8s (DOKS)
  • 3 8GB/4vCPU worker nodes
  • Workload running on one node without any other workloads
  • Tsung connects directly to the node using Nodeport service
4 Likes

One thing I can’t grasp: what “users” is. I can understand “open connections” but how it relates to users.

3 Likes

Oh, good question. I found the relevant part of Tsung’s manual.

  • users Number of simultaneous users (it’s session has started, but not yet finished).
  • connected number of users with an opened TCP/UDP connection (example: for HTTP, during a think time, the TCP connection can be closed by the server, and it won’t be reopened until the thinktime has expired). new in 1.2.2 .

Wild guess: if the server retains session data for users for a while after their previous connection closed it still induces load on the server, and would explain why there are more users than connections. I’ve never used Tsung, though, and haven’t looked too closely about how this experiment is configured.

2 Likes

Hey. A colleague just sent me this post. I’d be happy to look into it more although it’s late here so just gave a cursory glance so far.

I documented my tsung configuration for pushex at https://github.com/pushex-project/pushex/tree/master/examples/load-test. I was able to hit 100ks active connections on this and never tried to fully max it out. This configuration may be of help as you look into it. Happy to help dig in further if you check it out and determine the bottleneck isn’t tsung.

The max connections per worker was something like 60-90k, so you should be able to hit 300k active connections easy.

If you’re using additions on top of base Phoenix then they could be bottlenecks but will most likely appear in your CPU traces

Edit: I see the root issue is k8s vs VPS now. Making new reply to comment on that

1 Like