Load testing: Struggling to get more than 50K-60K connections to a Phoenix Channel using Tsung

praveenperera · May 22, 2019, 10:53pm

I’ve been trying load test of phoenix channels for an app I’ve been working on for a client. I’m using tsung to do so. I’m struggling to get more than 50K-60K connections at a time. I feel like I’m missing something simple, but I’ve been struggling with this for a while now.

EDIT: I also just tried it with a brand new phoenix project:
GitHub - praveenperera/bare_channel

Here is my info:

Tsung setup:
Version: 1.6.0
5 worker nodes: DigitalOcean - 16GB 6vCPUs
1 controller node: DigitalOcean - 16GB 6vCPUs

Application server setup:
DigitalOcean - 8GB 4vCPUs also tried 16GB 6vCPUs

Elixir: 1.8.1
Phoenix: 1.4.6

I removed all code from the channel. Now on join it just sends back an :ok response with an empty payload.

What I’ve tried so far:

Increasing limits on all the workers, the controller and the node running the phoenix application. I used this script to do so:

#!/bin/bash

#limits
sudo sysctl -w fs.file-max=12000500;
sudo sysctl -w fs.nr_open=20000500;
ulimit -n 20000001;
sudo sysctl -w net.ipv4.tcp_mem='10000000 10000000 10000000';
sudo sysctl -w net.ipv4.tcp_rmem='1024 4096 16384';
sudo sysctl -w net.ipv4.tcp_wmem='1024 4096 16384';
sudo sysctl -w net.core.rmem_max=16384;
sudo sysctl -w net.core.wmem_max=16384;

echo "fs.file-max = 1048576" >> /etc/sysctl.conf
echo "# limits
* soft     nproc          1048576
* hard     nproc          1048576
* soft     nofile         1048576
* hard     nofile         1048576
root soft     nproc          1048576
root hard     nproc          1048576
root soft     nofile         1048576
root hard     nofile         1048576
" >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
sysctl -p

touch /root/setup_log
echo "COMPLETE AT: $(date)" >> /root/setup_log

Different versions of tsung, different settings.
Tried the websocket instead of tcp on tsung but it wasn’t sending messages properly
Switched over to using releases instead of doing mix phx.server
Removed all application code, only replying with an :ok and empty payload
Deploying a new version straight on the VPS, avoiding docker and k8s
Turned logger level to warn

<?xml version="1.0"?>
<tsung loglevel="warning" version="1.0">
  <clients>
    <client host="tsung-worker-1" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-2" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-3" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-4" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
    <client host="tsung-worker-5" use_controller_vm="false" maxusers="64000" cpu="6">
      <ip scan="true" value="eth0"/>
    </client>
  </clients>
  <servers>
    <server host="134.209.166.67" port="9962" type="tcp"/>
  </servers>
  <load>
    <arrivalphase phase="1" duration="300" unit="second">
      <users maxnumber="2000000" arrivalrate="2000" unit="second"/>
    </arrivalphase>
  </load>
  <options>
    <option name="ports_range" min="1025" max="65535"/>
  </options>
  <sessions>
    <session name="websocket" probability="100" type="ts_websocket">
      <request>
        <websocket type="connect" path="/socket/v1/websocket?vsn=2.0.0"/>
      </request>
      <request>
        <websocket type="message">          ["1", "1", "payload:url!!!commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4","phx_join", {}]        </websocket>
      </request>
      <for var="i" from="1" to="1000" incr="1">
        <thinktime value="30"/>
      </for>
    </session>
  </sessions>
</tsung>

Results:

I am only hitting about 60% CPU usage and 50% memory usage. Does anyone know how many connections I should be reasonably be expecting on a server this size?

al2o3cr · May 23, 2019, 1:34am

Have you looked into tweaking the underlying HTTP server parameters at all? The http and https config options for Phoenix.Endpoint allow passing options to Plug.Cowboy. Options like max_connections under transport_options sound relevant…

niccolox · May 23, 2019, 3:53am

side note, when doing Hadoop clusters some things are often recommended

hard code the dns ip of all the nodes in cluster in the /etc/hosts of each machine (avoids lookups)
make sure the nodes are topologically close to each other (few hops)
make sure the connections are not shared, i.e. some kind of isolated network
make sure the network connections between nodes is fast

did you try to run the cluster on a single machine?

thanks for sharing btw, very inspiring and interesting

praveenperera · May 23, 2019, 6:07am

Thanks for the suggestion. I had some time late a night today, so I tried changing it in the endpoint. I feel like this isn’t the right place to do the config, I will investigate tomorrow: Set max_connections option on endpoint · praveenperera/bare_channel@b6e537e · GitHub.

So far no effect.

I did also notice that the first plateau always seems to happen at 60 seconds. I changed the Tsung arrival rate to 1000/sec (down from 3000/sec). And the first plateau again happened at 60 seconds but this time with only about ~12,000 connected. And it levels off around ~17,500 after 150 seconds.

Really makes me wonder if the problem is with the tsung configuration and not phoenix itself.

jola · May 23, 2019, 7:27am

You’re right, Endpoint doesn’t take transport_options, so that doesn’t do anything. It’s a config option for cowboy, so it’s passed as part of http.

Here are the docs for Endpoint

https://hexdocs.pm/phoenix/Phoenix.Endpoint.html

and here are the docs for Plug.Cowboy

https://hexdocs.pm/plug_cowboy/Plug.Cowboy.html

As you can see from the second doc page, the default limit to max_connections is set to 16_384.

Phillipp · May 23, 2019, 7:48am

Ranch docs say I could also set it to :infinity. Are there any weird side effects if I do so or choose a higher number than the default, e.g. 30_000?

LostKobrakai · May 23, 2019, 7:54am

:infinity sounds like an invitation for ddos, but I doubt setting it a higher number to cause any issues if your system can handle the load.

Phillipp · May 23, 2019, 7:57am

Good point. My use case is not related to websockets but as I understand, this setting is for all requests. My project needs to handle several thousand mobile apps that send updates on a regular basis so I am trying to prepare it as good as possible and it seems like it’s a good idea to increase this setting ahead of time.

praveenperera · May 23, 2019, 1:18pm

Oops that was a dumb late night mistake, I fixed it this morning: Set transport options in the correct place in endpoint options · praveenperera/bare_channel@6152884 · GitHub

config :bare_channel, BareChannelWeb.Endpoint,
  ...
  http: transport_options: [max_connections: 1_048_576, num_acceptors: 1000],
  ...

iex(1)> Application.get_env(:bare_channel, BareChannelWeb.Endpoint)
[
  render_errors: [view: BareChannelWeb.ErrorView, accepts: ["json"]],
  pubsub: [name: BareChannel.PubSub, adapter: Phoenix.PubSub.PG2],
  url: [host: "example.com", port: 80],
  check_origin: false,
  cache_static_manifest: "priv/static/cache_manifest.json",
  http: [
    :inet6,
    {:port, 9962},
    {:transport_options, [max_connections: 1048576, num_acceptors: 1000]}
  ],
  secret_key_base: "4+x4ETzHP43Kwt0HKS7X6md8kSKqvBLYxKLsp3QltmJwMkfFd3mRKl60Ay9KQgiPu8TGR23Nl3UG00ANM7ZxM3BWtugdiXHnifHmcysezKENcMHM73Jc1FOVapyE/iga3qyT1Q7PwH+YH"
]

But same result, there is a plateau at 60 seconds. I’m assuming this option has no effect on websockets?

First picture is with an arrival rate of 1000. Second one is with an arrival rate of 2000.

achempion · May 23, 2019, 3:08pm

60-sec plateau seems quite strange. What happens if you increase the arrival rate to 10x? Also, you have 5 testing servers and connections is also around 10k * server_count.

Could you try to set up another completely separate set of testing servers and start stress testing at the same time on both sets of servers? This should eliminate Tsung from bottleneck suspects.

sribe · May 23, 2019, 4:04pm

60 seconds is often FIN_WAIT, and if you only had one node running Tsung I would say “aha! you’re running into port exhaustion”, but you have 5 nodes, so unless all 5 Tsung nodes somehow come at your target server through the same gateway, in other words a single IP, that’s not it. But this 60 second thing seems too consistent to be coincidence, so I’d look into that, even if it’s not clear how it could the problem.

outlog · May 23, 2019, 4:11pm

would think 60 seconds is the default timeout for Phoenix Channels (or rather the socket https://github.com/phoenixframework/phoenix/blob/master/lib/phoenix/transports/websocket.ex#L9) … since you are not sending heartbeats, the channels will close down - pretty much at the same rate as new ones are added - thus the flatline…

see https://gist.githubusercontent.com/Gazler/53b842764f778fe57757/raw/9509c3d980f13bbb739f4ae117dc84ef1d721076/phoenix.xml which might help (it sends the heartbeats…)- though I’m searching for a more recent config.

      <for var="i" from="1" to="10" incr="1">
        <thinktime value="10"/>
        <request>
          <websocket ack="no_ack" type="message">{"topic":"phoenix","event":"heartbeat","payload":{},"ref":"3"}</websocket>
        </request>
      </for>

alternatively:
set the phoenix channel timeout to infinity… eg
user_socket.ex:
transport(:websocket, Phoenix.Transports.WebSocket, timeout: :infinity)

praveenperera · May 24, 2019, 2:06am

Oh wow that was it! Thanks a lot. I had the heartbeat message in on an earlier test configuration but I was having the same problem. Might have been caused by something else at the time. But it seems to be working now! I had a feeling it was going to be something simple that I over looked.

Thanks for everyone’s input and help!

The commit that fixed it: Set websocket timeout to infinity · praveenperera/bare_channel@00d5653 · GitHub

NOTE: I am getting different errors and crashes around 140 secs now, but I can definitely work through them.

I will report back, I am curious to see what it takes to max out this box (8GB, 4CPU $40/month)

praveenperera · May 24, 2019, 2:32pm

Update: By increasing the max number of processes I was able to max out the CPU and Memory
MIX_ENV=prod elixir --erl "+P 5000000" -S mix phx.server

Maxed out at 271,092 connected channels which is great. This is an empty channel with nothing being returned. But this will be my benchmark.

I was able to run this test at an arrival rate of 10,000

~~side note: currently trying to figure out how to increase max number of processes when using releases~~
edit: nevermind figured it out, just add +P 5000000 to the vm.args file generated by distillery

achempion · May 24, 2019, 3:47pm

Would by great to test on Kubernetes to find how it affects bare-metal performance

praveenperera · May 24, 2019, 5:20pm

That’s actually going to be my next test. I will post the results.

praveenperera · May 24, 2019, 9:33pm

I ran two tests

On VPS (no k8s) using distillery release: Add and configure distillery for releases · praveenperera/bare_channel@ba93e01 · GitHub

On K8s cluster using distillery release: Add dockerfile and release script · praveenperera/bare_channel@d540f68 · GitHub

Docker image: praveenperera/bare_channel:master

I feel like I’m probably missing something again because this is a huge drop. I think I’m going to do the rest of my testing on the VPS and come back to trying to improve performance on the K8s cluster.

More info on cluster:

Running on DigitalOcean managed K8s (DOKS)
3 8GB/4vCPU worker nodes
Workload running on one node without any other workloads
Tsung connects directly to the node using Nodeport service

achempion · May 24, 2019, 10:07pm

One thing I can’t grasp: what “users” is. I can understand “open connections” but how it relates to users.

jmitchell · May 25, 2019, 1:06am

Oh, good question. I found the relevant part of Tsung’s manual.

users Number of simultaneous users (it’s session has started, but not yet finished).

connected number of users with an opened TCP/UDP connection (example: for HTTP, during a think time, the TCP connection can be closed by the server, and it won’t be reopened until the thinktime has expired). new in 1.2.2 .

Wild guess: if the server retains session data for users for a while after their previous connection closed it still induces load on the server, and would explain why there are more users than connections. I’ve never used Tsung, though, and haven’t looked too closely about how this experiment is configured.

sb8244 · May 25, 2019, 3:51am

Hey. A colleague just sent me this post. I’d be happy to look into it more although it’s late here so just gave a cursory glance so far.

I documented my tsung configuration for pushex at https://github.com/pushex-project/pushex/tree/master/examples/load-test. I was able to hit 100ks active connections on this and never tried to fully max it out. This configuration may be of help as you look into it. Happy to help dig in further if you check it out and determine the bottleneck isn’t tsung.

The max connections per worker was something like 60-90k, so you should be able to hit 300k active connections easy.

If you’re using additions on top of base Phoenix then they could be bottlenecks but will most likely appear in your CPU traces

Edit: I see the root issue is k8s vs VPS now. Making new reply to comment on that