I’ve been trying load test of phoenix channels for an app I’ve been working on for a client. I’m using tsung to do so. I’m struggling to get more than 50K-60K connections at a time. I feel like I’m missing something simple, but I’ve been struggling with this for a while now.
I am only hitting about 60% CPU usage and 50% memory usage. Does anyone know how many connections I should be reasonably be expecting on a server this size?
Have you looked into tweaking the underlying HTTP server parameters at all? The http and https config options for Phoenix.Endpoint allow passing options to Plug.Cowboy. Options like max_connections under transport_options sound relevant…
I did also notice that the first plateau always seems to happen at 60 seconds. I changed the Tsung arrival rate to 1000/sec (down from 3000/sec). And the first plateau again happened at 60 seconds but this time with only about ~12,000 connected. And it levels off around ~17,500 after 150 seconds.
Really makes me wonder if the problem is with the tsung configuration and not phoenix itself.
Good point. My use case is not related to websockets but as I understand, this setting is for all requests. My project needs to handle several thousand mobile apps that send updates on a regular basis so I am trying to prepare it as good as possible and it seems like it’s a good idea to increase this setting ahead of time.
60-sec plateau seems quite strange. What happens if you increase the arrival rate to 10x? Also, you have 5 testing servers and connections is also around 10k * server_count.
Could you try to set up another completely separate set of testing servers and start stress testing at the same time on both sets of servers? This should eliminate Tsung from bottleneck suspects.
60 seconds is often FIN_WAIT, and if you only had one node running Tsung I would say “aha! you’re running into port exhaustion”, but you have 5 nodes, so unless all 5 Tsung nodes somehow come at your target server through the same gateway, in other words a single IP, that’s not it. But this 60 second thing seems too consistent to be coincidence, so I’d look into that, even if it’s not clear how it could the problem.
Oh wow that was it! Thanks a lot. I had the heartbeat message in on an earlier test configuration but I was having the same problem. Might have been caused by something else at the time. But it seems to be working now! I had a feeling it was going to be something simple that I over looked.
side note: currently trying to figure out how to increase max number of processes when using releases edit: nevermind figured it out, just add +P 5000000 to the vm.args file generated by distillery
I feel like I’m probably missing something again because this is a huge drop. I think I’m going to do the rest of my testing on the VPS and come back to trying to improve performance on the K8s cluster.
More info on cluster:
Running on DigitalOcean managed K8s (DOKS)
3 8GB/4vCPU worker nodes
Workload running on one node without any other workloads
Oh, good question. I found the relevant part of Tsung’s manual.
users Number of simultaneous users (it’s session has started, but not yet finished).
connected number of users with an opened TCP/UDP connection (example: for HTTP, during a think time, the TCP connection can be closed by the server, and it won’t be reopened until the thinktime has expired). new in 1.2.2 .
Wild guess: if the server retains session data for users for a while after their previous connection closed it still induces load on the server, and would explain why there are more users than connections. I’ve never used Tsung, though, and haven’t looked too closely about how this experiment is configured.
Hey. A colleague just sent me this post. I’d be happy to look into it more although it’s late here so just gave a cursory glance so far.
I documented my tsung configuration for pushex at https://github.com/pushex-project/pushex/tree/master/examples/load-test. I was able to hit 100ks active connections on this and never tried to fully max it out. This configuration may be of help as you look into it. Happy to help dig in further if you check it out and determine the bottleneck isn’t tsung.
The max connections per worker was something like 60-90k, so you should be able to hit 300k active connections easy.
If you’re using additions on top of base Phoenix then they could be bottlenecks but will most likely appear in your CPU traces
Edit: I see the root issue is k8s vs VPS now. Making new reply to comment on that