So useless benchmarks aside, Its possible to write a webserver that can serve 300k requests per second (perhaps more with optimizations). This is a simple hello world benchmark, where the client sends a GET and the server replies with 200 + empty string.
I did some playing around with GitHub - vans163/stargate at next (next branch) and rewrote it today into elixir, the motivation was that I attended a talk recently where the presenter spoke about benchmarking pheonix/cowboy vs , and achieved 8k RPS on 2 cores. I said, WTH? This makes the technology look bad.
Benchmark was done on bare metal on a:
Dual Processor | Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
10 physical cores to erlang (all of proc 1), 10 physical cores to benchmarking (all of proc 0)
Using a unix domain socket to prevent benchmarking the NIC /+ FULL TCP/IP stack.
Elixir was started with iex --erl “+S 10 +sbt ts +sct L10-19C10-19P1N1” -S mix
Benchmark was done via autocannon
1 core per autocannon instance, which produces 45k reqs per second
rm bench
taskset -c 0 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 1 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 2 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 3 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 4 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 5 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 6 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 7 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 8 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 9 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
grep "k requests in " bench | cut -c 1-3 | python -c"import sys; print(sum(map(int, sys.stdin)))"
grep "Latency" bench
Stargate achieved:
3299k requests total in 11 seconds
300k requests per second
13ms average latency
Cowboy achieved:
cowboy:start_clear(http, [{port, 0},{ip,{local,<<"/tmp/star.sock">>}}]
ERL_FLAGS="+S 10 +sbt ts +sct L10-19C10-19P1N1" make run
110k requests per second
30ms average latency
Ram usage was 100mb for both Stargate and Cowboy.
The stress test ensured all cores were utilised 100%, cranking up the requests only increased the latency.
msacc for Stargate
iex(16)> :msacc.print()
Average thread real-time : 5008850 us
Accumulated system run-time : 49719193 us
Average scheduler run-time : 4971915 us
Thread aux check_io emulator gc other port sleep
Stats per thread:
scheduler( 1) 3.04% 1.16% 59.37% 5.47% 7.61% 22.61% 0.74%
scheduler( 2) 2.69% 1.15% 59.69% 5.47% 7.80% 22.48% 0.73%
scheduler( 3) 2.71% 1.15% 59.54% 5.49% 7.70% 22.69% 0.73%
scheduler( 4) 2.67% 1.18% 59.45% 5.48% 7.79% 22.68% 0.75%
scheduler( 5) 2.65% 1.19% 59.45% 5.46% 7.73% 22.72% 0.79%
scheduler( 6) 2.60% 1.21% 59.46% 5.45% 7.80% 22.72% 0.77%
scheduler( 7) 2.63% 1.16% 59.48% 5.48% 7.89% 22.62% 0.74%
scheduler( 8) 2.64% 1.15% 59.49% 5.45% 7.71% 22.84% 0.73%
scheduler( 9) 2.71% 1.15% 59.28% 5.49% 7.80% 22.84% 0.73%
scheduler(10) 2.60% 1.15% 59.48% 5.43% 7.84% 22.77% 0.73%
Stats per type:
scheduler 2.69% 1.17% 59.47% 5.47% 7.77% 22.70% 0.74%
What is the purpose of this benchmark? This is to show that with OTP21.2 Erlang / Elixir is on the level of Scala/NGINX when it comes to webserver throughput, and there should be no reason to keep making claims like “Oh Elixir, doesn’t it cap out at 25k RPS? Sorry its not performant and we wont use it.”
Concerning though is ports using 22% of each scheduler to only recv/send off a 30k~ descriptors. I think there could be more optimisations to be made.
Also with some optimisations on Stargate, I am sure 600k+ RPS is achievable. Stargate to Elixir rewrite was done in 1 day, its already amazing such great performance can be seen out of the box. I cant imagine how many coding days was spent on NGINX to achieve 300k RPS.
NGINX was not benchmarked (mostly because I dont know how to configure it) but claims 1.3m RPS on the same benchmark using 10 cores at 2.3ghz.
EDIT: 600k RPS was achieved by replying to the request without parsing. But :msacc now shows
Thread aux check_io emulator gc other port sleep
Stats per thread:
scheduler( 1) 0.02% 0.02% 0.05% 0.00% 4.04% 0.09% 95.78%
scheduler( 2) 4.42% 3.56% 16.28% 0.25% 36.30% 36.32% 2.87%
scheduler( 3) 4.41% 3.56% 16.25% 0.26% 36.46% 36.20% 2.87%
scheduler( 4) 4.41% 3.60% 16.27% 0.26% 36.57% 35.98% 2.91%
scheduler( 5) 4.38% 3.54% 16.40% 0.25% 36.39% 36.17% 2.86%
scheduler( 6) 4.37% 3.68% 16.21% 0.25% 36.92% 35.64% 2.93%
scheduler( 7) 4.45% 3.50% 16.32% 0.26% 37.00% 35.63% 2.83%
scheduler( 8) 4.24% 3.44% 16.30% 0.24% 36.23% 34.36% 5.18%
scheduler( 9) 1.68% 1.34% 5.74% 0.10% 18.02% 12.62% 60.49%
scheduler(10) 0.78% 0.60% 2.66% 0.04% 9.00% 5.78% 81.14%
What is other, and port is kinda high. (I was not fully saturate and the autocannon tool became the bottleneck)
EDIT2: Same test as the EDIT but using active, false, produced 400k RPS, meaning that active, true is more performant now than active, false.