300k requests per second webserver in elixir! OTP21.2 - 10 cores

vans163 · December 18, 2018, 1:22am

So useless benchmarks aside, Its possible to write a webserver that can serve 300k requests per second (perhaps more with optimizations). This is a simple hello world benchmark, where the client sends a GET and the server replies with 200 + empty string.

I did some playing around with GitHub - vans163/stargate at next (next branch) and rewrote it today into elixir, the motivation was that I attended a talk recently where the presenter spoke about benchmarking pheonix/cowboy vs , and achieved 8k RPS on 2 cores. I said, WTH? This makes the technology look bad.

Benchmark was done on bare metal on a:

Dual Processor | Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
10 physical cores to erlang (all of proc 1), 10 physical cores to benchmarking (all of proc 0)
Using a unix domain socket to prevent benchmarking the NIC /+ FULL TCP/IP stack.

Elixir was started with iex --erl “+S 10 +sbt ts +sct L10-19C10-19P1N1” -S mix

Benchmark was done via autocannon

1 core per autocannon instance, which produces 45k reqs per second

rm bench
taskset -c 0 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 1 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 2 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 3 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 4 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 5 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 6 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 7 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 8 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &
taskset -c 9 autocannon -c 10 -d 10 -S /tmp/star.sock http://test.com 2>> bench &

grep "k requests in " bench | cut -c 1-3 | python -c"import sys; print(sum(map(int, sys.stdin)))"
grep "Latency" bench

Stargate achieved:
3299k requests total in 11 seconds 
300k requests per second

13ms average latency

Cowboy achieved:

cowboy:start_clear(http, [{port, 0},{ip,{local,<<"/tmp/star.sock">>}}]
ERL_FLAGS="+S 10 +sbt ts +sct L10-19C10-19P1N1" make run

110k requests per second
30ms average latency

Ram usage was 100mb for both Stargate and Cowboy.

The stress test ensured all cores were utilised 100%, cranking up the requests only increased the latency.

msacc for Stargate
iex(16)> :msacc.print()   
Average thread real-time    :  5008850 us
Accumulated system run-time : 49719193 us
Average scheduler run-time  :  4971915 us

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
 scheduler( 1)    3.04%    1.16%   59.37%    5.47%    7.61%   22.61%    0.74%
 scheduler( 2)    2.69%    1.15%   59.69%    5.47%    7.80%   22.48%    0.73%
 scheduler( 3)    2.71%    1.15%   59.54%    5.49%    7.70%   22.69%    0.73%
 scheduler( 4)    2.67%    1.18%   59.45%    5.48%    7.79%   22.68%    0.75%
 scheduler( 5)    2.65%    1.19%   59.45%    5.46%    7.73%   22.72%    0.79%
 scheduler( 6)    2.60%    1.21%   59.46%    5.45%    7.80%   22.72%    0.77%
 scheduler( 7)    2.63%    1.16%   59.48%    5.48%    7.89%   22.62%    0.74%
 scheduler( 8)    2.64%    1.15%   59.49%    5.45%    7.71%   22.84%    0.73%
 scheduler( 9)    2.71%    1.15%   59.28%    5.49%    7.80%   22.84%    0.73%
 scheduler(10)    2.60%    1.15%   59.48%    5.43%    7.84%   22.77%    0.73%

Stats per type:
     scheduler    2.69%    1.17%   59.47%    5.47%    7.77%   22.70%    0.74%

What is the purpose of this benchmark? This is to show that with OTP21.2 Erlang / Elixir is on the level of Scala/NGINX when it comes to webserver throughput, and there should be no reason to keep making claims like “Oh Elixir, doesn’t it cap out at 25k RPS? Sorry its not performant and we wont use it.”

Concerning though is ports using 22% of each scheduler to only recv/send off a 30k~ descriptors. I think there could be more optimisations to be made.

Also with some optimisations on Stargate, I am sure 600k+ RPS is achievable. Stargate to Elixir rewrite was done in 1 day, its already amazing such great performance can be seen out of the box. I cant imagine how many coding days was spent on NGINX to achieve 300k RPS.

NGINX was not benchmarked (mostly because I dont know how to configure it) but claims 1.3m RPS on the same benchmark using 10 cores at 2.3ghz.

EDIT: 600k RPS was achieved by replying to the request without parsing. But :msacc now shows

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread: 
 scheduler( 1)    0.02%    0.02%    0.05%    0.00%    4.04%    0.09%   95.78%
 scheduler( 2)    4.42%    3.56%   16.28%    0.25%   36.30%   36.32%    2.87%
 scheduler( 3)    4.41%    3.56%   16.25%    0.26%   36.46%   36.20%    2.87%
 scheduler( 4)    4.41%    3.60%   16.27%    0.26%   36.57%   35.98%    2.91%
 scheduler( 5)    4.38%    3.54%   16.40%    0.25%   36.39%   36.17%    2.86%
 scheduler( 6)    4.37%    3.68%   16.21%    0.25%   36.92%   35.64%    2.93%
 scheduler( 7)    4.45%    3.50%   16.32%    0.26%   37.00%   35.63%    2.83%
 scheduler( 8)    4.24%    3.44%   16.30%    0.24%   36.23%   34.36%    5.18%
 scheduler( 9)    1.68%    1.34%    5.74%    0.10%   18.02%   12.62%   60.49%
 scheduler(10)    0.78%    0.60%    2.66%    0.04%    9.00%    5.78%   81.14%

What is other, and port is kinda high. (I was not fully saturate and the autocannon tool became the bottleneck)

EDIT2: Same test as the EDIT but using active, false, produced 400k RPS, meaning that active, true is more performant now than active, false.

mpugach · December 18, 2018, 9:43am

This is awesome

the presenter spoke about benchmarking pheonix/cowboy

There is no Phoenix btw

It would be super great if you could contribute to GitHub - TechEmpower/FrameworkBenchmarks: Source for the TechEmpower Framework Benchmarks project in order to present Elixir better in the next round of TechEmpower Framework Benchmarks

vans163 · December 18, 2018, 2:38pm

I will try, it seems like a lot of work though to meet their guidelines.

EDIT3: (because cannot edit main anymore)

Other in the msacc report was using alot of CPU, and I did not quite understand why nor did I wanna recompile erlang with microstate accounting. So I pulled out of the inet_drv and wrote a simple c nif to do TCP networking, and the throughput double. I got the msacc report to look like

scheduler( 1) 0.68% 0.00% 89.85% 3.46% 6.01% 0.00% 0.00%
scheduler( 2) 0.66% 0.01% 90.43% 3.40% 5.50% 0.00% 0.00%

Using 2 schedulers because 10 physical cores generating load now caps out the benchmarking tool and I am out of cores to assign to the benchmarker. now 90% of the time is spent in emulator, 6% is other, I am guessing 6% other is the NIF calls to the socket calls?

The throughput was 250k for 2 physical cores. If all scales linearly that is 1.25m RPS for simple GET hello world benchmark.

The NIF is PoC https://gist.github.com/vans163/d96fcc7c89d0cf25c819c5fb77769e81 ofcourse its only useful in the case there is constant data on socket, otherwise this PoC will break if there is idle connections that keep getting polled. This opens the possibility though to using something like DPDK.

So 1.25m RPS elixir hello world webserver? Some say useless benchmark, but if you look at someone like CloudFlare who is struggling with things like NGINX and is simply salivating at a performant actor model implementation, look no further than BEAM (Elixir).

michalmuskala · December 18, 2018, 4:09pm

AFAIK for NIFs the time spend in NIFs themselves is accounted for in the emulator stat if you don’t have the VM compiled with the microstate accounting.

Reading the docs, without microstate accounting, other includes managing timers and busy waiting that are otherwise separate.

vans163 · December 18, 2018, 4:28pm

I got 30% in Other, 30% in Port and 40% in Emulator when I just replied with a static response without parsing (using :gen_tcp). I think writing a modern pollset driver based on a c_nif could lead to some interesting optimisations. If we assume some basic things like, :gen_tcp.send can return {:partial_send, buffer} to not have to maintain the send buffer in the inet_drv, and return only binaries from recv operations, remove recving a particular # of bytes, etc. As these not strictly necessary functions have deep hidden costs.