Plug POST performance

apenney · April 30, 2016, 2:31pm

At work we have a lot of Scala code in our low latency/high throughput product and a couple of us have been advocating Elixir as an alternative to the current codebase. We recently undertook a 24 hour hackathon and tried to replace a chunk of our code with Elixir but immediately ran into performance issues. I’m hoping other people on the list might have some suggestions for me because I’ve picked the brains of the IRC channel and been unable to improve things.

To begin with I’ll set the stage for the testing we did. We wanted to start by understanding the raw performance of Elixir “out of the box”, and so we whipped up a couple of codebases to accept POSTs and return 204 regardless of what happens.

We used this code:

Alongside a less complex piece of code:

gist.github.com

https://gist.github.com/apenney/6a485aeba55882c4dcb24eed74787784

gistfile1.txt

defmodule Ebid do
  import Plug.Conn

  def init(options) do
    # initialize options

    options
  end

  def call(conn, _opts) do

This file has been truncated. show original

We used wrk and wrk2 to performance various POST tests against this code. We used the following script to feed wrk with data to post:

gist.github.com

https://gist.github.com/apenney/eeb1983c1aa4447dee908a516e802b3f

mopub.lua

wrk.method = "POST"
wrk.body = '{"badv": ["attbestdeals.com", "att.ecustomersupport.com", "battlefield.com", "chitika.infobar.com", "doubledowninteractive.com", "ea.com", "electronicarts.com", "fifa.com", "getpager.com", "http://i.jumptap.com/img/6396/1361905793497.jpg", "http://pabstblueribbon.com/", "http://www.lonestarsteakhouse.com/", "lonestarsteakhouse.com", "neebo.com", "oaoa.com", "origin.com", "pabstblueribbon.com", "pogo.com", "poshmark.com", "pvz.com", "pvzgardenwarfare.com", "teamlava.com", "titanfall.com", "worldoftanks.com", "www.applift.com", "www.att.com", "www.getpager.com"], "app": {"publisher": {"id": "agltb3B1Yi1pbmNyEAsSB0FjY291bnQY2_3WEww", "name": "Resignation Media"}, "ver": "1.8.7", "name": "theCHIVE iOS", "bundle": "448999087", "cat": ["IAB1", "IAB14", "entertainment", "lifestyle"], "id": "agltb3B1Yi1pbmNyDAsSA0FwcBj01OETDA", "storeurl": "https://itunes.apple.com/us/app/thechive-probably-best-app/id448999087?mt=8&uo=4"}, "imp": [{"bidfloor": 0.48999999999999999, "tagid": "agltb3B1Yi1pbmNyDQsSBFNpdGUYzMLVEww", "displaymanagerver": "1.16.0.0", "displaymanager": "mopub", "instl": 0, "banner": {"ext": {"mraid": [{"functions": ["storePicture", "supports", "playVideo", "createCalendarEvent"], "version": "2.0"}], "nativebrowserclick": 1}, "h": 50, "pos": 1, "battr": [8, 9, 10, 6], "api": [3], "w": 320, "btype": [4]}, "id": "1"}], "at": 2, "device": {"os": "iOS", "language": "en", "ua": "Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_1 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D201", "ip": "70.208.70.132", "make": "Apple", "connectiontype": 3, "js": 1, "dpidsha1": "114c3f8782d889422a71b6b86b410f3ee2b42143", "dnt": 0, "ext": {"idfa": "9ADB4DCE-7CF8-42A6-B323-87245FF56E1E", "uidh": "NzQzMTc0ODU0ANXOz33PsdUu3+8/LP6FT8Ke/Z8eCFsX/+J5mKvM3XzR"}, "carrier": "311-480", "devicetype": 1, "dpidmd5": "cab0928a8e8ad332f4d2a35b6aea0033", "osv": "7.1.1", "geo": {"lat": 40.667591000000002, "city": "New York", "region": "NY", "lon": -73.996285, "country": "USA"}, "model": "iPhone"}, "bcat": ["IAB25", "IAB26"], "id": "01928d56-5a87-4039-860a-cdce749a0a74"}'
wrk.headers["Content-Type"] = "application/json"

For an example of the flags we used with wrk:

gist.github.com

https://gist.github.com/apenney/df6c597546cba84c62cc56598775930b

gistfile1.txt

root@wrk:~/wrk2# ./wrk -c 1024 -d 60 -t 8 -R 30000 -s mopub.lua  http://10.128.22.204:8000/supply-partners/mopuba
Running 1m test @ http://10.128.22.204:8000/supply-partners/mopuba
  8 threads and 1024 connections
  Thread calibration: mean lat.: 2919.419ms, rate sampling interval: 9945ms
  Thread calibration: mean lat.: 2927.443ms, rate sampling interval: 9920ms
  Thread calibration: mean lat.: 2938.697ms, rate sampling interval: 9879ms
  Thread calibration: mean lat.: 2936.486ms, rate sampling interval: 10059ms
  Thread calibration: mean lat.: 2936.350ms, rate sampling interval: 9895ms
  Thread calibration: mean lat.: 2896.689ms, rate sampling interval: 9920ms
  Thread calibration: mean lat.: 2932.869ms, rate sampling interval: 9928ms

This file has been truncated. show original

We did the testing on DigitalOcean machines as well as machines inside our own datacenter. We were unable to get beyond 22,000 QPS with any of our testing, no matter what we tried tweaking. On the same hardware we can swap in some scala and do over 300,000 QPS easily.

Things that we tried:

Tweaking sysctls.
Running wrk from the same box to rule out the network.
GET instead of POST (45k QPS)
A whole bunch of changes to the code (can see the git history for that)
Using elli instead of cowboy (much worse, ~2200 QPS).

What I’m hoping is that other people on this forum can grab the code and take a look for obvious problems, as well as potentially running “wrk” against it in whatever environments they have. I spent some time with eflame and Observer trying to figure out why things are slow but I’m not familiar enough with BEAM/Elixir to make any real headway into figuring out why things are so slow.

Any help would be really appreciated!

rjk · April 30, 2016, 5:52pm

First thing i noticed is that you’re calling wrk on ‘/supply-partners/mopuba’ where your icepick code has the route defined as ‘/supply-partners/mopub’ (without the trailing ‘a’) this could give you a different code handling path than you intended. At least something to try

Have you seen this benchmark? https://github.com/mroth/phoenix-showdown
This one has already way higher request rates you’re getting now, maybe try that one first on the same hardware to see how it performs and use it as a baseline?

Another thing you could try is compiling with HiPe (high performance erlang) compiler through usage of ERL_COMPILER_OPTIONS="[native,{hipe, [o3]}]" when calling mix compile.

You could also try using one of the (erlang) profiling tools to see which functions take most of your CPU cycles.

One last thing i noticed when testing your stuff out on a simple macbook air 11" is that i could double the req/sec when i changed the json payload from the one above to just ‘{}’. So it seems it could have something to do with string handling of the incoming json. On the macbook air 11" i got 11k req/sec.

Goodluck!

apenney · April 30, 2016, 5:58pm

Oh, we definitely tested with the right URL, sorry, just a bad paste of one of the many we tried (we were just testing fallback vs actually correct route at that time).

I’ll play around with the phoenix code (we actually tried, at one point, a new phoenix app modified to just 204, same results) as well as HiPE to see what that does. I played around with eflame for profiling but all it really showed me was most of the time was spent in cowboy/ranch. I did a couple of graphs by "grep -v"ing out the cowboy code and very little was left.

I’d be curious and surprised if you saw the same results using the “Ebid” code I had in the gist, as it shouldn’t really pay any attention to the body contents. From what I can see by reading plug code that only kicks in if you are actually using one of the parsers, and that code doesn’t parse anything.

sasajuric · May 1, 2016, 7:03am

Maybe I’m missing something, but the code in the repo seems to do more than that. If I simplify that top-level plug, to just return 204:

defmodule Icepick.PlugRouter do
  import Plug.Conn

  def init(options) do
    options
  end

  def call(conn, _opts) do
    conn
    |> send_resp(204, [])
  end

  def start_link() do
    {:ok, _} = Plug.Adapters.Cowboy.http(
      __MODULE__,
      [],
      [port: 8000])
  end
end

I get the following on my dev machine:

$ wrk -c 100 -d 60 -t 2 -s wrk.lua  http://localhost:8000/supply-partners/mopub
Running 1m test @ http://localhost:8000/supply-partners/mopub
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    26.40ms   79.89ms 291.43ms   91.53%
    Req/Sec    22.38k     8.47k   46.78k    79.63%
  2510775 requests in 1.00m, 362.02MB read
Requests/sec:  41846.51

The latency here is still unusually large, but at least the throughput is way better (originally it was below 20k/sec).

Recently I demoed a simple Phoenix app which did some parameter decoding and json encoding (so a bit more work than this simple example), and I got a sub-~~micro~~millisecond 99th percentile latency with a wrk test (this same machine), so I’m pretty certain there’s room for improvements here.

apenney · May 1, 2016, 1:17pm

That’s effectively what we had in the second link in the top post:

gist.github.com

https://gist.github.com/apenney/6a485aeba55882c4dcb24eed74787784

gistfile1.txt

defmodule Ebid do
  import Plug.Conn

  def init(options) do
    # initialize options

    options
  end

  def call(conn, _opts) do

This file has been truncated. show original

Which we also had trouble getting good performance with. It was far below the ~300k we see with scala. Definitely better throughput with that code in your testing. How powerful a dev machine (I’m guessing just a laptop?).

I’m VERY interested in the fact you had sub-millisecond with Phoenix. One of the worst issues we saw in our testing was latency of over 9 seconds. The machine was 60% idle from a CPU perspective at the time and we were completely baffled by how that could happen.

sasajuric · May 1, 2016, 1:46pm

Yeah, it’s a 2013 MacBook pro, quad core 2.6GHz, 16MB RAM.

I expect scala will usually perform better on such synthetic small benchmarks. I wonder though how would it look when it’s a more real-life-like scenario, where these plugs do some real work both CPU and I/O bound, and they allocate some memory which needs to be GC-ed at some point.

Also a question: in your Elixir Plug bench, was CPU fully utilised under load?

If you start with a generated Phoenix app, there might be some issues. First, make sure to run the system as an OTP release (using exrm is best). Another important tip is to either increase log level to :warn or decrease Plug.Logger level to :debug. Otherwise, all requests are logged, and there’s a huge amount of requests under load-test. IIRC, Logger will start applying back-pressure when the queue gets bigger, and that might cause the bottleneck, and CPU under-utilisation during increased load.

Other common mistake is running JSON requests through the browser pipeline (use the API pipeline). Finally, if you’re testing locally, don’t use too many threads, because wrk might interfere with the VM.

Just to be clear, my test was very shallow. I had a simple endpoint with one browser req, and one API req, and issued them with 1:1 ratio with wrk for 30 seconds. That got me about 30k reqs/sec and sub-ms times in 99% percentile.

apenney · May 1, 2016, 4:46pm

We do 85,000 QPS in Scala with a full workload (parsing the json, and making decisions based on that) and 300,000 QPS when we just 204. That’s why I’m so frustrated, I can’t get the performance anywhere even similar in the simple case to our more complex case.

In no case was the CPU ever fully loaded, we were averaging about 60% idle which is why I became convinced we must have hit some “bottleneck”, but we couldn’t prove that out.

We definitely didn’t change the logging level or use exrm so I’ll start from there and see where we can go from there. We’re definitely doing something silly as other people report over 200k QPS for similar tests on Elixir.

sasajuric · May 1, 2016, 5:23pm

Until you’re able to get to 100% CPU usage under the load you’re probably hitting some bottleneck somewhere. Here are some additional suggestions for discovering it:

Whether you’re using exrm or not, make sure to load test the code compiled with MIX_ENV=prod
If you’re load testing simple 204 response, make sure to keep Plug.Parsers out. Using it might cause the body to be loaded (see Plug.Parsers.JSON), and the throughput of your simple CPU-bound code might become I/O bound.
Consider temporarily moving Plug/http out of the picture. Write a simple program that spawns a couple of processes (at least as many as you have cores), and make those processes CPU bound. They could run an infinite loop which invokes :rand.uniform, or something similar. Just make sure they’re not receiving messages, sleeping, or doing I/O bound work. Check if you’re getting 100% CPU usage. If not, then take try playing with erl options and see if it helps.
Start the observer, then load test the system for a while. If you have a bottleneck in the system, it might show up in the processes tab during the load test. Sort the view by reds and/or message queue, and try to see if some process is constantly on/near the top of the list with a large message queue. If yes, that’s a possible bottleneck.

Only when you reach CPU usage of 100% can you consider comparing Scala vs Erlang/Elixir, and also think about possible optimizations in your code. Before that, some other bottleneck is likely making your system less efficient than it could be.

Good luck!

DianaOlympos · May 1, 2016, 6:42pm

Just a random tough. But did you compiled it with a prod mix environement ? That can have massive impact sometimes.

apenney · May 1, 2016, 6:43pm

Yes, we definitely compiled with prod, we didn’t see a huge difference between dev/prod because dev consolidates protocols these days.

talentdeficit · May 1, 2016, 7:12pm

by not doing anything in the request body you’re just measuring how fast the language can open and close tcp acceptors. try doing something like unserializing and serializing a json body to the requests and benchmark again

apenney · May 1, 2016, 7:14pm

That’s exactly what we did during our testing - we parsed (with both poison and jiffy), giving us performance below 20,000 QPS. Like I said, our scala code is at 85,000 so obviously that would make Elixir an unsuitable replacement.

Maybe we’re only testing how fast we can open and close tcp acceptors but in our workload we need that to be capable of at least 100,000 QPS and from other peoples benchmarks (like, take Whatsapp for example) doing this should definitely be possible.

sasajuric · May 1, 2016, 9:05pm

This reminds me: have you tried providing +K true to Erlang VM (enables kernel poll, default is false), and/or increasing the number of async threads for I/O with +A n (default is 10)?

apenney · May 1, 2016, 9:16pm

We tried +K true and +A 20, then +A 100. All the above results were with +K true/+A 100.

eldritchideen · May 2, 2016, 3:32am

Interesting I have tried myself to benchmark a simple plug on my Macbook. The Plug code is listed below and started by

{:ok, _} = Plug.Adapters.Cowboy.http SimplePlug, []

I get similar results to yourself I cannot get more than 26,000 Requests/sec. Tried setting +K true also.

defmodule SimplePlug do
  import Plug.Conn

  def init(options) do
    options
  end

  def call(conn, _opts) do
    conn
    |> put_resp_content_type("text/plain")
    |> send_resp(200, "Hello")
  end
end

I tried a simple Java spark example.

import static spark.Spark.*;

public class Main {
    public static void main(String[] args) {
        get("/", (req, res) -> "Hello");
    }
}

It gets about 35,000 Requests/sec. So it’s not that much ahead of Elixir.

Go out of the box, blows them all away…

package main

import (
	"io"
	"net/http"
)

func handler(writer http.ResponseWriter, request *http.Request) {
	io.WriteString(writer, "Hello")
}

func main() {
	http.HandleFunc("/", handler)
	http.ListenAndServe(":8080", nil)
}

This gets about 51,000 Requests/sec

eldritchideen · May 2, 2016, 4:02am

Also did some runs looking at the processes under observer.

Using wrk -c 40 -d 180s -t 2 --latency http://localhost:4000/ to load up the plug. I don’t see any process’ queue growing longer then 3 msgs. I suspect on a Macbook with 4 cores about 25,000 Requests/sec is as good as it gets with Elixir/Plug.

Not that this is a bad figure, I did a very simple Python 3 async benchmark to see what it could do with the exact same wrk command. The python code was:

import asyncio
from aiohttp import web

async def handle(req):
    return web.Response(body=b"Hello")

app = web.Application()
app.router.add_route('GET', '/', handle)

web.run_app(app)

Python3 only managed to get about 2,300 Requests/sec so Elixir/Plug is about 10 times faster than Python3 doing similar task under same load conditions.

benwilson512 · May 2, 2016, 10:26am

That’s fair, but what’s weird is that these are still much lower numbers than what has been achieved before. I wonder if there was some regression in cowboy? Gonna need to do some digging.

sasajuric · May 3, 2016, 1:13pm

I’ve tried benching SimplePlug on my MBP. I built an OTP release and ran the 3min test. Here’s what I got:

Running 3m test @ http://localhost:4000/
  2 threads and 40 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.05ms   72.63ms 804.03ms   88.79%
    Req/Sec    25.74k     8.24k   35.75k    86.90%
  Latency Distribution
     50%  705.00us
     75%    1.09ms
     90%  121.27ms
     99%  312.24ms
  8003062 requests in 3.00m, 1.41GB read
Requests/sec:  44441.19
Transfer/sec:      8.02MB

The latency leaves a lot to be desired, but the throughput is nice. I’ve also seen varying results on repeatable runs. For example, this one had better (but still not good) latency, but a lower throughput:

Running 3m test @ http://localhost:4000/
  2 threads and 40 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.71ms   44.46ms 558.37ms   94.47%
    Req/Sec    27.17k     6.56k   49.11k    90.48%
  Latency Distribution
     50%  644.00us
     75%    0.90ms
     90%    1.44ms
     99%  242.17ms
  7420593 requests in 3.00m, 1.31GB read
  Socket errors: connect 0, read 1, write 0, timeout 0
Requests/sec:  41206.18
Transfer/sec:      7.43MB

For some shorter tests, I was able to get more stellar results:

Running 30s test @ http://localhost:4000/
  2 threads and 40 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.52ms    8.52ms 169.09ms   98.74%
    Req/Sec    28.17k     3.94k   33.88k    81.51%
  Latency Distribution
     50%  651.00us
     75%    0.87ms
     90%    1.18ms
     99%   17.46ms
  1638500 requests in 30.04s, 295.63MB read
Requests/sec:  54550.41
Transfer/sec:      9.84MB

I didn’t investigate this deeper. One possible reason is that this is my devbox, and there’s a bunch of stuff running on the system. It probably didn’t help that the load test was running on the same box as well. Also, since it’s OS X, I couldn’t play with +sbt flags which might make a difference.

Given the numbers I get locally on a devbox where a lot of other apps are running in the background, I’m fairly confident that better results can be observed on the dedicated server.

apenney · May 6, 2016, 4:54pm

I spent some time last night “tinkering” with interesting results.

Cowboy 2.0, using a hacked up copy of their examples/post_data can do 106k/sec 204’ing without reading the body on the machines I struggled to get over 25k on with elixir/plug.

Elli could do 160k on the same hardware.

I tried to use one of the elli plug adapters and got 2k QPS through Elixir. The weirdest part of my Elli testing was I was using rebar3 shell and that creates +A 0 and +K false and I still got 160k. I am sure if I build a release and tune it properly I can drive that upwards of 250-300k.

Just figured I’d put a mini update as I’m not finished (very slowly) trying to untangle this performance issue. My next plan is cowboy 1.1, which is what Plug is using, to see if I reach a bottleneck there.

Then adding raw erlang body reading into both cowboy and elli to see what that looks like.

josevalim · May 6, 2016, 7:26pm

This is definitely suspicious. Phoenix in dev is going to do code reloading and other things. If you are getting the same performance in both, there is definitely something else at play.

I would expect the default Phoenix stack in production to be around a 10% impact on a pure Plug stack. Also remember that logging has a huge impact in performance, make sure both apps are logging about the same amount of information.

If results are still suspicious, please let us know how you installed Erlang and which flags were used during the installation.