Evaluating Elixir / Phoenix for a web-scale, performance-critical application

performance
phoenix
Tags: #<Tag:0x00007f8e9df0d218> #<Tag:0x00007f8e9df0c9a8>

#1

Hi all - cross posting this from the elixir-talk mailing list. I could use some help. I am currently evaluating Elixir and Phoenix for a performance-critical application for a Fortune 500 company. This could be another great case study for Elixir and Phoenix if I can show that it can meet our needs. Initial performance testing looked phenomenal, but I am running into some performance concerns that will force me to abandon this tech stack entirely if I cannot make the case.

The setup: an out-of-the box phoenix app using mix phoenix.new. No ecto. Returning a static json response. Basically a hello-world app.

The hardware:

  • Macbook Pro, 16gb, 8 core, 2.5ghz, running elixir/phoenix natively, and also using docker container
  • Amazon EC2 T2.Medium running Elixir Docker image

The tests: used ab, wrk, siege, artillery, curl with a variety of configurations. Up to 100 concurrent connections. Not super scientific, i know… but

No matter what I try, Phoenix logs out impressive numbers to stdout - generally on the order of 150-300 microseconds. However, none of the load testing tooling agrees. No matter the hardware or load test configuration, I see around 20-40 ms response times. The goal for the services that I am designing is 20ms and several thousand requests per second. The load tests that @chrismccord and others have published suggest that I should be able to expect 3ms or less when running localhost, but i’m not seeing anything close to that.

Would anyone be willing to work with me to look at some options here? I’d be incredibly grateful. Don’t make me go back to Java, please :slight_smile: Is this even possible what I am asking?


Plug POST performance
#2

I’ve read several times that for performance testing you should run the app in production mode.

MIX_ENV=prod mix compile.protocols
MIX_ENV=prod PORT=4001 elixir -pa _build/prod/consolidated -S mix phoenix.server

This is from the 0.7.2 docs so I’m not sure it’s all still needed, but might be worth a try.


#3

Thanks for the great suggestion @terakilobyte - I just tried running it in PROD mode:

MIX_ENV=prod mix compile
MIX_ENV=prod mix phoenix.digest
MIX_ENV=prod PORT=4001 mix phoenix.server

./wrk -t8 -c100 -d30S --timeout 2000 http://localhost:4001/api/products

Running 30s test @ http://localhost:4001/api/products
8 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.16ms 32.83ms 365.59ms 94.76%
Req/Sec 494.23 139.45 770.00 76.14%
116284 requests in 30.06s, 71.44MB read
Requests/sec: 3868.18
Transfer/sec: 2.38MB

Still not very good unfortunately :frowning:


#4

Interestingly, with only 10 concurrent connections:

Running 10s test @ http://localhost:4001/api/products
8 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.18ms 3.91ms 40.59ms 88.15%
Req/Sec 1.12k 249.24 1.78k 68.38%
89005 requests in 10.01s, 54.68MB read
Requests/sec: 8890.15
Transfer/sec: 5.46M


#5

There was a similar thread recently. You may find some tips there.

Some low-hanging fruit to improve perf would be:

  1. Raise log level in prod to :warn to suppress logging each request.
  2. If you’re testing a REST endpoint, make sure it goes through the :api pipeline, and not the :browser one.
  3. Build an OTP release and bench against that

Also, check your cpu usage while testing. If all your CPUs are not constantly near 100% then you may have some bottleneck.


#6

Check this micro benchmark tool released by a member of the community: Benche and BencheeCSV 0.1.0 release - easy and extensible (micro) benchmarking


#7

To expand on Sasa’s point, the :browser endpoint by default generates CSRF tokens to be used in forms as a security measure. These can be fairly expensive to generate at least compared to a hello_world JSON endpoint. Definitely make sure you aren’t doing that.


#8

Thank you for your quick response on this @sasajuric. I looked around in the forums first but didn’t see the other thread. Lots of great information over there to try. Pretty sure I’m piping through :api - here’s my router:

defmodule MarketApi.Router do
  use MarketApi.Web, :router

  pipeline :api do
    plug :accepts, ["json"]
  end

  scope "/api", MarketApi do
    pipe_through :api
    resources "/products", ProductController, except: [:new, :edit]
  end
end

CPU is pretty well maxed out across all cores when I run the tests.

Thanks @benwilson512 - I’m super green to Phoenix, so I appreciate the tip. I think I am using :api. Is there anything I need to do besides pipe_through :api ?[quote=“thinkpadder1, post:6, topic:832, full:true”]
Check this micro benchmark tool released by a member of the community: Benche and BencheeCSV 0.1.0 release - easy and extensible (micro) benchmarking
[/quote]

I hadn’t seen this before - I’ll give it a whirl. Is it possible with the tool to have it benchmark where Phoenix is spending all its time?


#10

Try to make a release with exrm, and then start as a service:
rel/your_app/bin/your_app start.


#11

Also, check out Elixir’s very own Observer tool.


#12

Could you put your test app to Github or smth else? It would help to found a reason.


#13

Hi all, just wanted to say thank you for all of your timely help on this. I have been working through the suggestions a bit at a time as time permits, and I will put up some code as soon as I can. So far, turning off all of the output to stdout and building an exrm release have provided some improvements - though there are still a few scenarios where MIX_ENV=PROD outperforms even the exrm release.

I’ll update as soon as I have some more information. Thanks again!


#14

Any updates, Matt? I came to this thread from Sasa’s blog post and I remember a thread on Hacker News performing benchmarks at Rackspace. In your original post you had mentioned going back to Java, so I thought I’d link you to their tests: https://gist.github.com/omnibs/e5e72b31e6bd25caf39a


#15

Few things to consider

  1. If you running tests from outside AWS you might see some delay because of connecting time to AWS.
    2.Try to have AWS instance with SSD volumes instead EBS volumes.

If you can send the sample code, I can take a look and suggest you better.


#16

Thanks for this @sudostack - I have been concentrating on getting some benchmarks for Go, Java and Elixir since last I checked in. Elixir is next up and I hope to get some more data this week. Thanks for the link to the tests - would be awesome to see some updated numbers. It’s impressive to see the throughput that Phoenix was getting back then. I’ve been getting <20k RPS on my macbook.[quote=“subbu05, post:15, topic:832”]
If you running tests from outside AWS you might see some delay because of connecting time to AWS.2.Try to have AWS instance with SSD volumes instead EBS volumes.
[/quote]

I’m going to try hitting it from either the same box or one within the same VPC. Good advice on #2. We were spinning these up on Docker instances, which people don’t seem to do with Elixir. I haven’t seen a good reason yet, but I am curious as to what overhead is introduced by Docker.


#17

I have been doing similar benchmarks etc… for my application which is a json api - and I have found that elixir / phoenix is not the fastest thing out there (nor does it claim to be), but combined with the balance of productivity vs performance in my opinion it beats scala / java / go and the rest, but I am considering it coming from ruby / rails so its similarity to ruby was important too. Saying that, I have worked with scala and java before and cannot see the frameworks such as play being anything like as easy to develop with, but they probably will get more requests per second out of a single box, but elixir scales in a more predictable way is what I have read when adding more hardware - not got to that point yet though.


#18

Phoenix is fast, but only as fast as your pipeline allows it to be. The point of phoenix is to be ‘near’ the fastest while having beyond intense reliability.

However my main point for this post, *DO*NOT*TEST*FROM*WINDOWS*. We learned that the hard way from work. If a server is on windows or the testing client is on Windows then (at least Windows 10) introduces a near 200ms latency on initial connections while it ‘fills the TCP buffer’, and at least on windows 10 we’ve tried everything to disable it, registry edits, setting the TCP connection with NONAGLE and a hundred things in-between…