Evaluating Elixir / Phoenix for a web-scale, performance-critical application

Hi all - cross posting this from the elixir-talk mailing list. I could use some help. I am currently evaluating Elixir and Phoenix for a performance-critical application for a Fortune 500 company. This could be another great case study for Elixir and Phoenix if I can show that it can meet our needs. Initial performance testing looked phenomenal, but I am running into some performance concerns that will force me to abandon this tech stack entirely if I cannot make the case.

The setup: an out-of-the box phoenix app using mix phoenix.new. No ecto. Returning a static json response. Basically a hello-world app.

The hardware:

  • Macbook Pro, 16gb, 8 core, 2.5ghz, running elixir/phoenix natively, and also using docker container
  • Amazon EC2 T2.Medium running Elixir Docker image

The tests: used ab, wrk, siege, artillery, curl with a variety of configurations. Up to 100 concurrent connections. Not super scientific, i knowā€¦ but

No matter what I try, Phoenix logs out impressive numbers to stdout - generally on the order of 150-300 microseconds. However, none of the load testing tooling agrees. No matter the hardware or load test configuration, I see around 20-40 ms response times. The goal for the services that I am designing is 20ms and several thousand requests per second. The load tests that @chrismccord and others have published suggest that I should be able to expect 3ms or less when running localhost, but iā€™m not seeing anything close to that.

Would anyone be willing to work with me to look at some options here? Iā€™d be incredibly grateful. Donā€™t make me go back to Java, please :slight_smile: Is this even possible what I am asking?

10 Likes

Iā€™ve read several times that for performance testing you should run the app in production mode.

MIX_ENV=prod mix compile.protocols
MIX_ENV=prod PORT=4001 elixir -pa _build/prod/consolidated -S mix phoenix.server

This is from the 0.7.2 docs so Iā€™m not sure itā€™s all still needed, but might be worth a try.

4 Likes

Thanks for the great suggestion @terakilobyte - I just tried running it in PROD mode:

MIX_ENV=prod mix compile
MIX_ENV=prod mix phoenix.digest
MIX_ENV=prod PORT=4001 mix phoenix.server

./wrk -t8 -c100 -d30S --timeout 2000 http://localhost:4001/api/products

Running 30s test @ http://localhost:4001/api/products
8 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.16ms 32.83ms 365.59ms 94.76%
Req/Sec 494.23 139.45 770.00 76.14%
116284 requests in 30.06s, 71.44MB read
Requests/sec: 3868.18
Transfer/sec: 2.38MB

Still not very good unfortunately :frowning:

2 Likes

Interestingly, with only 10 concurrent connections:

Running 10s test @ http://localhost:4001/api/products
8 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.18ms 3.91ms 40.59ms 88.15%
Req/Sec 1.12k 249.24 1.78k 68.38%
89005 requests in 10.01s, 54.68MB read
Requests/sec: 8890.15
Transfer/sec: 5.46M

2 Likes

There was a similar thread recently. You may find some tips there.

Some low-hanging fruit to improve perf would be:

  1. Raise log level in prod to :warn to suppress logging each request.
  2. If youā€™re testing a REST endpoint, make sure it goes through the :api pipeline, and not the :browser one.
  3. Build an OTP release and bench against that

Also, check your cpu usage while testing. If all your CPUs are not constantly near 100% then you may have some bottleneck.

9 Likes

Check this micro benchmark tool released by a member of the community: Benche and BencheeCSV 0.1.0 release - easy and extensible (micro) benchmarking

4 Likes

To expand on Sasaā€™s point, the :browser endpoint by default generates CSRF tokens to be used in forms as a security measure. These can be fairly expensive to generate at least compared to a hello_world JSON endpoint. Definitely make sure you arenā€™t doing that.

4 Likes

Thank you for your quick response on this @sasajuric. I looked around in the forums first but didnā€™t see the other thread. Lots of great information over there to try. Pretty sure Iā€™m piping through :api - hereā€™s my router:

defmodule MarketApi.Router do
  use MarketApi.Web, :router

  pipeline :api do
    plug :accepts, ["json"]
  end

  scope "/api", MarketApi do
    pipe_through :api
    resources "/products", ProductController, except: [:new, :edit]
  end
end

CPU is pretty well maxed out across all cores when I run the tests.

Thanks @benwilson512 - Iā€™m super green to Phoenix, so I appreciate the tip. I think I am using :api. Is there anything I need to do besides pipe_through :api ?[quote=ā€œthinkpadder1, post:6, topic:832, full:trueā€]
Check this micro benchmark tool released by a member of the community: Benche and BencheeCSV 0.1.0 release - easy and extensible (micro) benchmarking
[/quote]

I hadnā€™t seen this before - Iā€™ll give it a whirl. Is it possible with the tool to have it benchmark where Phoenix is spending all its time?

2 Likes

Try to make a release with exrm, and then start as a service:
rel/your_app/bin/your_app start.

2 Likes

Also, check out Elixirā€™s very own Observer tool.

2 Likes

Could you put your test app to Github or smth else? It would help to found a reason.

2 Likes

Hi all, just wanted to say thank you for all of your timely help on this. I have been working through the suggestions a bit at a time as time permits, and I will put up some code as soon as I can. So far, turning off all of the output to stdout and building an exrm release have provided some improvements - though there are still a few scenarios where MIX_ENV=PROD outperforms even the exrm release.

Iā€™ll update as soon as I have some more information. Thanks again!

1 Like

Any updates, Matt? I came to this thread from Sasaā€™s blog post and I remember a thread on Hacker News performing benchmarks at Rackspace. In your original post you had mentioned going back to Java, so I thought Iā€™d link you to their tests: https://gist.github.com/omnibs/e5e72b31e6bd25caf39a

Few things to consider

  1. If you running tests from outside AWS you might see some delay because of connecting time to AWS.
    2.Try to have AWS instance with SSD volumes instead EBS volumes.

If you can send the sample code, I can take a look and suggest you better.

Thanks for this @sudostack - I have been concentrating on getting some benchmarks for Go, Java and Elixir since last I checked in. Elixir is next up and I hope to get some more data this week. Thanks for the link to the tests - would be awesome to see some updated numbers. Itā€™s impressive to see the throughput that Phoenix was getting back then. Iā€™ve been getting <20k RPS on my macbook.[quote=ā€œsubbu05, post:15, topic:832ā€]
If you running tests from outside AWS you might see some delay because of connecting time to AWS.2.Try to have AWS instance with SSD volumes instead EBS volumes.
[/quote]

Iā€™m going to try hitting it from either the same box or one within the same VPC. Good advice on #2. We were spinning these up on Docker instances, which people donā€™t seem to do with Elixir. I havenā€™t seen a good reason yet, but I am curious as to what overhead is introduced by Docker.

1 Like

I have been doing similar benchmarks etcā€¦ for my application which is a json api - and I have found that elixir / phoenix is not the fastest thing out there (nor does it claim to be), but combined with the balance of productivity vs performance in my opinion it beats scala / java / go and the rest, but I am considering it coming from ruby / rails so its similarity to ruby was important too. Saying that, I have worked with scala and java before and cannot see the frameworks such as play being anything like as easy to develop with, but they probably will get more requests per second out of a single box, but elixir scales in a more predictable way is what I have read when adding more hardware - not got to that point yet though.

Phoenix is fast, but only as fast as your pipeline allows it to be. The point of phoenix is to be ā€˜nearā€™ the fastest while having beyond intense reliability.

However my main point for this post, *DO*NOT*TEST*FROM*WINDOWS*. We learned that the hard way from work. If a server is on windows or the testing client is on Windows then (at least Windows 10) introduces a near 200ms latency on initial connections while it ā€˜fills the TCP bufferā€™, and at least on windows 10 weā€™ve tried everything to disable it, registry edits, setting the TCP connection with NONAGLE and a hundred things in-betweenā€¦

4 Likes

I know this has been looong ago, but I have to clarify this for myself: how is it possible that Phoenix, which is built on top of Plug, has a (slightly) lower latency and considerably better consistency (lower Ļƒ)?

Given how old these benchmarks are itā€™s anyoneā€™s guess, Iā€™d suggest they be redone before anyone worries about analyzing their results.