7.5 second average page load speed for 200 visitors hitting a Phoenix driven website - on a $5/month Digital Ocean Droplet

performance
phoenix

#21

You can always display things like view counts that aren’t really important overall asynchronously so you get a fairly static page load otherwise (Drab makes it super easy as an example).


#22

I would not keep coming back to a site if I have to wait 7-8 secs at every click.

For sure! If page loads ever took anywhere near that long, performance would be my #1 concern for the project.

Educational value for you is not worth it? This is basically production-level of work and is a priceless experience (unless you already did it 10 times before).

I worked as a Node dev in the US for a few years (at one unicorn and later a YC startup) and did some Rails contracting as well, so web dev isn’t entirely novel to me, but I’m still at a good place in the learning curve for Elixir and especially its ecosystem.

Mostly, it’s time. I only really have about one or maybe two hours a day I can spend on this stuff and am also recovering from pretty severe RSI that limits keyboard time. I certainly don’t want a broken site experience, but I also have to prioritize what’s useful for either my learning or that of my viewers. So each week it’s a question of “How should I allocate these 5-10 hours?”


#23

My point of view is that the educational value of doing this is bigger for the viewers and not the author, so its a very good thing that we can point beginners that are reaching their first performance issues.

Now, asking as a beginner myself, couldn’t you use just a genserver to cache the episodes instead of using genserver+ets? Would that affect the cache performance?


#24

Technically you can easily use an Agent that stores a map with top 20-50 articles.


#25

An Agent would quickly become a bottleneck in a system that’s already slow (or maybe not, 200 req/s might not be enough to overload a genserver). But let’s assume that there is a bit more load (>10000 req/s), then if these top articles don’t change often (like no more than a few times per minute), then compiling them as a list into a module using something like mochiglobal: https://github.com/lpgauth/foil or https://github.com/discordapp/fastglobal or any other lib like that would improve system’s efficiency significantly.

As for

[…] you auto-increment view counts on every page view, and that it would be pointless to update the cache on update since it would get busted on every view.

I’d use ets tables with their counters, probably. But there are still faster ways to implement counters with beam if it’s really necessary: https://github.com/andytill/oneup, if you start experiencing ets lock contention. The first step would be to use an ets table per scheduler though, before reaching for nifs like andytill/oneup. There are also counters available in rust, of course, which might seem safer to some people.

As for caching, I don’t think it’s necessary for rendering pages, just have a compiled markdown template as iolists of html binaries and input the data that changes before pushing it over the socket, that’s basically how the benchmarks I mentioned above were structured and it was very fast …


#26

It’s funny to me, because I used to work on a site that served 14m visitors a month (I don’t remember the pageviews), and when we were trying to optimise it, we were told peak traffic was 70 requests per seconds.

So we tried to optimise for 100 requests per second, but it was taking to long, so in the end the managers decided to just throw another £10k per year at the database and call it a day. The pages loaded in 1-3 seconds and they were paying £35k a year in AWS costs.

And I’m looking at these numbers and thinking “we probably could’ve run that for $5 a month”.


#27

Compiling the Markdown template (or even its output) to an iolist would be ideal. Are there any tools for doing this that you’d recommend?


#28

Why not just save both the markdown and the compiled HTML to the database and show the HTML in your templates without any processing (since it’s already compiled)?


#29

A big concatenated string of HTML would definitely be slower than an iolist.


#30

Can you explain why?

What makes loading a snippet of pre-compiled HTML from a database slower than an iolist? I don’t even know what an iolist is.


#31

I don’t even know what an iolist is.

Have a look at these two blog posts:
Elixir and IO Lists, Part 1: Building Output Efficiently
Elixir and IO Lists, Part 2: IO Lists in Phoenix


#32

That 2nd post is really good, but it sounds like EEx will transform all of the static bits of a template into functions and will get cached but the dynamic bits will never be cached.

But in this example of displaying markdown, wouldn’t it always be pulled from a dynamic resource? It would be a text field from the database. Either markdown text that gets compiled to HTML at runtime (page view) or HTML that gets compiled from markdown on add / edit in the admin.

How would having the markdown be compiled into an iolist make a difference if in both cases it’s dynamic data?


#33

Well, in ruby you do not need caching because the calculation of a dynamic value costs a lot of time, but rather because concatenating all the small pieces together to a single large string before submitting to the client takes lot of time. Due to how IO-lists work on the beam, this concatenating step can be skipped.

If really calculation of some value is expensive, then on the BEAM you usually cache the result of the calculation, not the result of rendering the template containing that value.


#34

I’ve searched through several github repos and couldn’t find anything, but decided to write a small example of how what I suggest could look like. These “pre-compilation” steps can be abstracted away into a module and used like html.eex templates in phoenix.

Results of rendering this blob of markdown with a “dynamic” view count in the end:

Heading
=======

## Sub-heading
 
Paragraphs are separated
by a blank line.

Two spaces at the end of a line  
produces a line break.

Text attributes _italic_, 
**bold**, `monospace`.

Horizontal rule:

---

Bullet list:

  * apples
  * oranges
  * pears

Numbered list:

  1. wash
  2. rinse
  3. repeat

A [link](http://example.com).

![Image](Image_icon.png)

> Markdown uses email-style > characters for blockquoting.

Inline <abbr title="Hypertext Markup Language">HTML</abbr> is supported.

<%= view_count %>

on my laptop recorded with benchee:

Operating System: macOS"
CPU Information: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
Number of Available Cores: 4
Available memory: 8 GB
Elixir 1.6.6
Erlang 21.0.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 μs
parallel: 1
inputs: none specified
Estimated total run time: 21 s


Benchmarking example.md as binary...
Benchmarking example.md as iodata...
Benchmarking example.md as precompiled iodata...


Name                                       ips        average  deviation         median         99th %
example.md as precompiled iodata     6286.59 K       0.159 μs   ±187.28%       0.140 μs        0.49 μs
example.md as binary                    2.43 K      410.78 μs    ±42.90%         377 μs      837.24 μs
example.md as iodata                    2.31 K      433.24 μs    ±36.81%         406 μs      798.84 μs

Comparison:
example.md as precompiled iodata     6286.59 K
example.md as binary                    2.43 K - 2582.42x slower
example.md as iodata                    2.31 K - 2723.61x slower

“example.md as binary” is what I think your current approach – judging by what I’ve read here – looks like.


#35

Yep, that’s exactly what I meant: an entry level optimization, good enough for a while because it’s quick to implement. I agree it’s not a sustained long-term solution. :+1:

fastglobal would be my exact choice. If I had a blog with 2000 articles I’d still cache them all in there; that could not take more than 50 or so megabytes. Good enough to run 30-40 such blogs on a RPi Zero behind a good router!


#36

@nickjanetakis @NobbZ

I tried caching the compiled html of each episode page (both member and visitor versions of each) and got a modest improvement (about another 25%).

Then, based on @jakemorrison’s link, I made some OS and nginx setting changes to increase limits on file descriptors and workers and got a dramatic speedup!

My last tests with loader are getting 450 reqs/second on the listing page and even better for single episodes—250 req/sec response times down to 417ms, and it survived 750 reqs per sec, averaging 3255ms each.

This is actually a lot better than I’d expected it to get with such minimal changes. @idiot’s approach is probably what I’ll dig into (eventually) when there’s a need for more improvement.

Thanks for all of the ideas, everyone :grinning:


#37

Do you mean you saved the HTML to the DB and then directly loaded it?

New numbers look really good. 4 seconds down to 400ms for 250 req / s is a massive boost. Was this still with loader.io’s figures?

If you get bored an interesting test would be to bring up digitalocean’s graphs and run your test just to see if you’re CPU, memory or disk I/O bound.


#38

Cant wait to see the next episode :smile:


#39

Got the next performance episode edited and posted.

tldw; file handle limits and Nginx worker limits were both significant limiting factors at least after increasing Phoenix’s max_keepalive option and caching the markdown conversions. After raising both the performance improved dramatically.

It still doesn’t look like CPU or memory is a constraint, but disk IO and bandwidth might be.


#40

Interesting. I just re-ran the wrk test on your episodes page and there was no difference in performance. If anything it might be decreased because now there’s timeouts.

Before your latest changes:

nick@workstation:/e/tmp$ wrk -t8 -c200 -d30 https://alchemist.camp/episodes
Running 30s test @ https://alchemist.camp/episodes
  8 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.30s   240.47ms   1.88s    80.63%
    Req/Sec    20.58     13.87   100.00     74.83%
  4105 requests in 30.05s, 205.56MB read
Requests/sec:    136.61
Transfer/sec:      6.84MB

After your latest changes:

nick@workstation:/e/tmp$ wrk -t8 -c200 -d30 https://alchemist.camp/episodes
Running 30s test @ https://alchemist.camp/episodes
  8 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.41s   211.83ms   2.00s    51.85%
    Req/Sec    19.84     13.00    90.00     76.91%
  3755 requests in 30.08s, 196.78MB read
  Socket errors: connect 0, read 0, write 0, timeout 60
Requests/sec:    124.82
Transfer/sec:      6.54MB