Help optimizing or profiling performace of ex_aws & hackney

Shayon · May 18, 2020, 10:38pm

Elixir & Erlang versions: Elixir 1.10.2 (compiled with Erlang/OTP 22)
ExAws version: ex_aws 2.1.3
ExAwsDynamo version: ex_aws_dynamo 3.0.1
HTTP client version: hackney 1.15.2
Host: AWS EC2 Instance type c5.2xlarge (8 cores, 16 GB memory)

Hi all!
We are writing a simple phoenix application that provides a REST API to query a DynamoDB Table.

During early stages of load testing the API of this service we noticed high process memory usage and high system load on our host after just 500 req/s (each request makes one to two queries to the Dynamo Table). We also observed high 95th percentile of request time (700 ms). We followed up by looking in two places: function call profiling and top memory consuming processes.

Looking at function calls:
We profiled using :fprof to look deeper and found that the function with the highest call count by a factor of 5 (to the third highest call count) was ExAws.Auth.remove_dup_spaces and the function with the second highest call count by a factor of 2 (to the third highest call count) was Jason.Decoder.string being called by Jason.Decoder.parse .

While it is a bit odd that these functions are hot, it doesn’t seem to indicate anything as seriously wrong. So we moved on to look towards understanding the memory usage we were exploring.

Looking at memory:
We used :etop to understand what was consuming our memory. We sent a modest amount of traffic to the service (3 req/s) and observed the changes to the top processes sorting by memory usage. Below is a sample:

========================================================================================                                                                                                                                                                                                                                       
 'query@127.0.0.2'                                                         21:31:26                                                                                                                                                                                                                                            
 Load:  cpu         1               Memory:  total      136677    binary       1771                                                                                                                                                                                                                                            
        procs     329                        processes   93257    code        17606                                                                                                                                                                                                                                            
        runq        0                        atom          815    ets          1307                                                                                                                                                                                                                                            
Pid            Name or Initial Func    Time    Reds  Memory    MsgQ Current Function                                                                                                                                                                                                                                           
----------------------------------------------------------------------------------------                                                                                                                                                                                                                                       
<7480.11279.0> tls_connection:init/     '-'  389056 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11300.0> tls_connection:init/     '-'  387479 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11334.0> tls_connection:init/     '-'  383728 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11327.0> tls_connection:init/     '-'  382759 4916124       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11298.0> tls_connection:init/     '-'  382497 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11337.0> tls_connection:init/     '-'  382324 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11291.0> tls_connection:init/     '-'  382279 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11332.0> tls_connection:init/     '-'  382107 4916124       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11340.0> tls_connection:init/     '-'  381925 4916540       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11325.0> tls_connection:init/     '-'  381798 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11286.0> tls_connection:init/     '-'  380394 4916124       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11338.0> tls_connection:init/     '-'  379561 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11329.0> tls_connection:init/     '-'  378424 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11294.0> tls_connection:init/     '-'  377844 4916140       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.11295.0> tls_connection:init/     '-'  377508 4916244       0 gen_statem:loop_rece                                                                                                                                                                                                                                       
<7480.129.0>   ssl_manager              '-'   6233310783868       0 gen_server:loop/7                                                                                                                                                                                                                                          
<7480.311.0>   telemetry_poller_def     '-'   57002  231276       0 gen_server:loop/7                                                                                                                                                                                                                                          
<7480.11163.0> etop_txt:init/1          '-'   17030  690244       0 etop:update/1                                                                                                                                                                                                                                              
<7480.692.0>   hackney_pool:init/1      '-'   12603  143784       0 gen_server:loop/7                                                                                                                                                                                                                                          
<7480.258.0>   'Elixir.Logger'          '-'    5827  143740       0 gen_event:fetch_msg/                                                                                                                                                                                                                                       
========================================================================================

After sending traffic, we immediately saw several tls_connection:init related processes take up the top slots. Each taking about 5 mb of memory. ExAws does use Hackney so we surmise that these processes are hackney-related. By default ExAws uses the default hackney pool of size 10, but as you can see we see more than 10 implicated processes.

Is 5 MB per connection normal? Why do we see more than 10 tls_connection processes when the default pool size is 10?

Overall, I’d like to know if anyone has any other suggestions as to what we should look into for the purpose of optimizing or profiling our current code or whether or not what we are seeing is expected. We assumed that we would be able to get more out of a single host.

Moxide · May 20, 2020, 9:28pm

Hello,
Have you set the hackney_opts in ex_aws config, to make use of the default pool ?

      config :ex_aws, :hackney_opts,
        pool: :default,
        max_connections: 50

If it’s not a problem for you, you can also configure ex_aws_dynamo to use the http scheme instead of https.

Shayon · May 20, 2020, 11:11pm

I did try messing around with that. I saw no noticeable difference when explicitly configuring the default pool.

Also we don’t really think that http is an option here, but that is a good point. As it does seem TLS seems to be part of the problem here.

We’ve since tried switching ex_aws’s http_client to Mint (using Mojito to handle the pooling for us). Config looking like:

config :mojito,
  timeout: 2500,
  pool_opts: [
    size: 20,
    pools: 10
  ]

With Mint/Mojito we have had a bit more success, with the ability to go as high as ~900 requests a second before seriously tanking our response times. While we’ve done this we’ve noticed that the scheduler utilization chart in :observer shows that as we increase load the schedulers seems to all synchronize their wave patterns - we’re guessing this points to some resource contention. We’re trying out eflame to see if it shows us where this contention lies.

axelson · May 20, 2020, 11:29pm

If you’re looking for high performance http libraries, you may want to try out https://github.com/ninenines/gun as well. Although it’ll be a little more work since it’s an erlang library (but I think some people use a wrapper like https://github.com/petrohi/machine_gun)

cmkarlsson · May 21, 2020, 1:41am

Can the server side handle the load?

When using connection pools you must make sure that the server side can handle the number of requests on the number of connection it has got. If not you will start seeing larger process queues and the sort of 95% percentile bad latency.

With a pool of 20 connections. 500 req/s = 500 / 20 = 25 requests per second per connection.
Each request should not take longer than 40ms.

If they constantly run over 40ms then queues will grow and things will gradually become worse and worse until everything stops.

Check the length of the process queues in observer. If they keep increasing something is not able to keep up with the load and you have a process bottleneck. It could be the process pool being too small, or the logic in the http client pool or something else. I don’t think the memory of various processes matter too much.

When benchmarking start with a single connection (perhaps by using mint without involving a pool) and fire of sequential requests against the server. This will give you a baseline what 1 connection can handle and what its latency characteristics are. Then you can start calculating how many connections you need to be able to handle the throughput required.

In case the server cannot handle the load and you have an need for ongoing requests I suggest looking into using a dispcount based pool (https://github.com/benoitc/hackney_disp)

Moxide · May 21, 2020, 6:36am

We are using Dynamodb, and it’s capable to handle a way heavier load, given that you’re not exceeding the request limits. Limiting factors are the type of read (consistent ones consumes 2 read capacities) and size of the objects you’re fetching (the limits are given per chunk of 4kb).

Shayon · May 22, 2020, 12:07am

Thanks for all of your thoughts!

@axelson We had looked at :gun as an option, and are still interested in giving it a try to see if it can outperform mint. So thanks for the suggestion! I hadn’t seen MachineGun and I like that it will also bring in Poolboy so maybe I’ll use that when we give it a try.

@cmkarlsson You bring up a great point. We weren’t really thinking in terms of connections per pool and what that means for process queuing. I’ve built a graph in our metric dashboard to show this relation: Incoming Requests per sec divided by Pool Size multiplied by AWS Request Time. I can see that when this graph goes past 1000 we do indeed start suffering in our Process Run Queue graph. With this knowledge, we increased the pool size and put the system under load again. This time around we achieved a higher load before seeing spikes in the Process Run Queue, but the other graph did not show pool overflow either. This leads us to think that we are now running into another bottleneck.

Using :fprof again we found that Jason.Decode seemed to be implicated as a hot code path. So, now we’ve switched the ExAws json_codec and the phoenix json_library to use JiffyEx. After profiling under load again we see that we are back to our http_client library, Mint, being the hottest code path.

We’re up to handling 3k req/s on one machine while staying stable. Although we have increased our instance type size to c5.4xlarge (16 cores, 32 GB memory).

It still feels like we ought to get more throughput from this, but maybe not

@Moxide We’re running these load tests on two nodes, and AWS is reporting 2.97k capacity units consumed per minute. We haven’t seen any read throttles yet either.