- Elixir & Erlang versions:
Elixir 1.10.2 (compiled with Erlang/OTP 22)
- ExAws version:
ex_aws 2.1.3
- ExAwsDynamo version:
ex_aws_dynamo 3.0.1
- HTTP client version:
hackney 1.15.2
- Host: AWS EC2 Instance type c5.2xlarge (8 cores, 16 GB memory)
Hi all!
We are writing a simple phoenix application that provides a REST API to query a DynamoDB Table.
During early stages of load testing the API of this service we noticed high process memory usage and high system load on our host after just 500 req/s (each request makes one to two queries to the Dynamo Table). We also observed high 95th percentile of request time (700 ms). We followed up by looking in two places: function call profiling and top memory consuming processes.
- Looking at function calls:
We profiled using:fprof
to look deeper and found that the function with the highest call count by a factor of 5 (to the third highest call count) wasExAws.Auth.remove_dup_spaces
and the function with the second highest call count by a factor of 2 (to the third highest call count) wasJason.Decoder.string
being called byJason.Decoder.parse
.
While it is a bit odd that these functions are hot, it doesn’t seem to indicate anything as seriously wrong. So we moved on to look towards understanding the memory usage we were exploring.
- Looking at memory:
We used:etop
to understand what was consuming our memory. We sent a modest amount of traffic to the service (3 req/s) and observed the changes to the top processes sorting by memory usage. Below is a sample:
========================================================================================
'query@127.0.0.2' 21:31:26
Load: cpu 1 Memory: total 136677 binary 1771
procs 329 processes 93257 code 17606
runq 0 atom 815 ets 1307
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
----------------------------------------------------------------------------------------
<7480.11279.0> tls_connection:init/ '-' 389056 4916140 0 gen_statem:loop_rece
<7480.11300.0> tls_connection:init/ '-' 387479 4916140 0 gen_statem:loop_rece
<7480.11334.0> tls_connection:init/ '-' 383728 4916140 0 gen_statem:loop_rece
<7480.11327.0> tls_connection:init/ '-' 382759 4916124 0 gen_statem:loop_rece
<7480.11298.0> tls_connection:init/ '-' 382497 4916140 0 gen_statem:loop_rece
<7480.11337.0> tls_connection:init/ '-' 382324 4916140 0 gen_statem:loop_rece
<7480.11291.0> tls_connection:init/ '-' 382279 4916140 0 gen_statem:loop_rece
<7480.11332.0> tls_connection:init/ '-' 382107 4916124 0 gen_statem:loop_rece
<7480.11340.0> tls_connection:init/ '-' 381925 4916540 0 gen_statem:loop_rece
<7480.11325.0> tls_connection:init/ '-' 381798 4916140 0 gen_statem:loop_rece
<7480.11286.0> tls_connection:init/ '-' 380394 4916124 0 gen_statem:loop_rece
<7480.11338.0> tls_connection:init/ '-' 379561 4916140 0 gen_statem:loop_rece
<7480.11329.0> tls_connection:init/ '-' 378424 4916140 0 gen_statem:loop_rece
<7480.11294.0> tls_connection:init/ '-' 377844 4916140 0 gen_statem:loop_rece
<7480.11295.0> tls_connection:init/ '-' 377508 4916244 0 gen_statem:loop_rece
<7480.129.0> ssl_manager '-' 6233310783868 0 gen_server:loop/7
<7480.311.0> telemetry_poller_def '-' 57002 231276 0 gen_server:loop/7
<7480.11163.0> etop_txt:init/1 '-' 17030 690244 0 etop:update/1
<7480.692.0> hackney_pool:init/1 '-' 12603 143784 0 gen_server:loop/7
<7480.258.0> 'Elixir.Logger' '-' 5827 143740 0 gen_event:fetch_msg/
========================================================================================
After sending traffic, we immediately saw several tls_connection:init
related processes take up the top slots. Each taking about 5 mb of memory. ExAws does use Hackney so we surmise that these processes are hackney-related. By default ExAws uses the default
hackney pool of size 10, but as you can see we see more than 10 implicated processes.
Is 5 MB per connection normal? Why do we see more than 10 tls_connection processes when the default pool size is 10?
Overall, I’d like to know if anyone has any other suggestions as to what we should look into for the purpose of optimizing or profiling our current code or whether or not what we are seeing is expected. We assumed that we would be able to get more out of a single host.