Very weird performance issue (inet_gethost_native)?

I have been having gallbladder issue and havent eaten for 10 days while i wait for my ultrasound, if this is a little poorly explained i apologize in advance. fasting is doodoo for my brain power

I have a Phoenix API JSON endpoint thats currently doing ~100,000 requests/minute across 10 nodes, so roughly ~150 RPS per node.

This is the lifecycle of a request:

  1. request comes in from vendor A. In order to serve this request I need to, in real-time, hit 3 other endpoints, gather results, find the data I need, and then I can fulfill the request.
  2. So… In the controller for this request, I use Task.async/await to create 3 processes which each hit a separate API endpoint with machine_gun. I have machine_gun timeout set to 1000ms
  3. In theory, a request from vendor A should never take longer than 1000-1200ms. 1000ms timeout cutoff + several ms overhead for rest of code to run. Let’s say I hit 3 endpoints, 2 finish with 200 OK in 50ms, and the 3rd one times out at 1000ms. I take the 2 successful responses and ignore the third. I take the data I have, and do what I need to, and then return it to vendor A.

In my mind… this seems to me like it has a predictable upper limit as far as resources are concerned… I mean realistically I don’t see how 150 RPS … which each create 3 outbound requests with a 1000ms timeout, thats only 450 outgoing RPS, at most ‘locking’ 450 processes which will immediately be released.

However, this is not the case. I noticed on several of my 10 instances, some are rapidly running OOM and wreaking havoc on other requests. I was having massive timeouts on all 3 api endpoints, in huge chunks (thousands of requests were timing out all in batches, as if something else had failed rather than the api itself). Someone on our team ended up pausing 2 of the endpoints out of 3 and magically the 3rd api endpint, whcih was constantly timing out, now worked magically at 80ms response time 100% of the time without error. this tells me some other thing outside the actual API endpoint itself is causing some issue.

I have thousands of errors in rollbar a minute saying Async/Await is taking timing out and taking longer than 5000ms to reply, which doesnt even make sense because the timeout is set to 1000.

Also, i connected to observer one time and saw this… i am getting help with someone at gigalixir but they said this is used for DNS lookups. Perhaps I dont know DNS as well as i thought, but i thought all that happened behind the scenes, as it, the language itself wasnt responsible for resolving hostnames to IP… isnt that the job of a dns server?

Can anyone think of anything else that may be causing so many failures and wreaking havoc with memory, in such a ‘predictable’ setup?

Here’s an example of the massive groups of blackout timeouts that would happen every minute:

Ah. Quite the opposite. I don’t have a solution, just some observations:

I looked at machine_gun, and the docs warn:

  • When using MachineGun in a long-living process (for example genserver) make sure to handle messages in the form of {ref, _} tuples, which may be produced by pool timeouts.
  • When using MachineGun with HTTP/2 and modern HTTP/1 servers we recommend using lowercase header names. For example content-type .
  • MachineGun may timeout when request with empty body contains content-type header and does not contain content-length header. See this issue for details.

As suggested in this machine_gun issue, you can increase the timeout, to say 30_000, and then look for any error messages from your 3 processes.

I also imagine people are going to want to see your configs of the following:

%MachineGun.Response{body: body, status_code: 200} =
  MachineGun.get!(
    "http://icanhazip.com",
    [{"accept", "text/plain"}],
    %{pool_timeout: 1000, request_timeout: 5000, pool_group: :default})

Options are included to show defaults and can be omitted. pool_timeout and request_timeout default to values specified in pool group configuration. If not specified in pool group configuration they default to the values in the example.

Good luck with your health problems.

2 Likes

thank you for the good wishes :slight_smile: Ooo great notes… maybe youre right about content length. I do know there were some of those being logged. ill try to dig into the specifics of the response

{response_time, request} = :timer.tc(fn ->
  MachineGun.get(url,
    [{"accept", "*/*"}, {"user-agent", "Test/1.0"}],
    %{request_timeout: @request_timeout, pool_group: :app})
end)

stats = %{stats | response_time: Decimal.div(Decimal.new(response_time), Decimal.new("1000"))}

{stats, body} = case request do
  {:error, %MachineGun.Error{reason: :request_timeout}} ->
    {%{stats | timeout: 1}, nil}
  {:error, %MachineGun.Error{reason: {:closed, _}}} ->
    # nullify response_time
    {%{stats | connection_closed: 1, response_time: nil}, nil}
  {:error, %MachineGun.Error{reason: :closed}} ->
    # same as {:closed, _} - https://github.com/ninenines/gun/issues/180
    # nullify response_time
    {%{stats | connection_closed: 1, response_time: nil}, nil}
  {:error, %MachineGun.Error{reason: :shutdown}} ->
    # nullify response_time
    {%{stats | shutdown: 1, response_time: nil}, nil}
  {:error, %MachineGun.Error{reason: :pool_timeout}} ->
    # nullify response_time
    {%{stats | pool_timeout: 1, response_time: nil}, nil}
  {:error, %MachineGun.Error{reason: reason}} ->
   RedisLogger.error("[UNKNOWN HTTP ERROR] #{inspect(reason)} #{url}")
    # nullify response_time
    {%{stats | other_error: 1, response_time: nil}, nil}
  {:ok, %MachineGun.Response{status_code: status_code}} when status_code >= 400 ->
    {%{stats | bad_status_code: 1}, nil}
  {:ok, %MachineGun.Response{body: ""}} ->
    {%{stats | empty_response: 1}, nil}
  {:ok, %MachineGun.Response{body: nil}} ->
    {%{stats | empty_response: 1}, nil}
  {:ok, resp} ->
    {stats, resp.body}
end

and here is where i do the request. i do some other misc things like parsing xml/json but nothing as far as ‘work’ is concerned. just data transformation with Enum/Map/List, etc

  App.Config.enabled_apis()
  |> Enum.map(fn api -> Task.async(fn -> make_request(api, ip, ua) end) end)
  |> Enum.map(&Task.await(&1))
  |> find_winner

Have you tried increasing the pool size? The default size of 4 would probably be too small for this kind of workload

1 Like

Yes, i’m using this config:

  pool_size: 300,         # Poolboy size [1]
  pool_max_overflow: 300, # Poolboy max_overflow [1]
  pool_timeout: 120_000,
  conn_opts: %{}          # Gun connection options [2]

Also i am logging pool timeouts which i havent gotten any yet so i imagine i would see those first

but honestly i should just set these at like 1000, because these apis should be able to handle it fine

2 Likes