Many Concurrent Outbound Http Requests - How to Manage cpu utilization

TestingTester · August 21, 2023, 10:16pm

Hello all,

I am working on a project using Phoenix that receives requests for data, makes many external API calls, and returns the received data after some formatting. For each incoming request I’d like to make ~30+ requests to an external service.

I am hoping to accomplish this by making my external API calls concurrently, but I am running into CPU bottlenecks. Currently, I am kicking off each request in a task, awaiting all of the tasks, and formatting + merging the received data. This greatly reduces my response time, but comes at a cost of cpu/scheduler utilization. I’m seeing cpu usage spike to 20-50% momentarily when trying to make 100+ outbound requests in a short period of time.

This is more psuedocode than reality, but here’s a snippet that demonstrates roughly how I am making my external requests & gathering the responses:

Enum.map(some_enumerable_of_length_30, fn thing -> 
  Task.async(fn ->
    some_http_client.get(...)
    |> parse_response
  end)
end)
|> Enum.map(&Task.await/1)
|> merge_results
|> format

I’ve tried a number of http clients: hackney, mint, finch, req
I’ve played around with the connection pool sizes, but I can’t seem to avoid high cpu usage.

Additional context: the data I’m receiving per external request is ~500B - ~700B and I’m making all of my requests to the same service.

My main question is: Is it really that intensive to make 100+, 500+, etc https requests to an external service concurrently, or am I doing something horribly wrong in my Elixir application code?

lud · August 21, 2023, 11:51pm

Isn’t it a good thing to max out the CPU though?

You are enqueuing all the requests in one instant in your loop, then it’s expected that the system will run as much as possible concurrently. If I had enough memory I would like the CPU spike to be at 100%, to handle as much work as it can the faster it can.

If you want to smooth things down you can use Task.async_stream and tweak the :max_concurrency option to limit concurrent jobs, at the cost of higher response time.

Finch uses connection pools, so you could use that to limit the concurrent connections as well, but I am not sure how that translates to control concurrent requests. That should be available in Req as well, as it uses Finch under the hood.

al2o3cr · August 21, 2023, 11:58pm

Are the connections using HTTPS? That’s got a significant initial overhead to negotiate the initial SSL connection, regardless of how short the request + response are.

TestingTester · August 22, 2023, 12:02am

That is a great point. I am seeing the speed up that I expected and the tradeoff is cpu usage. However, my concern is that I’m seeing what I feel is abnormally high cpu usage compared to the number of requests I’m receiving (and therefore sending to the external api).

I’m seeing huge spikes from less than 10 incoming requests (10 * 30 = 300 outgoing requests), and I’d like to scale this to support a much higher number of incoming requests. Obviously, I could throw money at the problem and get a beefier cpu, but I’m trying to determine if there is something I’m missing here. Ideally, I wouldn’t have to scale hardware so early in the testing phase.

TestingTester · August 22, 2023, 12:05am

Yes, I need to make the requests using https.

Also, the response size is rather small, but the requests can take 100-300ms (maybe even 500ms). Not sure if that would have any impact on cpu if a few of the concurrent requests take longer than others to complete.

dimitarvp · August 22, 2023, 12:23am

You are basically awaiting each task one by one instead of doing Task.await_many or, as @lud said and it’s even better: use Task.async_stream and have its max_concurrency be less than or equal to the maximum size of your Finch pool.

Not sure if that’s going to bring down the CPU load by a lot but it will help somewhat + you will be doing even better parallelization.

derpycoder · August 22, 2023, 3:47am

I don’t know if connection pools can help in this regard, but give it a try.

Since all the calls to external APIs are HTTPs, it would be great to hold onto those connections for subsequent calls. (The decryption and encryption might be maxing out the CPU usage)

See Finch:

https://hexdocs.pm/finch/Finch.html#module-usage

Here’s how Plausible Analytics uses it:

github.com

plausible/analytics/blob/master/lib/plausible/application.ex

defmodule Plausible.Application do
  @moduledoc false

  use Application

  require Logger

  def start(_type, _args) do
    children = [
      Plausible.Repo,
      Plausible.ClickhouseRepo,
      Plausible.IngestRepo,
      Plausible.AsyncInsertRepo,
      Plausible.ImportDeletionRepo,
      Plausible.Ingestion.Counters,
      {Finch, name: Plausible.Finch, pools: finch_pool_config()},
      {Phoenix.PubSub, name: Plausible.PubSub},
      Plausible.Session.Salts,
      Plausible.Event.WriteBuffer,
      Plausible.Session.WriteBuffer,

This file has been truncated. show original

Example:

defp finch_pool_config() do
    finch_pool_config = Application.fetch_env!(:derpy_coder, DerpyCoder.Finch)
    default_pool_config = finch_pool_config[:default_pool_config]

    %{
      :default => [
        size: default_pool_config.size,
        count: default_pool_config.count
      ],
      "https://derpycoder.site" => [
        protocol: :http2,
        count: 50,
        conn_opts: [
          transport_opts: [
            timeout: 15_000,
            verify: :verify_peer,
            cacertfile: mkcert_path("/rootCA.pem"),
            keyfile: mkcert_path("/rootCA-key.pem")
          ]
        ]
      ],
      "https://s3.derpycoder.site" => [
        protocol: :http2,
        count: 50,
        conn_opts: [
          transport_opts: [
            timeout: 15_000,
            verify: :verify_peer,
            cacertfile: mkcert_path("/rootCA.pem"),
            keyfile: mkcert_path("/rootCA-key.pem")
          ]
        ]
      ]
   }
end

krasenyp · August 22, 2023, 5:27am

After you apply the very useful suggestions people before me gave, you can think of caching in-memory and through HTTP. First one can be accomplished with ETS and the second by leveraging Cache-Control and ETag response headers.

TestingTester · August 22, 2023, 1:31pm

Thanks for the suggestion, I do plan to implement caching, probably using redis or ets, to avoid repeated requests across the network.

mattbaker · August 23, 2023, 5:02pm

I’m willing to bet there’s something wrong with your application code (or instrumentation! I’ve made that mistake before). I work on a service that will make hundreds of downstream requests and we definitely don’t see any big CPU bumps, so something seems off

Also I don’t know what kind of resource allocation your application has available in terms of CPU or memory, is it seriously constrained? Still, I wouldn’t expect a 20%-50% bump. Could it be the way you’re manipulating or merging the data? Though that seems pretty unlikely given the payload size.

In short, something fishy is going on! You might need to try doing some profiling. I’m guessing you can reproduce it locally? The Observer is a great tool as well if you’re not already using it.

There are a few things that you might think about, however they don’t explain the CPU bump:

Make sure you’re passing SSL certs in as a file, here’s an example in HTTPoison but similar options exist in Finch/Mint/etc. If you’re making enough downstream requests it can affect memory usage.
We switched from HTTPoison to Finch and found a big improvement in terms of reliability and speed, looks like you’ve tried it already, just wanted to endorse it
Your code example makes me think you could simplify it with Task.async_stream (although I would also consider Task.Supervisor.async_stream or Task.Supervisor.async_stream_no_link, depending on your needs). I’m pretty sure async_stream matches your use case exactly, but maybe I scanned your post too fast!