mgwidmann

mgwidmann

IO Bottlenecked Processing

Hi All,

I am prototyping a system using Broadway to process SQS messages and then using Flow to process each message. Each message being an S3 path of a file that is substantially large, around 200 to 400 MB compressed gzip, expanding to typically around 1 to 2 GB of text data.

I’m using a few very large instances to process this data but unable to get Elixir to use all of the CPU. Each worker is using its own hackney pool, so they should not be competing. Here’s what I have configured for the default pool:

  follow_redirect: true,
  # Must be longer than long-poll setting for SQS (10 seconds)
  recv_timeout: 30_000,
  timeout: 300_000,
  checkout_timeout: 30_000,
  max_connections: 50

I am using the following Broadway configuration (though I have tried many variants of this):

Broadway.start_link(__MODULE__,
      name: __MODULE__,
      producers: [
        default: [
          module:
            {BroadwaySQS.Producer,
             queue_name: "my-queue",
             max_number_of_messages: 10,
             wait_time_seconds: 4,
             visibility_timeout: 300,
             receive_interval: 10},
          stages: 5
        ]
      ],
      processors: [
        default: [
          max_demand: 1,
          stages: System.schedulers_online
        ]
      ],
      batchers: [
        default: [
          batch_size: 10,
          batch_timeout: 10_000,
          stages: 5
        ]
      ]
    )

I have the following in my rel/vm.args to try to open up any VM level IO blockage, without any luck.

## Enable kernel poll and a few async threads
##+K true
+A 1024
## For OTP21+, the +A flag is not used anymore,
## +SDio replace it to use dirty schedulers
+SDio 1024
# Allow up to 10 million processes
+P 10000000

## Increase number of concurrent ports/sockets
-env ERL_MAX_PORTS 4096

I seem to have more luck in some areas with presigned URLs than using ExAws but neither perform where I’d like.

Here is a screenshot of CPU utilization for 3 instances. I’ve tried creating more processes and I’ve tried on just a single instance and I can never really get it above 60-75 max CPU.

For reference, here is my presigned URL processing function which produced the above graph:
(NOTE: I’m only downloading to disk because with streaming to memory it performs worse (0-1% CPU) using the code I wrote in this PR, though I can find nothing wrong with it)

  @read_chunk_size 1024 * 1024 * 10
  def process("https://" <> _ = presigned_url, pool_name) do
    temp_filename = "/tmp/#{pool_name}.tmp"
    _ = File.rm(temp_filename) # Ok if it doesn't exist
    {microseconds, _} =
      :timer.tc(fn ->
        Download.from(presigned_url, path: temp_filename, http_opts: [hackney: [pool: pool_name]])
      end)

    _ = Logger.info("File #{temp_filename} downloaded in #{microseconds / 1_000_000} seconds")

    result =
      temp_filename
      |> File.stream!([], @read_chunk_size)
      |> file_processing_stream()
      |> reduce_results()
      |> Enum.into(%{})

    File.rm!(temp_filename)

    result
  end

Download times also go up significantly when under load, from 3-5 seconds to double or triple that under load.

Whats left that I haven’t tried?

Most Liked Responses

outlog

outlog

I’m confused by all the G/gb/GB mb/MB etc being thrown around.. maybe just stick to one unit, or make sure they are used correctly..

Just wanted to point out that storage speed might be a/the bottleneck as well.. the m5.24xlarge have EBS instance storage, so since this instance has plenty of memory, maybe go straight to memory and not to disk/EBS..

(even if you use an instance with dedicated ssd, I doubt it can hold up to ingressing 25 Gbit straight to disk..)

sribe

sribe

Do you mean 100MB/s? Because at 100mb/s it should take much longer than 2-4 seconds per file.

Do you mean 32MB/s? Because at 32mb/s, it should take much longer than 10-15 seconds per file.

I’m not trying to be argumentative above. It’s just that so far, unless I’ve missed something, you haven’t provided some pretty key numbers that would let us figure out what bandwidth you’re actually using. Your graphs & numbers don’t tell us how many processes you’re running, nor how many files you’re transferring in the times shown. It would be easiest to answer my question if we just had a graph of network I/O, but lacking that, if we had other numbers we could estimate.

I’m asking because so far there’s nothing in your posts that eliminates bandwidth as the limiting factor instead of CPU. It seems perfectly possible that you’re saturating network or disk I/O at less than 100% CPU usage. And speaking of disk, what are you using? Something that can sustain 2.5gB/s?

benwilson512

benwilson512

Author of Craft GraphQL APIs in Elixir with Absinthe

I’d definitely start by isolating the various parts of the system. Instead of reading messages from SQS, start with a pre-defined list and benchmark how long it takes to work through that.

You’re processing each file as far as I can tell 1 per CPU, this is going to have sub optimal download performance because S3 download speed per path isn’t particularly high. If you want to max that out you want multiple concurrent downloaders per path using something like ExAws.S3 — ExAws.S3 v2.5.9

Where Next?

Popular in Questions Top

sergio
In Ruby, I can go: User.find_by(email: "foobar@email.com").update(email: "hello@email.com") How can I do something similar in Elixir? ...
New
marius95
Hello everyone, I try to use an Javascript Event Handler in my root.html.leex file. Therefore I created a function in the app.js file: ...
New
mcarvalho
What is the difference between System.get_env and Application.get_env? For example, what are best practices to use one versus another.
New
Fl4m3Ph03n1x
About me? ( if you have nothing better to do than reading about some random guy in the internet :stuck_out_tongue: ) Hello all, this is ...
New
jerry
Good day to you all. I have been struggling to get a query involving like and ilike to work. Can anyone assist me on this, please? pro...
New
Lily
In templates/appointment/index.html.eex: &lt;%= for appointment &lt;- @appointments do %&gt; &lt;tr&gt; &lt;td&gt;&lt;%= appoi...
New
ycv005
I have followed this StackOverflow post to install the specific version of Erlang. And When I am running mix ecto.setup then getting fol...
New
baxterw3b
Hi guys, i’m new in the Elixir world, and i have to say, that i love it! i’m having some problem to understand anonymous functions with ...
New
script
If I have a string “1000 cfu/ml” . I want to remove the characters and / and space . So the string is like this "1000" What is the ...
New
dblack
I’ve got an issue with an app and I’ve no idea of how to troubleshoot it. I’m hoping someone here might have seen something similar. I p...
New

Other popular topics Top

senggen
Erlang/OTP 25 [erts-13.2.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] 15:22:35.803 [error] gen_event {lager_file_backend...
New
siddhant3030
Hi, I have to write a raw query for one of my project. But till now I have used ecto queries and don’t have much experience writing raw ...
New
mcarvalho
What is the difference between System.get_env and Application.get_env? For example, what are best practices to use one versus another.
New
greenz1
I have a phoenix application from which a user can download multiple(5-6) files of size 1MB. I couldn’t find anything related to sending ...
New
msaraiva
Surface is an experimental library built on top of Phoenix LiveView and its new LiveComponent API that aims to provide a more declarative...
564 43622 214
New
vegabook
I’m brand new to Phoenix and I have stripped one of the demo applications to the bone. I just want to get an svg up on the screen. Here i...
New
bsollish-terakeet
Credo is smart enough to check for (something like) this: assert length(the_list) == 0 with this response: Checking if an enum is empt...
New
jason.o
In the code below, if the create action is not set to accept “extra_key” as an input, it errors out with a message shown above. Is there ...
New
nsuchy
Hi. I’ve noticed that Windows Powershell has it’s own IEX command and you cannot access Elixir’s IEX due to the conflict. This isn’t a cr...
New
PeterCarter
There are pre-rolled solutions for other frameworks that do work. However, Phoenix does not seem to have these. Have people had good expe...
New

We're in Beta

About us Mission Statement