Ludicrous speed CSV to AWS SQS dumping

Here’s the situation:

I’ve got an CSV file on S3 which contains about a million rows. Each row has about 30 columns, consisting of short string data. The CSV file contains a headers row as well.

I want to take this CSV file from S3, parse it into key value pairs, and put each row as a k/v pair as a message into SQS.

And I want to do this ludicrously fast. At the moment I have an implementation that:

  1. Downloads the CSV from S3, stores it in a temp directory
  2. Streams that file using File.stream, CSV.decode
  3. Chunks each block of 10 messages using Stream.every_chunk.
  4. Uses Task.async_stream to send each block of 10 messages to SQS using send_message_batch from ExAWS.

(I’d share the code, but I’m on my phone at the moment, sorry!)

Using the above steps I can get it to run in 11 minutes from my local machine, through to SQS. That includes the initial downloading of the file itself.

Is there a way that I could make this go even faster? Is there a technique or something else that I’m missing here that would make the whole process faster?

As is often the case, it depends on where the bottleneck(s) are. If the bottleneck is downloading the CSV file then you could start by streaming the CSV rather than downloading it and then chunking the data to SQS.

If the constraint is adding to SQS I would suggest looking at flow as a way to have configurable concurrent consumers/producers. Some of the strategies would depend on whether or not ordering of events pushed into SQS matters.

1 Like

As a proportion of speed, the download time is about a minute. Maybe not even. So I don’t particularly care about that part.

The order of events is not important. Each row in the CSV is eventually processed individually.

If you really need “ludicrous” speed in transforming data between one AWS service to another, then further tie yourself to AWS by looking at their offerings. Note, Elixir may not be the the best choice for this approach.

4 Likes

I assume that sending requests to SQS is the most time consuming in your case, so I’d try to decrease number of requests by increasing the batch size.