Ludicrous speed CSV to AWS SQS dumping

radar · November 12, 2021, 11:22pm

Here’s the situation:

I’ve got an CSV file on S3 which contains about a million rows. Each row has about 30 columns, consisting of short string data. The CSV file contains a headers row as well.

I want to take this CSV file from S3, parse it into key value pairs, and put each row as a k/v pair as a message into SQS.

And I want to do this ludicrously fast. At the moment I have an implementation that:

Downloads the CSV from S3, stores it in a temp directory
Streams that file using File.stream, CSV.decode
Chunks each block of 10 messages using Stream.every_chunk.
Uses Task.async_stream to send each block of 10 messages to SQS using send_message_batch from ExAWS.

(I’d share the code, but I’m on my phone at the moment, sorry!)

Using the above steps I can get it to run in 11 minutes from my local machine, through to SQS. That includes the initial downloading of the file itself.

Is there a way that I could make this go even faster? Is there a technique or something else that I’m missing here that would make the whole process faster?

kip · November 12, 2021, 11:36pm

As is often the case, it depends on where the bottleneck(s) are. If the bottleneck is downloading the CSV file then you could start by streaming the CSV rather than downloading it and then chunking the data to SQS.

If the constraint is adding to SQS I would suggest looking at flow as a way to have configurable concurrent consumers/producers. Some of the strategies would depend on whether or not ordering of events pushed into SQS matters.

radar · November 12, 2021, 11:54pm

As a proportion of speed, the download time is about a minute. Maybe not even. So I don’t particularly care about that part.

The order of events is not important. Each row in the CSV is eventually processed individually.

gregvaughn · November 13, 2021, 12:48am

If you really need “ludicrous” speed in transforming data between one AWS service to another, then further tie yourself to AWS by looking at their offerings. Note, Elixir may not be the the best choice for this approach.

perzanko · November 13, 2021, 12:50am

I assume that sending requests to SQS is the most time consuming in your case, so I’d try to decrease number of requests by increasing the batch size.