I’ve got an CSV file on S3 which contains about a million rows. Each row has about 30 columns, consisting of short string data. The CSV file contains a headers row as well.
I want to take this CSV file from S3, parse it into key value pairs, and put each row as a k/v pair as a message into SQS.
And I want to do this ludicrously fast. At the moment I have an implementation that:
Downloads the CSV from S3, stores it in a temp directory
Streams that file using File.stream, CSV.decode
Chunks each block of 10 messages using Stream.every_chunk.
Uses Task.async_stream to send each block of 10 messages to SQS using send_message_batch from ExAWS.
(I’d share the code, but I’m on my phone at the moment, sorry!)
Using the above steps I can get it to run in 11 minutes from my local machine, through to SQS. That includes the initial downloading of the file itself.
Is there a way that I could make this go even faster? Is there a technique or something else that I’m missing here that would make the whole process faster?
As is often the case, it depends on where the bottleneck(s) are. If the bottleneck is downloading the CSV file then you could start by streaming the CSV rather than downloading it and then chunking the data to SQS.
If the constraint is adding to SQS I would suggest looking at flow as a way to have configurable concurrent consumers/producers. Some of the strategies would depend on whether or not ordering of events pushed into SQS matters.
If you really need “ludicrous” speed in transforming data between one AWS service to another, then further tie yourself to AWS by looking at their offerings. Note, Elixir may not be the the best choice for this approach.