AWS - How to stream a CSV file

Hi,

I am trying to stream a file from AWS.
It is a very large file, so I thought it would be better to use Stream and , once that works, to chunk the data.
I am using NimbleCSV - the parse_stream function.
I am getting back an empty list after running Enum.to_list

This is my code so far:

alias NimbleCSV.RFC4180, as: CSV

ExAws.S3.download_file("testfiles", "file_upload.csv", :memory) 
|> ExAws.stream!() 
|> CSV.parse_stream 
|> Enum.to_list

I would like to do something like:

alias NimbleCSV.RFC4180, as: CSV

ExAws.S3.download_file("testfiles", "file_upload.csv", :memory) 
|> ExAws.stream!() 
|> CSV.parse_stream 
|> Stream.chunk_every(10_000)
|> Stream.map(& process_stream(&1)
|> Enum.to_list

Thanks! :smiley:

Have you checked that ExAws.stream!() gives you anything?

Afaik streaming data from AWS gives you chunks of same size in bytes. NimbleCSV expects a stream of (individual) lines. So you need to convert between those two, which NimbleCSV has a helper for.

3 Likes

I just had a more detailed read through the ExAws.S3 docs - there is also an example there of converting the stream into a stream by line - see ExAws.S3 — ExAws.S3 v2.3.3 , but the NimbleCSV one (NimbleCSV — NimbleCSV v1.2.0) would be the way to go.

1 Like

They should be mostly the same code since Update example for streaming lines by fastjames · Pull Request #170 · ex-aws/ex_aws_s3 · GitHub, but yeah, using the available API over copy pasting the implementation is certainly simpler.

1 Like

Thank you, the CSV.parse_stream worked!

Final code:

alias NimbleCSV.RFC4180, as: CSV

ExAws.S3.download_file("testfiles", "file_upload.csv", :memory) 
|> ExAws.stream!() 
|> CSV.to_line_stream()
|> CSV.parse_stream ()
|> Stream.chunk_every(10_000)
|> Stream.map(& process_stream(&1)
|> Enum.to_list()