Huge CSV processing

Hey guys,
I’ve got a huge CSV ( around 10 GB ) that needs to be processed hourly
Do you guys have any suggestions what is the best practice in that scenario?
Any useful mixes?

Thanks in advance.
Daniel

This conference talk by José Valim sounds like it may be of interest to you: https://www.youtube.com/watch?v=XPlXNUXmcgE

It goes from processing data in an eager way (reading everything up front), to a lazy way (leveraging Streams), to concurrent processing. Depending on how complex your data+processing is it sounds like Streams would be more than sufficient.

If you provide more context my opinion could change though :smiley:

2 Likes

You may be aware of it already, but Plataformatec created a library called NimbleCSV that I would look it. I’ve never used it on anything close to a 10 GB file, but it was pretty easy to use, and it supports streaming pretty well. I would think that as long as you handled it in appropriately-sized chunks you should be fine.

As far as scheduling the task goes, the only app-level crons I’ve done in Elixir used Quantum. I don’t remember it well enough to endorse, but I also don’t remember hating it.

2 Likes

I process large CSVs also… In addition to nimble_csv when I need to do the reading/writing in Elixir I also shell out from my Elixir app to xsv extensively for pre-processing… it’s written in Rust and super fast. Not sure what processing you need to do with the hourly files but it could speed you along.

1 Like

Thank you very much @brettbeatty @tme_317 and @BurntSushi.
You helped me a lot.

1 Like