Hey guys,
I’ve got a huge CSV ( around 10 GB ) that needs to be processed hourly
Do you guys have any suggestions what is the best practice in that scenario?
Any useful mixes?
Thanks in advance.
Daniel
Hey guys,
I’ve got a huge CSV ( around 10 GB ) that needs to be processed hourly
Do you guys have any suggestions what is the best practice in that scenario?
Any useful mixes?
Thanks in advance.
Daniel
This conference talk by José Valim sounds like it may be of interest to you: https://www.youtube.com/watch?v=XPlXNUXmcgE
It goes from processing data in an eager way (reading everything up front), to a lazy way (leveraging Streams), to concurrent processing. Depending on how complex your data+processing is it sounds like Streams would be more than sufficient.
If you provide more context my opinion could change though
You may be aware of it already, but Plataformatec created a library called NimbleCSV that I would look it. I’ve never used it on anything close to a 10 GB file, but it was pretty easy to use, and it supports streaming pretty well. I would think that as long as you handled it in appropriately-sized chunks you should be fine.
As far as scheduling the task goes, the only app-level crons I’ve done in Elixir used Quantum. I don’t remember it well enough to endorse, but I also don’t remember hating it.
I process large CSVs also… In addition to nimble_csv
when I need to do the reading/writing in Elixir I also shell out from my Elixir app to xsv
extensively for pre-processing… it’s written in Rust and super fast. Not sure what processing you need to do with the hourly files but it could speed you along.