I want to process 8 MB file. Based on the articles, I am going to use Flow module for parrelel process. Before processing the large CSV. I want split the single file into multiple small file. So, that would be good for computation. I can split the file into multiple file by below code,
Based on my understanding stream will use single core for the process. But the Flow will use all core by itself. And also Flow work like a parrellel process by the help of Genstage.
So consuming the input stream directly, rather than loading a file, splitting it into x, writing them back to disc and then reading them individually, could improve overal runtime.
Though “8MB” is not what I consider a “large file”.
If my file is 8 MB, rathen than doing, splitting single file into multiple file and writing back to disc and then reading them individually is not needed.
You suggesting me to pass stream directly. Thanks for that.
If you have only a single file, you can not do much. Though if instead an earlier stage of the process would submit you 4 files rather than 1, then you would benefit from it.
To actually be able to split your CSV by entries, you need at least scan it for newlines anyway to not split within a line. So you already read in the full file anyway, and instead of writing the data back to disc you can as wello process it straight.
I do not have figures for this, though even if you won’t be able to max out your cores that way, intuition says, that the task will be finished quicker if not writing split files back to disc.
Though, as I said earlier, if you have many files from the get go, perhaps because something in your system writes out a file per hour and you consume them once a day, then you really could benefit from the fact that the data is already “multisource”.