Split single large file into multiple small file

Yasir · October 12, 2020, 11:29am

I want to process 8 MB file. Based on the articles, I am going to use Flow module for parrelel process. Before processing the large CSV. I want split the single file into multiple small file. So, that would be good for computation. I can split the file into multiple file by below code,

File.stream!(file_name, read_ahead: 100_000)
|> Stream.chunk_every(1000)
|> Enum.each(fn chunk →
Enum.each(chunk, &IO.write(smal_file_name, &1))
end)

Is there is any efficient way to split the file?

Eiji · October 12, 2020, 11:42am

Personally I recommend using this library:

It also supports streams.

Yasir · October 12, 2020, 12:06pm

Thansk for the reply.

Based on my understanding stream will use single core for the process. But the Flow will use all core by itself. And also Flow work like a parrellel process by the help of Genstage.

NobbZ · October 12, 2020, 12:08pm

You can use a stream as input for the flow.

So consuming the input stream directly, rather than loading a file, splitting it into x, writing them back to disc and then reading them individually, could improve overal runtime.

Though “8MB” is not what I consider a “large file”.

Yasir · October 12, 2020, 12:26pm

Thanks for the reply.

If my file is 8 MB, rathen than doing, splitting single file into multiple file and writing back to disc and then reading them individually is not needed.

You suggesting me to pass stream directly. Thanks for that.

How can i split? if i am receiving really big file. I got this splitting idea from here https://hexdocs.pm/flow/Flow.html#module-avoid-single-sources

NobbZ · October 12, 2020, 12:51pm

If you have only a single file, you can not do much. Though if instead an earlier stage of the process would submit you 4 files rather than 1, then you would benefit from it.

To actually be able to split your CSV by entries, you need at least scan it for newlines anyway to not split within a line. So you already read in the full file anyway, and instead of writing the data back to disc you can as wello process it straight.

I do not have figures for this, though even if you won’t be able to max out your cores that way, intuition says, that the task will be finished quicker if not writing split files back to disc.

Though, as I said earlier, if you have many files from the get go, perhaps because something in your system writes out a file per hour and you consume them once a day, then you really could benefit from the fact that the data is already “multisource”.

Yasir · October 12, 2020, 1:41pm

Thanks for the discussion @NobbZ @Eiji