How to generate a big CSV file concurrently?

Hello

I need to generate a big CSV file, that might has thousands of entries, how can I run concurrent processes to handle this task and all write to the resulted file at the same time?

Is there an advice? what is the recommendation here?

There is no such possibility, as in the end you will be locked on the writing anyway (writes to the single file must be sequential). So if these “thousand entries” do not fit in RAM then your only option is to write it to separate files and then join these files. If it fits in the memory then you can build iodata in memory and then dump it into a file. However I think that speedup will be marginal, as this operation is probably IO bound, not CPU bound, so parallelisation will have marginal impact there.

1 Like

You can stream into a file without having it all in memory. “thousand entries” really isn’t that many though.

stream_of_data
|> NimbleCSV.RFC4180.dump_to_stream
|> Stream.into(File.stream!("output.csv"))
4 Likes

But without parallelism. If sequentially then of course that you do not need to keep everything in memory at once.

2 Likes

Ah yes good point. As you note though parallelism within the same file is unlikely to be all that helpful.

1 Like

A general query: Isn’t this problem similar to logging from multiple process to a single log file? How does the elixir logger solves this problem ?

It isn’t really a problem. A single process is perfectly capable of writing to a file as fast as the hard drive will allow as long as it can get data fast enough. In Logger there is a single process that owns the file, and then other processes that need to write logs send that log process lines (short version). There are optimizations there where ets can be used to communicate or manage back pressure but that’s the rough idea.

5 Likes

Ah that makes sense, Thanks