I know that File.write performs two operations: it opens the file and then writes to it. However, considering that File.open followed by IO.write avoids one operation for each iteration, I want to know: when I’m building a lazy pipeline, is it recommended to use File.write for each iteration, or should I keep the file open and just write to it? Is there any significant impact between these two approaches?
Could someone clarify the pros and cons of these two approaches for me?
PS: I am focused on impacts for lazy pipelines like this:
Stream.map(…)
|> Stream.reduce(…)
|> Stream.with_index(…)
|> Stream.map(write_on_file)
I’ve used File.open and IO.write / IO.binwrite with a Stream like you, dozens of times, and never benchmarked it against having File.write at each step. I simply assumed the latter is a terrible idea and can never be faster. Main reason might not even be speed per se; it’s just that opening and closing a file can be risky under load and in more constrained machines (which, let’s be real, are most cloud offerings out there, even the mainstream ones; they regularly put your apps and databases on I/O-throttled machines) – it might actually fail with “out of memory” or “not enough file handles” etc., I’ve seen those even on machines with 64GB of RAM and 2TB SSD storage.
TL;DR do as little syscalls as at all possible. We can’t optimize adequately for that when we work with a higher-level language with Elixir but at least we can use the little tools we have, like “don’t open and close file 500 times in a loop”.
…And all of that changes if each step to write to the file is seconds or minutes apart, of course (i.e. you stream data from a slower remote node).
I’m not sure I understand the question. If you need to do multiple writes to a file in a loop then yes, File.open is made for it. If you have the complete content of the file to write already, then use File.write only once.
I’m going to take some time off in the future to do this benchmarks, but makes sense avoid open/close the file a hundred times in a loop. I expect that this have’nt so much impact , but is always good avoid if have’nt sure
In a world after the Spectre / Meltdown CPU vulnerability mitigations, syscalls are more expensive performance-wise. The less of them in your program, the better.