This looks great!
I’m puzzled about one sentence though:
I was wrong assuming that it’s the process communication that puts the biggest overhead here.
Just to be clear: there’s no process communication happening here. Everything happens in the same process, since you’re working with both files in the raw
mode (default for file streams). For clarification, see “Processes and raw files” in File doc.
Also, I couldn’t sleep over the fact that streamed version was about 3x slower on my machine than the read version (1.8s vs 0.6s). I played with it a bit more and discovered that streaming bytes works much faster than streaming lines. That led me to the following solution which shaved down the streaming version to ~ 1s:
def main([ filename, "stream" ]) do
out_file = File.open!(filename <> ".out", [:write, :raw, :delayed_write])
File.stream!(filename, [], 4000)
|> Enum.reduce({"", out_file}, &handle_chunk/2)
File.close(out_file)
end
defp handle_chunk(chunk, {unfinished_line, file}) do
(unfinished_line <> chunk)
|> String.split("\n")
|> process_lines(file)
end
defp process_lines([unfinished_line], file), do: {unfinished_line, file}
defp process_lines([line | rest], file) do
if filter_line(line) do
IO.binwrite(file, line <> "\n")
end
process_lines(rest, file)
end
Here, I’m taking chunks of 4k (larger chunks didn’t improve perf). When I read the chunk I append it to the unfinished line from the previous chunk. Then I split on newline, and process all except the last element. The last element is unfinished line which I’ll prepend to the next chunk and repeat.
At this point, streaming version is also faster than Ruby (which is ~ 1.8s on my machine).
Note that I’m splitting input per bytes, so I’m not sure whether this will work correctly with unicode files.