EDIT: Feel free to read the discussion, but the issue was reading the Benchee results wrong. It tracks allocated memory throughout the benchmark and not the total amount memory used at any given time.
We were having a discussion at work about File.read vs File.stream and my assumption going in was that File.read would always be faster but at the expense of memory since the entire goal of using File.stream is to process one line at a time.
We ran a Benchee test with a 2 line file and a 2,000 line file through and ended up with some surprising results that I wanted to share here for comments. Why does File.stream use so much more memory if it’s only processing one line at a time? That seems to defeat the purpose of using it.
Here are the two implementations (parsing buildpack.config files):
File.read
File.read!(file)
|> String.split("\n")
|> Enum.map(&String.trim/1)
|> Enum.reject(fn x -> x == "" end)
|> Enum.map(fn line ->
[lang, version] = String.split(line, "=")
{lang, version}
end)
|> Enum.into(%{})
We also tried a 3rd with Stream.run instead of Enum.into. Here are the Benchee results.
##### With input 2000 line file #####
Name ips average deviation median 99th %
File.read! with Enum 371.14 2.69 ms ±22.50% 2.54 ms 5.39 ms
File.stream! with Stream.run 325.30 3.07 ms ±21.00% 2.94 ms 5.14 ms
File.stream! with Stream |> Enum.into 286.92 3.49 ms ±18.34% 3.34 ms 6.22 ms
Comparison:
File.read! with Enum 371.14
File.stream! with Stream.run 325.30 - 1.14x slower +0.38 ms
File.stream! with Stream |> Enum.into 286.92 - 1.29x slower +0.79 ms
Memory usage statistics:
Name Memory usage
File.read! with Enum 578.56 KB
File.stream! with Stream.run 1017.41 KB - 1.76x memory usage +438.84 KB
File.stream! with Stream |> Enum.into 1251.91 KB - 2.16x memory usage +673.34 KB
**All measurements for memory usage were the same**
##### With input 2 line file #####
Name ips average deviation median 99th %
File.read! with Enum 51.23 K 19.52 μs ±69.95% 15.61 μs 51.74 μs
File.stream! with Stream |> Enum.into 22.49 K 44.47 μs ±49.00% 36.04 μs 112.23 μs
File.stream! with Stream.run 21.34 K 46.85 μs ±60.72% 38.46 μs 123.17 μs
Comparison:
File.read! with Enum 51.23 K
File.stream! with Stream |> Enum.into 22.49 K - 2.28x slower +24.95 ÎĽs
File.stream! with Stream.run 21.34 K - 2.40x slower +27.33 ÎĽs
Memory usage statistics:
Name average deviation median 99th %
File.read! with Enum 1.08 KB ±0.00% 1.08 KB 1.08 KB
File.stream! with Stream |> Enum.into 3.26 KB ±0.00% 3.26 KB 3.26 KB
File.stream! with Stream.run 2.90 KB ±0.00% 2.90 KB 2.90 KB
Comparison:
File.read! with Enum 1.08 KB
File.stream! with Stream |> Enum.into 3.26 KB - 3.02x memory usage +2.18 KB
File.stream! with Stream.run 2.90 KB - 2.69x memory usage +1.82 KB
Oops, it’s not number of lines, but size in KB. When you pass read_ahead without the size, it assumes a default of 64kb. Here for more: Erlang -- file
I have a limited understanding of this concept but my impression is that Stream is lazy and Enum is eager and that in general lazy evaluation is less efficient unless there really are scenarios in which the evaluation does not need to occur. In your case since every evaluation is ultimately completed, Stream might reasonably be expected to be less efficient. If you have a situation in which some step will lead to a branch point in which some evaluation is skipped for some items in the iterator, you might then expect that Stream would be more efficient than Enum as Stream would also avoid evaluating all the previous steps for those items. I hope that someone will correct me if I’m thinking about this in the wrong way.
Right, the primary purpose of stream isn’t necessarily to be faster but to have constant memory usage. This makes it much safer on arbitrarily large files.
Just out of curiosity, we tried making the file a lot bigger (about 200mb). The results are the same. The memory usage on the Stream implementation is 2x higher than just using File.read and Enum…which doesn’t seem like it should be the case. At this point I feel like I have to be doing something wrong to end up with these results.
Here are the Benchee results for the bigger file though.
Name ips average deviation median 99th %
File.stream! with Stream.run 0.0647 15.45 s ±0.49% 15.43 s 15.56 s
File.stream! with Stream |> Enum.into 0.0593 16.87 s ±0.58% 16.89 s 16.97 s
File.read! with Enum 0.0506 19.76 s ±4.07% 19.54 s 20.96 s
Comparison:
File.stream! with Stream.run 0.0647
File.stream! with Stream |> Enum.into 0.0593 - 1.09x slower +1.42 s
File.read! with Enum 0.0506 - 1.28x slower +4.31 s
Memory usage statistics:
Name Memory usage
File.stream! with Stream.run 5.35 GB
File.stream! with Stream |> Enum.into 6.58 GB - 1.23x memory usage +1.23 GB
File.read! with Enum 3.04 GB - 0.57x memory usage -2.30341 GB
Before I posted this I decided to run it again and watch htop while the two were running. My laptop has 32gb of RAM and when the File.read version is running varied between 16-24% memory usage while the test was running. When Stream was running it never exceeded 0.9%. This was reflected in the overall memory usage on the system as well.
So after all of this I think the issue may be either Benchee itself or my Benchee configuration, because it doesn’t seem to be properly tracking the memory usage. Anybody else ever run into that?
This measurement is not the actual effect on the size of the BEAM VM size, but the total amount of memory that was allocated during the execution of a given scenario. This includes all memory that was garbage collected during the execution of that scenario.
I think what happened here is the Enum version just convert from a large Enum to another large Enum; whereas in the Stream version the final Enum was built up one item at a time. Assuming the final result has 2000 items, the Stream version will allocate 2000 Enums, with 1, 2, 3, … 2000 items in each. This will cause more memory allocator thrashing.
What Benchee is measuring is not exactly what first jumped to mind when I saw the title “memory usage”; from the README:
This measurement is not the actual effect on the size of the BEAM VM size, but the total amount of memory that was allocated during the execution of a given scenario. This includes all memory that was garbage collected during the execution of that scenario.
This is useful (more allocations made → more GC → more slow) but very definitely not the thing that Stream is intended to constrain (peak memory usage).
In the test with the bigger file when I was watching it with htop the actual memory usage on my system was closer to 25% (of 32gb) for the File.read version while it remained steady a at .9% (of 32gb) in the Stream version.