Surprising behavior of File.stream vs File.read

EDIT: Feel free to read the discussion, but the issue was reading the Benchee results wrong. It tracks allocated memory throughout the benchmark and not the total amount memory used at any given time.

We were having a discussion at work about File.read vs File.stream and my assumption going in was that File.read would always be faster but at the expense of memory since the entire goal of using File.stream is to process one line at a time.

We ran a Benchee test with a 2 line file and a 2,000 line file through and ended up with some surprising results that I wanted to share here for comments. Why does File.stream use so much more memory if it’s only processing one line at a time? That seems to defeat the purpose of using it.

Here are the two implementations (parsing buildpack.config files):

File.read

      File.read!(file)
      |> String.split("\n")
      |> Enum.map(&String.trim/1)
      |> Enum.reject(fn x -> x == "" end)
      |> Enum.map(fn line ->
        [lang, version] = String.split(line, "=")
        {lang, version}
      end)
      |> Enum.into(%{})

File.stream!

      File.stream!(file)
      |> Stream.map(&String.trim/1)
      |> Stream.map(fn line ->
        [lang, version] = String.split(line, "=")
        {lang, version}
      end)
      |> Enum.into(%{})

We also tried a 3rd with Stream.run instead of Enum.into. Here are the Benchee results.

##### With input 2000 line file #####
Name                                            ips        average  deviation         median         99th %
File.read! with Enum                         371.14        2.69 ms    ±22.50%        2.54 ms        5.39 ms
File.stream! with Stream.run                 325.30        3.07 ms    ±21.00%        2.94 ms        5.14 ms
File.stream! with Stream |> Enum.into        286.92        3.49 ms    ±18.34%        3.34 ms        6.22 ms

Comparison: 
File.read! with Enum                         371.14
File.stream! with Stream.run                 325.30 - 1.14x slower +0.38 ms
File.stream! with Stream |> Enum.into        286.92 - 1.29x slower +0.79 ms

Memory usage statistics:

Name                                     Memory usage
File.read! with Enum                        578.56 KB
File.stream! with Stream.run               1017.41 KB - 1.76x memory usage +438.84 KB
File.stream! with Stream |> Enum.into      1251.91 KB - 2.16x memory usage +673.34 KB

**All measurements for memory usage were the same**

##### With input 2 line file #####
Name                                            ips        average  deviation         median         99th %
File.read! with Enum                        51.23 K       19.52 μs    ±69.95%       15.61 μs       51.74 μs
File.stream! with Stream |> Enum.into       22.49 K       44.47 μs    ±49.00%       36.04 μs      112.23 μs
File.stream! with Stream.run                21.34 K       46.85 μs    ±60.72%       38.46 μs      123.17 μs

Comparison: 
File.read! with Enum                        51.23 K
File.stream! with Stream |> Enum.into       22.49 K - 2.28x slower +24.95 μs
File.stream! with Stream.run                21.34 K - 2.40x slower +27.33 μs

Memory usage statistics:

Name                                          average  deviation         median         99th %
File.read! with Enum                          1.08 KB     ±0.00%        1.08 KB        1.08 KB
File.stream! with Stream |> Enum.into         3.26 KB     ±0.00%        3.26 KB        3.26 KB
File.stream! with Stream.run                  2.90 KB     ±0.00%        2.90 KB        2.90 KB

Comparison: 
File.read! with Enum                          1.08 KB
File.stream! with Stream |> Enum.into         3.26 KB - 3.02x memory usage +2.18 KB
File.stream! with Stream.run                  2.90 KB - 2.69x memory usage +1.82 KB

3 Likes

Maybe I am wrong, but File.stream! is calling String.split and the Enum version doesn’t.

You’re right, that was copied from another part of our conversation. Re-running now and I’ll update the post.

Updated. Definitely brought it a little closer.

This can be changed by declaring the number of lines with read_ahead:

File.stream!(file, read_ahead: _a_thousand_lines = 1000)

I believe it’s worth testing it too :slight_smile:

I just tried it again with the read_ahead argument on the Stream.run version and the results were identical fwiw.

The default File.stream does use :read_ahead it seems.

%File.Stream{
  line_or_bytes: :line,
  modes: [:raw, :read_ahead, :binary],
  path: "elixir_buildpack.config",
  raw: true
}```

Oops, it’s not number of lines, but size in KB. When you pass read_ahead without the size, it assumes a default of 64kb. Here for more: Erlang -- file

2 Likes

I have a limited understanding of this concept but my impression is that Stream is lazy and Enum is eager and that in general lazy evaluation is less efficient unless there really are scenarios in which the evaluation does not need to occur. In your case since every evaluation is ultimately completed, Stream might reasonably be expected to be less efficient. If you have a situation in which some step will lead to a branch point in which some evaluation is skipped for some items in the iterator, you might then expect that Stream would be more efficient than Enum as Stream would also avoid evaluating all the previous steps for those items. I hope that someone will correct me if I’m thinking about this in the wrong way.

1 Like

Right, the primary purpose of stream isn’t necessarily to be faster but to have constant memory usage. This makes it much safer on arbitrarily large files.

3 Likes

Yea, that’s my main concern here. Why is the memory usage for stream so much higher here?

Even the example in the docs states:

Open up a file, replace all # by % and stream to another file without loading the whole file in memory:

On this code example:

File.stream!("/path/to/file")
|> Stream.map(&String.replace(&1, "#", "%"))
|> Stream.into(File.stream!("/path/to/other/file"))
|> Stream.run()

https://hexdocs.pm/elixir/Stream.html#run/1

Why would the memory usage be so much higher in the Stream test? That’s the opposite of my expectation here.

Just out of curiosity, we tried making the file a lot bigger (about 200mb). The results are the same. The memory usage on the Stream implementation is 2x higher than just using File.read and Enum…which doesn’t seem like it should be the case. At this point I feel like I have to be doing something wrong to end up with these results.

Here are the Benchee results for the bigger file though.

Name                                            ips        average  deviation         median         99th %
File.stream! with Stream.run                 0.0647        15.45 s     ±0.49%        15.43 s        15.56 s
File.stream! with Stream |> Enum.into        0.0593        16.87 s     ±0.58%        16.89 s        16.97 s
File.read! with Enum                         0.0506        19.76 s     ±4.07%        19.54 s        20.96 s
Comparison: 
File.stream! with Stream.run                 0.0647
File.stream! with Stream |> Enum.into        0.0593 - 1.09x slower +1.42 s
File.read! with Enum                         0.0506 - 1.28x slower +4.31 s
Memory usage statistics:
Name                                     Memory usage
File.stream! with Stream.run                  5.35 GB
File.stream! with Stream |> Enum.into         6.58 GB - 1.23x memory usage +1.23 GB
File.read! with Enum                          3.04 GB - 0.57x memory usage -2.30341 GB

Before I posted this I decided to run it again and watch htop while the two were running. My laptop has 32gb of RAM and when the File.read version is running varied between 16-24% memory usage while the test was running. When Stream was running it never exceeded 0.9%. This was reflected in the overall memory usage on the system as well.

So after all of this I think the issue may be either Benchee itself or my Benchee configuration, because it doesn’t seem to be properly tracking the memory usage. Anybody else ever run into that?

1 Like

According to Benchee documentation:

This measurement is not the actual effect on the size of the BEAM VM size, but the total amount of memory that was allocated during the execution of a given scenario. This includes all memory that was garbage collected during the execution of that scenario.

I think what happened here is the Enum version just convert from a large Enum to another large Enum; whereas in the Stream version the final Enum was built up one item at a time. Assuming the final result has 2000 items, the Stream version will allocate 2000 Enums, with 1, 2, 3, … 2000 items in each. This will cause more memory allocator thrashing.

4 Likes

That seems to address it based on what I was seeing with htop as well.

Can you show your complete file?

:thinking: What Benchee is measuring is not exactly what first jumped to mind when I saw the title “memory usage”; from the README:

This measurement is not the actual effect on the size of the BEAM VM size, but the total amount of memory that was allocated during the execution of a given scenario. This includes all memory that was garbage collected during the execution of that scenario.

This is useful (more allocations made → more GC → more slow) but very definitely not the thing that Stream is intended to constrain (peak memory usage).

2 Likes

Yep, that was where it all went sideways.

In the test with the bigger file when I was watching it with htop the actual memory usage on my system was closer to 25% (of 32gb) for the File.read version while it remained steady a at .9% (of 32gb) in the Stream version.

Also @benwilson512

1 Like