Understanding Streams and Enum

venomnert · December 8, 2018, 12:36am

Hey guys,

I came across the idea of Stream, Enum, lazy evaluation and eager evaluation recently. I have been trying to wrap my head around the idea of these concepts this week. I reached out to Elixir slack and got a better understanding. So, I was hoping if you guys can review my explanation of Streams and Enum and let me know if I understood it correctly.

Let’s work with following example.

Let’s say we have the following list of numbers [1,…,1000]. We want to apply the following transformation on it: add by 2, multiply by 5, and add by 5.

For Enum here is the following implementation:
1…1000
|> Enum.map(fn(x) -> x + 2 end)
|> Enum.map(fn(x) -> x * 5 end)
|> Enum.map(fn(x) -> x + 5 end)

The following will happen:

For Enum.map(fn(x) -> x + 2 end) the map will go over 1000 iteration and will produce a new list [2, …, 1002]
For Enum.map(fn(x) -> x * 5 end) the map will go over 1000 iteration and will produce a new list [10, …,5010]
For Enum.map(fn(x) -> x + 5 end) the map will go over 1000 iteration and will produce a new list [15, …,5015]
So the result is we have iterate 3000 times and produced 3 new lists

For Stream here is the following implementation:
1…1000
|> Stream.map(fn(x) -> x + 2 end)
|> Stream.map(fn(x) -> x * 5 end)
|> Stream.map(fn(x) -> x + 5 end)
|> Enum.take(1000)

The following will happen:

For Stream.map(fn(x) -> x + 2 end) will wrap around the list [1,…,1000] and return as a stream
For Stream.map(fn(x) -> x * 5 end) will wrap around the previous stream and return a stream
For Stream.map(fn(x) -> x + 5 end) will wrap around the previous stream and return a stream
- So it would look like the following: Stream.map( fn( fn( fn(x) -> x + 2 end ) -> x * 5 end ) -> x + 5 end)
- Where x is the first element of the list so x = 1
For Enum.take(1000) execute the above function and return all the elements from the list
So the result is we have iterate 1000 times and produced 1 new list

benwilson512 · December 8, 2018, 1:25am

That is correct yes. It might seem therefore that using a stream is always better, since you have fewer traversals. This isn’t always the case though, since although you only go through the list 1000 times, getting each value has more overhead. To get the first item, the Enum.take has to tell the x + 5 stream “get me a value”. It then has to tell the x * 5 stream “get me a value”, which in turn calls the x + 2 stream and so on.

In the enum case, it just has to do a list traversal. In general I usually lean towards Enum unless I’m doing something where the laziness will help avoid work. Here’s a good example:

some_list
|> Stream.map(&expensive_function/1)
|> Enum.find(& &1.successful)

This will walk through a stream only as far as it needs to find the first operation where %{successful: true}. If I did Enum.map it’d have to do the expensive function for everything.

venomnert · December 8, 2018, 2:01am

Ahh that makes sense. And it seems as though Enum is usually faster based on what I have been finding.

josevalim · December 8, 2018, 7:58am

Yes. I like to say that streams are about using less memory at the cost of CPU. Streams will only be faster for quite large collections (such as infinite ones, which would never finish with Enum) or a high amount of traversals.

sasajuric · December 8, 2018, 10:13am

For smaller inputs and/or just one transformation Enum will be faster. For larger inputs with multiple transformations, using streams in the middle can sometimes be dramatically faster because we don’t generate large intermediate lists. More importantly, memory usage will be stable, thus reducing the chance of blowing up the production for some unexpected large input. That’s why I usually write a transformation pipeline in the style of:

input_enumerable
|> Stream.trans_1
|> Stream.trans_2
|> ...
|> Enum.last_trans

As always, there are some gotchas. If your code consumes an enumerable multiple times, it’s usually better to make sure that the input enumerable is not a stream, to avoid needless (and sometimes quite costly) duplicate computations.

dimitarvp · December 8, 2018, 5:14pm

There is no magic number for everyone but anything requiring processing of 2000 items and above I always delegate to streams – the concrete scenario allowing.

As others mentioned, this stabilizes your memory usage and vastly reduces the chance of your production code to get killed off by watchdogs if its RAM usage spikes sharply. RAM is much more valuable and stringently monitored and controlled compared to CPU or I/O operations on most hosting providers.