How to use streams with char lists?

Background

I have the following code, which takes a string, converts it to a charlist and then maps over it.

defmodule RotationalCipher do
  @alphabet_size 26

  defguard is_lower?( char ) when char in ?a..?z
  defguard is_upper?( char ) when char in ?A..?Z

  def rotate(text, shift) do
    text
    |> String.to_charlist()
    |> Enum.map( &spin(&1, shift) )
    |> to_string()
  end

  defp spin( char, shift ) when is_lower?( char ), do: ?a + rem( char - 71 + shift, @alphabet_size )
  defp spin( char, shift ) when is_upper?( char ), do: ?A + rem( char - 39 + shift, @alphabet_size )
  defp spin( char, _ ), do: char
end

Problem

Using Enum is all nice but there is a performance benefit in using streams. As in, according to Elixir In Action, I should only use Enum at the very end to force everything into coming together. Doing it before only makes things slower.

Now you will say “this is a simple app, no need to optimize.” But consider that we are ciphering the entire Bible or any of it’s variants. It’s quite a big book and ciphering it via streams would offer a real benefit.

Question

So my question is:

  • Can this example be adapted to use Streams and only use 1 Enum at the end?

You could use Stream instead of Enum in your example, like this

text 
|> String.to_charlist() 
|> Stream.map( &IO.inspect(&1) )  # I just changed this line to test in my console
|> Enum.to_list 
|> to_string()

I added an Enum.to_list near the end.

That will unfortunately not bring any benefit at all since I am converting to a list of chars, to a stream, back to a list of chars back to a string. I was hoping for an approach more like in the discussion bellow:

Where you stream the string from the start.

Maybe this could help for processing text in parallel

1 Like

Doesn’t that work?

StringIO.open(text)
|> elem(1)
|> IO.stream(:line)
|> Stream.map(&spin(&1, shift))     # your processing here
# ... more of your processing here
|> Enum.join

The last line directly combines all string pieces into a big string. But you can always use any Enum function istead of that one, depending on your desired end result.

2 Likes

Don’t stream just because you heard that it is faster. This is not true!

In many cases streams make it slower!

Please benchmark your use case, before blindly applying a stream.

In a case where you iterate a single time, a stream will probably cost, as you have some overhead during dispatching.

If you have a certain number of stages, you might consider a stream, but should still not blindly use it.

In my opinion, streams do especially play their powers when you would do a lot of “shape shifting” stages, which would flatten a given enum or might drop elements inbetween. A stream here helps a lot to reduce time spent with building lists.

If you want to process a string by char and have concerns about converting the string to a list, map over it andconvert it back again, you should probably use a recursive function that matches on the codepoints and builds a new string in the accumulator, then you might receive some additional optimisation powers from the beam, but beware composed characters!

def f(str, acc \\ <<>>)
def f(<<>>, acc), do: acc
def f(<<c::utf8, str::binary>>, acc), do: f(str, acc <> <<c + 1>>)
3 Likes

This is an interesting claim, that goes literally against everything I have read thus far. Streams are lazy while Enums are eager, and thus if you have a pipe with multiple Enum functions you ought to optimize it by trying to use a Stream because this way you only compute the data structure once ( in the end when you use Enum to pull it together ).

Obviously, if you only have 1 transformation, using an Enum is the preferred way to go, as Enum code is usually simpler and easier for us pesky Humans to read and understand and because using a Stream only to convert it right after into a list ( or something else ) would offer no benefit at all.

Perhaps I misunderstood your comment?
Could you provide an example where Streams are slower than Enums in a pipelne?

I thought similar to you about a year ago. I learned the hard way during advent of code that streams have a cost.

I learned by benchmarks that they only give a benefit in the above described situations and on very large input that would trigger many GCs when processed using Enum.

A stream always involves additional state keeping, multiple layers of dynamic dispatch, etc. Situation might be totally different when we had a statically typed language that were able to dispatch most of the calls during compile time.

You can learn about my last year’s experience with advent of code when searching this forum.

2 Likes

But that’s usually what happens when you have several processing stages of incoming data? I haven’t benchmarked but I definitely don’t want to have my code – that can easily receive 500k+ items – make huge lists and discard them at every stage and overload the GC by doing transformations on each stage and on each item.

I agree that for smaller workloads – like OP’s, simply iterating over a string or several lines of strings – Streams can definitely be an overkill! Don’t get me wrong.

What I am saying is the moment you have no guarantees how big an input data you will get you are better off playing it safe and utilize Stream. If the code becomes a performance hotspot, that’s fixable post factum. For example: two functions, one requiring guarantees that the incoming data is no bigger than, say, 10k items, uses Enum – and another, a generic one, that always uses Stream.