alvises

alvises

Streaming lines from an enum of chunks

Hi all!

Related to this article , what I want to do is to process a text file using Elixir Streams. Now, most of the time is useful to get a stream of lines rather than chunks of text. So I’ve tried different solutions to do what actually IO.stream(iodevice, :line) does:

File I’ve used in the example
To do the experiments below I’ve used this file: (125mb) https://poeticoding-data.sfo2.digitaloceanspaces.com/httpstream/numbers.txt which is a list of integers, one per line. (30M lines)

Goal
Opening the file without the :line option, the enum we get is a stream of chunks, something like

["1231\n212\n1000\n","100\n212\n111","23\n1000","2\n102\n"]
and I want to transform it to a list of lines using Elixir Streams.

["1231\n","212\n","1000\n","100\n","212\n","11123\n","10002\n",102\n"]

0. File.stream with :line option: performance I want to reach

File.stream! streams the line for me

File.stream!("numbers.txt")
# below is the same is every case
|> Stream.map(fn line-> 
	{num,_} = Integer.parse(line)
	num
end)
|> Enum.sum()

On my computer this takes 16 seconds with almost no memory impact. This is the time I want to reach.

1. Using Stream.transform and regex (slowest and strange memory spikes)

File.stream!("numbers.txt",[],2048) #chunks instead of lines
|> Stream.transform("",fn chunk, acc ->
	[last_line | lines] = 
		Regex.split(~r/(?<=\n)/, acc <> chunk)
		|> Enum.reverse()
	{Enum.reverse(lines),last_line}
end)
# below is the same is every case
|> Stream.map(fn line-> 
	{num,_} = Integer.parse(line)
	num
end)
|> Enum.sum()

The reason why I split and reverse extracting the last_element from the list, is because the last element could be part of the next chunk. Using the example at the beginning

"100\n212\n111","23\n1000"
The last line of the first chunk is part of the first line of the second chunk,
So what I do is to split the first chunk obtaining
["100\n","212\n"] and accumulating "111" which is then concatenated to the next chunk
"111" <> "23\n1000" obtaining ["11123\n",1000"].

This unfortunately is quite slow and what it makes it slow is the regular expression: 70 seconds.
The advantage is that \n is preserved like the behaviour of File.stream!

2. Going through the whole binary recursively

I just go through the whole binary creating a list of lines.

def next_line(chunk,current_line\\""), do: next_line(chunk,current_line,[])

def next_line(<<"\n"::utf8, rest::binary>>,current_line,lines) do
	next_line(rest,"",[current_line | lines])
end

def next_line(<<c::utf8, rest::binary>>,current_line,lines) do
	next_line(rest,<<current_line::binary, c::utf8>>,lines)
end

def next_line(<<>>,current_line,lines), do: {Enum.reverse(lines), current_line}

And using this function to emit the lines chunk by chunk

File.stream!("numbers.txt",[],2048) #chunks instead of lines
|> Stream.transform("",&next_line/2)
# below is the same is every case
|> Stream.map(fn line-> 
	{num,_} = Integer.parse(line)
	num
end)
|> Enum.sum()

This is much faster, but still not as fast as using File.stream! with the :line option: 21 seconds

3. Interesting case: String.split

Now an interesting case. If I don’t care about that the \n is preserved, I can use String.split(chunk,"\n")
So, we just need to change the regex from the first to String.split

File.stream!("numbers.txt",[],2048) #chunks instead of lines
|> Stream.transform("",fn chunk, acc ->
	[last_line | lines] = 
      acc <> chunk
      |> String.split("\n")
      |> Enum.reverse()
	{Enum.reverse(lines),last_line}
end)
# below is the same is every case
|> Stream.map(fn line-> 
	{num,_} = Integer.parse(line)
	num
end)
|> Enum.sum()

This runs at just 8 seconds.

Questions

Why the 3 is so much faster? And Why the recursive one (number 2) can’t reach run at the same time of the target example (number 0) ?

Is there a better way to do this?

Most Liked

alvises

alvises

Looking at the code it seems that pop does something similar to a double reverse:
List.pop_at
List.do_pop_at

It checks the length at the beginning (L710), which is a slow operation with long lists, then recursively goes back accumulating a reversed list which is reversed at the end, which makes sense since lists are linked lists and we can’t just traverse the list removing the last item.

To just extract the last item it seems that the “double reverse” is much faster:

list = Enum.to_list 1..10_000

Benchee.run(%{
	"double reverse" => fn -> 
		[_popped_element| rev_list] = Enum.reverse(list)
		_poped_list = Enum.reverse(rev_list)
	end,

	"List.pop_at(-1)" => fn ->
		{_popped_element, _popped_list} = List.pop_at(list, -1)
	end
},
time: 10, 
memory_time: 2
)
Name                      ips        average  deviation         median         99th %
double reverse        19.00 K       52.63 μs    ±50.42%          61 μs         139 μs
List.pop_at(-1)        7.39 K      135.35 μs    ±26.58%         128 μs         273 μs

Comparison:
double reverse        19.00 K
List.pop_at(-1)        7.39 K - 2.57x slower +82.72 μs
rhruiz

rhruiz

you probably can save time in the double reverse time by:

{last_line, lines} =
      acc <> chunk
      |> String.split("\n")
      |> List.pop_at(-1)
bismark

bismark

I believe #3 is faster because under the hood String.split(binary, "\n") uses :binary.split/3 and per the docs:

Although the majority of functions could be provided using bit-syntax, the functions in this library are highly optimized and are expected to either execute faster or consume less memory, or both, than a counterpart written in pure Erlang.

Where Next?

Popular in Questions Top

fireproofsocks
I’m working on defining a simple Ecto schema for a table (in PostGres), but I don’t see where I can define a column as NOT NULL. Conside...
New
aadeshere1
I have a another noob question about loop. Since elixir is immutable, while loop is not directly possible. total = 10 while total != 0 ...
New
jaysoifer
Is there a way to rollback a specific migration and only that one ("skipping" all the other ones)? Would mix ecto.rollback -v 2008090...
New
johnnyicon
Hi all, I've just started learning Elixir and Phoenix Framework, so please pardon my n00bness at this stage. I'm trying to use Postg...
New
shahryarjb
Hello, I have map which I want to convert it to string like this: the map: %{last_name: "tavakkoli", name: "shahryar"} the string I ne...
New
jononomo
I am trying to figure out how Mix knows whether the environment is test, dev, or prod -- where is this set? Thanks.
New
hariharasudhan94
lets say i have a sample like a = 20; b = 10; if (a &gt; b) do {:ok, "a"} end if (a &lt; b) do {:ok, b} end if (a == b) do {:ok, "eq...
New
rms.mrcs
Hi, I need to transform a list of numbers into a map where the keys are the indexes and the values are the original values of the list....
New
lucidguppy
I have a super simple question about elixir - how would I take a file like this foo bar baz and output a new file that enumerates th...
New
dotdotdotPaul
Okay, I'm having a heck of a time trying to figure out how to best handle the validation of belongs_to associations in Ecto. I'm sure I'...
New

Other popular topics Top

TunkShif
This post is an instruction guide to help you setup your Neovim for Elixir development from scratch. It includes general information on h...
274 41454 115
New
mcarvalho
What is the difference between System.get_env and Application.get_env? For example, what are best practices to use one versus another.
New
lessless
I believe there are people here who are dealing with CSV files import on the daily basis, and since Excel is a really popular tool there ...
New
belgoros
I’m not a pro in using Regex and can’t figure out why the following behaviour happens, especially if we take into account the difference ...
New
grych
Hi folks, Few months ago I have announced the proof-of-concept of the library to manipulate the browsers DOM objects directly from Elixi...
639 52238 488
New
saif
Hello everyone, Long time lurker first time poster here. I’ve recently begun working on Elixir full-time again! :raised_hands: It’s been...
New
nsuchy
Hi. I’ve noticed that Windows Powershell has it’s own IEX command and you cannot access Elixir’s IEX due to the conflict. This isn’t a cr...
New
sergio_101
I am VERY much an elixir newbie. I have taken one elixir course and one phoenix course on Udemy. During that course, I saw the instructor...
New
AstonJ
We’ve put together this wiki for Phoenix LiveView - please feel free to add any info you feel is worth including. What is Phoenix LiveV...
New
hariharasudhan94
I would like to know what is the best IDE for elixir development?
New

We're in Beta

About us Mission Statement