Standard Libraries vs Tail-Recursion

Hi, i have been working on a code that counts words on a file. I mostly used standard libraries from Elixir, then tried to run some tests to see how it responds. I tried a simple text with few words and something absolutely ireal with 500k lines of text… And i tought it’s probably becouse of not using tail-call recursion instead of Enum and Map. The first thing i noticed is that memory usage is very high (on large file) but also that keeps some memory in use even when it finishes. Does tail-call solve this? Why is that memory still runing? how would you improve the code?
Here is my code:

defmodule RepeatedWords do

  def word_count(readFile) do
     readFile
     |> File.stream!
     |> sanitize
     |> count
     |> order
  end

  defp sanitize(words) do
    words
    |> Stream.map(&String.split(&1, ~r/(*UTF)\W/, trim: true))
    |> Enum.to_list
    |> List.flatten
    |> Enum.map(&String.downcase(&1))
  end

  defp count(words) when is_list(words) do
    Enum.reduce(words, %{}, &count_acc/2)
  end

  defp count_acc(word, acc) do
    Map.update(acc, String.to_atom(word), 1, &(&1 + 1))

  end

  defp order(words_count) do
    words_count
    |> Map.to_list()
    |> Enum.sort_by(&(elem(&1, 1)), :desc)
  end

end
1 Like

You probably want to :binary.copy/1 individual chunks, to avoid keeping refs to the containing binary forever. The earlier you copy, the earlier the upper refs can be GC’d

Also, words |> Strem.map/2 |> Enum.to_list can be written as words |> Enum.map/2. The Enum.to_list/0 call forces the stream anyway.

Never do this! Just use the string as the key.

5 Likes

String.to_atom(word)

Here is your answer, as @NobbZ says, try to avoid this. The atoms on the beam are not collected by the garbage collector, and that might be the cause of your memory increasing even after the code finished

1 Like

A general note about using Stream functions: the benefit of streaming is not holding the whole file in memory. Calling Enum.to_list on the stream totally eliminates that benefit.

For instance:

readFile
|> File.stream!
|> Stream.flat_map(&String.split(...))
|> Stream.map(&String.downcase/1)
|> Enum.frequencies()

At runtime, this will only ever hold onto one “line” worth of memory along with the map of words → counts.

3 Likes

This still has the references problem, as no explicit copy of the split substrings happens.

1 Like

Looks like String.downcase reassembles the string from individual codepoints:

That should lose any reference to the original un-split binary, right?

2 Likes

Indeed that should do it.

1 Like

Thanks all for the answers! I updated the code with some of your advices. It runs so efficiently now!

Here you can see it:

defmodule RepeatedWords do

  def word_count(readFile) do
     readFile
     |> File.stream!
     |> sanitize
     |> count
     |> order
  end

  defp sanitize(words) do
    words
    |> Stream.flat_map(&String.split(&1, ~r/(*UTF)\W/, trim: true))
    |> Stream.map(&String.downcase(&1))
  end

  defp count(words) do
    Enum.frequencies(words)
  end


  defp order(words_count) do
    words_count
    |> Enum.sort_by(&(elem(&1, 1)), :desc)
  end

end

I will still work more updates onto this, but im fascinated for the memory reduction with these changes (went from 3gb memory on large file to not going up of 25mb no matter what).

3 Likes