This is fairly simple task. I would even say that “beginner level” one.
To split this into separate parts let’s see at the requirements:
This mean that we need to read files somehow and if there is none then read stdin.
Ok, so we will need to chunk data by words. In general by word we understand whitespace separated string, and I will treat them as such.
Ok, so before we count the words we need to normalise strings to lowercase (assuming example from the above) and get rid of the “punctuation” (I assume that by punctuation they mean non alphanumeric characters, they do not say anything about digits, so I assume these also count as “words”).
So we need to use Stream and read lazily. This will greatly help us with point 1.
This is quite obvious.
So firstly, we need to know if there are any params, and if so, then provide content of the files that are pointed, and if not then use stdin.
- We need to know the params, so we open Elixir documentation and search for
argv.
- We need to check what it returns (in C
argv contains program name), so we create simple test program test.exs:IO.inspect System.argv
And run it elixir test.exs and elixir test.exs foo and we see that it contains only arguments (no program name), perfect.
- We know that we need to read files and IO, and that we need to use
Stream, so again in docs we search for File.stream (it seems that it will fail on non existent file, fortunately they didn’t said a word about error handling) and IO.stream. Nice, we have them both.
- Now we can create function to give us our stream of data:
def stream, do: build_stream(System.argv())
defp build_stream([]), do: IO.stream(:stdio, :line) # situation when we do not have any args
defp build_stream(files) when is_list(files),
do: files |> Enum.map(&File.stream!/1) |> Stream.concat()
Now we have reading. Step two isn’t grouping files, but normalising them:
-
We need all data to be lowercase, however searching for that gives us nothing. But what about other name downcase? Bingo!
-
Now we need to get rid of the pesky punctuation. As we agreed earlier, punctuation is everything that is not:
So now how we can “get rid” of that. The simplest solution? replace them with empty string.
Our normalize/1 function:
def normalize(str) when is_binary(str) do
str
|> String.downcase()
|> String.replace(~r/[^0-9a-z ]/, "")
end
Great, now we have each line of the input normalised via:
stream
|> Stream.map(&normalize/1)
It is also worth dropping empty strings:
stream
|> Stream.reject(& &1 == "")
Next, we need to split them into the words. This is as simple as calling String.split on each line, however it would be easier for us to have it as an uniform stream of words instead of stream of line words. So instead of Stream.map we use Stream.flat_map. So we have:
stream
|> Stream.flat_map(&String.split/1)
Now we need to split them into chunks of three consecutive words via Stream.chunk_every/4 and then join it back to single, space separated string:
stream
|> Stream.chunk_every(3, 1, :discard) # we aren't interested in last, non full, entries
|> Stream.map(&Enum.join(&1, " "))
And now the main dish - counting, which in this situation is dumb easy:
stream
|> Enum.reduce(%{}, fn chunk, acc ->
Map.update(acc, chunk, 1, & &1 + 1) # update element in map by 1; if there is none, add it with value 1
end)
Now we have map in form %{chunk => count} and we need to sort it by amount of occurrences:
map
|> Enum.sort_by(&elem(&1, 1), &>=/2)
And display top 100:
list
|> Enum.take(100)
|> Enum.each(fn {chunk, count} ->
IO.puts([Integer.to_string(count), " - ", chunk])
end)
Whole program looks like this:
defmodule Input do
def stream, do: build_stream(System.argv())
defp build_stream([]), do: IO.stream(:stdio, :line) # situation when we do not have any args
defp build_stream(files) when is_list(files),
do: files |> Enum.map(&File.stream!/1) |> Stream.concat()
def normalize(str) when is_binary(str) do
str
|> String.downcase()
|> String.replace(~r/[^0-9a-z ]/, "")
end
end
Input.stream()
|> Stream.map(&Input.normalize/1)
|> Stream.reject(& &1 == "")
|> Stream.flat_map(&String.split/1)
|> Stream.chunk_every(3, 1, :discard) # we aren't interested in last, non full, entries
|> Stream.map(&Enum.join(&1, " "))
|> Enum.reduce(%{}, fn chunk, acc ->
Map.update(acc, chunk, 1, & &1 + 1) # update element in map by 1; if there is none, add it with value 1
end)
|> Enum.sort_by(&elem(&1, 1), &>=/2)
|> Enum.take(100)
|> Enum.each(fn {chunk, count} ->
IO.puts([Integer.to_string(count), " - ", chunk])
end)
And you can check it on your own if it works.