Today I conducted a small exercise where I compared each word in a text file to a list of dirty words. I downloaded around 20 books from http://www.gutenberg.org/browse/scores/top and used the Flow example from Flows documentation (in Avoiding single sources section), just replacing the reduce function with a new one.
I did the exercise in Elixir and Java just to see how much better it would be in Elixir. Turns out Elixir was 5x slower than the java implementation.
This has me a bit worried that either Elixir isnât as performant as I thought, or that Elixir is only good for very simple things, or that Iâm just doing it wrong.
Below is the elixir code I was testing:
def start3() do
streams = for file <- File.ls!("test/resources") do
File.stream!("test/resources/#{file}", read_ahead: 100_000)
end
streams
|> Flow.from_enumerables()
|> Flow.map(&String.replace(&1, "\n", ""))
|> Flow.map(&String.downcase(&1))
|> Flow.flat_map(&String.split(&1, " "))
|> Flow.filter(&String.starts_with?(&1, @badwords))
|> Flow.partition()
|> Flow.reduce(fn -> %{} end, fn word, acc ->
Map.update(acc, word, 1, & &1 + 1)
end)
|> Enum.to_list()
end
Here is the Java code:
public Map<String, Integer> test3() throws IOException {
File dir = new File("/Users/jeramy/dev/elixir/veronica/test/resources");
Stream<Path> filesStream = Files.list(dir.toPath());
final Map<String, Integer> foundDirtyWords = new HashMap<>();
filesStream.forEach(fileStream -> {
try (Stream<String> stream = Files.lines(fileStream)) {
stream.map(line -> line.replace("\n", ""))
.flatMap(line -> Arrays.stream(line.split(" ")))
.filter(word -> dirtyWords.stream().anyMatch(w -> word.toLowerCase().contains(w.toLowerCase())))
.forEach(word -> {
int val = foundDirtyWords.getOrDefault(word, 0);
foundDirtyWords.put(word, val + 1);
});
} catch (IOException e) {
// do nothing. this is just a test
}
}
);
return foundDirtyWords;
}
Any thoughts or suggestions?