defmodule Foo do
def run(file_name) do
File.open! file_name, [:read], fn f ->
IO.stream(f, :line) |> Enum.each(&process_line/1)
end
end
defp process_line(line) do
String.rstrip(line) |> String.split(",")
end
end
[ file_name | _ ] = System.argv
Foo.run(file_name)
So much slow than this Ruby code:
def run(file_name)
File.open file_name, "r" do |f|
f.each_line{ |l| process_line(l) }
end
end
def process_line(line)
line.chomp.split(",")
end
run(ARGV[0])
Elixir:
$ time elixir test.exs ../data_gen/posts.csv
real 0m24.496s
user 0m23.527s
sys 0m1.983s
Ruby:
$ time ruby test.rb ../data_gen/posts.csv
real 0m6.556s
user 0m6.444s
sys 0m0.100s
I suspect it’s because I’m not streaming the lines properly?
Running an elixirscript means to compile and then call into the compiled module. I do not think, that for that short script the compiletime will make any significant difference.
I’m writing a distributed computation framework (basically an Apache Storm rip off) in both Elixir and Ruby. I’m at the point now where I can benchmark and analyze post-computation statistics. As expected, Elixir beats the crap out of Ruby in all benchmarks… except distributed join. And it all hinges around the fact that the tuples I’m joining are read off the file system.
It’s hard to show off benchmarks of the framework when I’m completely limited by file io. I tried to parallelize it by breaking the file into 10 pieces and having 10 processes read it. And that helped a lot (on a machine with 16 cpus), but still, the components of the topology that aren’t reading the file spend most of their time idle waiting for input.
Anyway, I have a lot more to say about it (and more Elixir performance questions), but that’s for another post…
Measuring with time is a bit misleading, since you’re also measuring the time to start the VM. Probably won’t make big difference in this case, but to get a more precise number, use :timer.tc.
Also, since chomp and split in Ruby are implemented in C, I’d expect some constant factor difference in Ruby’s favour. Erlang was not designed for raw CPU speed, so CPU bound processing will be slower. In my experience, that’s usually not a problem (i.e. the difference is not significant).
If you need to perform some intensive long computation, then it’s worth considering doing it in something else. If that’s only a small part of your system, then you could integrate external code into Erlang. There are a couple of options for that, such as ports, or NIFs, which would allow you to implement performance sensitive part in e.g. Rust or C, and invoke them from Elixir/Erlang.
If most of your code is CPU oriented, and you don’t need fault tolerance, high availability, stable latency, and fair scheduling, then perhaps Elixir/Erlang are not the best tools for the job.
Getting fast I/O stream processing in Elixir does require some non-obvious tweaks. Due to the way I/O works on the BEAM (there’s a special process that does I/O and passes results to your process as a message), you want the messages to be as long as possible to avoid overheads.
Doing it line by line is about the slowest way possible.
The tricks I have learned are documented in
But in general the bigger chunks you read and process things in the faster it goes. For a 4 meg csv file I would just slurp the whole thing into memory and operate on the resulting binary.
Hmm, using separate process to read separate sections of the file is one trick I have not tried. However, everything requires benchmarks on the hardware you’re actually going to run on. Latency is everything when it comes to schemes for concurrent I/O.
My guess it that the size of the file would have to be such that it gets to the flat part of the
“just slurp the file” curve. On a macbook with an SSD that’s a pretty big file, but on a linux server pretending s3 is a filesystem, it might be quite small.