Help with performance (file io)

cjbottaro · June 7, 2016, 2:20am

Hello,

Why is this Elixir code:

defmodule Foo do

  def run(file_name) do
    File.open! file_name, [:read], fn f ->
      IO.stream(f, :line) |> Enum.each(&process_line/1)
    end
  end

  defp process_line(line) do
    String.rstrip(line) |> String.split(",")
  end

end

[ file_name | _ ] = System.argv

Foo.run(file_name)

So much slow than this Ruby code:

def run(file_name)
  File.open file_name, "r" do |f|
    f.each_line{ |l| process_line(l) }
  end
end

def process_line(line)
  line.chomp.split(",")
end

run(ARGV[0])

Elixir:

$ time elixir test.exs ../data_gen/posts.csv

real	0m24.496s
user	0m23.527s
sys	0m1.983s

Ruby:

$ time ruby test.rb ../data_gen/posts.csv

real	0m6.556s
user	0m6.444s
sys	0m0.100s

I suspect it’s because I’m not streaming the lines properly?

Thanks for the help,
– C

uranther · June 7, 2016, 2:34am

Try File.stream!/3 instead of File.open! and IO.stream

cjbottaro · June 7, 2016, 3:01am

Thanks! It’s twice as fast now:

$ time elixir test.exs ../data_gen/posts.csv

real	0m12.135s
user	0m11.734s
sys	0m0.807s

(using this code)

  def run(file_name) do
    File.stream!(file_name, [:read])
      |> Enum.each(&process_line/1)
  end

Still twice as slow as Ruby though…

Any other tips? Thanks again.

uranther · June 7, 2016, 3:10am

With String.rstrip(line) |> String.split(","), you are ripping through the line multiple times. Try using the trim: option instead:

  defp process_line(line) do
    line
    |> String.split(",", trim: true)
  end

cjbottaro · June 7, 2016, 3:13am

That change made no significant difference… :sad face:

sotojuan · June 7, 2016, 3:52am

Wild/dumb guess but maybe because it has to spin up a VM to run?

cjbottaro · June 7, 2016, 3:58am

I assume that accounts for milliseconds, not many seconds…

NobbZ · June 7, 2016, 5:13am

I’m not sure how fast BEAM boots up, but have you tried profiling using fprof?

dom · June 7, 2016, 1:07pm

What version of Ruby is that? Not all of them will use UTF-8 by default.

You could also load the whole file into memory first, rather than stream. If I’m not mistaken that’s what the Ruby version does.

andre1sk · June 7, 2016, 1:33pm

build executable with escript?

NobbZ · June 7, 2016, 1:48pm

Running an elixirscript means to compile and then call into the compiled module. I do not think, that for that short script the compiletime will make any significant difference.

andre1sk · June 7, 2016, 2:12pm

Yep you are right

andre1sk · June 7, 2016, 2:23pm

on 4mb csv file php is about 10x faster

cjbottaro · June 7, 2016, 2:39pm

Oye, so that’s it, huh.

I’m writing a distributed computation framework (basically an Apache Storm rip off) in both Elixir and Ruby. I’m at the point now where I can benchmark and analyze post-computation statistics. As expected, Elixir beats the crap out of Ruby in all benchmarks… except distributed join. And it all hinges around the fact that the tuples I’m joining are read off the file system.

It’s hard to show off benchmarks of the framework when I’m completely limited by file io. I tried to parallelize it by breaking the file into 10 pieces and having 10 processes read it. And that helped a lot (on a machine with 16 cpus), but still, the components of the topology that aren’t reading the file spend most of their time idle waiting for input.

Anyway, I have a lot more to say about it (and more Elixir performance questions), but that’s for another post…

sasajuric · June 7, 2016, 2:54pm

Measuring with time is a bit misleading, since you’re also measuring the time to start the VM. Probably won’t make big difference in this case, but to get a more precise number, use :timer.tc.

Also, since chomp and split in Ruby are implemented in C, I’d expect some constant factor difference in Ruby’s favour. Erlang was not designed for raw CPU speed, so CPU bound processing will be slower. In my experience, that’s usually not a problem (i.e. the difference is not significant).

If you need to perform some intensive long computation, then it’s worth considering doing it in something else. If that’s only a small part of your system, then you could integrate external code into Erlang. There are a couple of options for that, such as ports, or NIFs, which would allow you to implement performance sensitive part in e.g. Rust or C, and invoke them from Elixir/Erlang.

If most of your code is CPU oriented, and you don’t need fault tolerance, high availability, stable latency, and fair scheduling, then perhaps Elixir/Erlang are not the best tools for the job.

andre1sk · June 7, 2016, 2:59pm

if you use IO.binstream instead of File.stream it speeds things up x2

cjbottaro · June 7, 2016, 3:55pm

Friend sent me this:

http://blog.alainodea.com/en/article/398/blazing-fast-concurrent-text-i-o-in-erlang

bbense · June 7, 2016, 5:05pm

Getting fast I/O stream processing in Elixir does require some non-obvious tweaks. Due to the way I/O works on the BEAM (there’s a special process that does I/O and passes results to your process as a message), you want the messages to be as long as possible to avoid overheads.
Doing it line by line is about the slowest way possible.

The tricks I have learned are documented in

But in general the bigger chunks you read and process things in the faster it goes. For a 4 meg csv file I would just slurp the whole thing into memory and operate on the resulting binary.

bbense · June 7, 2016, 5:13pm

Hmm, using separate process to read separate sections of the file is one trick I have not tried. However, everything requires benchmarks on the hardware you’re actually going to run on. Latency is everything when it comes to schemes for concurrent I/O.

My guess it that the size of the file would have to be such that it gets to the flat part of the
“just slurp the file” curve. On a macbook with an SSD that’s a pretty big file, but on a linux server pretending s3 is a filesystem, it might be quite small.

cjbottaro · June 7, 2016, 5:35pm

Well, a part of that trick is manually partitioning the file by hand ahead of time with head -n | tail -n…

Not sure how to do that automatically without (slowly) scanning the lines.