Help with performance (file io)

Correct me if I’m wrong but if you aren’t using mix then your protocols aren’t consolidated, so there’s a massive overhead to using the Enumerable protocol which is all of your Stream / Enum stuff.

1 Like

The problem with most of these benchmarks is that what they actually do with the data is relatively trivial. Since most other languages are more or less using exactly the same I/O flow these benchmarks are reasonable tests of the relative speed of the text processing in each language.

In the case of Elixir and the BEAM, the I/O subsystem is so different that it’s costs dominate the equation and you end up comparing apples to oranges. There are a lot of tricks you can do to make the BEAM more orange-y, but ultimately, Elixir and the BEAM is not the optimal solution for every problem.

The BEAM makes a lot of very complicated things quite easy, but there is no free lunch. The price you pay for easy concurrency is complexity at the boundaries.

2 Likes

This topic got me interested both as an exercise with basic parts of Elixir, but also because I love writing shell scripts and I’m sure I’m not the only one considering Elixir as a nice language not just for servers, but tooling too. Pattern matching, easy packaging with escript, self-supervisioned daemons - the list goes on and on :slight_smile:

So I’ve also spent some time researching and optimizing and as result I wrote the following blog post:

Elixir vs Ruby: File I/O performance

There’s also accompanying source code available for everyone to play with.

Anyway, let me comment on some of your suggestions after trying them out:

@NobbZ: I do not think, that for that short script the compiletime will make any significant difference.
@andre1sk: build executable with escript?

Both building for prod and/or escript decrease exec time by no more than 0.1s - 0.3s.

@andre1sk: if you use IO.binstream instead of File.stream it speeds things up x2

I couldn’t replicate that. In my case File.stream is always faster than File.open + IO.stream or binstream. Could you check that out on my code?

@cjbottaro: Anyway, I have a lot more to say about it (and more Elixir performance questions), but that’s for another post… :slight_smile:

I’d really like to read more about your findings. Especially about how did you manage to overcome the file I/O performance bottleneck, as I guess you had to settle with some final solution for your distributed computation framework.

@bbense Getting fast I/O stream processing in Elixir does require some non-obvious tweaks. Due to the way I/O works on the BEAM (there’s a special process that does I/O and passes results to your process as a message), you want the messages to be as long as possible to avoid overheads.

This turned out to be very true in my benchmarks. The question is how to efficiently parse large files that need to be processed on per-line basis and are too big to read them whole, but without streaming them line by line which is so slow in Elixir.

I’ve implemented a simple concurrent solution where each process loaded specific part of file and did manual binary line splitting and joining of broken lines at part breaks but it was slower than simple File.stream (as it only increased the number of messages).

@bbense Hmm, using separate process to read separate sections of the file is one trick I have not tried.

I have (with :file.pread and Task.async). It didn’t turn out well :slight_smile:

4 Likes

There are many interesting points in that post. However, I disagree that the task is mostly I/O bound. In fact, it seems to me the task is mostly CPU bound, spending most of its time in filter_line (in the “read” version). I fiddled with it a bit, and was able to reduce the read version to about 1sec (from 1.8) with the following approach:

defp filter_line(line) do
  filter_line(line,<<>>)
end

defp filter_line(<<c::utf8, rest::binary>>, integer) when c in ?0..?9 do
  filter_line(rest, <<integer::binary, c>>)
end
defp filter_line(<<?,::utf8, _::binary>>, integer) do
  num = String.to_integer(integer)
  rem(num, 2) == 0 || rem(num, 5) == 0
end
defp filter_line(<<_other::utf8, rest::binary>>, _integer) do
  filter_line(rest, <<>>)
end

Basically, I’m iterating one codepoint at a time, accumulating integer codepoints (0…9) until the first observed comma. Then I convert the accumulated string of digits to int, and do the rem math. By doing this I don’t need to split the entire line by comma (since I don’t care about the rest anyway), and also don’t need to do additional regex scan for the terminating integer.

I didn’t do much to analyze the correctness besides verifying that the output file size is the same. I’m also not saying this is the best (or optimal) approach. The point is, as I hinted earlier in this thread, that most of the action is taking place in a few CPU bound functions (assuming you perform I/O efficiently, which you do with File.read! and File.write!). If Ruby’s versions of those operations are implemented in C, then Ruby will probably be faster.

2 Likes

Thanks for the feedback, that’s exactly what I was going after when I wrote that article :slight_smile: I’m not that good with binary pattern matching yet so it’s nice to see it put into good use.

You’re right that the read version spends lots of time running string and regex operations in the filter_line function. I’ve actually left it unoptimized (and as close to Ruby equivalent as possible) on purpose in order to focus on the File I/O itself. You could also reduce the rem(num, 2) == 0 || rem(num, 5) == 0 to a check if last char is one of 0 2 4 5 6 8 :slight_smile:

Instead, I’d like to treat that part as constant and try making things faster by changing the I/O flow itself. Especially in the streaming version which could really use some performance boost and in which no optimizations in filter_line will help much. For serious use with serious (or arbitrary) file sizes, streaming may be the only way to go.

As to being CPU and not I/O bound, do you think it’s also true for the streaming version? If it’s the message sending/receiving between processes that kills that version’s performance, then I thought it’s correct to say we’re I/O bound (as it’s the I/O layer that puts this unavoidable strain on the VM). On the other hand message passing probably involves CPU more than hard disk or RAM so from that perspective we’re CPU bound in that version too. What do you think?

2 Likes

Ah, this is a great trick! I just tried it, and it reduces the running time to 0.8 sec:

defp filter_line(<<c::utf8, ?,::utf8, _::binary>>)
  when c in [?0, ?2, ?4, ?5, ?6, ?8],
  do: true
defp filter_line(<<_::utf8, ?,::utf8, _::binary>>),
  do: false
defp filter_line(<<_other::utf8, rest::binary>>),
  do: filter_line(rest)

I don’t see why does it have to look as close to ruby. Isn’t the point of this exercise to do the work as fast as possible?

If you want to limit the memory usage, then yes. To make it work faster, you can use a read-ahead buffer. Similarly, you could use delayed write on the writing side. Here’s a version which takes less than 2 secs on my machine (with the optimized filter_line):

def main([ filename, "stream" ]) do
  File.stream!(filename, read_ahead: 100_000)
  |> Stream.filter(&filter_line/1)
  |> Stream.into(File.stream!(filename <> ".out", [:delayed_write]))
  |> Stream.run
end

It’s not as efficient as the eager one, but much faster than the original. Perhaps it can be further optimized, but I didn’t spend any time investigating it.

Well, we have to read a lot of data, and write a lot of data, so we’re partially I/O bound in both versions. In the streaming one we’re doing I/O less efficiently (but reducing the memory usage), so that’s a trade-off.

I consider I/O bound to mean we’re doing I/O operations, i.e. talking to external devices. Message passing (unless to a different machine) is not such operation so it’s not I/O bound in my opinion. Regardless, message passing overhead might be significant if the processes are doing little work on each message.

In any case I’d say it’s first worth optimizing the sequential algorithm (perhaps by manually recursing, and making the decision with less processing :slight_smile:) before considering splitting the work over multiple processes. Concurrency is not a remedy for a suboptimal sequential algorithm :slight_smile:

3 Likes

If you want to limit the memory usage, then yes. To make it work faster, you can use a read-ahead buffer. Similarly, you could use delayed write on the writing side.

Wow, the :delayed_write indeed is a game changer here! I’ve somehow missed it in the docs. I’ve already tried :read_ahead and it didn’t affect performance by much. This still holds true when used in tandem with :delayed_write although now the benefit of increasing :read_ahead seems to be a bit bigger.

I don’t see why does it have to look as close to ruby. Isn’t the point of this exercise to do the work as fast as possible?

You’re actually right. I was thinking to just keep that part of the script away from reflections on my blog post, as kind of a constant that doesn’t change between scripts. But it’s not fair to Elixir not to use its pattern matching and recursive powers, esp. since I compare the total performance anyway.

Thanks again! I’ll update the blog post with those findings, optimizations and with a corrected statements about being I/O bound.

1 Like

Yeah, the point was to optimize the code algorithmically. Having established that most of the time is spent in the code which decides whether the line should be filtered or not, I tried to reduce the processing there. We don’t need to process the entire line if we’re deciding on the first column only. This had some savings in your csv example, but could have saved a lot more for larger rows. Your trick with checking the last digit is also a great save, since we don’t have to collect all the digits, nor convert to int.

So now, we’re better than Ruby, although it would be interesting to see how Ruby would perform with similar optimizations.

Moreover, the :delayed_write puts the streaming version in the area of Ruby (which I suspect buffers by default). So my non-conclusive conclusion would be that for this contrived microbenchmark we can expect similar performance in both languages (assuming both implementations are tuned), so I don’t see problems with Elixir I/O compared to Ruby (which your otherwise great post seems to imply).

3 Likes

Just wanted to give you (and everyone here) a heads up that a heavily rewritten article on Elixir file I/O is up:

Elixir vs Ruby: File I/O performance (updated)

I think it’s worth it to quote here the key conclusion on files and Elixir:

In case of Elixir you can get similar performance if you put streams into proper use (as shown above) or if you go for a read-all-at-once approach. You can also gain a serious performance edge over Ruby if you make use of pattern matching and recursion.

Therefore, it makes most sense to write such scripts from scratch with precise idea about how to put unique Elixir features into use. I can see some serious use cases here that could take benefit from OTP, pattern matching and streaming, like supervisioned CSV import/export workers, Unix daemons or command line tools. Doing blind conversion, like I did in this experiment, makes little sense and doesn’t yield a fair comparison.

Aside from rewritten conclusions, it also includes a much more logical layout of the whole optimization process. And gives credit where the credit is due :slight_smile: I hope this time I’ve nailed it and it’ll serve a proper reference for everyone who stumbles upon this problem.

8 Likes

Thank you for this write-up! It clearly explains the various ways to slice this problem, explores Elixir/OTP strengths and weaknesses, while also bringing in the Ruby perspective. :heavy_check_mark::bookmark:

2 Likes

This looks great!

I’m puzzled about one sentence though:

I was wrong assuming that it’s the process communication that puts the biggest overhead here.

Just to be clear: there’s no process communication happening here. Everything happens in the same process, since you’re working with both files in the raw mode (default for file streams). For clarification, see “Processes and raw files” in File doc.

Also, I couldn’t sleep over the fact that streamed version was about 3x slower on my machine than the read version (1.8s vs 0.6s). I played with it a bit more and discovered that streaming bytes works much faster than streaming lines. That led me to the following solution which shaved down the streaming version to ~ 1s:

def main([ filename, "stream" ]) do
  out_file = File.open!(filename <> ".out", [:write, :raw, :delayed_write])

  File.stream!(filename, [], 4000)
  |> Enum.reduce({"", out_file}, &handle_chunk/2)

  File.close(out_file)
end

defp handle_chunk(chunk, {unfinished_line, file}) do
  (unfinished_line <> chunk)
  |> String.split("\n")
  |> process_lines(file)
end

defp process_lines([unfinished_line], file), do: {unfinished_line, file}
defp process_lines([line | rest], file) do
    if filter_line(line) do
      IO.binwrite(file, line <> "\n")
    end
    process_lines(rest, file)
end

Here, I’m taking chunks of 4k (larger chunks didn’t improve perf). When I read the chunk I append it to the unfinished line from the previous chunk. Then I split on newline, and process all except the last element. The last element is unfinished line which I’ll prepend to the next chunk and repeat.

At this point, streaming version is also faster than Ruby (which is ~ 1.8s on my machine).

Note that I’m splitting input per bytes, so I’m not sure whether this will work correctly with unicode files.

3 Likes

Just to be clear: there’s no process communication happening here. Everything happens in the same process, since you’re working with both files in the raw mode (default for file streams). For clarification, see “Processes and raw files” in File doc1.

This is interesting. Initially I also thought there’s no process involved, but I kept that statement from the original article assuming that even if we don’t spawn an extra process (which happens with File.open in its default not-raw config) then there is still some low-level process within Erlang VM itself that’s responsible for serving file operations in a non-blocking fashion.

But then again, even if that would be true, I can’t really be sure if that could even be classified as “process”. It may be a thread instead and use other means for delivering data to our process than what we used to call “sending messages”. That seems to be the case according to the top part of this doc. So ultimately you’re right :slight_smile: It was a leftover from the original article where I’ve clearly mixed up the concepts of interprocess message passing with a tight stream loop that runs on the same process. I’ve fixed it.

Also, I couldn’t sleep over the fact that streamed version was about 3x slower on my machine than the read version (1.8s vs 0.6s). I played with it a bit more and discovered that streaming bytes works much faster than streaming lines.

This is amazing. I’ve tried taking it further, with this and then this, but failed to beat your version. I wonder if it could be made even faster, perhaps by adding some clever concurrency.

Note that I’m splitting input per bytes, so I’m not sure whether this will work correctly with unicode files.

As far as I understood the docs, String.split is UTF-8 aware. Also, you’re using utf8 qualifiers on every pattern match so it looks like it should work with UTF-8 just fine to me. The IO.binwrite doc includes a warning about using it with devices in unicode mode, but that’s not the case here. I’ve added some emoticons to input CSV and they are still there in the output, but I’m not sure that’s an ultimate proof of being UTF-8 compilant :slight_smile:

1 Like

I actually tried some variants of your approaches first, and they failed :slight_smile:
I think this one is the fastest because String.split will just forward to :binary.split if the split pattern is a binary, and then the split will be done in C code. But it’s just a guess on why I think String.split works better here.

One thing that I wonder, but didn’t verify is what happens if a 2 byte codepoint ends up being split over two chunks. I suspect it might still work, but didn’t really check it.

In any case, this was a nice exercise. Thanks for writing the post, which motivated me to do some playing on my own. I think together both of us learned a lot in the process :slight_smile:

1 Like

I’m not gonna say no since I didn’t try it, but I’m not sure we can get a lot from concurrency here since there’s not a lot of processing on each element, so I think the cost of message passing would shadow any possible benefit of concurrency.

But to be honest, I’m fairly pleased with results. On my machine I can read, filter, and write ~500k rows of CSV in about 1 sec, using about 64kb of memory. That seems decent to me :slight_smile:

5 Likes

One very relevant area with concurrency is when you need to process multiple files. We have to do something a lot like this at cargosense as part of parsing binary data files that get, and I’m able to do several hundred MB/sec of IO on my MBP by leveraging 4-8 files processing concurrently. With streams, this only uses about 10mb of memory.

3 Likes

I encountered a similar slowdown while reading a 1.5 GiB log file (which contained enormous NUL byte sequences due to corruption) line by line in Elixir 1.4.2 and Erlang 19.2 under Linux 3.16, which took 8 hours and 7 minutes to complete!

However, I was able to bring the execution time down to 90 seconds (thereby achieving a massive 320x speedup) by reading the file as a raw byte stream (instead of UTF-8) and with generous caching:

file = "path/to/very/large/file"
io = file |> File.open!(read_ahead: 128 * 1024) # 128 KiB cache
lines = io |> IO.binstream(:line)

See my blog post for the details on the major contributing sources of this solution. Cheers!

6 Likes

This link appears to be broken. Does anyone know if that blog moved or is archived somewhere?