Understanding Elixir/Phoenix performance

josevalim · December 7, 2018, 10:06am

Doing file traversals is generally not going to be as efficient in Elixir/Erlang as in other languages. I will explain why.

When you call File.open/2 in Elixir, it doesn’t return a file handler. It returns a process (a lightweight thread of execution) that contains the file handler. But the file handler itself is not even a direct file handler, as you would get in C, but it is an instance of a linkedin driver, which is a piece of code that runs isolated in the VM, that then talks to the file handler.

You may be wondering: why all of this indirection then?

The reason why File.open/2 returns a process is because we can then pass this process around nodes and do file writes across nodes. So for example, I can open up a file on node A, pass that reference to node B, and node B can read/write to that file as if it was in node B, but everything is actually happening in node A. So the reason why we do this is because we favor distribution over raw performance.

What about the linked driver thing though? There are two reasons. First of all, let’s remember that those kind of operations need to be implemented in C or a low-level language for syscalls. And while Erlang provides interoperability with C code, in earlier versions, it was not possible to do an I/O based operation from within the C code. If you did that, you could mess up with the Erlang schedulers that are responsible for concurrency. The second reason is that, if you have C code and there is a bug in that C code, then it can cause a segmentation fault and bring the whole system down, so we prefer to keep our systems running. That led the code to be put in those linked drivers.

Of course all of this adds overhead but the reason we are fine with it is because for our use cases it is most likely that you will find yourself passing a file between nodes than traversing directories as fast as possible, so we focus on the former.

The situation has improved in the latest Erlang/OTP 21 release because the VM added the ability to run I/O blocking C code with something called dirty NIFs, so they recently removed the linked drivers for file operations and that improved performance. But still, most calls in the File module is going through processes and what not. You can actually bypass this process architecture, usually by invoking the :prim_file module or passing a [:raw] option to the File module operations and that typically improves things.

But in a nutshell that’s why it won’t be as fast, because there are many cases where we prefer to focus on features such as distribution and fault tolerance than raw performance.

Btw, regarding CSV processing, did you try the nimble_csv library?