Performance of External functions vs NIF

tj0 · June 27, 2022, 6:48am

Recently, I had to move a non-thread safe nif into an external server and thought I’d share some interesting results.

I thought that this would for sure be slower than a NIF. After all we’re going over a network and have to do some data serialization vs nifs being…well…native inline functions.

Punchline - the external server was faster for calls that normally took about 100us.

  Name                ips        average  deviation         median         99th %
  pool_func        6.35 K      157.56 μs     ±9.26%      153.60 μs      205.64 μs
  pool_e2e         6.17 K      162.15 μs    ±10.32%      159.71 μs      227.91 μs
  nif_func         4.30 K      232.58 μs     ±3.67%      234.35 μs      259.72 μs
  nif_e2e          4.18 K      239.32 μs     ±4.50%      240.37 μs      270.51 μs

For shorter functions, the network overhead seems to dominate. Seems to be about 15μs baseline to call (measured on a 11th Gen Intel(R) Core™ i5-1135G7 @ 2.40GHz).

    Name                  ips        average  deviation         median         99th %
    nif_random        35.31 K       28.32 μs    ±23.33%       29.75 μs       37.93 μs
    pool_random       25.15 K       39.76 μs    ±14.27%       39.60 μs       50.41 μs

The effect was even more pronounced on a slower arm machine - Raspberry Pi equivalent:

Name                ips        average  deviation         median         99th
%                                                                                          
pool_func        1.36 K        0.74 ms     ±7.74%        0.73 ms        0.83 ms                                                                                          
pool_e2e         1.28 K        0.78 ms     ±9.10%        0.78 ms        0.92 ms                            
nif_func         0.80 K        1.25 ms     ±2.30%        1.26 ms        1.31 ms                                                                        
nif_e2e          0.77 K        1.29 ms     ±2.46%        1.30 ms        1.40 ms                                                                               
                                                                                
Comparison:                                                                     
pool_func        1.36 K                                           
pool_e2e         1.28 K - 1.06x slower +0.0472 ms
nif_func         0.80 K - 1.70x slower +0.51 ms
nif_e2e          0.77 K - 1.76x slower +0.56 ms

The general design is as follows:

NimblePool initiates some workers via gen_tcp with options [:binary, active: false, packet: 2] which gives max packet size of 65k, but the largest one I’m sending is 512.
The connection is made over unix domain sockets as the tcp socket was much slower. Unix socket was median 15us and 99th 22us, while tcp was 40us and 285us.
On receiving a tcp connection, the server forks and goes into a read loop which execute the function. The connection remains open for new messages.
The messages are passed to the server via json and the results are returned via msgpack. This shaves about 15us off for my data (512 bytes array of floats).
Pretty much everything is statically allocated on the tcp server including msgpack.

If there’s some interest, I can share the server side code. It may be interesting for those doing machine learning, etc. It’s written in Zig (my first time) and it’s pretty easy to use existing C libraries.