Performance of External functions vs NIF

Recently, I had to move a non-thread safe nif into an external server and thought I’d share some interesting results.

I thought that this would for sure be slower than a NIF. After all we’re going over a network and have to do some data serialization vs nifs being…well…native inline functions.

Punchline - the external server was faster for calls that normally took about 100us.

  Name                ips        average  deviation         median         99th %
  pool_func        6.35 K      157.56 μs     ±9.26%      153.60 μs      205.64 μs
  pool_e2e         6.17 K      162.15 μs    ±10.32%      159.71 μs      227.91 μs
  nif_func         4.30 K      232.58 μs     ±3.67%      234.35 μs      259.72 μs
  nif_e2e          4.18 K      239.32 μs     ±4.50%      240.37 μs      270.51 μs

For shorter functions, the network overhead seems to dominate. Seems to be about 15μs baseline to call (measured on a 11th Gen Intel(R) Core™ i5-1135G7 @ 2.40GHz).

    Name                  ips        average  deviation         median         99th %
    nif_random        35.31 K       28.32 μs    ±23.33%       29.75 μs       37.93 μs
    pool_random       25.15 K       39.76 μs    ±14.27%       39.60 μs       50.41 μs

The effect was even more pronounced on a slower arm machine - Raspberry Pi equivalent:

Name                ips        average  deviation         median         99th
%                                                                                          
pool_func        1.36 K        0.74 ms     ±7.74%        0.73 ms        0.83 ms                                                                                          
pool_e2e         1.28 K        0.78 ms     ±9.10%        0.78 ms        0.92 ms                            
nif_func         0.80 K        1.25 ms     ±2.30%        1.26 ms        1.31 ms                                                                        
nif_e2e          0.77 K        1.29 ms     ±2.46%        1.30 ms        1.40 ms                                                                               
                                                                                
Comparison:                                                                     
pool_func        1.36 K                                           
pool_e2e         1.28 K - 1.06x slower +0.0472 ms
nif_func         0.80 K - 1.70x slower +0.51 ms
nif_e2e          0.77 K - 1.76x slower +0.56 ms

The general design is as follows:

  • NimblePool initiates some workers via gen_tcp with options [:binary, active: false, packet: 2] which gives max packet size of 65k, but the largest one I’m sending is 512.
  • The connection is made over unix domain sockets as the tcp socket was much slower. Unix socket was median 15us and 99th 22us, while tcp was 40us and 285us.
  • On receiving a tcp connection, the server forks and goes into a read loop which execute the function. The connection remains open for new messages.
  • The messages are passed to the server via json and the results are returned via msgpack. This shaves about 15us off for my data (512 bytes array of floats).
  • Pretty much everything is statically allocated on the tcp server including msgpack.

If there’s some interest, I can share the server side code. It may be interesting for those doing machine learning, etc. It’s written in Zig (my first time) and it’s pretty easy to use existing C libraries.

5 Likes