Recently, I had to move a non-thread safe nif into an external server and thought I’d share some interesting results.
I thought that this would for sure be slower than a NIF. After all we’re going over a network and have to do some data serialization vs nifs being…well…native inline functions.
Punchline - the external server was faster for calls that normally took about 100us.
Name ips average deviation median 99th %
pool_func 6.35 K 157.56 μs ±9.26% 153.60 μs 205.64 μs
pool_e2e 6.17 K 162.15 μs ±10.32% 159.71 μs 227.91 μs
nif_func 4.30 K 232.58 μs ±3.67% 234.35 μs 259.72 μs
nif_e2e 4.18 K 239.32 μs ±4.50% 240.37 μs 270.51 μs
For shorter functions, the network overhead seems to dominate. Seems to be about 15μs baseline to call (measured on a 11th Gen Intel(R) Core™ i5-1135G7 @ 2.40GHz).
Name ips average deviation median 99th %
nif_random 35.31 K 28.32 μs ±23.33% 29.75 μs 37.93 μs
pool_random 25.15 K 39.76 μs ±14.27% 39.60 μs 50.41 μs
The effect was even more pronounced on a slower arm machine - Raspberry Pi equivalent:
Name ips average deviation median 99th
%
pool_func 1.36 K 0.74 ms ±7.74% 0.73 ms 0.83 ms
pool_e2e 1.28 K 0.78 ms ±9.10% 0.78 ms 0.92 ms
nif_func 0.80 K 1.25 ms ±2.30% 1.26 ms 1.31 ms
nif_e2e 0.77 K 1.29 ms ±2.46% 1.30 ms 1.40 ms
Comparison:
pool_func 1.36 K
pool_e2e 1.28 K - 1.06x slower +0.0472 ms
nif_func 0.80 K - 1.70x slower +0.51 ms
nif_e2e 0.77 K - 1.76x slower +0.56 ms
The general design is as follows:
- NimblePool initiates some workers via gen_tcp with options [:binary, active: false, packet: 2] which gives max packet size of 65k, but the largest one I’m sending is 512.
- The connection is made over unix domain sockets as the tcp socket was much slower. Unix socket was median 15us and 99th 22us, while tcp was 40us and 285us.
- On receiving a tcp connection, the server forks and goes into a read loop which execute the function. The connection remains open for new messages.
- The messages are passed to the server via json and the results are returned via msgpack. This shaves about 15us off for my data (512 bytes array of floats).
- Pretty much everything is statically allocated on the tcp server including msgpack.
If there’s some interest, I can share the server side code. It may be interesting for those doing machine learning, etc. It’s written in Zig (my first time) and it’s pretty easy to use existing C libraries.