First call to NIF is slow, but the next one is 4x faster


I have a NIF function (let’s say FUNC) implemented with C++, which is called approx. 10-20 times a second. Its implementation is quite simple - it just reads a struct by an offset in vector allocated previously in the same NIF and returns something like [{x, [%{y: y}]}], where x and y are integers.

There are another functions in the same NIF I’m calling before the function FUNC updating the vector.

So the pipeline is something like ReadNetworkPacket->Decode->CallOtherNifFunc->CallFUNC.

Every time I call FUNC I call it two times in a row, reading vectorA and vectorB. The problem is that the first call takes 8us and the second is 2 us.

I’ve double-checked the amount of work done is the same every time.

Are there any tools I can use to debug this?


The first time you call any function in an application run with mix it’s slow because mix lazy loads code. I would build a release and see if the problem persists.


It is is slow every second call. If I call it three time in a row, then first call is slow, but the other two are fast. It looks like something happens between receiving a message from network and calling the function. There are no any allocations in my code.

As for building release thanks, I will try.

1 Like

Can you share an example code?

1 Like

I’ve implemented a very simple test just calling a C++ NIF function returning :ok.

The function being called twice every 0.1 sec in infinite loop (GenServer.handle_info). On my machine the first call on every iteration takes approx 3.5µ and the next call takes 0.4µ.

Here is the code:


Can you post a code as text. Either pastę it here or use Gist (or other pastebin). I prefer to not fetch untrusted archives from internet.


Here it is


I had a look at this, and could repro approximately the same numbers on my system with the gist you posted.

I think what you are observing here is the impact of cache eviction when the process is unscheduled (i.e. when you reach the reduction limit, or run out of things to do). My basis for this is:

  1. A tight loop, or a small (e.g. 1ms) send_after interval sees no real discrepancy in call latency, i.e. on subsequent loops, the first is as fast as the second, I’m assuming because the second call on the prior iteration was recent enough to keep the caches populated.
  2. A Process.sleep(100) between the first and second call sees the cost of the second call rise to the cost of the first.
  3. I ran another experiment where I launched two Tests.Server processes under a supervisor. The first didn’t take any timings but just called the nif in a tight loop, the second behaved as in your gist, making a pair of calls and then effectively sleeping for 100ms. By making them both run on the same scheduler (--erl "+S 1:1"), both calls in the second GenServer were fast. My assumption is that this is because the first genserver is keeping the relevant caches populated with what the second needs.

I suspect there’s just enough work done in the BEAM between your process yielding and getting rescheduled for either its code (jitted code / nif code) or its data (process stacks / heaps) or both to get evicted to some degree, and you’re observing the cost of bringing it back in to L1.


Yes, this is because of L1 and/or interrupts. I’ve run the same test on an isolated CPU core with kernel interrupts pinned to another core and the issue just solved.