First call to NIF is slow, but the next one is 4x faster

rychale · February 11, 2022, 9:44pm

Hi,

I have a NIF function (let’s say FUNC) implemented with C++, which is called approx. 10-20 times a second. Its implementation is quite simple - it just reads a struct by an offset in vector allocated previously in the same NIF and returns something like [{x, [%{y: y}]}], where x and y are integers.

There are another functions in the same NIF I’m calling before the function FUNC updating the vector.

So the pipeline is something like ReadNetworkPacket->Decode->CallOtherNifFunc->CallFUNC.

Every time I call FUNC I call it two times in a row, reading vectorA and vectorB. The problem is that the first call takes 8us and the second is 2 us.

I’ve double-checked the amount of work done is the same every time.

Are there any tools I can use to debug this?

benwilson512 · February 11, 2022, 10:36pm

The first time you call any function in an application run with mix it’s slow because mix lazy loads code. I would build a release and see if the problem persists.

rychale · February 12, 2022, 9:01am

It is is slow every second call. If I call it three time in a row, then first call is slow, but the other two are fast. It looks like something happens between receiving a message from network and calling the function. There are no any allocations in my code.

As for building release thanks, I will try.

hauleth · February 12, 2022, 9:14am

Can you share an example code?

rychale · February 13, 2022, 2:26pm

I’ve implemented a very simple test just calling a C++ NIF function returning :ok.

The function being called twice every 0.1 sec in infinite loop (GenServer.handle_info). On my machine the first call on every iteration takes approx 3.5µ and the next call takes 0.4µ.

Here is the code: https://transfer.sh/Jjw2Wc/nif-test.tgz

hauleth · February 13, 2022, 4:48pm

Can you post a code as text. Either pastę it here or use Gist (or other pastebin). I prefer to not fetch untrusted archives from internet.

rychale · February 13, 2022, 6:45pm

Here it is

gist.github.com

https://gist.github.com/rychale/8531fbecd7cd3e8e04da977f8ab0e513

Makefile

CXX ?= g++

ERLANG_PATH = $(shell erl -eval 'io:format("~s", [lists:concat([code:root_dir(), "/erts-", erlang:system_info(version), "/include"])])' -s init stop -noshell)

OBJECTS    = $(patsubst c_libs/%.cpp,build/%.o,$(wildcard c_libs/*.cpp))

all: priv/native.so

dirs:
	mkdir -p build

This file has been truncated. show original

main.ex

defmodule Test.NativeProcessor.Lib do
  @moduledoc false
  @on_load :load_nifs

  def load_nifs do
    path = Application.app_dir(:test, "priv/native") |> String.to_charlist()

    :erlang.load_nif(path, 0)
  end

This file has been truncated. show original

mix.exs

defmodule Test.MixProject do
  use Mix.Project

  def project do
    [
      app: :test,
      version: "0.1.0",
      elixir: "~> 1.13",
      start_permanent: Mix.env() == :prod,
      deps: deps(),

This file has been truncated. show original

There are more than three files. show original

ausimian · February 14, 2022, 12:07pm

I had a look at this, and could repro approximately the same numbers on my system with the gist you posted.

I think what you are observing here is the impact of cache eviction when the process is unscheduled (i.e. when you reach the reduction limit, or run out of things to do). My basis for this is:

A tight loop, or a small (e.g. 1ms) send_after interval sees no real discrepancy in call latency, i.e. on subsequent loops, the first is as fast as the second, I’m assuming because the second call on the prior iteration was recent enough to keep the caches populated.
A Process.sleep(100) between the first and second call sees the cost of the second call rise to the cost of the first.
I ran another experiment where I launched two Tests.Server processes under a supervisor. The first didn’t take any timings but just called the nif in a tight loop, the second behaved as in your gist, making a pair of calls and then effectively sleeping for 100ms. By making them both run on the same scheduler (--erl "+S 1:1"), both calls in the second GenServer were fast. My assumption is that this is because the first genserver is keeping the relevant caches populated with what the second needs.

I suspect there’s just enough work done in the BEAM between your process yielding and getting rescheduled for either its code (jitted code / nif code) or its data (process stacks / heaps) or both to get evicted to some degree, and you’re observing the cost of bringing it back in to L1.

rychale · February 16, 2022, 8:31am

Yes, this is because of L1 and/or interrupts. I’ve run the same test on an isolated CPU core with kernel interrupts pinned to another core and the issue just solved.

Thanks