NIF returns incorrect results, reuses binaries?

munksgaard · June 21, 2023, 11:50am

I’m currently in the very early stages of implementing futlixir, a bridge that enables Elixir to call out to Futhark programs. It works by taking a Futhark library and generating a corresponding NIF, similar to how Rustler lets you use Rust programs.

Futlixir is still in a very early stage, but I can currently create arrays and call simple functions with those arrays. Here is an example of what the Elixir code looks like:

c("lib_map.ex") # Import Map.NIF
{:ok, cfg} = Map.NIF.futhark_context_config_new()
{:ok, ctx} = Map.NIF.futhark_context_new(cfg)

xs_binary = <<0, 1>>
{:ok, xs} = Map.NIF.futhark_new_u8_1d(ctx, xs_binary)
{:ok, ^xs_binary} = Map.NIF.futhark_u8_1d_to_binary(ctx, xs)
{:ok, ys} = Map.NIF.futhark_new_u8_1d(ctx, <<1, 4>>)
{:ok, zs} = Map.NIF.futhark_entry_add(ctx, xs, ys)
{:ok, <<1, 5>> = zs_binary} = Map.NIF.futhark_u8_1d_to_binary(ctx, zs)

xs_binary = <<1::integer-signed-64-little>>
{:ok, xs} = Map.NIF.futhark_new_i64_1d(ctx, xs_binary)
{:ok, ^xs_binary} = Map.NIF.futhark_i64_1d_to_binary(ctx, xs)
{:ok, ys} = Map.NIF.futhark_new_i64_1d(ctx, <<1279::integer-signed-64-little>>)
{:ok, zs} = Map.NIF.futhark_entry_add_i64(ctx, xs, ys)
{:ok, <<1280::integer-signed-64-little>> = zs_binary} = Map.NIF.futhark_i64_1d_to_binary(ctx, zs)

The program is very simple, first it creates two arrays of bytes (u8) and adds them together, yielding <<1,5>>, next it takes two arrays of i64 and adds them together, yielding <<1280::integer-signed-64-little>> (the array has size 1).

The problem arises because sometimes the call to Map.NIF.futhark_i64_1d_to_binary will return <<2, 0, 0, 0, 0, 0, 0, 0>>, or indeed whatever xs_binary is set to, instead of the correct result. For some reason, it seems like it’s not correctly allocating a new binary for zs_binary but reusing xs_binary? This only happens some times, and only if the preceding block of code is included as well (the one adding two u8 arrays).

The definition of futhark_i64_1d_to_binary looks like this:

static ERL_NIF_TERM futhark_i64_1d_to_binary_nif(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[])
{
  struct futhark_context **ctx;
  struct futhark_i64_1d **xs;

  ErlNifBinary binary;
  ERL_NIF_TERM ret;

  if(argc != 2) {
    return enif_make_badarg(env);
  }

  if(!enif_get_resource(env, argv[0], CONTEXT_TYPE, (void**) &ctx)) {
    return enif_make_badarg(env);
  }

  if(!enif_get_resource(env, argv[1], I64_1D, (void**) &xs)) {
    return enif_make_badarg(env);
  }

  const int64_t *shape = futhark_shape_i64_1d(*ctx, *xs);

  enif_alloc_binary(shape[0] * sizeof(int64_t), &binary);

  if (futhark_values_i64_1d(*ctx, *xs, (int64_t *)(binary.data)) != 0) return enif_make_badarg(env);
  futhark_context_sync(*ctx);

  ret = enif_make_binary(env, &binary);

  return enif_make_tuple2(env, atom_ok, ret);
}

Does anyone have a clue what might be wrong?

I’ve uploaded all the files needed to reproduce the issue here: lib_map.c · GitHub

To compile and run it, run the following commands:

gcc -Wall -shared -o lib_map_nif.so -fPIC lib_map_nif.c -lOpenCL -lm
iex --dot-iex test.exs

You’ll probably need to run it a handful of times for the error to trigger.

tj0 · June 21, 2023, 4:05pm

enif_alloc_binary could be re-using the memory of the previous answer. You may need to zero it / memset.

Otherwise, there’s too many pointers references for me to reason about easily.

Edit: Looked closer, not sure if that’s it. But there’s something funny going on here. Maybe check the result of
futhark_values_i64_1d with some type of assert? Set the binary value to 0 beforehand and check that it has changed before futhark_context_sync. Of course, it could be something completely different, but I think that’s a reasonable starting point.

munksgaard · June 22, 2023, 5:49am

Thank you for your response. Yes, I realize that the example is a bit unwieldy. I’m going to try to minimize it a bit, hopefully later today.

I’ve tried manually setting the array values as well, with odd results. When I (hopefully) get back to this problem later today, I’ll also try to replicate those results and post them here.

Thanks!

jhogberg · June 22, 2023, 9:17am

Have you checked the return value of futhark_context_sync?

The precise semantics of the return value depends on the backend. For the sequential C backend, errors will always be available when the entry point returns, and futhark_context_sync() will always return zero. When using a GPU backend such as cuda or opencl, the entry point may still be running asynchronous operations when it returns, in which case the entry point may return zero successfully, even though execution has already (or will) fail. These problems will be reported when futhark_context_sync() is called. Therefore, be careful to check the return code of both the entry point itself, and futhark_context_sync().

munksgaard · September 22, 2023, 10:45am

Yes, it returns 0 even when the error occurs.

munksgaard · September 22, 2023, 12:00pm

Turns out that the _new calls in futhark (e.g. futhark_new_i64_1d) are asynchronous, so I needed to insert an explicit synchronization point after creating arrays and before handing off control to the BEAM.