Erlinit: Erlang terminated due to signal 11

Hi,

I am currently working on an embedded Armv7 device that runs an image built from the nerves project.

In my firmware, I replaced the standard /sbin/init of erlang’s with nerves’s erlinit.

My device was running fine for 9 whole days, and then it crashed. Luckily, I had the dmesg output from the time the erlang VM exited, and this was the message: “erlinit: Erlang terminated due to signal 11”.

From reading online, I think this is a segmentation fault.

I am looking for help on how to debug this, as it is very hard to reproduce, and no error messages appear to be printed when it happens… It just seems to happen out of nowhere. Also, the fact that is takes several days to occur makes it very impractical to debug by sitting there watching the device.

Any help or ideas on how to go about this, or what tools to use would be greatly appreciated.

OTP Version: 24.1.3
Elixir Version: 1.12.1-otp-24
Linux Kernel: 5.4

A shot in the dark, but for what its worth …
Feels like it may be due to some resource (storage/IO buffers etc.) filling up over time ?
Have you checked log files or terminal output (stdio/stderr) from daemons that aren’t directed to /dev/null ?

1 Like

In this case, this probably won’t give you much more information, but erlinit can provide a shutdown report that’s saved to /data. Add this to your config.exs:

config :nerves, :erlinit, shutdown_report: "/data/last_shutdown.txt"

If you want to get a crash dump from Erlang, then environment variables can be used to control that. You might already have this or something like it in your Nerves system’s rootfs_overlay/etc/erlinit.config, so check that first:

# Enable crash dumps (set ERL_CRASH_DUMP_SECONDS=0 to disable)
-e ERL_CRASH_DUMP=/data/erl_crash.dump;ERL_CRASH_DUMP_SECONDS=5

I feel like the sad part is that this is going to end up being an issue with a NIF. Take a look at this blog post for working with core dumps with Nerves.

The only other idea I have is that if you have a suspicion on a possibly sketchy NIF, perhaps there’s a way of calling it a lot that can make the crash happen more quickly.

2 Likes

@milangupta , possibly, although I do track memory and CPU usage, and even write it to a csv file. There is no visible memory / CPU leak occuring, everything is consistent.

Yes, I’ve checked the log files, no errors other than the one in the title.

@fhunleth , I have played around with shutdown_report before. I effectively have the same thing working though with the --run-on-exit option, where I simply call a script that runs dmesg and outputs it to a file on the filesystem. That is how I found the error message in the title.

I have ERL_CRASH_DUMP set, but I don’t have ERL_CRASH_DUMP_SECONDS set. I will try that though.

I will also look into the NIFs, we have a few of them running as well.

To clarify, I meant checking the storage space where any log files are kept …

That said, @fhunleth recommendation is the way to go … analyzing it from a core file is likely the most reliable way to hunt this gremlin.

Oh okay, yes those were checked as well. We have circular rotation set up for any logging, so they only ever get to a certain size.

@fhunleth , I am not 100% sure this is a Nif problem. I have tried creating my own faulty nif that will absolutely cause a seg fault when called, but what happens when I call the function is the erlang VM seems to crash in a much more volatile way, where I don’t even get the message from erl_init “erlinit: Erlang terminated due to signal 11”, or any other message for that matter. The VM just dies instantly.

This is my nif_test.c file

/* nif_test.c */
#include <erl_nif.h>
#include <unistd.h>


static ERL_NIF_TERM hello1(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[])
{
/* Illegal memory access because no memory is allocated for foo2 */
    float *foo, *foo2;
    foo = (float*)malloc(1000);
    foo2[0] = 1.0;

    // won't get here
    return enif_make_int64(env, 1);
}

static ErlNifFunc nif_funcs[] =
{
    {"hello1", 0, hello1, ERL_NIF_DIRTY_JOB_IO_BOUND}
};

ERL_NIF_INIT(Elixir.NifTest,nif_funcs,NULL,NULL,NULL,NULL)

And this is the Elixir code:

defmodule NifTest do
  @moduledoc """
  Documentation for `NifTest`.
  """
  @on_load :load_nifs

  def load_nifs do
    Application.app_dir(:nif_test, "priv/nif_test")
    |> String.to_charlist()
    |> :erlang.load_nif(0)
  end

  def hello1() do
    raise "NIF hello1/0 not implemented!!"
  end
end

Calling NifTest.hello1() will cause the seg fault.