Accessing System-Level Logs for Debugging Spontaneous Shutdowns on Raspi Zero

I’m currently working on a project with Nerves(Nerves Livebook) on a Raspberry Pi Zero, and I’ve encountered a recurring issue where the device spontaneously shuts down. To better understand what’s happening behind the scenes and potentially pinpoint the cause, I’m looking to dive into the system-level logs.

Could anyone guide me on how to access these logs within the Nerves environment? Specifically, I’m interested in any logs that could shed light on system events or errors that might precede these unexpected shutdowns.

1 Like

Debugging spontaneous shutdowns can be hard since the key log message might not be logged.

There are a few options:

  1. If you’re using RingLogger, turn on persistence. RingLogger will write logs every so many minutes and then on graceful shutdown. Not sure if this will help given what you posted, but seems worth a try. In your config, do something like this:
config :logger, RingLogger,
  persist_path: "/data/ring_logger.log",
  persist_seconds: 300,
  1. Use RamoopsLogger. This backend logs to a special area of DRAM that survives reboots sometimes. Add the dependency and add RamoopsLogger to your :logger :backends in the config.exs.

  2. Enable the :console logger, connect to the UART, watch messages and hope.

  3. Enable an Elixir or Erlang file logger like you would for a non-Nerves project. Configure it to log to a file in /data. There might be a sync option to force writes more frequently. I don’t personally use this option, but I know people coming from non-embedded backgrounds sometimes are really comfortable with this one.

On Nerves, log messages from the Linux kernel and from C programs using syslog are all routed through the Elixir logger. Hopefully these aren’t your issue.

Some of the hardest bugs, imho, to debug are the ones that kill heart. If this is your issue, then try running Nerves.Runtime.Heart.status to see if wdt_time_left or heartbeat_time_left count down to 0. When they hit 0, heart thinks that something is really wrong with the BEAM or something else on the device and triggers a reboot.

Hope this helps.

3 Likes

Hello @fhunleth, Thanks for your responses. I think they are really important. I have tried RingLogger but until now no error has been logged. I’m thinking the problem lies in the heart, here is my heart status after running pi zero for 24 hours:

%{
program_version: %Version{major: 2, minor: 3, patch: 0},
wdt_timeout: 15,
wdt_time_left: 11,
wdt_pre_timeout: 0,
wdt_pet_time_left: 3,
wdt_options: [:settimeout, :magicclose, :keepaliveping],
wdt_last_boot: :power_on,
wdt_identity: “Broadcom BCM2835 Watchdog timer”,
wdt_firmware_version: 0,
snooze_time_left: 0,
program_name: “nerves_heart”,
init_handshake_timeout: 0,
init_handshake_time_left: 0,
init_handshake_happened: true,
init_grace_time_left: 0,
heartbeat_timeout: 30,
heartbeat_time_left: 26
}

The wdt_pet_time_left becomes very small. I think it will trigger a reboot but dont know how to extend the timeout limit or fix this problem. Could you give me more hints? I have checked GitHub - nerves-project/nerves_heart: Erlang heartbeat support for Nerves

I think the first thing to try is to just disable heart altogether. To do this, edit your project’s rel/vm.args.eex and comment out the -heart line:

## Enable heartbeat monitoring of the Erlang runtime system
#-heart -env HEART_BEAT_TIMEOUT 30

I really hope that gives you time to look at log, inspect your device or do whatever to see why Erlang isn’t telling heart that it’s ok, and therefore not petting the hardware watchdog.

As for changing the hardware watchdog timer, it’s already set to the longest time. If you trace through bcm2835_wdt.c, the max time looks like it’s just under 16 seconds and that due to truncation, it shows up as 15 seconds when asked.

Best bet is to comment out heart like above. I don’t know how 30 seconds could be too short on the heart timeout, but you could also update that.

Hope you find the issue.

Thanks @fhunleth , I have figured out the problem. It relates to a problem of Kino. I have report the issue:

1 Like