Any idea why calling Application.started_applications on a nerves shell would block for around 2 min when the primary network interface is disconnected?
This nerves cm4 device has a secondary 4G modem interface eth1. When the device’s wired eth0 is disconnected, the nerves ssh shell is still accessible over the secondary interface but some functions of the Application module will block! The same function call will not block if issued over a reversed websocket these devices have configured for remote management. Those functions only block when called over the nerves shell. Reconnecting the eth0 primary interface will not fix the condition. That condition remains until the device is rebooted.
My initial guess is that it is trying to send the IO output to the wrong place.
@lawik That’s an interesting theory. I’m not sure how it would block Application.started_applications, but let me dump what I know.
The ssh client’s IO server with OTP 28 only supports unicode output. If you write latin1, the request is ignored and the ssh session hangs. For example, if you run IO.binwrite(“hello”) at the IEx prompt, it will happen. This was what was happening with RingLogger, but that was a bug since it has Unicode data. When that call was added, there wasn’t a String.replace_invalid/2 and using binwrite was a hack to avoid raising on invalid Unicode.
The call to Application.started_applications has a 5 second timeout on the call to the application controller, so it not returning for 2 minutes seems pretty impressive.
This is weird. Weird enough that given that it’s correlated to networking, I’d blame DNS as a joke, but I honestly don’t know. Very, very curious.
Checked utf-8 is selected and the nerves shell correctly displays unicodes.
Nerves shell over serial port has no issues. All function calls execute timely regardless of the state of the network interfaces.
Reconnecting to the nerves shell after killing the blocked ssh connection from the Linux terminal works. The nerves ssh daemon keeps working yet the same functions keep blocking.
My guess is that started_applications blocks indefinitely and the 2m timeout is just the ssh connection giving up.
What Erlang and Elixir versions are you on? What loaded applications are there or if that also hangs, what deps are there that might have applications. Seems like something is breaking in a fun way somewhere in the release.
One thing that’d be interesting is to remove all the deps and code that isn’t Nerves standard stuff to bisect between a fundamental issue between Erlang/Elixir/Nerves/hardware or an issue in the project code.
For example, I believe NIFs can break some assumptions.
Depending on your setup you might run mix nerves.new and just reference your system to run a “clean” project.
I got another identical hardware set with same firmware and the condition wont replicate there… this shifts the search space to miss configuration or data corruption or hardware failure. Thanks for your valuable input. Learned a lot.