How to make Nerves use all processors of ARM processors with different type cores?

Hi everyone,

Some time ago I got a wrongly shipped Luckfox Lyra Ultra (https://www.luckfox.com/Luckfox-Lyra-Ultra). As it seemed like a handy format, and I had no other use for it, I decided that would be an interesting one to learn how to get Nerves running on a new target. And while it has been more interesting than I imagined it finally does give me iex and run programs.

The Rockchip RK3506B processor on this one has 3 ARM Cortex-A7 processors and one ARM Cortex-M0. I got Nerves running the 3 A7s but not the M0. It can be started and run on its own in parallel, but not as part of Nerves.

That is a random hobby project anyway so not that important, but the learning process it is also a warm up for using the Rockchip RK3588 with Nerves. And that one has 4 of each type of processor core so a good solution to use all would be nice.

I’ve tried searching around, so sorry if this is a recurring question, but how is this best handled?

My understanding of M-cores is limited. But I understand it as being essentiallt an integrated microcontroller on the device. Nerves (with Linux) won’t go on there.

Now if it has access to enough memory and someone does the effort to establish support you might be able to do Elixir on it via AtomVM. Since that runs on microcontrollers.

My expectation is that Nerves would be able to program the M-cores and talk to whatever runs on there through some shared mechanism. But not that Nerves runs on those cores.

1 Like

Yes, it slowly dawned on me after posting that the M0 core is actually the same as used on the RP2040. If a core is not capable it makes sense that it is just ignored. It seems it was added more as a coprocessor for GPIO pins and so on. If I can find a half justified use case I’ll play a bit with that and see how that goes.

But for more capable different core processors I assume the logic then is that Nerves will use all cores capable enough? And for better or worse match the workload with their relative capabilities? That would be quite impressive out of the box!

1 Like

Yeah. It will do what the BEAM does. For heterogenous variants like performance/efficiency cores you can try that stuff on an Apple Silicon Mac. Fundamentally for Nerves we are talking the BEAM on a Linux kernel. It’ll do the things in the usual manner. Which should be solid. I don’t think the BEAM does much to be extra smart about placement which might mean we can’t use efficiency wins as much. Not sure.

I’m having a go at the Radxa Rock 5T which has the 3588 with four A76 cores and four A55. Asking Claude got the reply that the A76 and A55 are binary compatible and that the scheduler handles core selection. (Which seems to make sense). Whether that is random so the faster cores might end up idle while the slow ones are busy remains to be seen. The Rock 5T seems far more complex than the Luckfox Lyra Ultra so not sure I’ll get it going at all.

Well, this turned into a very long day. The 4*A76 and 4*A55 show up as 8 cores. I guess the faster cores will just finish and get more work faster.

It should be an interesting SBC. Ethernet, Wifi, sound, the GPU, VPU and NPU all report for duty. And so do the HDMI outs and the interesting HDMI in. Whether all work as expected is for testing and fixing another day.

2 Likes

Are these custom systems up anywhere for reference?

The Luckfox Lyra Ultra will likely remain a hodgepodge happenstance side unless I find some use for it to do more. So while it boots I wouldn’t say it is tested in any way.

The Radxa Rock 5T, which is very similar to the 5B and 5C I imagine, has two possible use cases for us so more work is going into that. I’m not familiar enough with the Nerves world to know if it complies with expected practice and features outside of our specific use cases, but once we have a more tested and confirmed working image here it should be fine to share for what it is.

1 Like

Sounds plenty :slight_smile:

If people want perfection they can perfect it themselves. Always good to have more systems for reference.

So this is a bit of a rabbit hole except there be dragons in that hat! Since I’m just waiting for some rebuild I’ll do a little update:

The Radxa 5 series have two workable approaches, but neither is perfect. They are imperfect in different ways though:

  1. The official Radxa setup forked off at about linux kernel 6.1, and there are many proprietary blobs and drivers to make things like the NPU and VPU work. Reportedly it does so just fine. I made an image with that, but then hit the combination of wanting a browser kiosk solution with Mali GPU acceleration. Several possible ways were tried but in the end the conclusion was that I couldn’t make the Mali GPU driver work with 6.1 kernel needed for the other stuff. For the 5T I also had to backport the Wifi driver. (Reference if anyone else have a go at this version).

  2. The other path, which initially at least was much smoother, was going with the mainline linux kernel and the amazing work of Collabora and all their patches and improvements for the RK3588. Getting a kiosk up and running with Mali GPU acceleration was just fine. Thus the setup I’m working on now is based on their stable 6.18 linux kernel for the RK3588. No proprietary out of reach secret blobs, but on the other hand not (yet) full NPU and VPU functionality.

    For the NPU though Tomeu Vizoso have obviously put serious work into making an open source alternative. Drivers and libraries like Rocket, Mesa and Teflon make the NPU available without the Radxa proprietary blobs. The NPU has 3 cores though, and currently the above will only use one. Thus I’ve put up a round robin worker pool where each worker gets their own video frame. 3 workers are busy at the NPU at any time while a 4th worker is busy loading up the data to use. That seems to work pretty well. XLA does not support the NPU yet though, so a nice Elixir EXLA/ Axon/ NX solution seemed out of reach. Thus it is now built around a Rust NIF.

    The VPU functionality on this the open source alternative is currently limited to full HD decoding in various codecs and just jpg encoding. For higher resolutions it seems some OpenCL work might be needed.

    The headphone sound has had me stumped for hours and demoralized. The chip is old and used on many other SBCs for years so one would think it would be a well explored and documented how to make it work. I finally found the 37 page user guide to that chip, and most pages seems to be registers to combine and set just right. If I happen to get sound I’ll play the lottery the same day for sure!

    The 3588 also have two ISPs handling up to 48 megapixels. That was a welcome surprise, although I still have no idea of how to actually access or use them. Later is probably the right time for that.

2 Likes

Another update. We currently have a setup using RPI5 + Hailo 8 (26 tops). That works for our use, but is not ideal for various reasons outside of the AI side. The RK3588 is a beast with specialized hardware all over the place in comparison, and if I get the built-in NPU working properly then it also a less expensive option. Our testcase for that is based around Yolo v8 as we have that as a reference on the RPi5 setup and actual use case.

Thus I try to get Yolo v8s running on the above mentioned open source alternative. That hit a wall. Teflon still being very young miss many required operation implementations. So I implemented all the ones needed for Yolo 8 including the activation via LUT, but with the notable exception of transpose which defaulted to CPU fallback. I then achieved a frame rate of 0.2 fps. There seems to be a lot of copying and reshaping between operations which took by far most of that time. It might be a Teflon issue, or it might be a me issue, but either way not for me now.

I really want to work from Elixir land, and had by now implemented most operations needed for Yolo v8 anyway, so I dropped Teflon as a middleman and aimed for addressing registers directly via the Rocket NPU driver. But I could not get proper multi-core, multi-surface, fused operations with interleaved output working via Rocket. I then used Rocket only for initialization and did register poking directly from Rust NIFs outside of Rocket. This eventually got me to a working solutions for multi-surfaces, fused operations and interleaved outputs with a Yolo v8s framerate of 3-4 fps. Still slow though due to the use of many CPU fallbacks. The NPU seem to have alternative paths for some different operations, and Rocket seem to have one of those paths hardcoded during initialization. That is to the detriment of the other, and thus all of those operations end up going to the CPU instead. That spends lots of time copying and reshaping and so on. At this point I figure there might be more dragons ahead so I stop going in that direction.

So… By now I have lots of the internal workings of the NPU mapped out so I should be able to make the user side of an implementation. And Rockchip does provide an open source NPU driver to connect to. That is attractive as the same driver is used by many of their processors in different price and performance ranges most of which seem Nerves friendly. Further Ultralytics (Yolo) now supports Rockchip’s RKNN model format directly. That said Rockchip do supply their own 6.10 kernel for a reason, and their driver will not run voluntarily on the 6.18 Collabora core. So my focus has shifted to what it takes to patch that kernel into running the official NPU driver, and then using it from an Elixir perspective.

So far I have the NPU initializing powering up correctly and simple Conv2D operations working with 2 minor change to the official driver and some patches to the Collabora kernel. Hopefully I will be able to use the official driver entirely unchanged later, but for now I just want something working at all. If I do I will start a new thread as the headline for this thread is kind of off topic by now. :slight_smile:

Please do start a thread for your explorations of the RK3588. And especially the NPU bits but overall.

I have nothing to add except I am following along and have the Orange Pi 5 Plus on my shelf .. waiting.

Yes, that should be very similar.

I will do that once I am fairly sure that I got all these things correct. A purely practical question: To share any of these setups what format(s) and where should that be done?

1 Like

For sharing so other people can use it?

it could be a buildroot package, a Nerves System, depends on what part it is.

For the forum threads themselves I like a board-specific project thread.

I see a few different options that all would make sense to share I guess: 1. A clean boot image. That is tiny. 2. The kiosk solution with a working Chrome and Wifi settings with persistence. Still smallish. 3. A setup with hopefully working NPU and maybe other sharable and relevant bits and parts. Currently at about 500 MB+.

1 Like

I would suggest a nerves system that has as many hardware bits and bobs working as possible but not things which currently require very invasive/large deps (systemd comes to mind).

Kiosk I’d recommend as a separate system. At least that’s been the way with the official ones on rpi4 and 5. Really annoying (heavy, slow) builds with the browser engines and all.

That makes sense. I will see what else I can get going within decent effort after the NPU.

Excited about the NPU. Then I can use it with ex_nvr. I already have an implementation for the Pi Hailo accelerator.