Rockchip 3588 Nerves thread

Vidar · March 5, 2026, 8:24am

My plan is sadly more boring and basic. I just use Zigler to wrap all commands of the existing C libraries from Rockchip to use these hardware accelerators.

These libraries are already meant to work together in a hardware pipeline from Rockchip. Each of the pipe elements is a hardware accelerator and the handoffs between them are done as zero copy using the processors purpose made DMA buffer. The RGA draws the Yolo rectangles directly onto the DMA buffer.

The only odd duckling out is the initial use of the ISP before the NPU. That seems less pipe integrated. The ISP takes a Bayer RGB input, and outputs a normal picture frame, while the NPU wants an interleaved data structure. So there are two options to solve that: 1. Modify the input of the AI models to take the ISP output as direct input. No conversion needed in the pipeline then, but a big hassle and likely causing the odd error. Or 2, which I have chosen, to use the RGA accelerator to do the conversion. The RGA should have no issue keeping up with the max 60fps flow rate (a limit of the 2x30 fps ISPs), but it does add a bit of latency.

When the NPU wants a new frame that has to be resized to model input before storing as a copy in the NPU ring buffer. The RGA does that too. The NPU setup is 4 frame positions for 3 NPU cores, so one is always being loaded so a new frame is ready when a core needs it. (That architecture takes 3 times more memory , so maybe making an option to use 3 cores for 1 frame. That apparently has less overall throughput though).

I’ve used Membrane some before but I wouldn’t say I’m really familiar with it. It is more like keyboard pecking, observe and hope for the best. Rinse and repeat until tests are happy. Over time some more systematic wisdom might accumulate. I plan to pipe the video output into Membrane to stream the output but still work to be done implementing the above.

References is the way here for sure, and the libraries and their DMA buffers are doing zero copy all along the hardware pipeline with exception for the sideline with NPU grab with resized copy. To get the main pipeline output from the DMA buffer at the end I currently also do a copy. I like to think there is a better solution to be found.

The NPU on the RK3588 is not supported by the XLA, so by extension I figured likely not by EXLA or NX either? That was my initial thinking anyway. And then I found that this 3588 has hardware accelerators sticking out everywhere so I thought I’d try to make it possible to compose them, line them up and use them from Elixir.

lawik · March 5, 2026, 9:49am

It makes sense to use what’s available and Zig seems a nice touch for wrapping it up well.

Yeah. I think different people and applications will want different things. Some will just want a processed but full-quality frame and work with that from Elixir, some want a video stream. Some want to involve the NPU at some stage but in what way, what model, what fps, I think there are devils all over those details.

I mean put together what you need and that’ll give anyone else stuff to build from

lawik · March 5, 2026, 9:53am

Cloned and attempted to build: Issue #1 logged

Vidar · March 5, 2026, 10:32am

Ah, there is some leftover reference from the larger kiosk build. The fix is likely just removal. I’ll try to check for any other leftovers too before fixing.

I did try to build that repo in isolation to check, but I guess the cunning computer grabbed that one from elsewhere.

You got hold of a Rock 5T or adapting to the Orange equivalent?

The pipeline composing will be a separate library I thought. The current repo is thus fairly naked booting hardware, but with this library on top it would Elixir access to all the hardware bits I find interesting.

Edit: I told Claude to go fetch the issue and check for more of similar nature. He disposed of that one and another similar line. So the error should be gone. Hopefully didn’t mess up anything else in the process.

lawik · March 5, 2026, 10:55am

I removed the references. Will see how it goes

I am currently just building to see if it will boot. I imagine there might be a bunch of tweaks needed for the orange pi so might grab billal’s system if it doesn’t work.

Separate library is probably the right move, keep the system lean and strictly a compile-time concern.

Vidar · March 5, 2026, 11:12am

I could not get rid of all Radxa U-boot and this is all on Radxa’s proprietary BSP 6.1 kernel. Well, maybe the Orange is so similar it just runs? Crossing fingers!

The library might require a few changes in the kernel to make the zero copy composing possible, so there will likely be some more patches at that point.

Speaking of patches, the BSP 6.1 patch at the repo is a leftover from the NPU register exploration. Should not be needed anymore, so bad cleaning of me. I removed that too.

lawik · March 5, 2026, 1:47pm

Ah, you are probably right. I might try something later but I imagine there are enough differences to be trouble.

Vidar · March 5, 2026, 2:26pm

Doing a full throughput test of the library now. Any issues that require kernel modifications should show up, but looking good so far.

The hardware of the Radxa 5T and the Orange can’t be very different? Wifi chipset and such? Any Orange custom kernel would have to provide the same functionality so they might be very similar. (Or if luck have it copied from each other). Some diff digging might solve it.

There are many products using the RK3588 so it would be a lot more interesting if this wasn’t limited to Radxa products. I kind of like the FriendlyElec and Lion SBC variants for their better housing, not too mention the industrial versions, tablets and information panels.

Vidar · March 5, 2026, 6:30pm

So the pipeline tests, as described above, gets about 45 fps with Yolo v8s. Latency 65ms, whereof 58ms is NPU. Doing 3 NPU cores on each frame should bring down the latency but it just marginally less at 50ms, but then with just 20fps overall. Something is off… After checking it turns out the model itself has to be onnx converted with a setting for using the cores in parallel. I’ll leave that for another day. (Did the check, no latency improvement. It is optimized for throughput not latency it seems, so round robin with 1 frame per core is the way to go).

Orange Pi 5 Plus. Maybe we are overthinking it. The device tree is supposed to cater to hardware differences, and the 3588 and the various libraries for the accelerators are all the same. Thus, I figured it might be worth trying a simple mix: The Radxa 5T setup but with the Orange Pi 5 Plus device tree instead. (The 40 pin IO pin might be different. Worth double checking first). So, here goes nothing, the Radxa 5T setup with a Orange Pi 5 Plus device tree instead: