Looking for feedback on my thesis project: Distributed BEAM Compute & Capability-Based Routing for Nerves


Hi everyone,

I’m working on my college thesis: Distributed BEAM Compute and Capability-Based Routing for the Nerves Platform. I’d
love feedback from people who’ve worked on distributed Elixir, Nerves, or P2P systems.


The Problem

Nerves devices are typically managed from a central server. I want to flip that — a network of embedded BEAM nodes that are fully autonomous, self-organising, and can route work to each other based on what they can do, not where they are.

The target environment is unreliable networks — construction sites, warehouses, field deployments — where devices come and go, may be behind NAT or CGNAT, and there’s no central infrastructure.


The Vision

The API should feel like native OTP. A node advertises what it has:

Network.start_advertising([cpu: 4, gpu: 2, storage: 1024])

Any other node can then route work to a capable peer transparently:

Spawn a process on any node with a camera

Network.spawn([gpu: true], fn → capture_image() end)

Run a task on any node with enough CPU, await the result

result = Network.Task.async([gpu: 2], fn → run_inference(data) end)
|> Task.await()

send/receive work normally across the mesh


pid = Network.spawn([storage: 512], fn → 
receive do
{:write, data} → store(data)
end
end)
send(pid, {:write, payload})

The key goal: OTP works normally. Task, GenServer, send/receive, linking, monitoring — all transparent across the
mesh.


Functional Requirements

  • Nodes auto-discover each other (zero config)
  • Nodes advertise capabilities and can be found by capability query
  • NAT and CGNAT traversal
  • Sparse mesh — nodes gossip, not fully connected
  • Tasks are scheduled, monitored, and fault-tolerant
  • Standard OTP patterns work transparently across nodes

Current Architecture

After a lot of research, here’s where I’ve landed:

Discovery & Transport: libp2p (Rust via Port)

Elixir’s libp2p support is limited, so I’m running a Rust libp2p binary as an Erlang Port. It handles:

  • mDNS for local zero-config discovery
  • Kademlia DHT for discovery across subnets and NAT
  • DCUTR hole-punching and relay fallback for CGNAT
  • Communicates with Elixir over a JSON line protocol on stdin/stdout

Overlay Network: Partisan

Standard Erlang distribution (EPMD, full mesh) doesn’t fit — it assumes reliable networks and full connectivity.
Partisan is a drop-in replacement that supports:

  • Configurable topologies (sparse, peer-to-peer, client-server)
  • Works without EPMD
  • Gossip-based membership via plumtree broadcast

When libp2p discovers a peer, Partisan connects to it. From that point, all BEAM-level messaging goes through
Partisan.

Capability Registry: Horde

Once nodes are connected via Partisan, capabilities need to be replicated across the cluster. Horde uses a
CRDT-based distributed registry that tolerates node churn well. Each node registers its capabilities into
Horde.Registry on startup; Network.resolve([camera: true]) queries it to find a matching node.

Scheduling: Horde.DynamicSupervisor

Network.Task.async routes to a node via the registry and spawns under a Horde.DynamicSupervisor. If the node dies
mid-task, Horde can restart it elsewhere.

The stack:

Application Layer
Network.spawn / Network.Task.async
↓
Capability Registry (Horde.Registry)
resolve([camera: true]) → node
↓
Overlay Transport (Partisan)
Node.spawn / :partisan_rpc
↓
Discovery & NAT (libp2p via Rust Port)
mDNS + KadDHT + DCUTR/Relay

Open Questions

  1. Horde + Partisan compatibility

Horde internally uses :pg and :erpc, which assume standard Erlang distribution. Partisan replaces the dist layer.
Has anyone successfully run Horde over Partisan, or is this a known incompatibility? Would I need to implement a
simpler CRDT registry directly over Partisan’s broadcast?

  1. KadDHT for capability registry vs. gossip

I’m currently leaning toward using Kademlia DHT (via the libp2p Rust bridge) only for bootstrap and peer discovery,
and then using Partisan’s plumtree broadcast for capability propagation once connected. Does this make sense, or is
there value in keeping capabilities in the DHT for nodes that aren’t yet connected?

  1. Rust Port vs. NIF

The Rust libp2p binary runs as an Erlang Port (stdin/stdout JSON). This is safe for embedded (a crashing NIF takes
down the BEAM), but adds latency and serialisation overhead. For a thesis prototype, Port seems right. Anyone have
experience with this trade-off on Nerves specifically?

  1. Task transparency

For send/receive to work across nodes, the remote PID needs to be routable back to the caller. With standard Erlang dist this is automatic. With Partisan — does forwarding remote PIDs work transparently, or does Partisan require
explicit addressing?


What I’m Not Doing (Yet)

  • No libcluster — it assumes reliable networks and standard dist
  • No central registry or broker
  • No custom transport (relying on libp2p for that)

Things that I might consider, maybe do:
* Do BEAM Distribution over Libp2p streams maybe?


The Rust bridge, Partisan config, and a basic Network.spawn stub are all in place. The registry and task routing are what I’m building next.

Would love feedback on the architecture, especially the Horde/Partisan question and whether the Rust Port approach is sensible for Nerves. Thanks!


6 Likes

I have zero experience with nerves, unfortunately, but I just wanted to endorse you with this project. It looks very interesting and important. All the good luck in your journey!

4 Likes

I can’t claim have done many Nerves projects, but I’ve done a few. Maybe more importantly I found the theme interesting enough to read all of it. So here are my immediate brain drops:

I think differently about this part. I would not want to know what hardware a node has. That doesn’t really tell me anything about the actual functionality of a node.

What is interesting to know is what a node can do. I do not want to know how a node does it, nor should I need to know that in order to use it. For the most independent, exchangable and future proof nodes I think the nodes functionalty should be the focus.

For instance GenIcam is a standard for using various cameras across very different hardware standards and capabilities. Each camera has a XML file describing what it is, what it can do, and how to make it do that. Your program can ask for that file thus know what it can do. That list is typically quite extensive, and that is just for cameras. Thus I think a key challenge will be to define a general format that will work for all hardware functions with a clearly defined standard format. (Maybe USB plug and play with all the various hardware involved and supported there could be an inspiration).

I would probably have tried to use and or implement libp2p inside Elixir and avoid using Rust. Just squinting at it from a distance a distributed network tasks seems more at home with Elixir than Rust?

I use Rust NIFs rather than ports. From the documentation of one (which is not Nerves), but using Rust NIF with Rustler:

Crash Resilience — Two-Layer Protection

SearchTantivy prevents BEAM VM crashes through two complementary mechanisms:

Layer 1 — NIF Panic Catching: Every Rust NIF entry point is wrapped with std::panic::catch_unwind. If tantivy panics (assertion failure, index corruption, unexpected state), the panic is caught and converted to {:error, "NIF panic: ..."} instead of crashing the BEAM VM. This is transparent — you handle these like any other error.

Layer 2 — OTP Supervision: The SearchTantivy.Index GenServer is managed by a DynamicSupervisor. If a GenServer crashes (unexpected message, linked process death), it is automatically restarted. The :one_for_all top-level strategy ensures the Registry and DynamicSupervisor stay in sync.

Error handling pattern:

# NIF panics and normal errors are handled the same way
case SearchTantivy.search(index, query, limit: 10) do
  {:ok, results} -> results
  {:error, "NIF panic: " <> reason} -> Logger.error("Engine error: #{reason}"); []
  {:error, reason} -> Logger.error("Search failed: #{reason}"); []
end

What can go wrong and what happens:

Failure What Happens Your Code Sees
Rust panic (assertion, OOB) catch_unwind catches it {:error, "NIF panic: ..."}
Invalid query syntax tantivy returns error {:error, "query parse error: ..."}
GenServer crash Supervisor restarts it Next call works (or {:error, :noproc} briefly)
Index corruption tantivy returns error {:error, "..."} on open/search

Those are my brain drops. Interesting project. :slight_smile:

Can I ask you why those are unreliable networks? Dynamic, yes; spotty internet connection, yes, but within the cluster on site, why do you think the network is intrinsically unreliable? Here I use a narrow sense of “unreliable”, which means connectivity issues due to compounding problems due to the scale of the network.

I took your suggestion, I’m broadcasting what modules I’m hosting instead of pure hardware now to keep it simple. I still want to push back a little on NIFs. The p2p bridge/native part runs as a daemon. Nerves sort of lets you supervise it already. "The Bridge” as I’m calling it, is long running and that is the part that worries me. It might hog up resources, mess with scheduling. For the current scope, port seems to serve me fine. As much as the engineer inside me wants to do libp2p in elixir, the student in me knows that I won’t be able to make it in time for my graduation lol. but in an ideal world, libp2p runs in elixir and everything is BEAM native.

I’m not sure what speeds you need so ports might be just fine, and if not then it is always possible to change.

From the sound of distributed unreliable networks where connections come and go I just squinted and thought that sounds like something that needs a solution made from the ground up for distributed networks, fault tolerance and keeping nodes alive and connected.

Very good point about finished before graduation though! Now I need to take a better look at libp2p. If that will work with microprocessors or maybe AtomVM that would be all the more interesting.

It looks like you’re trying to achieve a mesh network in BEAM with capability cluster map. There are a lot of different approaches to these problems, but lets split it into different problems

  1. Some nodes are behind NAT
  2. Node discovery
  3. Network may be unreliable, and the messages can be spoofed
  4. Erlang by default requires full-mesh topology, but you can’t provide that
  5. We need capability mapping

Well, let’s address them one by one.

  1. First one is

    Some nodes are behind NAT

    You want to be using a DCUTR which is a hole-punching in libp2p, but this algorithm doesn’t work reliably. And most of the times it doesn’t work at all, because it merely hacks the NAT mapping tables in routers trying to reuse their fallback for stale connections where incoming and outgoing pair packets from A to B and B to A can result in a A ↔ B connection being added to NAT.

    So, in real world you will most likely need a relay to transfer all the traffic, like TURN. But if you have this kind of a relay, your cluster is now centralized.

    Let’s imagine we try to punch the NAT and fallback to TURN

  2. Alright, now nodes can access each other in theory, but they still need some kind of rendezvous, where they can learn addresses of each other. Usually, there is some discovery services used. You say

    mDNS

    but it works only in local networks, and we are behind NAT

    Kademilia DHT

    that one I don’t quite understand. First of all, it is a name of an algorithm, there is no such global service or network. If you want to use and existing network which implements some DHT, or you want to implement your own, each node still needs to know some peers before its deployed. But let’s imagine that the node knows its peers, and it can access a DHT, what’s then? It would have to map like "my thesis project node N" -> IP address, but it needs to be verifiable, so you need some kind of DHT where there is some cryptographical proof that some node can write to some key, otherwise malicious node could just poison the DHT.

  3. But alright, we solved all the problems above and we found some IP addresses we can access and that we think our nodes live in. Then, we need to have some ability to verify that these nodes are real nodes, not some malicious nodes. And we don’t want middleman (for example TURN relay or just some malicious router) listening to our unencrypted traffic and maybe spoofing some packets. So, we need a TLS and maybe some certificate verfication if we dont want to hardcode public keys of every node in our network. Erlang distribution by default is unencrypted and uses a cookie. So, you need some approach like empdless project does, where each node tries to perform a TLS handshake.

  4. Next, we need to have a full-mesh topology with default erlang distribution or use an approach as Partisan. As far as I know, Partisan/LASP gained almost no commercial users. And thats not because Partisan is bad, it is actually very cool, but because built-in erlang distribution can be tuned to support non full-mesh topologies, it can be tuned to work without EPMD, etc. I’d suggest looking into whatsapp talk about their huge cluster, they provide an overview of their approach and some starting points in erlang documentation about tuning. TLDR is “use hidden nodes, disable -connect all, disable global, tune the heartbeat timer”. It will kinda reduce all the features erlang has by default, but still, thats the best there is. You don’t need a routing inside the cluster, because you are working over the internet, and on the step 2 of my post, you must already have an ability for every device to connect to the other device.

  5. Now about capability routing. It all boils down to maintaining a cluster-wide state, a hashmap capability -> list of nodes nodes. So, it is a consensus problem essentially. It needs to be consistent, because we don’t want our Tasks to be executed on the wrong or dead nodes. But our network is also bad, and nodes are nerves and can just turn off. So, we need to make a CAP decision here. You use CRDT, that means eventually consistent AP, that means sometimes your Network.spawn will spawn on the wrong node. I don’t know if its good or bad


If I were to implement something like this in a commercial project, I would use some existing mesh overlay network like Yggdrasil. It provides e2e encryption, its stable and battle-tested, it provides full-mesh, it works by peering with nodes outside NAT and it provides a virtual network interface, so it just solves all problems from 1 to 4.

Overall it is a good idea and I wish you success with your thesis project. I can provide some other consultations or we can hop on a call, just contact me on the forum via private messages.

This would be outside of what I usually do, so I figured it would be a good exercise for my Claude skills.

So, as you will do the port implemention, here is a Elixir NIF wrapper for the Rust libp2p library. It might be handy for comparison or whatever else you might find useful. I put a little OTP application layer on top while I was at it.

The wrapper architecture with some added info.

The Elixir libp2p Rust wrapper

1 Like