Mesh - Capability-based routing for Processes on the BEAM

Hello everyone,

We’d like to share Mesh, a new open-source Elixir library developed by the Eigr community.

Mesh was created as part of the work on Spawn, where we needed a reliable and scalable way to route messages to specific groups of processes across a cluster, with deterministic behavior and minimal coordination overhead.

During Spawn’s development, we evaluated existing solutions such as :pg, Horde, and ProcessHub. These are solid libraries and work very well for many use cases. However, for our specific requirements — especially around capability-based routing, deterministic ownership, and very large-scale clustering — none of them fit exactly what we needed. Rather than forcing a model that didn’t align with our constraints, we decided to design a small, focused abstraction that could serve as a foundation for Spawn and similar systems.

That abstraction became Mesh.

What Mesh is

Mesh is a library for managing virtual processes and routing messages to them based on capabilities, rather than direct PIDs or global names.

At a high level, Mesh provides:

  • Capability-based process registration

  • Deterministic routing using shards

  • Decoupling between callers and process location

  • A model where any node can invoke a virtual process, and the system resolves where it should run

Processes register themselves with one or more capabilities, and callers route messages by capability and key. Mesh handles shard ownership and node selection, allowing systems to scale without relying on global process registries.

Example usage

Registering a process with a capability:

Mesh.register_capabilities([:game, :chat])

Routing a message to a virtual process by capability and key:

{:ok, pid, response} = Mesh.call(%Mesh.Request{
  module: MyApp.GameActor,
  id: "player_123",
  payload: %{action: "move"},
  capability: :game
})

Mesh also exposes simple functions for understanding cluster state:

Mesh.nodes_for(:game)     #=> [:node1@host, :node2@host]
Mesh.all_capabilities()   #=> [:game, :chat, :payment]

These helpers return lists of nodes that support a given capability or all capabilities registered in the cluster

Under the hood, Mesh computes a shard from the routing key, determines the owner node for that shard and capability, and delivers the message to the appropriate process — locally or remotely — without the caller needing to know where that process lives.

This makes Mesh particularly useful for systems that need:

  • Logical actors

  • Deterministic placement

  • Clear separation between routing logic and business logic

Project status

Mesh is new and under active development. The API is intentionally small, and we expect it to evolve as we continue integrating it into Spawn and gather feedback from real-world usage.

We’re sharing it early because we believe the underlying ideas may be useful beyond our own projects, especially for people building distributed systems on the BEAM who need more control over routing semantics.

Links

Documentation:
https://hexdocs.pm/mesh/Mesh.html

Hex package:

Spawn:

GitHub repository:

Feedback, questions, and contributions are very welcome.

— Eigr community

13 Likes

Seems cool.

Btw, this is not consistent hashing. It’s just regular hash sharding. There is no ring; it will catastrophically reshard if a node is added.

You should use rendezvous hashing anyway. That is, if you need to worry about resharding at all (I’m not sure you do).

Good catch — you’re right about the terminology.

Mesh currently uses deterministic hash-based sharding, not classic consistent hashing with a ring. Shards are fixed and process are mapped to shards via phash2/2.

That said, this is a deliberate design choice aligned with our use case. Mesh routes messages to virtual processes, not persistent data partitions. Actors are ephemeral, lazily created, and can be safely recreated elsewhere. As a result, minimizing “resharding cost” is not a primary concern for us.

Nodes joining or leaving do not change shard assignment — only shard ownership. There is no data migration involved, and no state is coupled to shard placement.

Rendezvous hashing is a great tool and could be explored in the future, but for now the current approach gives us simpler reasoning, O(1) routing decisions, and predictable behavior, which fits well with Mesh and Spawn’s actor model.

Still, thanks for pointing this out — we should probably avoid calling it “consistent hashing” in the docs and be more precise about the terminology. We’re always open to evolving the routing strategy as new requirements emerge.

3 Likes

Did you just AI-generate the answer?

It is not predictable and it is not consistent. If you had shart = 2 and nodes were [:node0, :node2, :node3], you’d get :node3 returned, but then a :node1 joins and sets the same capability, and you’d get a :node2 returned, which is kinda the opposite of consistent and predictable.

No, I was just trying to be polite.

As I said before, perhaps the text implied that partitions should be stateful or stable, but this is not the use case. The state of a process is not coupled to the shard; it only serves as a routing mechanism. Obviously, this doesn’t work for all use cases. We don’t need consensus on the shard. Sorry if I wasn’t academic or precise enough in the text, we can improve. By the way, considering this is version 0.1.1, contributions are welcome, and I’m sure that if there’s a use case or even a bug, a PR would be very well received.

Haha, it’s alright. It’s just that every response which starts with “Good catch, you’re absolutely right” reads as an LLM output. My friend even has a t-shirt with Claude logo and this phrase on it :grin:

If you don’t need to be consistent, why do you even shard in the first place? Pick random, round-robin with counters, etc. It would be much faster then hash computation and (more importantly) would not give a false feeling of consistent routing

Haha, it’s alright. It’s just that every response which starts with “Good catch, you’re absolutely right” reads as an LLM output. My friend even has a t-shirt with Claude logo and this phrase on it :grin:

I understand. Although that wasn’t the case, I’ll learn more about the terms used by AI so I don’t repeat them anymore :wink:

It seems you assume that if there’s no strong consistency then sharding doesn’t make sense (I’m assuming you thought that). But that’s not the case, sharding ins’t just for consistency (predictability, local determinism, stability, and other things probably).

I didn’t want shared state, or inconsistent decisions from each node; if the cluster is stable, you would lose local determinism otherwise, but all nodes arrive at the same decision in this case (stable cluster) with shards.

Regarding performance, I’m very happy with the benchmarks we’ve done and we’ll work more on that over time. And of course, we can consider consistent and strong hashing as you assume, but for this version we’re okay with that.

But we can build together, I’d be happy to see a PR.

1 Like

This is a brilliant concept which reminds me of Kubernetes. Having it in Elixir is tight. In that regard: is there a possibility to provide a negative-capability (taint)?

‘The system’ as the mesh or is there an orchestrating node.

And when a node is killed, are the processes restarted at other nodes?

I don’t think the term capability-based means what you think it means. A capability is an inforgeable token of authority. I don’t see such thing in Mesh. These atoms that you call capabilities are more like roles or groups, or identifiers.

is there a possibility to provide a negative-capability (taint)?

It would seem more like a node affinity than a taint.

And when a node is killed, are the processes restarted at other nodes?

When there is a new call to the process yes!

These atoms that you call capabilities are more like roles or groups, or identifiers.

Yes. They are tags, labels, whatever you want to call them, and they serve more as a group affinity for nodes, in the sense that you are addressing a process to a node only if it is part of that specified “group”.

Do you think random would be meaningfully faster than hashing? I kinda doubt it would make a substantial difference.

Either way the correct approach IMO would be to load-balance with the two random choices trick because randomness alone will randomly overload random nodes :slight_smile:

1 Like

I don’t know, I’d have to measure it, most likely random would be faster. Although that’s not the point of Mesh, mesh is basically a service discovery with the added benefit of being able to activate and monitor the target (process). We’re not talking about a pool, nor endpoints (servers), nor consensus systems; in Mesh we’re only talking about affinity groups. Where knowing the logical Process name, not the PID, nor a registry query, nor anything like that, you are able to send a request, and you are able to do this with “node affinity”, that is, within a larger system (cluster) you can have subsystems and you can logically divide processes among these subsystems.

The reason why random sampling wouldn’t be suitable here is that, in this use case, given that the cluster doesn’t change much, you’ll always be communicating with the same target, and if the cluster eventually changes, it will likely be synchronized at some point.

We will still be working on name collisions, better handling of various issues, the possibility for users to implement their own strategies, and so on… This is just version 0.1.1 of the library, which we are sharing early precisely so we can gather valuable feedback like the kind you gave us.

Without any pretension, I’m sharing the result of one of our benchmarks:

Creating 40,000 actors (10000 per capability)
Time: 2.12s
Throughput: 18827.1 actors/s
Success: 40000/40000 | Failures: 0
Actor distribution per node:
bench@127.0.0.1: 10000 actors
balanced_node_1@127.0.0.1: 10000 actors
balanced_node_2@127.0.0.1: 10000 actors
balanced_node_3@127.0.0.1: 10000 actors

50,000 invocations on existing actors
Time: 3.23s
Throughput: 15484.83 req/s
Success: 50000/50000

Hash ring distribution for 10,000 samples
Expected distribution per node:
bench@127.0.0.1: 2500 actors (25.0%)
balanced_node_1@127.0.0.1: 2500 actors (25.0%)
balanced_node_2@127.0.0.1: 2500 actors (25.0%)
balanced_node_3@127.0.0.1: 2500 actors (25.0%)
Standard deviation: 0.0

FINAL CLUSTER STATE

bench@127.0.0.1:
Status: Online
Processes: 14237
Memory: 124.83 MB
Actors: 10000

balanced_node_1@127.0.0.1:
Status: Online
Processes: 14213
Memory: 101.74 MB
Actors: 10000

balanced_node_2@127.0.0.1:
Status: Online
Processes: 14213
Memory: 104.38 MB
Actors: 10000

balanced_node_3@127.0.0.1:
Status: Online
Processes: 14213
Memory: 103.04 MB
Actors: 10000

Total actors in cluster: 40000
1 Like

Aloha – thanks for sharing this; it aligns with my own plans.

I opened issue #2 and PR #3: Fix shard lifecycle cleanup, monitor recovery, and capability isolation by nshkrdotcom · Pull Request #3 · eigr/mesh · GitHub.

Full disclosure: I set the investigation criteria and verified the results; Codex and Claude Code handled the investigation, failing tests, fixes, and the issue/PR/commit text.

1 Like

Hello, thank you very much for your contribution. Available in hex 0.1.3.

Not meaningfully, but why do hashing by some when there’s no requirement to route things into the same node in the first place?

With all respect, this is a hash based routing, based on the key provided by the user. There is a huge chance that some nodes will never be selected if user provides less keys than there are nodes (for example). But even if not, there is guarantee that user provides values which return more or less plain distribution of hashes.

For example, it routes based on request.id which is a string set by user. There is not uniqueness check on it and it can be just hardcoded :slight_smile: (or set to nil by default)

If the user has defined incorrect inputs, they shouldn’t expect guarantees. What we can do is warn them that something is being done incorrectly. We’ll add this validation, and thank you again for the suggestion.

But the goal itself is for them to provide an identity; they do this because they know they’ve defined a process that represents something they expect to happen. It’s not the goal of this library to provide guarantees about the user’s business rules. The user should build that on top of this library, in their own codebase. But as long as they provide a valid identity, they will be directed to the process the user defined.

You can implement a custom strategy and implement this behavior if you wish; the most recent documentation explains how to do this.

I’ve read the code and here’s my review

TLDR

This is a poor library with a lot of bugs, misleading documentation and poor applicability.

What does it actually do?

It essentially does 2 things.

  • You can set some “capabilities” for the node (think of it as of a group name you can assign a current node to) and you can query them
    Mesh.register_capabilities([:game, :chat])
    
  • You can perform some request on the group, which will pick a node from the group, spawn some (or use existing) GenServer and do the GenServer.call with specified payload
    {:ok, pid, response} = Mesh.call(%Mesh.Request{
      module: MyApp.GameActor,
      id: "player_123",
      payload: %{action: "move"},
      capability: :game
    })
    
    Each node in the group is going to maintain a shard_count of processes per each capability called “actors” in terms of this library (but I call them just shards here) and this request is routed to one of these shards. That means if the node has two capabilities and max shard count is 10, it is going to maintain 20 processes (all of them are lazily initialized).

Problems

Setting capabilities

Capabilities is a distributed data, and as all such data it is subject to CAP. Current solution is not qualifying for any of those letters and, putting CAP aside, provides very poor guarantees. If other node can’t start the Mesh and set the capabilities in time, the current node won’t know about it. It uses replication on net_kernel.monitor_nodes events, which just tries to do asynchorous rpc to get the capabilities of other node and share current capabilities with it.

That means that any change to cluster configuration or groups is eventually consistent (at best) and it is quite possible that your request would get routed to the already dead node. If so, this library provides no fallback and you’d just have a dead request

Picking a node

Alright, let’s forget about previous paragraph and imagine that every node has the exactly the same information about the groups (aka “capabilities”).

In pseudo-code it looks like this

shard_id = hash(request.id, shard_count)
nodes_in_group = Enum.sort(nodes_for_group(request.node))
node_to_execute_in = Enum.at(nodes_in_group, rem(shard_id, length(nodes_in_group)))
:rpc.call(node_to_execute_in, fn ->
  shard_owner = via(shard_id)  # Yes, shard_id, the same as on the first line
  GenServer.call(shard_owner, {:call, request})
end)

# Where the GenServer.call executes this

pid = start_or_get_shard_process(process, shard_id) # Again, shard_id
response = GenServer.call(pid, request.payload)
{:reply, response, ...}

And it has many problems. First thing is that routing to the node is not consistent. Second is that shard within node is selected twice, which is a bug.

Then, and it is the most beautiful bug here, shard_id is used as a key for selecting a node in a group and the shard process. That means, that if I have two nodes and 20 shards on each node, only 10 shards on each would receive the request, which is very fun. In general, if I have N nodes, only shards / N amount of shard will receive the request.
Because if I hit the first node on the list of two nodes, that maens that shard_id is dividable by two, and all shards with shard_id not dividable by two won’t receive a request on this node ever.

And for problems which I couldn’t reflect in the pseudo code:

  • Shard processes are created on demand and routed by shard_id. That means if I send a request to a not present shard with Module1 as the implementation, it would start a shard as GenServer.start_link(Module1, ...), and if I then send a request which would end up in the same shard, but with Module2 passed as the implementation, it would still hit the shard created by the first request, with Module1
  • If one shard performs a request, which would eventually get routed to the local node, there would be a deadlock. There is no check to ensure that the call ends up in the different shard

Other problems

  • It creates an ActorTable process on each node which just does nothing.
  • Misuse of PartitionSupervisor
  • Mesh.Cluster.Membership schedules monitoring after 100ms, while it can just do it in handle_continue
  • This process also syncs shards on every nodeup and nodedown, but it only changes the local shards, so this whole process is unnecessary
  • If an “actor” dies, it would get restarted by the supervisor, but the ActorTable ets table is going to have the pid of the dead “actor”, resulting in every request hitting the dead process
  • It uses global for locking on only the current node

Conclusion

Unclear use-case, bad implementation, misleading documentation.

Same functionality (but without a lot of bugs) can be achieved by using any other process group solution (including bultin global_group, pg and third-party Horde, Swarm, ProcessHub?, syn).

But given that shard processes are expected to be stateless and rounting to be inconsistent, it all boils down to just two lines

node = Enum.random(Node.list())
:erpc.call(node, fn -> ... end)
2 Likes

I’ve just reread my post, and it sounds a bit toxic. I want to make clear that given that this implementation is poor, it is a good start and the whole idea is not strange, but it is a classic problem which was solved several times by other authors from Elixir community and what you’re doing with Mesh is a good starting point to explore the problem space

My personal recommendation would be to

  1. Learn more about distributed systems. I would love to provide some books and resources that I personally find useful, but unfortunately very few of them are in English :frowning:
  2. Take a look at existing implementations, learn about their tradeoffs. For example, Horde library solves the similar problem, but it uses eventually consistent delta-CRDT merkle tree structure to (eventually) maintain the same version of the information about which processes belong to what groups. Other good example is ProcessHub which is, afaik, still in active development phase.
  3. Learn more about consistent hashing algorithms and their applicability. These algorithms are very interesting because there is an exotic mix of finite field algebra and some empirical observations of how distributed systems change their memberships
  4. Dont be afraid to ask questions. You can tag me in any thread or use the personal messages. I also provide mentorship and consulting services. But you can also just create topics on the forum. Discussions about distributed systems are especially welcome here

I am looking forward to seeing the new version of Mesh which would use some interesting distributed algorithm. Happy coding :grinning_cat:

2 Likes