Is inter-project supervision possible?

saleh-tekm · December 18, 2024, 3:17pm

Hi everyone,

I’m relatively new to Elixir/Erlang, and I’m working on a demo project. The setup involves a master supervisor, a supervisor, and a worker, and I need to run each of them on separate devices. I’m unsure how these components will communicate when they’re on different machines.

To tackle this, I created separate projects for each component and connected them using Node.connect() and Phoenix.PubSub. However, I realized that this approach doesn’t handle supervision properly, right?

I feel like I might be lacking in the fundamentals of Elixir/Erlang, and I’d really appreciate some guidance on how to implement this correctly.

Thanks in advance!

Hermanverschooten · December 19, 2024, 8:11am

What is your intention with this supervision?

LostKobrakai · December 19, 2024, 8:14am

While process links can span across nodes usually supervision and supervision trees are a thing local to a single node.

saleh-tekm · December 19, 2024, 10:10am

To have a control over the child node and restart the child it if it dies. To ensure fault tolerance…But I dont know if it is the thing that is possible if project is setup at multiple nodes / interproject, like different projects at different nodes.

Im sorry if Im wrong at fundamentals itself.

Hermanverschooten · December 19, 2024, 12:46pm

Why not keep it local? Supervision on the node?

dimitarvp · December 19, 2024, 2:17pm

What you need, I am not sure even Kubernetes has. It has some capabilities of distributed orchestration but I don’t think it can do OTP-like supervision.

It helps to zoom out and ask yourself why do you need something that almost no technology is offering (not stably and bulletproof anyway).

asabil · December 19, 2024, 6:06pm

This is of course possible, here is a crude example in Erlang:

-module(dsup).
-export([
    main/0
]).


-export([
    child_init/1,
    child_loop/1
]).

-behaviour(supervisor).
-export([
    init/1
]).


% Demo entry point
main() ->
    {ok, _, Node} = peer:start_link(),
    supervisor:start_link({local, ?MODULE}, ?MODULE, [Node]).

% Child process init
child_init(Node) ->
    Parent = self(),
    {ok, proc_lib:spawn_link(Node, ?MODULE, child_loop, [Parent])}.


% Child process main loop
child_loop(Parent) ->
    receive
        {echo, From, Message} ->
            From ! {echoed, Message},
            child_loop(Parent);
        {stop, From, Reason} ->
            From ! ok,
            exit(Reason)
    end.


% Supervisor init callback
init([Node]) ->
    Flags = #{},
    Children = [
        #{
            id => child,
            start => {?MODULE, child_init, [Node]}
        }
    ],
    {ok, {Flags, Children}}.

The only thing a supervisor care about is a that a child start function starts a new process, liinks it to the supervisor and return the {ok, Pid} (or {ok, Pid, Info}), the Pid can be started on a remote node as the example above show.

Here is an example session:

> erl -sname node1
Erlang/OTP 27 [erts-15.1.2] [source] [64-bit] [smp:14:14] [ds:14:14:10] [async-threads:1] [jit] [dtrace]

Eshell V15.1.2 (press Ctrl+G to abort, type help(). for help)
(node1@Delta-23)1> c(dsup).
{ok,dsup}
(node1@Delta-23)2> dsup:test().
{ok,<0.101.0>}
(node1@Delta-23)3> supervisor:which_children(dsup).
[{child,<15315.91.0>,worker,[dsup]}]
(node1@Delta-23)4> ChildPid = pid(15315,91,0), ChildPid ! {echo, self(), "hello"}.
{echo,<0.90.0>,"hello"}
(node1@Delta-23)5> flush().
Shell got {echoed,"hello"}
ok
(node1@Delta-23)6> ChildPid ! {stop, self(), crashed}.
{stop,<0.90.0>,crashed}
=SUPERVISOR REPORT==== 19-Dec-2024::18:59:02.505399 ===
    supervisor: {local,dsup}
    errorContext: child_terminated
    reason: crashed
    offender: [{pid,<15315.91.0>},
               {id,child},
               {mfargs,{dsup,child_init,['peer-2242-11228@Delta-23']}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,5000},
               {child_type,worker}]

=CRASH REPORT==== 19-Dec-2024::18:59:02.505058 ===
  crasher:
    initial call: dsup:child_loop/1
    pid: <15315.91.0>
    registered_name: []
    exception exit: crashed
      in function  dsup:child_loop/1 (dsup.erl, line 36)
    ancestors: [dsup,<0.90.0>,<0.89.0>,<0.76.0>,<0.71.0>,<0.75.0>,<0.70.0>,
                  kernel_sup,<0.47.0>]
    message_queue_len: 2
    messages: [{exit,crashed},{exit,<0.90.0>,crashed}]
    links: [<0.101.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 376
    stack_size: 29
    reductions: 168
  neighbours:

(node1@Delta-23)7> supervisor:which_children(dsup).
[{child,<15315.92.0>,worker,[dsup]}]

As you can see in the last line, the supervisor restarted the process on the remote node.

That being said, you might probably want to take a look at the distributed applications guide for your use case (Distributed Applications — Erlang System Documentation v27.2)

Hope this helps.

dimitarvp · December 19, 2024, 11:14pm

Cool, I am grateful that you showed this. I stand corrected.

I still would not want to have this in the split-brain scenarios but then again, I am not sure any other software stack would deal better with it than Erlang.

sbuttgereit · December 19, 2024, 11:41pm

My sense is that a distributed application that is fault tolerant is a different thing from a distributed fault tolerance mechanism. A distributed application that is fault tolerant seems a reasonable and achievable goal (with some limitation), but a distributed fault tolerance mechanism I feel like could never be better than “fragile” or “unreliable” across too many scenarios.

I can see maybe having a specific, maybe more robust, device taking a role as some sort of controller node for commanding distributed processes to start, shutdown (gracefully) as needed, or just to monitor status… but not in the sense of supervision trees, but in the sense that the child processes can receive messages from the central node and that send responses as needed.

For fault tolerance and supervision… I’d be inclined to keep all of that local to the node. To my knowledge it solves a lot of questions distributed supervision might leave you with. What happens if the communication (network) between the child and supervisor isn’t working? Does the child have to detect this and die (self-supervision)? If the supervisor on node a doesn’t see the child in the right time (again, maybe network), does it try to restart the remote child and if the remote child isn’t actually dead is that OK?

I think if I’m really needing something like this, I’m going beyond just baseline OTP, am not really very familiar with OTP, the first thing I’m going to look for is a library (GitHub - derekkraan/horde: Horde is a distributed Supervisor and Registry backed by DeltaCrdt, maybe GitHub - bitwalker/libcluster: Automatic cluster formation/healing for Elixir applications, etc.) built by others that are better versed in the issues. At least study them to get a handle on it.

I need to disclaim something here: I don’t have much hands-on experience with this in the Elixir/Erlang context so my thinking could well be flawed… but having worked with these kinds of distributed things before I have some instincts for it.

asabil · December 20, 2024, 10:36am

Does the child have to detect this and die (self-supervision)?

That’s what linking does, the linking is bidirectional, so if the link is severed between the processes, both processes receive an exit signal. What makes supervisors special is that they trap_exit. In this scenario it will mean that the child will terminate if the supervisor is no longer reachable.The supervisor will also receive an exit signal if the child is no longer reachable, but instead of terminating it will attempt to start a new child to replace the old one. Now if that fails many times, the supervisor itself will fail.

This of course doesn’t have to span networks. It could be 2 separate nodes on the same machine, or separate machines connected directly with an Ethernet/serial cable, it really all depends on the topology of the system.

All in all, the most important is to define/understand the failure modes and possibilities of your system. Is there a network? Is the network a possible problem? Is hardware failure a possibility you want to handle? Can someone yank a cable?

Once understood, you can build the right mechanism to handle the failures. No system can handle every single possible failure.

lud · December 20, 2024, 11:30am

Don’t do that.

If your supervisor starts a child, the child will belong to the supervisor’s node for process groups. For instance it’s IO.puts and other IO will go through the supervisor node. You will have to do a lot of stuff just to get to a normally operating system.

And if you turn off your worker device, the supervisor will quickly reach max restarts and crash your app, or maybe crash the master supervisor app.

I’m sure there are countless other problems we can find just by experimenting with this, plus the security problems.

If you need supervision, do it locally, and if you need to keep sync with another device, connect the nodes and use :net_kernel.monitor_nodes to know when you are disconnected. Or sync with TCP for instance.

Anyway, this is just my recommendation, of course it’s fun to play with distribution in unconventional setups. But if it’s a serious project I would strongly suggest to find another solution.

sbuttgereit · December 20, 2024, 1:27pm

All great points related to the technical capabilities of OTP supervisors, including points I haven’t internalized yet. So we agree that the technical objective of running cross node supervision trees exists; your detailed earlier post also made that clear.

My larger point was that just because you can doesn’t mean you should and knowing if you should or not often comes with having done a fair amount of research beyond just can it be made to work. You’re right in pointing out that understanding the failure modes is necessary to which I’d add understanding the “failure domains” of your application: the boundaries within which you can have failures which can be sufficiently isolated as to allow other application services to continue without interruption. But this leads me back to the original poster’s stated objective and architecture:

Having two devices, one device watching another for failure, is absolutely a thing and is appropriate in some scenarios… one could easily argue that the whole idea of supervisors is this concept simply abstracted into software and within an application. Where I think the original poster errs is that it appears they’re looking at a tool which accomplishes the fine-grained supervision of application processes, rather than protecting against the more general “is that other device still responsive or not” type failures. I don’t think you become more fault tolerant in those circumstances: you can’t achieve higher fault tolerance by more tightly coupling application processes across nodes which necessarily introduces new faulting scenarios which come with multiple devices and communication buses; @lud makes some excellent points about additional failure modes introduced by the proposed approach. If you do want “device a watches/manages device b”, I still think you’d be better off making the application services running on each node as independent as possible rather than deepening the coupling… application processes being supervised locally within the node… while creating/getting a library for a more fit-for-purpose service to allow device a and device b to communicate state and command restarts and such, avoiding the process linking which comes with OTP supervisors.

I’m happy to be shown that I’m wrong about my assumptions here, but my prior experiences make me weary of this kind of coupling.

In either case, to actually achieve higher and not lesser, fault tolerance, I still think there are payoffs in time and effort to doing some up-front research on the fundamentals. I think its reasonable for a newcomer to see supervisors and think maybe they can/should be used this way… but I think they’d need to next also be asking themselves the kinds of questions I originally posted before committing too far to that path. Yes, some of those questions will have satisfactory answers as you’ve pointed out, but others maybe not so much.