Coming up with a robust distributed system in Elixir

By using processes and supervision tree we can have a robust “let to fail” distributed system in Elixir. However i was thinking what if the physical server on which the processes fail? That way the supervisor will have no way(no server) of restarting the processes. Even if we were using a self recovery IAAS provider like AWS. The supervisor will somehow need to know about the new server that has been spun up and start its new process there.

Would it be simpler to achieve a robust distributed system by simply building stateless Elixir components, forget about supervision trees and let AWS handle the let it fail and self recovery bits?

Distributed means also to have more than one physical server.

You cannot achieve fault tolerance with one machine only.

Yes the architecture that i am discussing about here is a Elixir supervision tree with supervisor sitting on one physical server and its children sitting on individual physical servers.

vs

simple Elixir stateless components, no supervision tree. Use AWS elastic load balancer and EC2s to house those simple components to achieve the same effect of a distributed system

The goal of having multiple nodes is to have each of them being able to run the system even if one physical server goes down.

Why do You want to separate supervisors from their children?

Duplicate and get a good failover policy. You don’t need AWS elastic load balancer and EC2s to make it work with Elixir/Erlang.

You could also think Rails is fault tolerant because of AWS elastic load balancer and EC2s.

If you narrow down programming to a very specific area there may be very specific solutions that also work. This approach provides little for handling websockets, live gaming, real time financial analysis, or a whole host of other worlds where statelessness is a huge limitation.

Even if you’re doing a “stateless” webserver that does nothing more than handles web requests and pulls data from a DB, supervisors play a significant role properly starting, stopping, and managing the database connections.

This is not a recommended pattern.

1 Like

Hi, thank you for your input

Then what may i know what is the recommended pattern?

Have the supervisor and its child processes all sit together in one server?

Having supervisors and children separated is not fault tolerant, it is a distributed monolith application that breaks when one part breaks. If You want fault tolerance, You need to duplicate both hardware and software.

To learn more about distributed fault tolerance systems, here are 2 valuable documents:

Fault tolerance 101

Making reliable distributed systems in the presence of software errors

Both by Joe Armstrong himself.

2 Likes

Nice, thank you for the links I will take a look

There has been a lot of work in these areas. There is riak_core, syn, libcluster, swarm, …

With those tools in the toolbox, though, there is still the question of what are you distributing, what does “robust” mean, etc… is it ok to lose a job (e.g. a single http request in a web application) or do they need to be executed at least once (as in a transaction processor) … are the results ephemeral (as in a web response) or do they need to be persisted to disk … etc.

All of these approaches require thinking in different scopes: the individual task scope where processes are the atom of function; the local node scope where supervisor trees rule; the cluster scope where replication and durability can be found.

4 Likes