By using processes and supervision tree we can have a robust “let to fail” distributed system in Elixir. However i was thinking what if the physical server on which the processes fail? That way the supervisor will have no way(no server) of restarting the processes. Even if we were using a self recovery IAAS provider like AWS. The supervisor will somehow need to know about the new server that has been spun up and start its new process there.
Would it be simpler to achieve a robust distributed system by simply building stateless Elixir components, forget about supervision trees and let AWS handle the let it fail and self recovery bits?
Yes the architecture that i am discussing about here is a Elixir supervision tree with supervisor sitting on one physical server and its children sitting on individual physical servers.
vs
simple Elixir stateless components, no supervision tree. Use AWS elastic load balancer and EC2s to house those simple components to achieve the same effect of a distributed system
If you narrow down programming to a very specific area there may be very specific solutions that also work. This approach provides little for handling websockets, live gaming, real time financial analysis, or a whole host of other worlds where statelessness is a huge limitation.
Even if you’re doing a “stateless” webserver that does nothing more than handles web requests and pulls data from a DB, supervisors play a significant role properly starting, stopping, and managing the database connections.
Having supervisors and children separated is not fault tolerant, it is a distributed monolith application that breaks when one part breaks. If You want fault tolerance, You need to duplicate both hardware and software.
To learn more about distributed fault tolerance systems, here are 2 valuable documents:
With those tools in the toolbox, though, there is still the question of what are you distributing, what does “robust” mean, etc… is it ok to lose a job (e.g. a single http request in a web application) or do they need to be executed at least once (as in a transaction processor) … are the results ephemeral (as in a web response) or do they need to be persisted to disk … etc.
All of these approaches require thinking in different scopes: the individual task scope where processes are the atom of function; the local node scope where supervisor trees rule; the cluster scope where replication and durability can be found.