How is process supervision superior to other patterns?

supervisor

#1

Hi,

I’m coming from ruby/rails and trying to learn elixir. I love it so far, but I’m having a hard time to understand how exactly are Supervisors useful? It’s always mentioned that process supervision makes sure that your process doesn’t crash the whole app. How should I think about in ruby terms? For example I’m doing a network call, then I wrap it in a timeout block, or some other begin/rescue block, so it won’t crash the app as well. I can also retry the call if I want.

Or should I think about it like a background worker with a retry mechanism and then I can do more stuff concurrently?

Please explain what it allows me to do what ruby doesn’t.

Thanks


#3

In “ruby” terms this is more like a background job, who’s job is to watch other background jobs and make sure if they crash or fail or what ever, that there is a contingency plan. That plan could be a multitude of things, such as killing off other background jobs as a result or only retrying the one job in question.


#5

OK, got it. And because in Elixir/Erlang processes are more lightweight and supervision is easy, that’s superior to ruby where you would need a more heavy solution like RabbitMQ/Sidekiq/whatever to achieve concurrent fault tollerant workflow, right?


#6

I mean keep in mind this is a very broad generalization but I think you are starting to get it.


#7

It is not something easily translated to Ruby because of the BEAM.

In the BEAM, You can have million of processes running at the same time, this is not possible to translate this to Ruby. Supervisors are here to ensure fiability by restarting (or not) failing processes… but after repeated errors in restart, they will also fail.

A network call is just a function call. Usually You don’t code function defensively in try block.

Think of it like this…

Your Ruby server can serve a lot of requests.

In Elixir, each requests is processed by a server process, 1_000 connections = 1_000 servers, each is independant, with it’s own set of unshared memory, thus one process can go bad, but it won’t affect the others. And if so, supervisors are able to heal the failing process…

Yes syntax is the same, but mindset is SO different :slight_smile:

PS: I also come from Ruby


#8

In Ruby (and other scripting languages) you’d normally use some OS level tool like monit or god to monitor your worker processes and restart them if one died. With Elixir (or any BEAM language) that sort of fault tolerance is built in and written in the same language as the core logic.

The original primary requirement when Erlang/BEAM was invented was fault tolerance. It’s baked into the design of the VM at a low level.


#9

This is a topic I explore on my “Idioms for building fault-tolerant and distributed applications” talks, so if you haven’t seen it yet, it can be a good starting point: https://www.youtube.com/watch?v=xhwnHovnq_0

The talk above in particular was given as a keynote at Euruko 2016, so I may draw more direct comparisons to Ruby in there. The talk is from 2016 but all of the concepts still apply.


#10

Consider a taxonomy where we divide all software errors into two categories - errors you can anticipate, and errors that you cannot.

In your example, you mention one of the errors you can might anticipate - you have the forethought to realize that your network request might time out and you add in code to account for that anticipated error.

In many environments, especially those with lots of concurrency, distributed communication, etc. There can be bugs that are transient, and unanticipated. For example suppose your network request goes as expected (without a timeout) but the system at the other end is in a bad state and returning garbage. To make it more concrete suppose it returns the cost for a box of apples and the number of apples available. Your code wants to calculate the cost per apple as “cost of box / number of apples” - but the remote system says you have a $2.00 box of apples that contains 0 apples. Your code tries to divide by zero and raises a divide by zero exception.

At this point your Ruby application has run into a situation it didn’t expect. It throws an uncaught exception that takes the whole application down :-(.

In Elixir that request might be running in a process. This unanticipated, transient error would crash that process, but leave the rest of your app untouched. Better yet, if the process was supervised, then the supervisor can decide how to respond to the unexpected. It might retry the request. It might decide that a whole subsystem that was entirely dependent on knowing the unit price of an apple is now in a bad state and should be rebooted (killing and restarting a whole bunch of processes). Or it could just ignore the failed request and go running.

While supervisors are capable of handling anticipated errors like your timeout error and retries, they really shine when you consider what they allow in the face of transient and unanticipated errors.


#11

Thanks everyone! It’s more clear now.