Supervisors - Example in online docs

I’m referring to the example at the top of this link.

http://elixir-lang.org/docs/stable/elixir/Supervisor.html#content

What would happen in this example if the GenServer stack was initialized with an empty list [] instead of [:hello]? How is the supervisor supposed to handle a non-resolvable situation where you call :pop on an empty list and then crash stack and restart stack with an empty list [] only to fail again? I guess what I’m asking is. What happens to the failed :pop request which started this?

In that example if pop were called when the stack were empty then the process will crash (and the process that asked to ‘pop’ will get a timeout, probably crashing it too until it gets handled). The supervisor will recreate the process shortly. If it crashes too often too fast then the supervisor will crash itself to be handled higher up the chain, and that repeats until it finally stops crashing, gets handled, or the application finally crashes.

2 Likes

OK that makes sense. So Elixir can still fail(crash without recovery) if the functionality doesn’t exist to handle a situation? In this case an unmatched function call…

It would have to happen a lot and rapidly, but yep, that is a worst-case-scenario. :slight_smile:

That would also mean that you really need to run dializer. ^.^

Dializer… I’m sure I’ll eventually get to that point.

You might find this article interesting, especially from the “Supervision trees” slide. Supervisors are very useful to recover from transient failures, but they can’t do much when you have a critical bug in a core feature; these need to be checked with tests.

For instance if calling pop is part of your app’s initialization, and you removed the initial [:hello]… then yes, you’ll get crash loops and eventually the supervisor then the whole VM may go down. But if it’s so bad the app won’t even start, you should notice while developing or running automated tests :sweat_smile:

On the other hand, if pop is being called on an empty stack due to a race condition (e.g. it’s a web app and the user rapidly clicked “pop” multiple times before the UI could update), then the supervisor will save the day and the user may not even notice. You’ll get a crash log with a dump of the process state, which makes it a lot easier to reproduce than when you have just a stack trace.

3 Likes

Thank-you for these explanations and links. I’m at the point with Elixir and OTP where I’m trying to figure out the basic functionality of the common abstractions and this really helps… Thank-you.

It does not crash without recovery.

If you called pop on the genserver then the genserver would crash and so would the caller. The gen server would be resarted. The caller might too depending on if it’s supervised.

Quite possibly the only reason that the caller called pop at an invalid time is because it had invalid state. Now that’s cleared away and you’re good to go.

Possibly however it really is a bug in the code, and that will continue to happen. If it only happens periodically then life carries on.

If it happens a configurable number of times in a configurable number of seconds however then the supervisors themselves start crashing. This is intentional. Elixir’s error handling philosophy is that once a particular level thinks things are unrecoverably bad it tries to escalate the problem to a higher level so maybe it can handle it.

Eventually this can escalate all the way up to a server reboot if you have things configured that way.

My point here is that while bugs in the code are obviously bad and should be fixed, this isn’t a scenario that is gonna go around all the error handling that the BEAM provides.

1 Like