What would happen in this example if the GenServer stack was initialized with an empty list [] instead of [:hello]? How is the supervisor supposed to handle a non-resolvable situation where you call :pop on an empty list and then crash stack and restart stack with an empty list [] only to fail again? I guess what I’m asking is. What happens to the failed :pop request which started this?
In that example if pop were called when the stack were empty then the process will crash (and the process that asked to ‘pop’ will get a timeout, probably crashing it too until it gets handled). The supervisor will recreate the process shortly. If it crashes too often too fast then the supervisor will crash itself to be handled higher up the chain, and that repeats until it finally stops crashing, gets handled, or the application finally crashes.
OK that makes sense. So Elixir can still fail(crash without recovery) if the functionality doesn’t exist to handle a situation? In this case an unmatched function call…
You might find this article interesting, especially from the “Supervision trees” slide. Supervisors are very useful to recover from transient failures, but they can’t do much when you have a critical bug in a core feature; these need to be checked with tests.
For instance if calling pop is part of your app’s initialization, and you removed the initial [:hello]… then yes, you’ll get crash loops and eventually the supervisor then the whole VM may go down. But if it’s so bad the app won’t even start, you should notice while developing or running automated tests
On the other hand, if pop is being called on an empty stack due to a race condition (e.g. it’s a web app and the user rapidly clicked “pop” multiple times before the UI could update), then the supervisor will save the day and the user may not even notice. You’ll get a crash log with a dump of the process state, which makes it a lot easier to reproduce than when you have just a stack trace.
Thank-you for these explanations and links. I’m at the point with Elixir and OTP where I’m trying to figure out the basic functionality of the common abstractions and this really helps… Thank-you.
If you called pop on the genserver then the genserver would crash and so would the caller. The gen server would be resarted. The caller might too depending on if it’s supervised.
Quite possibly the only reason that the caller called pop at an invalid time is because it had invalid state. Now that’s cleared away and you’re good to go.
Possibly however it really is a bug in the code, and that will continue to happen. If it only happens periodically then life carries on.
If it happens a configurable number of times in a configurable number of seconds however then the supervisors themselves start crashing. This is intentional. Elixir’s error handling philosophy is that once a particular level thinks things are unrecoverably bad it tries to escalate the problem to a higher level so maybe it can handle it.
Eventually this can escalate all the way up to a server reboot if you have things configured that way.
My point here is that while bugs in the code are obviously bad and should be fixed, this isn’t a scenario that is gonna go around all the error handling that the BEAM provides.