Preventing cascading failure of library application taking whole node down

brendan · November 25, 2021, 10:42am

We have had some outages due to a non-critical part of our system failing and causing cascading failures up to the point of a node restart. In particular a library application crashed enough to cause a node restart.
I have read through Strategies to avoid cascading supervisor crashes? and https://ferd.ca/the-hitchhiker-s-guide-to-the-unexpected.html and I understand the strategies in solving cascading failures.

This question is specifically how to handle library applications/supervision trees that I’m not in control of (short of forking the library).
I have not had success in finding documentation on how library applications connect to the root supervisor and how that supervision strategy can be configured. Is there any way to do this? Or is this a limitation of Elixir/OTP?
Is the only way of solving this problem to solve it in the library?

I’m currently not mentioning the library because I don’t think it’s important to the question and it would be just one of many examples.

benwilson512 · November 25, 2021, 10:53am

The term “Library application” is generally used only for those libraries that do not spawn a supervision tree and therefore can’t crash in the way that you are talking about. Can you elaborate on the concrete issue you are facing?

LostKobrakai · November 25, 2021, 10:54am

There two levels to deal with here: There’s applications and there’s processes.

Any of your dependencies as well as your own application are separate otp applications. Applications can be started as :permanent | :transient | :temporary, which work similar to the restart strategies for processes on supervisors.

Any application can optionally register an application callback module. If an application doesn’t do so it’s commonly called library application (but that’s not a strict definition).

If an application does register such a callback module its start/2 implementation is called when the application is started and it needs to return a pid for a “root process”. Most often this is a supervisor, but doesn’t need to be.

If the “root process” crashes then the application is stopped immediately (no restart attempted). If the application is a :permanent one then all other applications and the vm will shut down. If it is :transient, then the crash reason of the root process is consulted. If the reason is :normal then it’s handled like a :temporary application, otherwise it’s handled like a :permanent one. For :temporary applications the application will stop without affecting other applications or the vm.

So the question becomes:

Is the crashing process in your applications supervision tree → Handle it within your supervision tree with e.g. circuit breakers or other means.
Is the crashing process not in your applications supervision tree → Either change the applications restart strategy or make sure the process doesn’t crash the application it is part of. This probably involved forking if you cannot have a fix added upstream.

brendan · November 25, 2021, 12:33pm

It is common for a library to implement Application and then
specify it in mix.exs:

def application do
  [mod: {Library.App, []}]
end

When a library does this, it starts its own supervision tree directly under the root supervisor simply by the library user adding the library as a dependency, ie. the project/application using the library does not have to explicitly add the library to its own supervision tree.

brendan · November 25, 2021, 12:40pm

How can I modify a library’s application restart strategy when I don’t explicitly add it to my application’s supervision tree or to the list of applications in a release? The library application is implicitly started simply by having it as a dependency, so as far as I know there’s no way to modify how it is started?
If a library does this:

def application do
  [mod: {Library.App, []}]
end

Can I prevent it from being started directly under the root supervisor and instead add it as a subtree to my application’s supervision tree?

LostKobrakai · November 25, 2021, 1:05pm

First of all applications are not started under a supervisor. Applications are just started. Only processes are supervised by supervisors, not applications. You can kinda hack application monitoring into the beam (see e.g. shoehorn on nerves), but generally you want to make sure applications do not crash for reasons they should not crash for. So the first step should always be fixing the application in question.

While you can edit the restart policy for applications when building a release I’d also wouldn’t consider this a fix. If the app is one of your dependencies then it’s likely used and needs to be available. If it’s not use then it should be removed or be made optional (likely again needing upstream changes).

You could try to work around it by just loading the application, but not starting it. That would make the pure code available without the processes, but it’s questionable if that won’t cause errors elsewhere.

brendan · November 25, 2021, 1:15pm

Since an application generally runs:

Supervisor.start_link(children, strategy: :one_for_one)

it must be started under some root OTP supervisor, otherwise it wouldn’t take the whole node/VM down when it crashes.
I guess I’ll try to make changes to the library.
It’s just weird to me how easy it is for one part of the system that you’re not even in control of to crash the whole system when Erlang/Beam is all about fault tolerance and isolation.
I think it’d make sense for a library user to be able to specify how and where the library’s supervision tree is started without having to make changes to the library. The library user really should be more in control.

LostKobrakai · November 25, 2021, 1:17pm

Within the beam there is no hierarchy of applications. It’s just a flat list of applications.

(Permanent) Applications stopping is nothing, which is generally meant to be cought. If an application crashes it’s expected to be in a state, which is not recoverable by restarting from within the vm.

Basically if an application stops for a crash of the root process you’re past the line of “we can fix this”.

brendan · November 25, 2021, 1:21pm

I guess the solution is to either fix the library or be more picky about the libraries that we use since it’s so easy for a library to be badly implemented and take down the entire VM

LostKobrakai · November 25, 2021, 1:26pm

brendan:

Since an application generally runs:
Supervisor.start_link(children, strategy: :one_for_one)
it must be started under some root OTP supervisor, otherwise it wouldn’t take the whole node/VM down when it crashes.

That’s not how it work. Yes there are processes higher in the supervision tree than the root process started by the application. Those are in each applications individual process tree though. There is no root process, which would join all the applications within a vm. Application livecycle is controlled by the :application_controller in the :kernel application.

Generally you should be able to recover from an VM crash no matter what. “fault tolerance and isolation” don’t mean “the vm will never crash”. Yes you likely want to keep those to a minimum, but you cannot guarantee it will never happen.

brendan · November 25, 2021, 2:01pm

Maybe I’m using the incorrect terminology, but something still monitors the root process of an application and decides to take the whole VM down when a single application crashes.
Can you recommend a resource where I could learn more about the Application lifecycle and application_controller? Is it sufficient to read Erlang -- application and Erlang -- Applications?

The problem isn’t that it doesn’t recover, it’s that a crash of a non critical part of the system takes down the whole system (ie. the critical part of the system) so I want more control over the non critical part of the system and how it handles failures which is basically what https://ferd.ca/the-hitchhiker-s-guide-to-the-unexpected.html talks about. Just in this case I need to make changes to a library to make it happen.

LostKobrakai · November 25, 2021, 2:09pm

Yes, that’s the case, but it’s not a supervisor, so there’s none of the properties present a supervisor usually provides. That’s the reason why I stress the difference.

You got two options: Change anything non critical to only run in the context of your own supervision tree (and therefore control) or set the application to be :temporary and deal with all the side effects of the application potentially being stopped.

SirWerto · November 25, 2021, 6:16pm

I recommend you the book from @ferd Learn you some Erlang for understand Erlang and OTP. It’s the Book for me. If you are familiar with the syntax, you can start at The Hitchhiker’s Guide to Concurrency chapter.

stefanchrobot · November 25, 2021, 8:47pm

This is an interesting scenario. I’d say the library has a bug and if the crash can be classified as originating from within the library, I’d even say it’s a pretty severe one.

Not an expert on libraries, but it seems that the more mature ones address this by actually not starting any processes on their own. Oban is a great example:

Each Oban instance is a supervision tree and not an application . That means it won’t be started automatically and must be included in your application’s supervision tree.