Supervisors > workers max_restarts

Hey guys! =)

TIL: that a supervisor gets shutdown when the worker child doesn’t start up. :neutral_face:

Someone just shut me down for it too and said how erlang is supposed to be 99.9% up and I didn’t have a good come back. hahaha =p Cause it is on the documentation. =(

Notice that the supervisor that reaches maximum restart intensity will exit with :shutdown reason. 

I currently have a genserver worker child that connects to an external source and if the external source is unavailable it fails then dies taking the application along with it. =(

It’s currently setup with an older elixir version and I can’t use dynamic supervisors with it.

a) Even if I put the worker behind another supervisor that the application supervises I think the app might still fail. =(

b) just extend the max_restarts and time frame (cant predict when the external service would be up)

c) handle it in the genserver on terminate (let it sleep before retry???)

Any suggestions?

I need to restore their faith in elixir. hehe =p

Thanks! =)

Supervisor restarts are meant to handle unexpected input to your code, bugs, flaky network connections or even the hard to detect heisenbug. All those can are likely to be resolved by just restarting any processes which died and moving on. A external resource, which is permanently down, is nothing, which can be resolved by restarting things on your end. If you put a process connecting to such a service in the supervision tree it‘s kinda expected to be somewhat vital to your application and with it being down what should your application do besides simply not working? If you don‘t want your appliation to die as well you can look into other mechanisms like backoffs or circut-breakers, which is not what supervisors can do. Supervisors are purposefully simple and try to keep things running and if that doesn‘t work they give up.

3 Likes

Not that it would change anything, but dynamic supervisors are the old supervisors with :simple_one_for_one strategy that have always been available. I don’t see how it could help in your case though …

2 Likes

thanks! I will look into those other mechanisms. =) I was trying to sell off elixir because the guy was a go guy then my app died. =p

1 Like

thanks! i saw someone use it when I was googling so i thought that would have prevented the problem.

1 Like

I’m no Elixir guru, so take everything below with a healthy dose of skepticism and don’t take it at face value :slight_smile:

Elixir/Erlang are famous for their fault tolerance, but they’re not magic: if you design your project poorly it won’t continue running just because it happens to be on the BEAM.

In this case, it seems that the cause of failure was indeed poor application design. If you’re connecting to a remote service, you have to plan for “what happens when it’s unreachable or too slow?”. According to your description, it seems like the decision you made was “just let it crash”.

While that is oft repeated as part of Erlang’s design philosophy, it’s important to understand why people recommend that approach. When designing systems in OTP, it’s expected that restarting a service/process will bring them into a known good, stable state. Basically “turning off and on” which solves so many issues when technology goes bonkers.

In the case you describe, you can’t just “let it crash” because when the service restarts, the remote service you’re trying to connect to could still be unreachable. In other words, restarting your process didn’t (and couldn’t!) solve your issue at all.

Instead, it’s recommended to design with a so-called “error kernel”: a portion of the system that MUST ALWAYS be correct. If that error kernel gets corrupted, everything should get shut down.

I your design, you basically had the connection to the third party system within your error kernel (in effect having your code say "this remote system will always be reachable), and when that wasn’t the case, the system failed.

Since that’s apparently not what you want to happen, you should instead design your system so that connecting to the third party system isn’t part of the error kernel. In other words, your processes should continue operating even if unable to reach the remote service.

One way this could be achieved is by starting a task to connect to the remote service, and shut the task down if it doesn’t return before a certain cutoff period (i.e. a timeout). Your GenServer would then either respond with {:ok, response} if it got a response, or {:error, :service_unavailable} if the remote service couldn’t be reached in time (either because it’s down or the response came too late). Then the client can decide what to do if the remote service is unreachable (try later, etc.).

Here is some more info on the idea of service guarantees and error kernels (note there are 4 links, I don’t know why some don’t display properly):

https://ferd.ca/it-s-about-the-guarantees.html

https://medium.com/@jlouis666/error-kernels-9ad991200abd

https://www.reactivedesignpatterns.com/patterns/error-kernel.html

Hope this helps!

7 Likes