I have an app that must never go down, under any circumstances. This app has a supervisor that has a ton of workers. These workers are feeble and may die … a lot. Every time a worker dies I have a metrics system that tells me so I know when something is wrong.
The problem here is that I can’t find a way to prevent my Supervisor from restarting or just outright dying. I tried the following configuration:
The point of max_restarts (and max_seconds) option is to break the endless restart cycle, and move recovery to the higher level. Thus, in my opinion, an infinity restart option doesn’t make sense. You could approximate it by e.g. using some insanely large number for both options, but I’d advise against doing it.
There are some alternative approaches which could be considered, but first I’d like to learn more about what do these workers do, and why can they restart so frequently?
Totally agree with @sasajuric here. The supervisor’s job is to bring up your child processes in a stable state. If the supervisor can’t achieve that steady state then you want to continue restarting at a higher level until you eventually shut down the application (because at that point something has gone terribly wrong).
I think in this case you may want to rethink your problem. We don’t really have enough information to say but if your workers really are that transient then I would probably move the workers into a dynamic supervisor and then monitor and start them from some other controlling process. But there are a lot of options here. The important thing is to think about the problem in terms of what guarantees you want your supervisor to provide.
These workers use gun to open HTTP2 connections to other servers. On startup, each worker opens a connection to a given domain and then keeps that connection open. This allows us to send an insane amount of requests per second.
The issue here is that when the connection times out or dies, the worker is smart enough to try and reconnect. But what if it is impossible to reconnect? Maybe there is some corrupt state in the worker or maybe there is something else. In this scenario (and other simillar scenarios) I allow the worker to die so the supervisor can create a new one with a fresh state (as advised by your book )
But what happens when a worker suicides over and over again? The Supervisor will try to restart it over and over again until it decides this is a lost cause, and then suicides himself as well.
Workers will fail. The systems they connect to may be unavailable for hours or days at a row. If my supervisor crashes, it starts a chain reaction that causes the app to crash. Thus my approach is to make the Supervisor never die while still being able to allow workers to die so they can come back online with a clean state.
You’re fixating on the wrong part of that statement.
^ thats the important part. You need to think about how to bring up your connection processes in a stable state and assume that they’ll eventually crash. If you aren’t bringing them up in a stable state to begin with then you’ve designed a fragile system.
If you have designed the system to handle the connections correctly and your supervisor still can’t bring up your system into a steady state then you absolutely want to surface that error. Eventually if this propagates far enough up then you’ll want to shutdown. If you’ve done your design well and you reach the point of shutting down then it means that something drastic has gone wrong outside of your control. To put it a different way, who cares if your service is up if it can’t do anything useful or always serves the wrong answers? Its functionally equivalent to being shutdown anyway.
Yes, but I rather have a useless service online that keeps pinging the team with error messages rather than having it crash completely. This is a design choice.
Yes, but it doesn’t mean I have to completely shutdown the entire application. I rather have it online checking for service availability every chance it gets and posting error messages to some service constantly than having it quit and shutdown.
As you describe it this is expected, not exceptional behaviour. Design needs to accomodate expected behaviour.
Sounds like working against live servers and watching for dead ones to come back online are distinct responsibilities that need to be handled separately (i.e. when a worker decides that the server won’t respond it terminates normally handing matters back to the dead server watcher).
It sounds like you need to manage this with some form of backoff. You can’t code to prevent anything bad from ever happening, that’s just not possible. But if you have some idea of what can go wrong (and it sounds like you do), you should handle those expected errors. This might be in the form of try/catches, exponential backoff in trying to connect with gun etc. You would in this case put that logic in the worker, not in the Supervisor, and not rely on crashes to refresh the state in cases of known errors.
If that’s for some reason not possible, set the worker strategy to transient and let a separate process handle the restarts. This is not exactly the reason for, but kind of fits into, the use case for the new Registry.select in 1.9, where you would be able to register your workers and then query the Registry for the status of the workers, restarting as required. You’d then be able to keep track of workers that keep crashing, backoff and notify/alert, but keep the other workers living. This can also be implemented by using Process.monitor. Note that this is making your life difficult for yourself, and you’re introducing a lot more risk for corrupted states.
When it comes to the unknown, there’s no way for you to keep the system working in a functioning condition. For those cases crashing is correct.
Alternatively, you can use my Parent library, which would allow you to have a sort of a smart supervisor. The benefit of a parent is that you don’t need a separate process to deal with restarts. You can have a sort of a crossover between a GenServer and a Supervisor in a single process.
I’m still not convinced that this is the right course, but I have to go out now, so I might post more comments/questions later
This exactly. Supervision restart is the wrong thing to fall back on if your process can’t access a server for hours at a time. For one thing, if you have a large number of these processes, your supervisor becomes a substantial bottleneck for managing the connectivity of these processes. Instead each process should manage its own connection, disconnection, and reconnection life cycle. The supervisor will be there if something truly unexpected happens.
I feel there’s a misconception about what supervisors are supposed to do.
Supervisors try to keep an application in an available state by restarting crashing children and hoping that resetting the internal state of the failing process/(sub)system does fix things or at least that any external factors causing the crash were temporary hick-ups. If things don’t get better your application as a whole is considered to be unavailable – not doing its job – and therefore it’s stoped and hopefully restarted from the OS level, so maybe that level of restart can bring things back to working. If even that doesn’t help there should (in theory) be no difference between the application running, but continuously failing, and the application not running at all.
If a process/(sub)system (continually) failing is not considered critical enough to bring down the application as a whole then this needs to be handled by something outside the supervision tree, which can make different decisions about what to do in the event of failure.
Thank you everyone for the comments and opinions. They truly helped me shape the understanding of how things are supposed to work in Elixir and which approach to take.
When I first entered the world of Elixir and Erlang, I was fascinated by the advertisement of being able to have an application running in production for 7 years in a row without a single day of downtime. My objective is to capitalize on that promise.
I guess that would make sense.
An advice that I will surely follow, now that I have read everything!
I am not familiar with circuit breakers. Can you point me to some literature on the matter ?
So, your recommendation of using the Parent library still holds? I like the idea that I can have a supervisor simply update a metric every time a child fails and that keeps trying to create children to remake the connections.
However from what I have read, I understand that Supervisors should be used only when things go wrong, therefore in this case perhaps using Parent would be against the community’s advise.
I discussed this in my talk at elixirconfeu. But if you’re dealing with connections to downstream services then you’re going to want to treat the connections as a state machine. They’ll come up in a disconnected state and attempt to connect over time. If they can connect then they move to the connected state and can start accepting traffic and sending requests. The connection processes can handle some number of known issues such as disconnects or 503s or something. But part of the benefit of treating your connections as a state machine is that allowing them to crash is a safer proposition. When the supervisor restarts the connection process the connection process won’t expect so it’s less likely to enter a crash/restart loop.
This is what I meant when I said you need to bring your processes up in a stable state. Processes that talk to the outside world should have no expectations that the downstream service is available when they start. Otherwise you’ll end with a system that can’t recover from transient failures and probably can’t even boot reliably.
This also gives you the observability that you want. It’s generally not helpful to page for every crashed process because you won’t have enough context to create a meaningful page. But if your processes are attempting to move from a disconnected state into a connected state and aren’t able to then you can start to send more meaningful alerts to an operator or an on-call person. You can also regularly poll your connection processes and if some number of them aren’t up then you can take yourself out of the load balancer or provide more meaningful liveness and readiness checks to your other systems.
Most services run in a conjunction with other services. The meaningful metric is typically not whether one service is healthy. It’s whether the entire system is healthy. Allowing the system to gracefully degrade becomes a key factor for healthy systems.