You’re fixating on the wrong part of that statement.
^ thats the important part. You need to think about how to bring up your connection processes in a stable state and assume that they’ll eventually crash. If you aren’t bringing them up in a stable state to begin with then you’ve designed a fragile system.
If you have designed the system to handle the connections correctly and your supervisor still can’t bring up your system into a steady state then you absolutely want to surface that error. Eventually if this propagates far enough up then you’ll want to shutdown. If you’ve done your design well and you reach the point of shutting down then it means that something drastic has gone wrong outside of your control. To put it a different way, who cares if your service is up if it can’t do anything useful or always serves the wrong answers? Its functionally equivalent to being shutdown anyway.
Don’t design around promises you can’t keep. This is central to the erlang philosophy. Your server will go bad. The network will go bad. Design to handle these.
Yes, but I rather have a useless service online that keeps pinging the team with error messages rather than having it crash completely. This is a design choice.
Yes, but it doesn’t mean I have to completely shutdown the entire application. I rather have it online checking for service availability every chance it gets and posting error messages to some service constantly than having it quit and shutdown.
As you describe it this is expected, not exceptional behaviour. Design needs to accomodate expected behaviour.
Sounds like working against live servers and watching for dead ones to come back online are distinct responsibilities that need to be handled separately (i.e. when a worker decides that the server won’t respond it terminates normally handing matters back to the dead server watcher).
It sounds like you need to manage this with some form of backoff. You can’t code to prevent anything bad from ever happening, that’s just not possible. But if you have some idea of what can go wrong (and it sounds like you do), you should handle those expected errors. This might be in the form of try/catches, exponential backoff in trying to connect with gun etc. You would in this case put that logic in the worker, not in the Supervisor, and not rely on crashes to refresh the state in cases of known errors.
If that’s for some reason not possible, set the worker strategy to transient and let a separate process handle the restarts. This is not exactly the reason for, but kind of fits into, the use case for the new Registry.select in 1.9, where you would be able to register your workers and then query the Registry for the status of the workers, restarting as required. You’d then be able to keep track of workers that keep crashing, backoff and notify/alert, but keep the other workers living. This can also be implemented by using Process.monitor. Note that this is making your life difficult for yourself, and you’re introducing a lot more risk for corrupted states.
When it comes to the unknown, there’s no way for you to keep the system working in a functioning condition. For those cases crashing is correct.
You can protect your workers with circuit breakers, one straight forward way to do this is to use this tiny application/library https://github.com/jlouis/fuse
Alternatively, you can use my Parent library, which would allow you to have a sort of a smart supervisor. The benefit of a parent is that you don’t need a separate process to deal with restarts. You can have a sort of a crossover between a GenServer and a Supervisor in a single process.
I’m still not convinced that this is the right course, but I have to go out now, so I might post more comments/questions later
This exactly. Supervision restart is the wrong thing to fall back on if your process can’t access a server for hours at a time. For one thing, if you have a large number of these processes, your supervisor becomes a substantial bottleneck for managing the connectivity of these processes. Instead each process should manage its own connection, disconnection, and reconnection life cycle. The supervisor will be there if something truly unexpected happens.
I feel there’s a misconception about what supervisors are supposed to do.
Supervisors try to keep an application in an available state by restarting crashing children and hoping that resetting the internal state of the failing process/(sub)system does fix things or at least that any external factors causing the crash were temporary hick-ups. If things don’t get better your application as a whole is considered to be unavailable – not doing its job – and therefore it’s stoped and hopefully restarted from the OS level, so maybe that level of restart can bring things back to working. If even that doesn’t help there should (in theory) be no difference between the application running, but continuously failing, and the application not running at all.
If a process/(sub)system (continually) failing is not considered critical enough to bring down the application as a whole then this needs to be handled by something outside the supervision tree, which can make different decisions about what to do in the event of failure.
I created quite a few applications where GenServers under a DynamicSupervisor either hold state and push it somewhere over http or get state from somewhere and hold it.
As others already said, it’s better to catch any http errors and use a retry strategy while keeping the GenServer alive.
Thank you everyone for the comments and opinions. They truly helped me shape the understanding of how things are supposed to work in Elixir and which approach to take.
When I first entered the world of Elixir and Erlang, I was fascinated by the advertisement of being able to have an application running in production for 7 years in a row without a single day of downtime. My objective is to capitalize on that promise.
I guess that would make sense.
An advice that I will surely follow, now that I have read everything!
I am not familiar with circuit breakers. Can you point me to some literature on the matter ?
So, your recommendation of using the Parent library still holds? I like the idea that I can have a supervisor simply update a metric every time a child fails and that keeps trying to create children to remake the connections.
However from what I have read, I understand that Supervisors should be used only when things go wrong, therefore in this case perhaps using Parent would be against the community’s advise.
I discussed this in my talk at elixirconfeu. But if you’re dealing with connections to downstream services then you’re going to want to treat the connections as a state machine. They’ll come up in a disconnected state and attempt to connect over time. If they can connect then they move to the connected state and can start accepting traffic and sending requests. The connection processes can handle some number of known issues such as disconnects or 503s or something. But part of the benefit of treating your connections as a state machine is that allowing them to crash is a safer proposition. When the supervisor restarts the connection process the connection process won’t expect so it’s less likely to enter a crash/restart loop.
This is what I meant when I said you need to bring your processes up in a stable state. Processes that talk to the outside world should have no expectations that the downstream service is available when they start. Otherwise you’ll end with a system that can’t recover from transient failures and probably can’t even boot reliably.
This also gives you the observability that you want. It’s generally not helpful to page for every crashed process because you won’t have enough context to create a meaningful page. But if your processes are attempting to move from a disconnected state into a connected state and aren’t able to then you can start to send more meaningful alerts to an operator or an on-call person. You can also regularly poll your connection processes and if some number of them aren’t up then you can take yourself out of the load balancer or provide more meaningful liveness and readiness checks to your other systems.
Most services run in a conjunction with other services. The meaningful metric is typically not whether one service is healthy. It’s whether the entire system is healthy. Allowing the system to gracefully degrade becomes a key factor for healthy systems.
It isn’t always practical to introduce a delay, but in cases where it is, I’ve used the following trick to allow a failing worker to “keep trying indefinitely” without hitting max restart intensity. In my worker, I use Process.send_after or :timer.sleep() to introduce a delay before executing the code that might fail. If the delay is greater than the max_seconds option you passed to Supervisor.start_link/2, then even if the worker fails repeatedly, it won’t fail frequently enough to exceed max restart intensity. It’s not elegant, but it is simple. Obviously this only suits certain cases, often it won’t be acceptable to introduce delays.