Yeah, the software industry in Germany is really frustrating and so much behind other countries.
Good idea. I guess this forum would be an excellent place for that. I’ll check if anyone has opened that thread already. (I might do this later)
Arguments like OTP and fault tolerance doesn help when the CTO states that this is handled by Kubernetes already.
I haven’t worked with Kubernetes a lot - but wouldnt it just handle fault-tolerance on the scale of an entire app crashing? Kubernetes + golang (without external golang packages) wouldnt handle fault-tolerance on the “agent” level. Which is where the niceties of OTP and its standards come to play.
I also had a hard time convincing companies to do Elixir (as a freelancer in Denmark).
It came to the point that my customers would require me to buy ice cream to them every time I mentioned Elixir.
I ended up starting my own company, become CTO and choose Elixir as the main language Now we are 10 people (3 developers) and I managed to hire the first Elixir developer besides my self.
Perhaps not a solution for anybody, since it requires the right place, time and people. But if the right place, time and people are there, go for it!
Yep, asking for explication of what ideas there actually are outside the current libs that offer value, gets interpreted as claiming any such ideas are worthless. You’ve been crystal clear that you (we) are open to hearing about them–
While this is true, the trend is for the apps in K8s to be smaller and smaller. As in it’s not uncommon to spec 1 CPU & 256MB for Java apps. Of course not everything fits that model, but… K8s pods look more & more like OTP supervisors with a handful of child processes.
If somebody is only familiar with mainstream technology they would never expect the notion of fault resilience (rather than bare bones micro-level fault handling) to be part of a programming language.
Most high level programming languages are designed “language first” to perform computations. I suspect Erlang and the BEAM were developed in lockstep for computation and coordination which is why runtime concepts like process links and monitors are part of the core.
This forms part of the supervision strategy of a system and in some situations is put in place not by the developer, who focuses only on what particular workers have to do, but by the architect, who has an overall view and understanding of the system and how the different components interact with each other.
There also seems to be a drive to commoditize developer skills and designing for fault resilience seems to require a more advanced skill set. So the responsibility is pushed to the next level up, outside of the container. At this level of granularity the container becomes the “component”, k8s adopts the “let it crash” philosophy, creating design pressures to have the “component” no larger than what you can afford to lose.
At least that is my current perception …
What is easier to scale, Go with Docker and Kubernetes or Erlang / Elixir + OTP?
This seems to be the crux of this divide we’re lately witnessing (K8s vs. other deployment strategies). Many CTOs and programmers in general want to treat fault tolerance as an implementation detail that can be offloaded onto the Operations / Sysadmins team. I certainly witnessed such efforts several times already.
I mean OK, they are free to try but I have the feeling that the IT area at large will yet again waste humongous amounts of time, spend a lot of money, generate huge legacy somebody has to maintain, burn out and seek alternatives… and then somebody will notice that Erlang had 99% of what they wanted all along.
Maybe it is miscommunication. People hear “restart the process” and view it as a higher layer and similar to k8s. But restarting a crashed process isn’t new and Erlang itself even comes with heart but no one has questioned what the point of supervision is if you have heart.
I think this goes beyond simply explaining the performance/cost benefits of not churning k8s pods. It could need an explanation that breaks down how it helps structuring your program and how you would no more want to be without it than you would try/catch.
I usually turn this around by saying supervisors work as if you had k8s inside your code. For example, when you have 20 database connections in your app, it is not something that k8s can manage, restart and automate, but supervisors provide exactly that, so you can apply the same principles on the “small” and the “large”.
Of course there are many gaps in this explanation but it can be a starting point for someone who heard very little about Erlang/Elixir.
I haven’t moved away from Elixir but I have moved away from dev into a more Ops/Arch:Managerial level…largely because I’m very passionate about protecting the development process from the business side of a company. As a developer, you can’t easily do that because there’s a lot more communication involved.
But there’s also the reality that after diving headfirst into Elixir I find myself having to avert my eyes to a lot of other languages. I find it’s much simpler for me to operate at a higher level and discuss the needs with devs of any language to solve in the way that makes the most sense to them…than it is for me to look at the code and constantly think about how much simpler this would be with Elixir.
This is where things like load-balancers - which are external to your application by definition - pose interesting questions for us.
K8s, PCF, or other high-level orchestrator, can stop sending a node traffic when it displays signs of issues - latency spikes, cpu spikes, request-rate processing drop off, 5xx status response-errors, and then create a new node.
When an Elixir apps is having problems, it can either continue to accept connections that it can’t currently handle in the hope of restoring normal service, or it can start to reject them, at which point some other component is going to have to handle this anyway.
The external orchestrator stops the bleeding as soon as a problem is detected. The internal handling of a supervisor, doesn’t as it attempts to recover from the problem. External orchestrators have a smaller range of errors that they can detect than internal supervisor hierarchies, and a smaller range of responses. Higher error sensitivity is good, a small and deterministic range of responses is also probably good.
Combining the two is, probably, the worst of all situations; your app accepts traffic it can’t currently handle as it tries to recover, delaying the point where the external orchestrator gives up and kills it.
This is why I don’t use Elixir beyond personal projects, even though I’m very fond of it.
Combining the two is the best of all worlds. Elixir and Erlang give you tools to provide signals to external load balancers and do much smarter load balancing and routing of traffic. You don’t not use a load balancer just because you’re using elixir.
At high scales you’re essentially running in degraded mode all the time whether you know it or not. The goal is to make all of your applications gracefully degrade. Every language and runtime has to contend with this. Elixir and erlang provide a useful set of tools to build that graceful degradation into your application which is a really powerful concept. But it doesn’t preclude or obfuscate the need to also provide graceful degradation at the system level (like in your load balancer example). It complements it.
A supervisor won’t retry things forever. Continuous errors will cascade and cause the whole node to go down permanently, which is when the orchestrator will detect it. The advantage of the supervisor is that if you have a small hiccup, you can quickly heal without going through an external system loop.
Furthermore, you usually write your supervision tree so the essential services are started first and the rest of the application won’t run unless those services are available. For instance, if you are detecting failures through a health check endpoint, it is straight-forward to either disable the health check endpoint if any other service is down, or have the health check report that the system is non functional.
So if your system is accepting requests when you know it can’t handle them, you have all of the control to stop accepting said requests. The goal of a supervisor is to help you think about failures and setup reasonable strategies. In the worst case scenario, if your strategy is never handle failures at that level, you might as well set
max_restarts: 0 .
Sometimes the better course of action is just to listen: “hey, I am sorry that you feel this way. I have not had the same experience. In any case, if you want to talk or suggest ways we can improve, we will be glad to hear”. That’s it.
I completely understand your perspective. It is frustrating to be called something you don’t agree with, but at the same time, this is a thread for people to express their opinions on why they moved away from Elixir. Of course, if you disagree with something, you can provide counter examples, but this thread should mostly be an exercise on listening for most of us. Otherwise, the next time a similar thread pops up, nobody will say anything, and then we won’t learn anything either.
There are countless threads that say wonderful things about Elixir, we will be fine with a thread that brings some of negative points to light so we can work on them.
Agreed. Apologies if what was said looked non-friendly.
I completely agree with your comments about systems at scale operating in some manner of perpetually degraded mode; continuous partial failure is a fact of life.
But those failures are, by their nature, likely to be surprising. When an application/node is having problems, you can’t rely on it to solve those problems, and you can’t rely on it to reliably inform other systems of those problems. The only way I know of to do that is by external observation of the behavior of a node. Then it doesn’t matter if a node thinks it is healthy or thinks it has issues, has managed to communicate that state to an external system or not… if it isn’t observably behaving as expected, you can take action - in this example, take it out of a load-balancer rotation.
In a homogeneous Beam environment, I might want to try to overcome that, but I’ve never worked in a homogenous environment at scale. I can externalize that control with PCF or K8s/Isto and my problems largely go away, no matter what the implementation technology is. At scale, being aggressive about killing nodes gives me better liveness. If 1 out of N nodes starts to look odd, immediately taking it out of rotation, creating a new one and terminating that node when any connections have returned or timed out, gives me a better return than watching and hoping that it will get better.
All this does presuppose ephemerality of nodes, something that has become increasingly true, but is not universally so and different trade-offs need to be made. But, in applications where nodes are truly ephemeral, I get no benefits from nursing sick nodes back to health.
A few thoughts:
At the time when Beam was invented, there were large machines with applications composed of many subsystems. Supervision of those subsystems within the application combined with very detailed work to stop failures propagating across processes in the VM was a work of insight and genius.
Now, in the world of microservices and, for the first time ever, shrinking node-sizes we are composing those subsystems at the node per process level rather than as internal modules and processes. To get reliability, we need a supervision mechanism just like the one the Beam has had for all these years. That’s the external orchestrator, and letting individual subsystems/processes/nodes crash is the epitome of the Erlang model.
Now, the Beam has supervisor hierarchies, and you could suggest that we think of an external orchestrator as one level, and the internal supervisor hierarchy as simply another finer grained level. And that is worth thinking about. I would however suggest that, just as the Beam doesn’t really try to save processes - it lets them crash and creates a new one, that containerized microservices are the process level abstraction in this case and that we should let them crash and create new ones.
This isn’t a black and white issue, but as container management orchestrators get better and better, I can’t make a convincing case for why I’d try to save a single node/process.
Except that I can typically have my nodes degrade in a much more robust way then istio can. Also the problems don’t “go away” because you use istio or k8s or whatever. You just have different problems now. But thats beside the point.
More importantly there’s nothing about running elixir that keeps you from running ephemeral nodes. We run ephemeral nodes at work. You’re drawing a distinction here that I truly don’t understand. Do your other apps and services not try to reconnect when they drop a database connection? Furthermore if the database goes down do you start killing your service nodes when they return enough 500s because they can’t talk to the database? Or do you stop sending them traffic because they start returning a degraded status in your liveness checks? All of these patterns work equally well with Elixir. The distinction your drawing isn’t real.
I’m not in any way suggesting that Elixir can’t operate in this way. I’m merely suggesting that some of the obvious benefits of supervision models cease to be such obvious benefits. And without the obvious benefits, it becomes a much harder sell.
Let me ask one question then: how do you expect your app to behave when you drop connections to the database because there is a very quick netsplit of few seconds?
The supervisor model in Elixir was never about the fault tolerance at the load balancer or between nodes (it could be done like that but you would need to write a lot of infrastructure to get this working, i.e. it is not there out of the box). It is about bringing a reasonable model to reason about errors inside the programming language.
And the behaviour Elixir provides here can be achieved in all programming languages. I am pretty sure Java, Ruby, Python, Haskell, etc can all reconnect to the database when the database disconnects, the case for Elixir is that achieving the same behaviour is done more reliably and with simpler idioms.
So I agree with @keathley. Those two should not really be compared and if someone was expecting the value for Erlang/Elixir to come from the fault tolerance at such a high level, it is no surprise their expectations are not being met. But preferably those expectations should not have been set in the first place.