What is easier to scale, Go with Docker and Kubernetes or Erlang / Elixir + OTP?

When presenting Elixir to some devs, they already replied to me with something around this lines, and the last one I remember said that AWS auto scaling already gives them the the distributed and fault tolerance I advertised in the BEAM.

So as you said devs that do not know the BEAM will not see the value of it, once the current cloud offer already gives them the “distributed” and “fault tolerant”/“high available” bits for their application to be resilient and high available.

6 Likes

I can understand your skepticism and I would not venture myself into discussion over communication practices between pods as I don’t know enough about it yet.

But something I do know is that the growth of IoT is such that a large share of the workload will have to be delegated outside the CSPs’ infrastructures, on edge devices. And once again, the industry actors that are not willing to go through this paradigm shift will not be able to sustain future requirements.

And in the context of billions of small user devices, inter-process communication opens the path towards M2M collaboration and this is why I would be very interested in knowing what specifically you would consider as a
architectural flaw in that case?

Best,

Igor

1 Like

(1) Distributed Erlang was primarily intended for physically colocated hardware that was networked in a “secure” manner (e.g. isolated from outside interference; think each node running on a separate board communicating through a common backplane).

(2) The following is a mere rationalization of this experience.

  • The cumulative code deployed to a single node aggregates to a single system. That system could simply be a sub-system to a larger whole but running on a single Erlang runtime acts as a single autonomous unit no matter how many processes exist on the inside.

  • For the purposes of fault tolerance (provided (1) is satisfied) distributed Erlang (i.e. process communication between nodes) makes sense because the systems are identical. There is no issue with implementation coupling because the systems are identical. The nature of the communication traffic is closer to intra-system rather than inter-system communication.

  • If two systems operating on two separate nodes implement different capabilities then more mainstream communication protocols should be preferred. Using distributed Erlang leads to implementation coupling because now all the collaborating systems have to be either implemented as BEAM systems or at least have to communicate through proxy technologies such as C Nodes, jinterface, or Pyrlang. Also more mainstream protocols may make it easier to use more common network monitoring (or perhaps even governance) tools.

Another consideration is the protocol design between systems. Ideally messages between systems should have a coarser granularity and occur much less frequently than messages inside the system. From that point of view it makes sense to move inter-system communication to an altogether different network protocol rather than using inter-process messaging. In the IoT space MQTT seems to be the preferred protocol (Visualizing Home Automation with GRiSP).

I would imagine that in the IoT space Erlang/Elixir isn’t that impacted by the Docker/K8s prejudice. However other languages and runtimes are available in that domain. The issue is that often tools that optimize for a low barrier of entry and fast first iteration delivery and deployment are adopted over tools that strive for all around (long-term) quality. I’ve already mentioned PHP and I think Python for Machine Learning fits there as well. I suspect that Python and JavaScript are popular with the “easy to get started” IoT crowd - despite of the technical short comings that may impose on the resulting solutions.

4 Likes

It is one of the biggest reason why I try to avoid all the Go I can. Underwhelming error handling with total lack of detectability of failed or hung goroutines.

I might not be missed when you compare with other languages that don’t even have such features, but it’s such a glaring omission for an Erlang old timer that it’s like (if I can be allowed the flippant comparison) looking at a car with no brakes or seatbelts while the driver tells you “if you’re just careful and time manual engine compression right, you don’t actually need any of that”

I understand what you mean by “not missed because you don’t know about them”, but frankly it still boggles my mind that nobody actually pines for them in these communities. The debugging stories I’ve heard and seen are just maddening.

17 Likes

Totally true statement–ask my neighbor about a truck ride through the mountains with no brakes-haha!

2 Likes

You don’t have to use BEAM clustering. It’s perfectly fine to treat Elixir/Erlang application the same way you would treat Go / Rails / Java application: it’s running it’s own instance on each virtual machine in the cloud, and you find some other way to cluster them together. In simplest case you just have shared database and a proxy in front of your machines. If you need the nodes to communicate with each other you add something like RabbigMQ.

You would do deployments/scaling the same way in Go as in Erlang/Elixir in that case, it’s precisely the same level of complexity.

BEAM clustering is there in case you have a good use case for it, but if you are already looking at deploying to the cloud you most likely will not do it / need it.

11 Likes

Yesterday I found this talk which is quite related to my question.

First it shows the strengths of Go,
then it shows the shortcomings of Go and why Erlang is better than Go,
in the end it shows that if we add Kubernetes to the mix, we can make Go as robust as Erlang.

But the question is still about the ease of deployment and scalability.

2 Likes

Not only. As others have mentioned in different threads, you cannot simply reboot a BEAM node. The BEAM VM is designed to work with and orchestrate a big number of smaller tasks. Killing a BEAM VM node like Kubernetes loves to do is akin to rebooting your PC because you cannot close one program. Brutally inefficient.

Not really. And certainly no different than killing a Go or Java node.

I really don’t get this presentation. Restarting crashed nodes and running more than 1 instance is not new in Kubernetes. This is how systems have been run forever. Kubernetes provides a lot of great unique stuff, but the idea that it running multiple instances of your app and restarting them if they crash is not any of them.

No one would suggest not handling errors in your Go code because you can have k8s restart it when it crashes. Or not handling exceptions in Java because k8s can restart the node. Erlang’s supervision and isolation helps make it so you can depend even less on k8s handling a failed pod. Resources aren’t infinite and scheduling pods isn’t free, so no one would want to rely on k8s to handle every failure in a program.

6 Likes

If you are going to let a process go down and relying on k8s to restart you on crashes for ‘reliability’ you are going to see a lot of 503s, because your nodes do not serve only one request at a time. If the process bounces all in-flight requests fail.

8 Likes

No, because your load balancer will re-try.

Not that I recommend depending on the LB to shim a thin appearance of reliability on top of an unreliable service :wink:

No sir. If the request is delivered to the application, and the application never sends a response or sends a 500, it cannot be “retried” safely (the request may have processed already, or it may be the cause of the failure) and I’m not aware of an HTTP load balancer that does this.

8 Likes

Uhm, I’ve seen plenty of load balancers, in production, at big companies running name-brand services which do exactly this. It’s an API-design issue to make sure that multiple tries do not do bad things. Because it is always possible for the client to not know whether a request succeeded or not–basic distributed systems principle: you can have at-least-once or at-most-once, but never exactly once. (Of course you can combine at-least-once with unique ids where needed, so that multiple submissions get skipped…)

And yes, if the request is the cause of the failure, and you keep re-trying, hilarity can ensue, which is why you need config param to limit number of retries…

1 Like

You’d also need a config parameter for the amount of memory the load balancer will use to buffer each request, so that it can be resent later if required. It might need quite a few gigabytes! If you can find documentation of a load balancer that exposes params like that I’d be thrilled to see it. AWS doesn’t do retries at all. Some may do retries on GET requests, as those are supposed to be idempotent and usually aren’t large. I was talking about PUT/POST/DELETE.

nginx retries upstream requests, by default–in fact I believe that disabling this by default for PUT & POST was a feature that was added in the 2016ish timeframe, before then it retried them by default as well, which could result in surprises for folks who hadn’t taken care to make their API idempotent.

Envoy, HAProxy, Traefik support retry–I don’t know what their default behaviors are.

I think you’re saying OTP as an supervisor. If so, the OTP can’t replace K8s because the supervisor only handles processes atop of app server.

What the “fault-tolerant” means in Elixir is process restart; DB instances, redis, different server instances are beyond the supervisor's range.

On the other hand, K8s orchestrates containers. That said, one can think K8s as a container supervisor.

And dockerizing in Elixir is advantageous too. Updating codes is so easy when you use docker.

For Hot code reloading in Elixir, I really wonder there is anyone actually use this feature instead of docker.

Scaling

Vertical scaling isn’t a big deal in most cases.
But horizontal scaling, I don’t think it’s smooth in Elixir. (I haven’t tried this yet, so please share experiences)

I think there’re some pain points. Especially multiple nodes with websocket. If you’re using ETS as a cache like con_cache or cachex, access and control will be another headache.

Though I wish to use ETS more often, but no one would doubt redis container is much more manageable when it comes to consider horizontal scaling.

Conclusion

Either you choose Go or Elixir, docker and k8s are more than an option if you’re making a big project. But I can say that the Elixir is working silently if the server is running on a standalone server.

A nohogu.com, one of my personal projects is running on a basic digital ocean server with postgres and ubuntu 1804. Only S3 is leveraged outside. (consider codes aren’t optimized, site is still in beta.) Much faster than the past rails server at no compromise.

5 Likes

You can stop wondering now. My company does, and there’s conference talks representing more companies too. We tend to be a quieter group than those who need help setting up docker, so we’re easy to miss.

13 Likes

Gigalixir supports hot reloads inside Docker containers that are managed by k8s. They don’t deploy a new image to deploy application updates.

2 Likes

Could you please write about your hot code reloading experiences somewhere?

3 Likes