I already responded on twitter, but for the sake of completeness I’ll redo it here. From what I can tell, Erlang uses its own dns resolver, inet_res, which is written in plain Erlang, so this issue should not surface in Erlang.
More generally, if the runtime uses a potentially long-running syscall, the thread pool can be exhausted. My feeling is that the OTP team tries to avoid these scenarios, and that such issues would already surface in practice, but I don’t have any data to back this claim.
Yeah, I am also interested. Thought he also said that there is a choice between two implementations and the first one actually can scale much more than a normal pool of OS threads doing synchronous I/O.
Just to make one thing clear first. I am not an expert when it comes to DNS, nor really how it works in Erlang. I know some things, but not all. This is how it works to the best of my knowledge.
Erlang provides both a UDP-based implementation and “native” implementation. You can configure which is used here: Erlang -- Inet Configuration.
The UDP-based one is written completely in Erlang with the pros and cons of that. You can see the implementation here. This can be faster than the native one or not depending on what you are doing. There is work ongoing to make it better.
The “native” implementation is a port program called inet_gethost. You can see the implementation here. This program starts a pool of threads that call gethostbyname and communicates to erlang via stdin/stdout. inet_gethost was written long before nifs became a thing, which is why it is not a nif.
Both of these have been around for about 20 years with tweaks along the way, but the main point is the same. Sometimes you want the OS name resolution and something you don’t. It depends on what it is that you are doing.
I think that now I have a better understanding but I am still left with a question, that was the origin of all that discussion in Twitter and the reason for this post:
So, is Erlang capable of causing OS threads starvation, be it with DNS queries or any other OS blocking syscall?
I think inet_res is there for speed, not for robustness. If your DNS server is not responding, you are screwed in multiple ways and os threads starvation may not be very high in the list.
I am not really worried about DNS(it is just the example in the Tweet), instead I just wanted to know if it was possible for the BEAM to starve the OS from threads, and it seems that is possible as mentioned in the previous post by @garazdawi.
If a scheduler thread runs a blocking code, it will block. Therefore any potentially long-running synchronous syscall could lead to thread exhaustion.
However, a benefit of Erlang runtime over most others is that you can only block a scheduler if the BIF you’re calling is blocking, whereas in other runtimes you can do it with your own custom logic. IIRC, blocking a go scheduler was as simple as for {}, and I suppose that in node something like while(true); should do the job
Consequently, the Erlang approach has an interesting potential: the runtime layer could completely prevent blocking and thread exhaustion. I don’t know how many potentially long-running blocking syscalls are currently used. It would be interesting to know that and see if there are possibilities to eliminate them or provide alternative solutions.
Good luck with that when DNS does not work. If your sshd.config has UseDNS you are screwed; if your sshd.config has UsePAM, and your PAM setup look up names you are also screwed. Hell, if your shell’s prompt has \h in it (VERY common), it will do hostname -f for every new shell to spawn.
The DNS service server is what he want to connect with via a remote shell, and what you are referring too is DNS problems in the machine trying to connect to the remote shell, thus it’s not the same, aka you can have the remote DNS service server innoperational, but from the moment you have your laptop with a working DNS then you can fire-up the remote shell.
The point of the (now huge) Twitter thread to me seemed to have come from “is it possible to starve a thread pool comprised of raw OS threads” and slow/unresponsive DNS was given as an example.
The answer is always “yes, it can”. Handing the keys to the kingdom to most programmers nowadays is a no-go because they have no clue there are actual physical limitations there. Do try and spawn 50,000 threads on your machine. Unless you have $25,000+ workstation you’ll start seeing your machine lag at the 3000th or even 2000th mark.
As others have remarked both here and in the Twitter thread, there’s a LOT that can be done. But the original poster seemed to do his very best to be not impressed (I pointed that out to him at the end of my participation). And the discussion got perverted to “but there ARE ways for all languages / runtimes to alleviate the problem!” which is IMO a discussion stopper.
Of course there are ways. There are ways to not litter parks yet people do it anyway. There are ways to have wooden furniture without destroying the Amazon forest but it’s destroyed anyway. Etc.
Same goes for languages/runtimes; he made a few remarks that modern languages are starting to learn from Erlang to which I simply responded “but I need the results today”. I won’t care much if PHP and Ruby and Python are finally green-thread-enabled 20 years down the line. This makes for inspirational history books but in the meantime all of us have to work with something.
So the discussion started off well but it failed to stick to the main point: “which languages/runtimes do it better TODAY?” – and as I also remarked in the Twitter thread, theoretical constructs like “every language can be as good as Erlang” is not an interesting or productive discussion.
@garazdawi Thanks for the links. I learned valuable things from them.
This seems to me that is far to be the correct way of doing it in Erlang, therefore maybe some Erlang developer from this forum can put a Pull Request to fix it?
Maybe @garazdawi can shed us some light or point us to someone who can?