Tweet: Thread pool startvation affects everyone with an example of a DNS service

I already responded on twitter, but for the sake of completeness I’ll redo it here. From what I can tell, Erlang uses its own dns resolver, inet_res, which is written in plain Erlang, so this issue should not surface in Erlang.

More generally, if the runtime uses a potentially long-running syscall, the thread pool can be exhausted. My feeling is that the OTP team tries to avoid these scenarios, and that such issues would already surface in practice, but I don’t have any data to back this claim.

5 Likes

So, @sasajuric invited @garazdawi to the conversation and seems that now we have the answer:

7 Likes

Today I wake-up and looking to my previous post I am not sure If I totally understand the Tweet I linked above.

Some questions for @garazdawi:

You mean that Erlang can do OS blocking calls that make OS spin threads, thus making possible for the BEAM to starve the OS from threads?

You mean a DNS implementation? Can you link to both? If not can you expand on this?

1 Like

Yeah, I am also interested. Thought he also said that there is a choice between two implementations and the first one actually can scale much more than a normal pool of OS threads doing synchronous I/O.

1 Like

There is inet_res that @sasajuric posted earlier. DNS look up is just UDP client so I assume it is not too much trouble to do in erlang.

1 Like

Just to make one thing clear first. I am not an expert when it comes to DNS, nor really how it works in Erlang. I know some things, but not all. This is how it works to the best of my knowledge.

The tweet I was referring to was https://twitter.com/davidfowl/status/1365232349584052227?s=20

Erlang provides both a UDP-based implementation and “native” implementation. You can configure which is used here: Erlang -- Inet Configuration.

The UDP-based one is written completely in Erlang with the pros and cons of that. You can see the implementation here. This can be faster than the native one or not depending on what you are doing. There is work ongoing to make it better.

The “native” implementation is a port program called inet_gethost. You can see the implementation here. This program starts a pool of threads that call gethostbyname and communicates to erlang via stdin/stdout. inet_gethost was written long before nifs became a thing, which is why it is not a nif.

Both of these have been around for about 20 years with tweaks along the way, but the main point is the same. Sometimes you want the OS name resolution and something you don’t. It depends on what it is that you are doing.

9 Likes

Thanks a lot for the more detailed explanation.

I think that now I have a better understanding but I am still left with a question, that was the origin of all that discussion in Twitter and the reason for this post:

So, is Erlang capable of causing OS threads starvation, be it with DNS queries or any other OS blocking syscall?

2 Likes

Yes it can definitely happen. I do not think it is as likely as many other systems, nor is the effect as bad.

5 Likes

I think inet_res is there for speed, not for robustness. If your DNS server is not responding, you are screwed in multiple ways and os threads starvation may not be very high in the list.

1 Like

I am not really worried about DNS(it is just the example in the Tweet), instead I just wanted to know if it was possible for the BEAM to starve the OS from threads, and it seems that is possible as mentioned in the previous post by @garazdawi.

But as mentioned, Erlang can handle this kind of situation more gracefully, and I think that it’s important to remark.

2 Likes

Yes, I already have done exactly that in the Twitter thread:

1 Like

If a scheduler thread runs a blocking code, it will block. Therefore any potentially long-running synchronous syscall could lead to thread exhaustion.

However, a benefit of Erlang runtime over most others is that you can only block a scheduler if the BIF you’re calling is blocking, whereas in other runtimes you can do it with your own custom logic. IIRC, blocking a go scheduler was as simple as for {}, and I suppose that in node something like while(true); should do the job :slight_smile:

Consequently, the Erlang approach has an interesting potential: the runtime layer could completely prevent blocking and thread exhaustion. I don’t know how many potentially long-running blocking syscalls are currently used. It would be interesting to know that and see if there are possibilities to eliminate them or provide alternative solutions.

3 Likes

I think it’s always good to keep the system responsive. At the very least this would allow us to fire up a remote shell and debug the problem.

1 Like

Good luck with that when DNS does not work. If your sshd.config has UseDNS you are screwed; if your sshd.config has UsePAM, and your PAM setup look up names you are also screwed. Hell, if your shell’s prompt has \h in it (VERY common), it will do hostname -f for every new shell to spawn.

The DNS service server is what he want to connect with via a remote shell, and what you are referring too is DNS problems in the machine trying to connect to the remote shell, thus it’s not the same, aka you can have the remote DNS service server innoperational, but from the moment you have your laptop with a working DNS then you can fire-up the remote shell.

1 Like

The point of the (now huge) Twitter thread to me seemed to have come from “is it possible to starve a thread pool comprised of raw OS threads” and slow/unresponsive DNS was given as an example.

The answer is always “yes, it can”. Handing the keys to the kingdom to most programmers nowadays is a no-go because they have no clue there are actual physical limitations there. Do try and spawn 50,000 threads on your machine. Unless you have $25,000+ workstation you’ll start seeing your machine lag at the 3000th or even 2000th mark.

As others have remarked both here and in the Twitter thread, there’s a LOT that can be done. But the original poster seemed to do his very best to be not impressed (I pointed that out to him at the end of my participation). And the discussion got perverted to “but there ARE ways for all languages / runtimes to alleviate the problem!” which is IMO a discussion stopper.

Of course there are ways. There are ways to not litter parks yet people do it anyway. There are ways to have wooden furniture without destroying the Amazon forest but it’s destroyed anyway. Etc.

Same goes for languages/runtimes; he made a few remarks that modern languages are starting to learn from Erlang to which I simply responded “but I need the results today”. I won’t care much if PHP and Ruby and Python are finally green-thread-enabled 20 years down the line. This makes for inspirational history books but in the meantime all of us have to work with something.

So the discussion started off well but it failed to stick to the main point: “which languages/runtimes do it better TODAY?” – and as I also remarked in the Twitter thread, theoretical constructs like “every language can be as good as Erlang” is not an interesting or productive discussion.

@garazdawi Thanks for the links. I learned valuable things from them.

2 Likes

The origin that David Fowler never disclosed is a Uber meltdown:

https://eng.uber.com/denial-by-dns/

that was revealed in this tweet by Maayan Hanin:

https://twitter.com/MA_Hanin/status/1365785527140704256?s=20

To what David Fowler replied with:

https://twitter.com/davidfowl/status/1365786108173291520?s=20

and in this test Repo Uber considers Erlang unsafe:

Screenshot from 2021-02-28 11-15-11

If you are lazy to go to the repo here it his the README:

And if you go and see the Erlang implementation:


-module(main).

-export([main/1]).

main(_) ->
    ok = inets:start(),
    {ok, _} = httpc:request("http://localhost:8080"),

    lists:foreach(
      fun(_) ->
              httpc:request("http://example.org")
      end,
      lists:seq(1, 25)
     ),

    time:sleep(1000),

    httpc:request("http://localhost:8080").

This seems to me that is far to be the correct way of doing it in Erlang, therefore maybe some Erlang developer from this forum can put a Pull Request to fix it?

Maybe @garazdawi can shed us some light or point us to someone who can?

Well, I wasn’t involved back then. I got mentioned by you at the point I specified above.

This was only yesterday night, not on the begin of all the discussion.

You are mentioned in the tweet reply, but the thread is so huge and with so many ramifications that is now easy to miss tweets on it :wink:

Screenshot from 2021-02-28 11-31-05

Maybe because he knew about the Uber considering Erlang unsafe, then he was thinking of us like fanboys, but giving the benefit of doubt?