Today I saw this tweet:
https://twitter.com/davidfowl/status/1362962967789072386
While it starts in the context of DOTNET:
Thread pool starvation affects everyone. Learnt about a service that went down because the dns server died and dns queries were hanging (there was no timeout). That in itself is a problem but it causes issues because dns resolution is synchronous… #dotnet
Dns resolution is historically blocking on every OS. See getaddrinfo. Modern operation systems have an asynchronous version of this API. Windows has GetAddrInfoExA/W https://docs.microsoft.com/en-us/windows/win32/api/ws2tcpip/nf-ws2tcpip-getaddrinfoexa and Linux has https://linux.die.net/man/3/getaddri
DNS resolution on windows is fully async on .NET if the operation system supports it. In .NET 6, we support it on linux as well. Both of these were external contributions!
Before .NET 6 and in most other platforms I’ve looked at, dns resolution is still blocking. This means that even if you use the async APIs, it’ll kick off a synchronous operation and block a thread pool thread.
I then moves to compare how it is done in NodeJS:
This lead me to look at what nodejs does here. Would it have run into the same problem? Nodejs uses libuv which is a libraries that tries to abstract various OS operations (file, socket, tty, etc) into a single unified API. But what does it do for dns resolution?
Turns out, for the APIs that aren’t truly asynchronous everywhere (file IO, DNS) nodejs uses a threadpool that runs work off the event loop. This means you aren’t blocking the event loop but non-event loop threads are being blocked with dns queries.
What happens when these threads are starved? Well node does this thing where it only uses 1/2 of the threads for this “slow IO”, the other work just queues up behind this blocking work. This results in a similar effect to thread pool starvation as the queue length just grows…
This would manifest as callback not being called and lots of work being scheduled (assuming lots of new DNS requests keep showing up) but never completing. I’m not sure what this is called in the node world, if anyone knows, please let me know.
Now he goes about comparing how its done in Golang:
Let’s see what the other languages do. Go blocks a goroutine and waits on a channel. It’ll eventually spin up a new OS thread if the goroutine blocks too long. Strangely, this logic uses cgo
Due to nature of pre-emptive scheduler of the BEAM and the possibility of having millions of BEAM processes running at same time I have the impression that this would not be such a big issue, but now I am curious to understand what would be implications on the BEAM when running DNS service application.
So, what is your take about a DNS service running on the BEAM? Could it lead to OS threads starvation?