Limiting maximum number of concurrent outgoing TCP/HTTP connections?

orestis · September 5, 2017, 9:57am

So I’m building a background service that periodically (once a minute) will reach out to a number of machines via HTTP or plain TCP. Each interaction should only take 100-200ms.

I’ve put each machine into each own GenServer and register those globally via Registry so I can reuse them and ensure that a) I don’t overwhelm a remote machine by accident, and b that requests for each machine will be serialized.

However, as the number of machines will scale to around 1000 (generous upper limit), I wonder if I will face the dreaded “out of file descriptors” error.

Since I’m being careful to not keep file descriptors open, I can always just bump the limit of file descriptors - but I wonder if there is any other hidden Erlang/Elixir limit I will hit before that.

I know there’s a port limit but that is be high (65k) and this VM doesn’t do anything else, so for plain TCP connections using :gen_tcp I should be fine.

For HTTP, I’m using straight HTTPPoison with the default options, which, as far as I can tell, doesn’t use any connection pooling. Now, I’m not sure whether I want to use HTTP keepalive, since I don’t trust the remote HTTP servers to do the right thing WRT to keep alive, and I don’t have control over them. I’ve seen weird bugs in the past.

Any other thing I should be aware of?

OvermindDL1 · September 5, 2017, 6:30pm

The system ‘outgoing’ ports is significantly higher than 65k, you generally should not worry on outgoing connections.

However, to limit scale you can either have a single genserver that just holds, say, a map of active connections, waiting for messages about them, or use a connection pool like poolboy or so.

sribe · September 14, 2017, 4:07am

The short answer is that no, there is no other limit than the ones you already know about.

The longer answer is that you may be uncomfortably close to some weird performance degradation that I haven’t yet figured out. Or not. I’d tell you the exact conditions that cause it, if I’d figured that out

orestis · September 14, 2017, 9:29am

Have you run into this yourself? I’m not sure if you’re mirroring my sentiment (things might go wrong in unexpected ways!) or offering advice

minhajuddin · September 14, 2017, 9:52am

If you are using HTTPoison which uses :hackney. Make sure that you set max_connections to a high number. The default is 50. (config :hackney, max_connections: 1000). I think even open ports count towards the nofile ulimit. We ran into the :emfile issue while opening a lot of http connections. Also make sure that you read the response body so that the socket is freed up properly. If you make a request and don’t read the response body properly it tends to create issues.

sribe · September 14, 2017, 12:51pm

Offering advice based on experience. Benchmark carefully. Benchmark at twice the load you expect to see if it falls apart.

steven7 · November 18, 2019, 8:55am

I believe I am having a similar issue, but I don’t see this being reported on HTTPoison nor hackney. I still need more time in debugging this but do you still face the similar issue with the latest HTTPoison or hackney?

My use case is I would pattern match the HTTPoison.Response like such,

with {:ok, %HTTPoison.Response{status_code: 200, body: body}} <- HTTPoison.get(url) do
 ...
else
{:ok, %HTTPoison.Response{status_code: code}} when code in [foobar] -> ...
_ -> ...
end

And I am getting a lot of weird :emfile error as well.

nikody · November 18, 2019, 12:18pm

Are you making many concurrent requests? Also, are you on OS X? There is a 256 default limit for open file descriptors in OS X and a new connection in hackney, I believe, opens a new socket. I bumped into that problem recently.

steven7 · November 19, 2019, 11:30am

Nope I am on Amazon Linux 2 (which is quite similar to CentOS 7).

minhajuddin · November 24, 2019, 3:00am

An emfile error points to an incorrectly set ulimit (you can check your current ulimit by running ulimit -n, but make sure that you run it as the user that your app is running as) or you are not pooling your HTTP connections properly. Even with a default ulimit of 1024, you’ll hit the limit if you open up 1K http connections without a connection pool. So, to debug this, I would do the following:

Figure out if you are using connection pooling in hackney
Find out what the cause of the emfiles is: is it too many open http connections, too many open files?
Use proper connection pooling and benchmark

steven7 · November 27, 2019, 5:29am

I am indeed using connection pooling in hackney and I did set my ulimit to 500K beforehand so those are out of the way. Actually my observation is that when I ran lsof I can see that my “main” beam.smp is holding say, 800 TCP connections (which is well below the limit) but I do have some other Erlang-associated process such as 1_schedul, 2_schedul, ... and 1_dirty_c, 2_dirty_c, ... etc each holding an identical amount of TCP connections as my main beam.smp (I know they are associated with my main beam.smp because lets say my beam.smp PID is 4469, the PID of say 1_schedul would be 4469 some_other_number) and all of these add up to huge numbers.

minhajuddin · November 27, 2019, 4:51pm

I am not sure if you can set the ulimit to 500K, the maximum number of outgoing connections cannot be more than 64K, A sure way of checking the ulimit under the right user and environment is by running the System.cmd "ulimit", ["-n"] from within your app and checking that value out. In the past, I’ve have had misconfigurations because the app was running as user foo whereas the ulimit was being updated for a user bar. As far as I know, a single app would have just one process which the beam would be running with 2 threads per core ( a normal scheduler and a dirty scheduler ), so not sure why you are seeing multiple processes.

OvermindDL1 · November 28, 2019, 2:11am

At least on Linux the file descriptor can be vastly significantly higher than 65k (I think it’s a tagged 32-bit handle, like 28 bits usable I think?).

steven7 · November 29, 2019, 3:34am

Could it be because I am using Flow to parallelise my HTTPoison requests?

sribe · November 29, 2019, 5:10pm

Right, but I think he was implicitly referring to the max of 64k ports per IP. So yeah, there’s a possible limit of 64k, unless the last gateway before your server has multiple IPs it can use to connect, and the smarts to spread connections over them–I know this is a real thing, but no idea how common it is in practice. Or your server is directly exposed on the internet with connections coming from IPs all over the place, which I expect is quite rare for Phoenix apps.