Odd slowdowns with concurrent HTTPS requests / HTTP Client Concurrency

I have a system that is making a lot of HTTPS requests. I’m using HTTPoison (which has Hackney under the hood). All is well when I run things real slow, like 500 requests a minute.

However, when I start ramping up my work pools to make more transactions and increase concurrency, I notice a big jump with slow turnaround from pre-request logic until I get a response. I’m not aware of a way to get the actual request/response timing, but I just keep track of the time right before the request and right after - which it goes from 4-5 seconds on average to the high 30s or more. Sometimes I end up with spikes of 150-250 seconds. I’m unsure of is if the actual VM is being bottlenecked in some way or by something by the OS (Ubuntu 14.04).

What I also notice is that when I run into a chunk of slow requests, it seems to ‘cascade’ and effect the handling/response time of subsequent requests for a while. I could understand if I were seeing CPU or memory spikes, but I’m not. The VM queues look good as well when watching in observer.

My code is making the requests in a pool to limit the concurrency. I also know hackney is pooling requests. I’ve used both the default hackney pool as well as creating my own hackney pool with a large enough capacity, but tempted to disable the hackney connection pooling altogether. I have added some sleep logic to avoid making too many requests start up simultaneously, but that hasn’t helped a whole lot either.

If anyone could give me some ideas for monitoring these connections better or figuring out the long delays, I’m all ears.

Things I have done so far:

  • increased ulimits on Ubuntu
  • increased +Q and +A settings on the erlang vm.args
  • watched the tls_connection:init calls in Observer (they tend to jump up in memory usage when I hit these slowness spikes)
  • split connections across multiple hackney pools

Server Specs:

  • 8 cores
  • 16gb RAM

I’m hardly moving the CPU at all and ram usage is normally below 1gb on the VM side

1 Like

And you are certain it is not the remote server as the first thing to check? Have you tried the same thing with python or something as a test to verify?

1 Like

I am working on a test with Ruby/Mechanize as we speak. But no, I am not sure the remote host isn’t the issue here.

Having said that, is there a good way to really inspect the connections? I just want to be sure my VM settings are as good as they could be. Or, should it be assumed this should just work?

1 Like

If you are running ubuntu in a virtual machine then the VM could definitely have some overhead depending on how its network is configured. But still, I’d say replicate the case in python/ruby/whatever and see if you have the same performance characteristics. It would definitely rule in one way or the other. Curious as to your results. :slight_smile:

2 Likes

I’ve tried a few things and have better performance. Much smoother/consistent response cycles. The big changes, for my scenario, were:

  • Disabling of hackney’s pooling ([{:hackney, [pool: false]}]), since my request workers are already managed by a pool
  • Drastically lowered the number of request workers in my pool. At one point I had over 1000 workers that could concurrently send out requests. When I knock that down to even 100 after removing the pooling, I noticed a much more consistent HTTP request cycle.

I also made some SSL-specific changes to match some old code I had in Ruby with Mechanize. I do not know if this had any impact, but it was worth a shot and it hasn’t hurt my performance so far.

  • Modified the SSL versions allowed - opts = [{:ssl, [versions: [:tlsv1, :"tlsv1.2", :"tlsv1.1"]]} | opts]
  • Allow SSL requests to fail verification - opts = [{:hackney, [insecure: true]} | opts]

Having said all of that, I’m not suggesting anyone follow what I’ve done. Ignoring SSL verification is probably not the best idea. Also, I’m guessing that hackney has pooling for a good reason. But it does seem to have helped me to just bypass it in my scenario. If I were not managing a request pool myself, I’d say its critical to leave the hackney pooling on or else you’d likely overwhelm the system.

Here are before and after CPU loads. The CPU load was always very low, but whats obvious now is that its much lower and consistent now - even though I have a much higher request throughput.

Before Changes

After Changes

Hmm, did you test it with ruby to verify no system issues regardless? You should be able to do a lot more than 100 concurrent connections (I was testing my server with 60k concurrent connections)…

1 Like

I can do a whole lot more than 100 concurrent connections, but I’m hitting my data throughput requirements now with only 100 connections. Sorry, I should have mentioned that.

There were no system issues with the Ruby test yesterday, other than the fact it just runs a lot slower and ‘concurrency’ is less than ideal.

At the end of the day I was able to take a process that ran across a dozen or more servers using Ruby and make it function on a single server. I’m happy with the outcome.

Awesome! My main holdback on my current work server is the fact it is hosted on Windows through IIS (to change here shortly! Whoo!), which imparts a surprisingly significant overhead compared to if I just access the port that it is on directly (why would anyone use IIS on Windows over nginx on linux?!? Wtf…).

1 Like

As someone that spent 1999-2011 in the Microsoft/IIS/.NET world, I understand your pain. At least you’re in an environment that is working with Elixir, though. Good luck and thanks for the help!

Yeah they use IIS here for a lot of things just because some of the Industry-Specific software requires it, but they are in the process of converting basically everything they can to linux starting about now so I am looking forward. ^.^

I had a similar experience and the solution was to force hackney to use the default pool, which I believe allowed my app to reuse connections and not have to redo the handshake for each request. I documented it here: http://coderstocks.blogspot.com/2016/01/sqs-throughput-over-https-with-elixir.html

Hopefully that is helpful for you.

3 Likes