Difficult debugging problem

chrismcg · December 27, 2018, 10:12pm

There’s a bit here about changing the port range: https://ma.ttias.be/linux-increase-ip_local_port_range-tcp-port-range/

Curious if you tried the net.ipv4.tcp_tw_reuse setting I mentioned above?

9mm · December 27, 2018, 10:20pm

Now that I increased the pool it helped raise the bar so I have to wait again until I see the new level at which it breaks :S

When I find the high water mark I will try that

9mm · December 27, 2018, 10:21pm

I didn’t try that one yet but I will when I find the new ‘max’ level

benwilson512 · December 27, 2018, 10:36pm

Ahhhh hah! I figured there was another pool in there. How are you configuring it now? For the number of CPUs you’ll have it may be important to use a very large pool. How many domains are you hitting btw?

9mm · December 27, 2018, 10:39pm

Each API has its own domain, so 2-3… although it would be amazing to set more. I notice if I enable more than 2 API’s it dies exponentially more quickly so I have just had it set to 2 for now. If I could set 5 or more that would be insanely awesome, but I’ll take what I can get.

Here’s how I set it in application.ex

:hackney_pool.start_pool(:api, [timeout: 1000, max_connections: 200])

I would like to take to totally ask what the "napkin math’ is for setting up pools…

like how do I find the best values (ideally without testing constnatly which is a bit of a pain. Perhaps what math do you use to calculate good ‘starter’ values), taking into consideration things like

total outgoing requests / second
total hosts
timeout value set for each host
pool size (hackney max_connections)
“time we keep connections alive in the pool” (hackney timeout)

benwilson512 · December 27, 2018, 10:54pm

I’m not sure I understand, what does it mean to “enable” an API?

My back of the napkin math would be to leave the timeout alone, and then set the connections to like 800. Reason is that if the 50 is a default for ~2-4 cores then at 64 cores you want at least 800.

Practically speaking though the ideal value is going to need to be found empirically. I’d test 400, 800, 1200, 1600 and see how it goes.

9mm · December 27, 2018, 11:07pm

Oh, the outgoing requests are data apis. for each incoming request im basically aggregating data from all simultaneous requests.

So 1 outgoing request = 1 api = 1 hostname. I can enable/disable them, currently its set to 2

9mm · December 27, 2018, 11:19pm

I can’t believe I got duped by that stupid default pool setting in Tesla. Without a pool I was thinking I was magically getting ‘more connections’ by multiplying the app but really I was just multiplying max_connections

I wonder what the upper limit are for max_connections. Like if I put it at 5000 what actually would limit that (like that the OS level etc)? CPU? ulimit? what else? ip_local_port_range?

benwilson512 · December 28, 2018, 2:06am

Really your best bet is to just benchmark. Test at some of the levels I suggested and see if you see a pattern, then do larger jumps to see how far that pattern goes.

9mm · December 28, 2018, 2:18am

Ok thanks, I’ll do that tomorrow.

Fast question regarding something you said above. You said to give kubedns an extra cpu or two. Do I do that just by subtracting 2 CPU’s from whatever my server is? so if its 64 then I set limits and requests to 62000m, or should those be 64000m and then I set kubedns elsewhere?

Also, I do remember testing pool settings for almost 2 full days (and other HTTP libraries). I remember on buoy when I went past 300-500 connections in the pool they wouldnt even connect… but perhaps because on gigalixir everything is 8cpu’s max, so it wouldnt be able to run on one machine ever.

This was the first time testing Pool settings on 64 cpu server. That would have been one of the first things i test in the new environment except i got sidetracked for a day or 2 because of the hidden pool.

Either way, I’ll try to see how high I can go now on one machine for the number in the pool before stuff starts to not even connect.

9mm · December 28, 2018, 2:20am

Second follow up question. You said I would have to ‘trace a function’. What’s the best tool to actually do that? I see like 10 versions of different ‘prof’ like cprof, dprof, eprof, etc etc. I have no idea which one would be best for this use case (I would probably trace the very top of the execution function ‘tree’, with the Task.async and Task.await, and trace all 3 functon calls inside the task). Id imagine it would trace the execution time of every inner/nested function, and maybe extra info too?

ananthakumaran · December 28, 2018, 9:35am

Hackney exposes very detailed metrics https://github.com/benoitc/hackney#metrics, Specially hackney.POOLNAME.queue_counter can be very useful in determining whether timeouts are due to pool size.

9mm · December 28, 2018, 6:15pm

Wow this is amazing, thank you

ashneyderman · December 28, 2018, 7:13pm

you can remove pool and see what you get. you will certainly be slower but at least you will be getting to the amount actual sockets possible. You will also figure out what goes first - CPU, network bandwidth.

Pool usage in this case is probably a good thing for you as it provides for some pushback mechanism. Something that I do not think you built (better way to do it is to place those at the gates not in the backroom with the pool).

9mm · December 28, 2018, 7:49pm

I tried without it and it was crazy low. are you saying thats the exact amount that its limited by an external resource?

ashneyderman · December 30, 2018, 6:30am

it could be, what is crazy low (i mean quantitatively)?

baldmountain · December 30, 2018, 9:24pm

Hi,

Sorry a bit late here. I worked on the adserver at Yieldbot and it had no issues handling 10,000 QPS on 6 instances (or was it 9) using IIRC 8 cpus servers with each instance using about 1G of ram. This should be doable. The servers were CPU bound not ram or network bound.

We were running our servers in Docker containers managed by mesos and singularity so it’s a bit different deployment. One thing we did run into is that the erlang VM thought it had resources for the underlying machine rather than what docker had allocated it. If you are running multiple instances on a single server they may all be creating schedulers for 64 cpus rather than the number you allocated for them. You’ll have way more schedulers than you need. Instead run several smaller servers 8 or 16 cpus and only a single instance of the application on each server. Setup haproxy or similar to handle routing to the instances. We used Baragon to update haproxy config when instances were started. Because we were using mesos we had to make sure only a single instance of a server came up on each slave. Also, new instances have to ramp up their load. If you hammer them all at once the schedulers get overloaded and the instance will crash. IIRC there is a haproxy setting to manage ramping up a new upstream server.

Not sure how Kubernetes works, but in docker/mesos the cpu allocation is a suggestion, but if the process uses too much, mesos kills the process.

Also IIRC don’t use bridge networking use host networking.

Definitely setup pools with hackney. The defaults are not good for this kind of load.

I should mention that Phoenix templates are awesome. We didn’t cache ads. We generated them on the fly. The server rendered html, css and js (along with a lot of checks and other processing) on every request and had a ~12ms response time.

I think that was most of the gotchas we ran into. And I hope I’m remembering everything correctly.

Oh, the only other thing was we had to open two ports. One for the server and one for epmd. Even if epmd wasn’t ever going to connect to another instance. haproxy only routed to the server port.

9mm · December 31, 2018, 4:53pm

Wow this is super advice. A few questions…

Were these outgoing requests (10K/S)? or incoming.
what http library did you use?
Did you have any wisdom for choosing the pool settings? My main confusion right now is how to calculate the max pool size per instance… like the CPU chart doesn’t clearly a non-fuzzy view of how that correlates to CPU usage. its hard as well because traffic is always changing so I can’t exactly predict what CPU is for the increased pool vs just higher traffic. I was hoping theres some other way to detect pool size max.

Thanks again, amazing info

baldmountain · December 31, 2018, 6:00pm

You’re welcome.

Incoming requests. We did make a request to a server on the local network on each incoming request so both. (As well as cache requests to scylladb.)
We used HTTPoison and Tesla. Both were configured to use hackney. (IIRC hackney handled https connections better, but that may be history more than current status.)
The pools had 256 for max_connections. But we had a pool for each outgoing service. Don’t forget to specify the pool in the options passed to get/post or you’ll get the default pool.

We found that the CPU limit was more from the scheduler being busy than requests. We knew it was time to scale when the message queues started to get behind.

Above I mentioned scylladb. We changed our caching from redis to scylladb. Performance was much better than redis.

One other thing I’ll throw out there is we wrote our phoenix controllers to do most everything in plugs. A controller would look like:

defmodule ServerWeb.FooController do
  use ServerWeb, :controller
  require Logger
  alias Plug.Conn

  plug :put_layout, false
  plug PluOne
  plug PlugTwo
  plug PlugThree
  plug :plug_four
  plug :plug_five
  plug PlugSix
  plug :plug_seven
  plug PlugEight

  def index(%Conn{assigns: %{param: param}} = conn, _params) do
    conn
    |> put_resp_header("cache-control", "no-cache")
    |> render("index.html", param: param)
  end
end

All the business logic happened in plugs. We could test them in isolation develop them in small isolated pieces. Some controllers had as many as 26 plugs.

I hope I am remembering all this correctly. Yieldbot shutdown the middle of December so all of this is from memory.

9mm · December 31, 2018, 9:16pm

scylladb looks amazing. very interesting, thank you. So basically when you see MsgQ backup that means schedulers are the holdup and not the actual server CPU/resources?

Very interesting… this is very helpful, in my project but also just general understanding