HTTP2 error using gun

erlang
http
http2
gun

#1

Background

I have an application that uses gun to make HTTP requests to a server while keeping the connection up.

Problem

The problem here is that I am getting the following general stream errors:

2019-03-13 14:55:55.258 [error] GENERAL ERROR: {:stop, {:goaway, 0, :enhance_your_calm, "too_many_pings"}, :"Client is going away."}
2019-03-13 14:55:57.282 [error] GENERAL ERROR: {:stop, {:goaway, 5, :enhance_your_calm, "too_many_pings"}, :"Client is going away."}
2019-03-13 14:55:58.924 [error] GENERAL ERROR: {:stop, {:goaway, 3, :enhance_your_calm, "too_many_pings"}, :"Client is going away."}

GOAWAY

So, according to the HTTP2 spec:

The GOAWAY frame (type=0x7) is used to initiate shutdown of a connection or to signal serious error conditions. GOAWAY allows an endpoint to gracefully stop accepting new streams while still finishing processing of previously established streams. This enables administrative actions, like server maintenance.

So, I am guessing the server receiving my petitions isn’t too happy and is telling my client to slow down.

My confusion

This is confusing to me due to some reasons, the main one being that I am not attacking a single machine, I am attacking a cloud balancer that has a cluster behind. In theory, this balancer would distribute my load to all the servers and things like this wouldn’t happen.

I also don’t understand if gun simply closes a connection and opens again or if any data was lost.

Can someone help me understand the causes and consequences of this error?


#2

Disclosure: Here I’m just as a clueless as the proverbial rubber duck.

But in the interest of making more information available:

  • The error suggests to me that a PING frame is being answered with GOAWAY about every two seconds.
  • Found a critical (possibly misinformed) opinion about the practice of HTTP/2 ping frames.
  • The gun documentation suggests that the default HTTP/2 ping timeout is 5000ms.

7. Error Codes

ENHANCE_YOUR_CALM (0xb):
The endpoint detected that its peer is exhibiting a behavior that might be generating excessive load.

Do the errors change when you extend your configured timeout for gun? (Not expecting it to but it’s good to rule things out).


#3

What cloud balancer is the service using? Perhaps the cloud balancer is doing some rate limiting?


#4

Your help is always welcome, no matter what animal format it may come in :smiley:

I read the whole thing and I must say that, at least to me, I found some of his arguments convincing. However I am not as well versed in HTTP2 as others are so I am not convinced that 3 negative points about a whole protocol make it utter garbage.

All I can say is I see where he comes from and I thank you for liking me to it.

Indeed ! I am using the default timeout :smiley:

I have not tried this, but having in mind we have an automatic scaled cluster behind said load balancer, this shouldn’t even be a problem to begin with, right?

We are using Google Cloud Platform.
It is not unusual for us to get 502 errors from the load balancer (an issue impossible to solve with google) so it can very well be that Google’s services aren’t as resilient as we need.


#5

Without knowing more about your circumstances, I can’t point to a concrete solution, but my hackles always go up when people speak in absolutes - and to call a problem “impossible” to solve on a major IAAS provider doesn’t sound likely.

For this specific concern, better use of connection draining, load-balancer health checks, and possibly some more graceful shutdown behavior of your own application, should be able to eliminate close to 100% of end-user 502s that stem from per-server interruptions like rolling deployments rather than true application defects.

This is part and parcel of zero-downtime deployments, which has been a desirable practice for quite some time, so we probably would’ve heard by now if it was truly unattainable on GCP. Their load-balancing is, in many ways, a bit more sophisticated than what AWS offers, for example.


#6

We have had 502 error from Google since we can remember. We know our cluster has enough machines and we know they are not returning errors nor are they overloaded (we have metrics to check). We know these requests (502) never make it to our machines in the first place. I am not Sherlock Holmes, but after lengthy discussions about this topic we literally have no other explanation.

As for major IAAS providers, I worked for GCP for some time so I know for a fact (since I know how it used to work) this is totally possible. As a client I used Azure for a quite some time (definitely worst than GCP) and we had troubles every 2 weeks. It was a nightmare.

So to me, and with my experiences and knowledge I have, it sounds not just likely, but normal.

Google manages their own load balancers, which send traffic world wide to all the regions. We don’t have direct control over them (we have some limited control). This is the charm of the offer - you don’t need to worry too much with balancers, Google does.


It is worth mentioning that after a long stress test, we concluded these errors were around 0.0015% of all requests. This is well withing the terms of service Google providers iirc, so there is little that can be done. We are leaving this matter as 0.0015% is not a sufficient error rate to invest more time and effort.