We have a service with a Plug interface. It calls a vendor API with requests and then later receives callbacks with payloads responding to the requests.
A couple weeks ago, the vendor’s error page started reporting that it was getting “connection closed” when hitting our callback endpoint about 1% of the time. This was odd to us because nothing had changed. We hadn’t made a commit on this service in over a year and it’s running on the same EC2 instance it always has.
The fix was to put an Nginx proxy server in front of the Elixir service.
While it’s nice to have a fix, why it happened is still a mystery. What would have been a good way to debug this? Is it more likely a Plug or Cowboy issue? My best guess is that the vendor changed or upgraded their HTTP client and there’s an incompatibility with Cowboy. Their support did not suggest this as a possibility when we talked to them, though. They felt there was something wrong with our application–we talked to them before the proxy fix, however. Any other thoughts?
That sounds more like network problems than an application incompatibility. Is there some more specific error code that they are getting? Or just an HTTP timeout?
You can log the whole conn and get details about their HTTP request.
Using Nginx adds a “buffer” as it will queue requests for a while waiting for the back end to respond. That can end up being a problem on high volume applications, and things get more reliable when we remove Nginx. https://www.cogini.com/blog/serving-your-phoenix-app-with-nginx/
Cowboy is generally a pretty compatible HTTP stack, though sometimes you need to set protocol options to handle things that Nginx was handling for you. e.g.
config :foo, FooWeb.Endpoint,
protocol_options: [max_keepalive: 5_000_000]
Thanks for your help.
A network problem that Nginx handles and Cowboy doesn’t, though, at least with their default configs.
The vendor’s error column just says “:closed” for the relevant requests. They normally report a specific error code if there is one.
I have log statements at the beginning of this very simple plug, but they don’t get hit on these requests. I guess I would have to clone Cowboy as a local dependency to dig deeper. Or maybe it’s only a matter of tuning Cowboy’s config to adapt to current network conditions.
I guess SSL/TLS version or parameters. One of the machines in their pool is requiring something stronger than what Cowboy is configured for, or vice versa.