I’ve been having periods of increased latency and request timeouts when I have higher throughput on my servers. I managed to get a look on my Observer for a server having issues and noticed many processes like this:
Name
Reds
Memory
MsgQ
Current Function
ssl_gen_statem:init/1
417788
7266120
0
gen_statem:loop_receive/3
When I caught it there was about 16 of these processes with around the same values for the reds and memory.
Essentially I see multiple attempted requests to DynamoDB which hit a 20 second timeout and error out. During this period there is a spike in response time for Dynamo requests and the ssl_gen_statem processes jump to the top of the observer. Other calls made to DynamoDB also increase, for instance a put statement which is usually 200ms goes to 30 seconds.
Eventually the node is disconnected from the cluster.
Also DynamoDB is my only upstream service and the process state for the ssl_gen_statem process shows DynamoDB.
It’s always possible it could be something else but my app is very minimal, only running Phoenix channels to pass messages to clients and then using DynamoDB to save and retrieve messages.
Edit: There’s also a spike in elrang process memory usage which with ssl_gen_statem memory would make me think that it’s a likely cause. My schedulers utilization remains the same
It looks like you’re in a different situation but the high throughput + SSL looked familiar.
That said, we were chasing OOMs, not latency. Things were performing fine. I wouldn’t assume an increase in memory usage necessarily leads to timeout issues.
Have you watched for any processes with long message queues?
I did see that and it seemed similar, though I would have to configure these options via the library and the documentation doesn’t really make clear how that is done in Ex_Aws.
I have looked for long queues and there doesn’t seem to be any (all processes show zero)
Sorry, that was a misunderstanding on my part. I meant more on the hackney side of things, I don’t much of an idea of what options would be helpful considering this problem (hackney documents re-using a connection but this isn’t an options based thing). I could use another client library but I’m not actually sure what would be one, the only one I know of is HTTPoison but that’s just using hackney, so I don’t know what it would do in this case.
I’ve done a little more research into the libraries and there’s more than I realised, though I’m still a little hesitant to swap it out and just hope for the best. It might be the best shot I have though
i’ve recently started using Req and it’s a much nicer API than HTTPoison, which i used previously.
However it seems i have the same issue you do here. did you ever resolve it?
in my case i’m doing very simple GET requests, a whole bunch, and these :ssl_gen_statem.init/1 processes just end up hanging around forever, eating up all the memory. killing them manually doesn’t break anything, but does free up memory, a lot.
no idea what to do. the requests are being run inside a task that does indeed finish.
Changing my HTTP client library seemed to help. I ended up switching to Finch which leverages NimblePool and Mint. Since I switched I haven’t had the timeouts on ExAws, except those related to DynamoDB. The processes with ssl_gen_statem also consume far less memory as well. I also had an improvement in response times as well. Since my requests were mainly going to the same location using Finch made a lot of sense.
I’m not sure if the library was the exact issue but I haven’t really had issues since I swapped.
Req uses Finch under the hood so i guess that wouldn’t help me. I did try HTTPoison for this task and it seemed to use the same amount of memory, but it cleared it much faster.
and the problem seems to be connected to bad URL’s where requests have SSL problems.
Just now got around to reading your article, thank you for it.
One thing that’s left unclear is: what exactly did you do to make the CA store use ETS? The article says “provide it as a file and not as a list of DER blobs” but that doesn’t ring a bell on the exact how.
I believe it’s what Hackney was doing by default (I’m going off old pull request notes ), so essentially the default behavior puts you in the more convenient but less optimal situation.
A reasonable default honestly! But we make hundreds of thousands of HTTP requests a minute and that’s when the scale started to be an issue.
I have the same problem as yours, the :ssl spawns a lot process but not terminate them. The init calls are :ssl_gen_statem.init/1, :supervisor.tls_dyn_connection_sup/1, :tls_sender:init/1, the amount of processes of these calls are 1:1:1.
Not found the reason yet, so I wrote a task to monitor the process amount and restart the :ssl.
We recently went down this path in NervesHub. There is a lot of replicated state in SSL, especially with CA certificates. We ended up adding hibernate_after: 15_000 SSL option to force the SSL processes to garbage collect quickly. It seems to have drastically reduced memory usage