Ssl_gen_statem:init/1 process takes high memory during periods of higher throughput

Hi,

I’ve been having periods of increased latency and request timeouts when I have higher throughput on my servers. I managed to get a look on my Observer for a server having issues and noticed many processes like this:

Name Reds Memory MsgQ Current Function
ssl_gen_statem:init/1 417788 7266120 0 gen_statem:loop_receive/3

When I caught it there was about 16 of these processes with around the same values for the reds and memory.

Looking deeper into the process this happens when making a call to DynamoDB (made via the Ex-Aws library which uses hackney: GitHub - ex-aws/ex_aws: A flexible, easy to use set of clients AWS APIs for Elixir)

Is there a way to reduce the usage of this process? I’m asking on this forum first before attempting to raise any sort of issue on the repository.

What did you find indicating your latency is connected to these ssl_gen_statem processes?

Essentially I see multiple attempted requests to DynamoDB which hit a 20 second timeout and error out. During this period there is a spike in response time for Dynamo requests and the ssl_gen_statem processes jump to the top of the observer. Other calls made to DynamoDB also increase, for instance a put statement which is usually 200ms goes to 30 seconds.

Eventually the node is disconnected from the cluster.

Also DynamoDB is my only upstream service and the process state for the ssl_gen_statem process shows DynamoDB.

It’s always possible it could be something else but my app is very minimal, only running Phoenix channels to pass messages to clients and then using DynamoDB to save and retrieve messages.

Edit: There’s also a spike in elrang process memory usage which with ssl_gen_statem memory would make me think that it’s a likely cause. My schedulers utilization remains the same

1 Like

It’s probably a long shot, but we did see issues with high throughput and the way we were supplying ssl certs: Hunting Memory Spikes in the Erlang BEAM | New Relic

It looks like you’re in a different situation but the high throughput + SSL looked familiar.

That said, we were chasing OOMs, not latency. Things were performing fine. I wouldn’t assume an increase in memory usage necessarily leads to timeout issues.

Have you watched for any processes with long message queues?

I did see that and it seemed similar, though I would have to configure these options via the library and the documentation doesn’t really make clear how that is done in Ex_Aws.

I have looked for long queues and there doesn’t seem to be any (all processes show zero)

https://hexdocs.pm/ex_aws/ExAws.Request.Hackney.html

config :ex_aws, :hackney_opts,
  [
   # keyword options go here.
  ]

You can also replace which HTTP client is used entirely.

Sorry, that was a misunderstanding on my part. I meant more on the hackney side of things, I don’t much of an idea of what options would be helpful considering this problem (hackney documents re-using a connection but this isn’t an options based thing). I could use another client library but I’m not actually sure what would be one, the only one I know of is HTTPoison but that’s just using hackney, so I don’t know what it would do in this case.

1 Like

I’ve done a little more research into the libraries and there’s more than I realised, though I’m still a little hesitant to swap it out and just hope for the best. It might be the best shot I have though

i’ve recently started using Req and it’s a much nicer API than HTTPoison, which i used previously.

However it seems i have the same issue you do here. did you ever resolve it?

in my case i’m doing very simple GET requests, a whole bunch, and these :ssl_gen_statem.init/1 processes just end up hanging around forever, eating up all the memory. killing them manually doesn’t break anything, but does free up memory, a lot.

no idea what to do. the requests are being run inside a task that does indeed finish.

anyway, where did you end up?

1 Like

Changing my HTTP client library seemed to help. I ended up switching to Finch which leverages NimblePool and Mint. Since I switched I haven’t had the timeouts on ExAws, except those related to DynamoDB. The processes with ssl_gen_statem also consume far less memory as well. I also had an improvement in response times as well. Since my requests were mainly going to the same location using Finch made a lot of sense.

I’m not sure if the library was the exact issue but I haven’t really had issues since I swapped.

2 Likes

Req uses Finch under the hood so i guess that wouldn’t help me. I did try HTTPoison for this task and it seemed to use the same amount of memory, but it cleared it much faster.

and the problem seems to be connected to bad URL’s where requests have SSL problems.

Just now got around to reading your article, thank you for it.

One thing that’s left unclear is: what exactly did you do to make the CA store use ETS? The article says “provide it as a file and not as a list of DER blobs” but that doesn’t ring a bell on the exact how. :thinking:

Ha, a pretty critical thing missing from that article! I wonder if it got stripped before it made it to the public blog.

There’s a “cacertfile” option that can be passed to the erlang SSL module. AFAIK most client libraries have some way to pass options down.

This looks familiar to what I remember doing:

We moved to finch but it’s probably the same structure either way

1 Like

Very nice, thank you!

And what is the bad way of doing it, that triggered the super high memory usage?

I believe it’s what Hackney was doing by default (I’m going off old pull request notes :joy:), so essentially the default behavior puts you in the more convenient but less optimal situation.

A reasonable default honestly! But we make hundreds of thousands of HTTP requests a minute and that’s when the scale started to be an issue.

1 Like

I have the same problem as yours, the :ssl spawns a lot process but not terminate them. The init calls are :ssl_gen_statem.init/1, :supervisor.tls_dyn_connection_sup/1, :tls_sender:init/1, the amount of processes of these calls are 1:1:1.

Not found the reason yet, so I wrote a task to monitor the process amount and restart the :ssl.

We recently went down this path in NervesHub. There is a lot of replicated state in SSL, especially with CA certificates. We ended up adding hibernate_after: 15_000 SSL option to force the SSL processes to garbage collect quickly. It seems to have drastically reduced memory usage

1 Like

I config finch max idle time, the process leaking was gone!