I’ve been having periods of increased latency and request timeouts when I have higher throughput on my servers. I managed to get a look on my Observer for a server having issues and noticed many processes like this:
When I caught it there was about 16 of these processes with around the same values for the reds and memory.
Looking deeper into the process this happens when making a call to DynamoDB (made via the Ex-Aws library which uses hackney: GitHub - ex-aws/ex_aws: A flexible, easy to use set of clients AWS APIs for Elixir)
Is there a way to reduce the usage of this process? I’m asking on this forum first before attempting to raise any sort of issue on the repository.
What did you find indicating your latency is connected to these ssl_gen_statem processes?
Essentially I see multiple attempted requests to DynamoDB which hit a 20 second timeout and error out. During this period there is a spike in response time for Dynamo requests and the ssl_gen_statem processes jump to the top of the observer. Other calls made to DynamoDB also increase, for instance a put statement which is usually 200ms goes to 30 seconds.
Eventually the node is disconnected from the cluster.
Also DynamoDB is my only upstream service and the process state for the ssl_gen_statem process shows DynamoDB.
It’s always possible it could be something else but my app is very minimal, only running Phoenix channels to pass messages to clients and then using DynamoDB to save and retrieve messages.
Edit: There’s also a spike in elrang process memory usage which with ssl_gen_statem memory would make me think that it’s a likely cause. My schedulers utilization remains the same
It’s probably a long shot, but we did see issues with high throughput and the way we were supplying ssl certs: Hunting Memory Spikes in the Erlang BEAM | New Relic
It looks like you’re in a different situation but the high throughput + SSL looked familiar.
That said, we were chasing OOMs, not latency. Things were performing fine. I wouldn’t assume an increase in memory usage necessarily leads to timeout issues.
Have you watched for any processes with long message queues?
I did see that and it seemed similar, though I would have to configure these options via the library and the documentation doesn’t really make clear how that is done in Ex_Aws.
I have looked for long queues and there doesn’t seem to be any (all processes show zero)
config :ex_aws, :hackney_opts,
# keyword options go here.
You can also replace which HTTP client is used entirely.
Sorry, that was a misunderstanding on my part. I meant more on the hackney side of things, I don’t much of an idea of what options would be helpful considering this problem (hackney documents re-using a connection but this isn’t an options based thing). I could use another client library but I’m not actually sure what would be one, the only one I know of is HTTPoison but that’s just using hackney, so I don’t know what it would do in this case.
I’ve done a little more research into the libraries and there’s more than I realised, though I’m still a little hesitant to swap it out and just hope for the best. It might be the best shot I have though