Hi everyone,
The title of this issue is long, because the problem is a bit complicated.
I would need your support in telling me what I can investigate further in the BEAM.
I have a system of Tasks deployed in ECS(AWS), independent nodes for one service.
We are using AWS SSM(Working with SSM Agent - AWS Systems Manager) to connect to one of the desired nodes ( example: aws ssm start-session --target ${CLUSTER}${FARGATE_TASK_ID}${FARGATE_CONTAINER_ID})
Next step from the node is to connect to the elixir console for the service via iex --remsh service-name --sname debug --cookie COOKIE
We identified that if the ssm session expires, the debug node remains connected, and throws the CPU of the service to 100%.
In order to save the situation a manual/code disconnect_node action is being required, making the CPU drop back.
My questions are:
- how can I investigate what makes the CPU grow?
- How can I differentiate via a monitoring solution this case to automatically disconnect the node?
Thank you
1 Like
If you are inside the server running the production release then you can connect to the node with ./bin/app-name remote
:
$ ls
bin erts-11.1.6 lib releases
$ ls bin
rumbl rumbl.bat
$ ./bin/rumbl remote
Erlang/OTP 23 [erts-11.1.6] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:1]
Interactive Elixir (1.11.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(rumbl@1d2fb4f5d614)1>
Maybe by using directly the production release iex
session you will not face the issue you mention, but I have no idea if it will solve it or not, but I think it will not hurt to try
Thank you for your input, unfortunately due to regulations in my case, we don’t connect to the exact service node, we connect to a debug node via which we have the chance to run
iex --remsh service-name --sname debug --cookie COOKIE
I did a bit more research to try to identify at the tcp level the hanging connections with the idea that maybe I can identify what is causing CPU increase by checking the Process statuses of the hanging PIDs, but I was not very lucky.
The node stats as well don’t show much in terms of increased reductions or run_queues. Which makes me clueless.
I will close the topic on my side for now. Thank you again.
1 Like