Expired ssm-agent session where a debug node is connected via "iex --remsh" does not disconnect automatically the node, leaving nodes() state inconsistent; bonus: CPU on service node spikes instantly to 100%

Hi everyone,

The title of this issue is long, because the problem is a bit complicated.
I would need your support in telling me what I can investigate further in the BEAM.

I have a system of Tasks deployed in ECS(AWS), independent nodes for one service.
We are using AWS SSM(Working with SSM Agent - AWS Systems Manager) to connect to one of the desired nodes ( example: aws ssm start-session --target ${CLUSTER}${FARGATE_TASK_ID}${FARGATE_CONTAINER_ID})

Next step from the node is to connect to the elixir console for the service via iex --remsh service-name --sname debug --cookie COOKIE

We identified that if the ssm session expires, the debug node remains connected, and throws the CPU of the service to 100%.

In order to save the situation a manual/code disconnect_node action is being required, making the CPU drop back.

My questions are:

  • how can I investigate what makes the CPU grow?
  • How can I differentiate via a monitoring solution this case to automatically disconnect the node?

Thank you

1 Like

If you are inside the server running the production release then you can connect to the node with ./bin/app-name remote:

$ ls 
bin          erts-11.1.6  lib          releases

$ ls bin
rumbl      rumbl.bat

$ ./bin/rumbl remote
Erlang/OTP 23 [erts-11.1.6] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:1]

Interactive Elixir (1.11.3) - press Ctrl+C to exit (type h() ENTER for help)

Maybe by using directly the production release iex session you will not face the issue you mention, but I have no idea if it will solve it or not, but I think it will not hurt to try :slight_smile:

Thank you for your input, unfortunately due to regulations in my case, we don’t connect to the exact service node, we connect to a debug node via which we have the chance to run
iex --remsh service-name --sname debug --cookie COOKIE

I did a bit more research to try to identify at the tcp level the hanging connections with the idea that maybe I can identify what is causing CPU increase by checking the Process statuses of the hanging PIDs, but I was not very lucky.
The node stats as well don’t show much in terms of increased reductions or run_queues. Which makes me clueless.

I will close the topic on my side for now. Thank you again.

1 Like