Step after copying the file is to stop using systemctl my_app stop
As seen in the unit file, we’re using daemon to start the app and stop to stop it.
So the issue I’m having is that randomly, some times the stop command hangs and deployment can´t continue. Deployment is done using Capistrano (don’t know if that matters but no harm in adding more info).
When this stop process gets “frozen”, I ssh into the box and use ps aux | grep my_app to filter what’s still hanging and all I see is this: deploy 27269 12.3 1.5 5536800 252496 ? Sl Mar10 1095:07 /data/my_app/app/erts-10.6.4/bin/beam.smp -K true -A 64 -- -root /data/my_app/app -progname erl -- -home /data/my_app/app -- -noshell -s elixir start_cli -mode embedded -setcookie ********** -name node@<real ip> -config /data/my_app/shared/my_app-0.52.5-20220310212140-6041.runtime -boot /data/my_app/app/releases/0.52.5/start -boot_var RELEASE_LIB /data/my_app/app/lib -kernel inet_dist_listen_min 49000 -kernel inet_dist_listen_max 49100 -- -- -extra --no-halt which, if I understand correctly, it’s just the app process running but no other process.
To get over this issue I would just kill this process and every other step completes correctly. Being that I’m a noob in Elixir and with limited understanding about Erlang VM, I understand that the release is a package where all executables (even the beam) are packaged into the compressed file but now I’m not sure what to search for in the logs or how can I debug this behaviour as I can’t see anything obvious in the logs.
My question is: Is there anything obvious that I’m missing/should be looking for? or this requires more in depth debugging/version update/upgrade?
Hey, sorry for the late response, I’ll do more tests today but what I can say so far, when this happens, using the :observer functionality I can see all schedulers go flat-line, not to zero. I’ll try to dig more for info today
So, yeah, everything still works, reachable, endpoints are available and all, my stop command is there waiting, the systemctl status my_app is Active: deactivating (stop-sigterm) since Mon 2022-03-28 19:51:07 UTC; 16min ago ps command to check on processes still shows the process running.
Adding more information, when in this state, I just ssh’d into the node and went to the my_app/app/bin/ and executed ./my_app stop and now I’m getting --rpc-eval : RPC failed with reason :nodedown I’m thinking that somewhere in the process of shutting down, communication is broken and now I can’t gracefully stop the process’
Do you happen to have somewhere I can read on that? I’d like to learn how SystemD would be managing EMPD, I was assuming that SystemD would just monitor the spawned process that will kickstart EMPD and then the app.
New findings tell me that, when the my_app/bin/my_app stop is called, epmd is terminated but not the beam command that is running my app. I think it does have to do with my systemd configuration; I may have to end up upgrading the application to newest erlang/elixir.
I am facing a similar problem with my app. I KNOW it’s all my fault, but I’m not sure what tools are available to help debug it. Main problem seems to be that the logger dies fairly early in the shutdown process (I think?) so it’s hard to log shutdown of the app
In my case I’m absolutely sure its because of bad OTP structures. So I have 3 main problems (now mostly eliminated, but only through learning…)
Starting processes outside of OTP trees. Especially ones which aren’t listening for shutdown of the whole app. Often this means doing stuff like calling start_link of some other component in a genserver init() call. Seems innocent enough and then you store the sub process pid. Lots of docs even suggest to do it…
Trying to use home grown restarting of dead processes. eg see the above. start_link something and then build your own monitoring to restart it if it dies. Could be something like spawning an external process, etc. In my case I have Beam trying to kill my process and the other process restarting it madly. Clue is that CPU is very high at this point on the machine (as the restart spins in a loop)
Spawning external processes. Lots to go wrong here. In my case I needed sudo to start the process, but this left me unable to terminate the process as my chosen wrapper didn’t sudo the kill call. Solved this by granting caps to a process to sigkill non owned processes
I tried a few ideas to instrument shutdown to see what was going wrong but didn’t find this easy. The observer died before I could see the stuck processes. Intuition was enough to find the issue, but sure, build correct OTP systems and you don’t have a problem…
I just wanted to keep the hope alive here, for now, I had the chance to update my stack (sort of) I updated Elixir to 1.13.3 and Erlang to 22.214.171.124 (which could help but not really fix the issue) When I get approval to update the Systemd unit file and add libraries to change the type of service, I’ll let you guys know if the issue is that.