Release started with daemon can't be stopped randomly

Hi guys, I was inherited a Phoenix app and have been learning Elixir stuff so please bear with me as I explain my current situation.

When doing a deployment, one of the primary steps is to stop the app to start the new released one. Things I know/current status:

  • We’re using releases that get generated inside a docker container into a tar.gz file
  • erlang 22.2.7, elixir 1.10.2, phoenix 1.4.12
  • Deployment process copies the tar.gz file via scp
  • Using SystemD unit file to manage as a service with the following config:
[Unit]
Description=MyApp
After=network.target

[Service]
Type=forking
User=deploy
Group=deploy
WorkingDirectory=/data/my_app
ExecStart=/data/my_app/app/bin/my_app daemon
ExecStop=/data/my_app/app/bin/my_app stop
TimeoutSec=infinity
Restart=on-failure
RestartSec=5
Environment=HOME=/data/my_app/app
Environment=LANG=en_US.UTF-8
Environment=PIDFILE=/data/my_app/shared/my_app.pid
Environment=RELEASE_CONFIG_DIR=/data/my_app/shared/config/
Environment=RELEASE_TMP=/data/my_app/shared
Environment=RELEASE_COOKIE=<cookie>
Environment=RELEASE_NAME=<hostname>
Environment=RELEASE_NODE=<hostname>@<real ip address>
Environment=RUN_ERL_LOG_GENERATIONS=10
Environment=RUN_ERL_LOG_MAXSIZE=536870912
LimitNOFILE=1048576
LimitNPROC=1048576
SyslogIdentifier=pipeline
RemainAfterExit=no

[Install]
WantedBy=multi-user.target

  • Step after copying the file is to stop using systemctl my_app stop
  • As seen in the unit file, we’re using daemon to start the app and stop to stop it.

So the issue I’m having is that randomly, some times the stop command hangs and deployment can´t continue. Deployment is done using Capistrano (don’t know if that matters but no harm in adding more info).

When this stop process gets “frozen”, I ssh into the box and use ps aux | grep my_app to filter what’s still hanging and all I see is this:
deploy 27269 12.3 1.5 5536800 252496 ? Sl Mar10 1095:07 /data/my_app/app/erts-10.6.4/bin/beam.smp -K true -A 64 -- -root /data/my_app/app -progname erl -- -home /data/my_app/app -- -noshell -s elixir start_cli -mode embedded -setcookie ********** -name node@<real ip> -config /data/my_app/shared/my_app-0.52.5-20220310212140-6041.runtime -boot /data/my_app/app/releases/0.52.5/start -boot_var RELEASE_LIB /data/my_app/app/lib -kernel inet_dist_listen_min 49000 -kernel inet_dist_listen_max 49100 -- -- -extra --no-halt which, if I understand correctly, it’s just the app process running but no other process.

To get over this issue I would just kill this process and every other step completes correctly. Being that I’m a noob in Elixir and with limited understanding about Erlang VM, I understand that the release is a package where all executables (even the beam) are packaged into the compressed file but now I’m not sure what to search for in the logs or how can I debug this behaviour as I can’t see anything obvious in the logs.

My question is: Is there anything obvious that I’m missing/should be looking for? or this requires more in depth debugging/version update/upgrade?

Any pointers would be greatly appreciated :slight_smile:

1 Like

Not an expert here, but can you try to check if the app is still working after the stop command is issued? Are the endpoints reachable (if it’s a web server)? Is the app still producing logs?

Hey, sorry for the late response, I’ll do more tests today but what I can say so far, when this happens, using the :observer functionality I can see all schedulers go flat-line, not to zero. I’ll try to dig more for info today

So, yeah, everything still works, reachable, endpoints are available and all, my stop command is there waiting, the systemctl status my_app is Active: deactivating (stop-sigterm) since Mon 2022-03-28 19:51:07 UTC; 16min ago ps command to check on processes still shows the process running.

Adding more information, when in this state, I just ssh’d into the node and went to the my_app/app/bin/ and executed ./my_app stop and now I’m getting --rpc-eval : RPC failed with reason :nodedown I’m thinking that somewhere in the process of shutting down, communication is broken and now I can’t gracefully stop the process’

More information on this is that, when this happens, the empd is gone but the app process is still there

Why do you use Type=forking when it is meant only for legacy tools that for whatever reason insist on running in background? For “modern” approach you should use Type=simple or Type=notify.

Don’t know, it was there from previous team member development, will look into that, thanks for pointing it out :smiley:

epmd being down was my first thought when I saw :nodedown. Is it starting properly on other boxes?

(Just trying to guess-help here, maybe the Type thing is the key here.)

In proper systemd-based deployment the application should be started without EPMD and the EPMD should be managed by the systemd independently from the application itself.

Do you happen to have somewhere I can read on that? I’d like to learn how SystemD would be managing EMPD, I was assuming that SystemD would just monitor the spawned process that will kickstart EMPD and then the app.

I need to write subsequent parts as it was some time since this o e.

3 Likes

Ok, I’m trying to test if this is the issue but now I’m having issues installing systemd because of rebar3 trying to compile enough and failing. Is there a limitation on versions?

Open an issue and I will look into it.

1 Like

New findings tell me that, when the my_app/bin/my_app stop is called, epmd is terminated but not the beam command that is running my app. I think it does have to do with my systemd configuration; I may have to end up upgrading the application to newest erlang/elixir.

1 Like

I am facing a similar problem with my app. I KNOW it’s all my fault, but I’m not sure what tools are available to help debug it. Main problem seems to be that the logger dies fairly early in the shutdown process (I think?) so it’s hard to log shutdown of the app

In my case I’m absolutely sure its because of bad OTP structures. So I have 3 main problems (now mostly eliminated, but only through learning…)

  1. Starting processes outside of OTP trees. Especially ones which aren’t listening for shutdown of the whole app. Often this means doing stuff like calling start_link of some other component in a genserver init() call. Seems innocent enough and then you store the sub process pid. Lots of docs even suggest to do it…

  2. Trying to use home grown restarting of dead processes. eg see the above. start_link something and then build your own monitoring to restart it if it dies. Could be something like spawning an external process, etc. In my case I have Beam trying to kill my process and the other process restarting it madly. Clue is that CPU is very high at this point on the machine (as the restart spins in a loop)

  3. Spawning external processes. Lots to go wrong here. In my case I needed sudo to start the process, but this left me unable to terminate the process as my chosen wrapper didn’t sudo the kill call. Solved this by granting caps to a process to sigkill non owned processes

I tried a few ideas to instrument shutdown to see what was going wrong but didn’t find this easy. The observer died before I could see the stuck processes. Intuition was enough to find the issue, but sure, build correct OTP systems and you don’t have a problem…

I just wanted to keep the hope alive here, for now, I had the chance to update my stack (sort of) I updated Elixir to 1.13.3 and Erlang to 22.3.4.25 (which could help but not really fix the issue) When I get approval to update the Systemd unit file and add libraries to change the type of service, I’ll let you guys know if the issue is that.

2 Likes