PidFile - create and manage a PID file from the BEAM process

I created a new library (rather I pulled out a couple files from my big project), it manages an operating system PID file for the BEAM.

The reason you might want this it to make a ‘proper’ systemd management file or something, or just an easy way to identify the PID’s of your multiple BEAM processes (I have a lot of BEAM instances running for example, this is useful to figure out which is which).

It’s Hex URL: https://hex.pm/packages/pid_file

It’s README.md:

PidFile

Manages a simple OS PID file for this BEAM system.

In other words it just makes a file whose sole contents is the Operating System PID of the running BEAM process.

It should also auto-clean old PID files on load, and clear the PID file on a ‘proper’ shutdown, but even if not a proper shutdown then it will still clear it properly next time.

Hex: https://hex.pm/packages/pid_file

Installation

    {:pid_file, "~> 0.1.0"},

Setup

Global Config

Add one of these to your config for it to be managed globally, replacing the values as necessary:

config :pid_file, file: "./my_app.pid"
config :pid_file, file: {:SYSTEM, "PIDFILE"}

Locally Managed

Add the worker to your supervision tree:

worker(PidFile.Worker, [[file: "/run/my_app.pid"]])
19 Likes

Also, if it helps, I use this overlay to generate (via distillary) a systemd service file using this PID library:

[Unit]
Description=<%= description %>
After=network.target

[Service]<% full_dir = Path.absname(output_dir)%>
Type=forking
User=<%= deploy_user %>
Group=<%= deploy_group %>
WorkingDirectory=<%= full_dir %>
ExecStart=<%= full_dir %>/bin/<%= p_name %> start
ExecReload=<%= full_dir %>/bin/<%= p_name %> reload_config
ExecStop=<%= full_dir %>/bin/<%= p_name %> stop
PIDFile=<%= full_dir %>/<%= p_name %>.prod.pid
Restart=always
RestartSec=5
Environment=PORT=3000
Environment=LANG=en_US.UTF-8
SyslogIdentifier=<%= p_name %>

[Install]
WantedBy=multi-user.target

Been using it in production ever since migrated the server from Windows 2008 to redhat, so a couple weeks now without issue. This follows the proper systemd model using a PID file with proper restarts (tested it by manually telling the server to stop, killing it, kill -9’ing it, comes back up in 5 seconds every time).

5 Likes

Thanks for this! I’ve been trying to get systemd to restart my app if it crashes. I used your :pid_file package and it seems to be managing the pidfile correctly. I added the path to the pidfile to my systemd unit file. I can start and stop the app via systemd and the app starts automatically when the server starts but if I kill the app it doesn’t restart. Wondering if you have any suggestions on this. I start my app using systemd, then check the pidfile to get the pid (in this case it was 1645). then run kill 1645 the app stops and is not restarted. I checked the systemd logs using journalctl -u my_app.service and got the following output. Any thoughts? Thanks again!

Aug 22 11:01:25 ip-172-31-4-169 my_app[1682]: Starting up
Aug 22 11:01:26 ip-172-31-4-169 systemd[1]: my_app.service: Supervising process 1645 which is not our child. We'll most likely not notice when it exits.
Aug 22 11:01:26 ip-172-31-4-169 systemd[1]: Started my_app.
Aug 22 11:02:03 ip-172-31-4-169 run_erl[1644]: Erlang closed the connection.

I noticed in the last line of the log messages run_erl[1644], note 1644 is the app’s pid - 1. I also noticed the message “Supervising process 1645 which is not our child. We’ll most likely not notice when it exits”. I think this may be a clue. Is run_erl a process that “wraps” the app process? Maybe systemd interacts with run_erl rather than our app directly and therefore doesn’t notice if the app exits.

Hmm I just kill -9 beam.smp here and it started back up after 5 seconds. Though I use the systemd service config file I displayed above.

Can you post the systemd service config file that you are using? And I bet the run_erl is pre-forking or something, as you can see I’m calling the release scripts directly.

Thanks for this. Yea I’m not too sure what the issue is. My server is ubuntu so I couldn’t use the exact command you used to kill it but I tried pkill -9 beam.smp and got the same results. Systemd reported that “Erlang closed the connection” and didn’t appear to attempt a restart. This is my conf file. Maybe I messed something up.

[Unit]
Description=My App
After=network.target

[Service]
Type=forking
User=my_app_deployer
Group=my_app_deployer
WorkingDirectory=/home/my_app_deployer/my_app/staging/my_app
ExecStart=/home/my_app_deployer/my_app/staging/my_app/bin/my_app start
ExecStop=/home/my_app_deployer/my_app/staging/my_app/bin/my_app stop
PIDFile=/home/my_app_deployer/my_app/staging/my_app/my_app.pid
Restart=always
RestartSec=5
EnvironmentFile=/home/my_app_deployer/.my_app_env
SyslogIdentifier=my_app
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Thanks again!

Well first of all that should not be there. That means ‘pretend everything stays working even when the application closes’. Why do you have that there?! o.O

2 Likes

Nice. Thanks for your help @OvermindDL1. That was my problem. Removed that line and it’s working :grin:

I had it in there because i saw in the Distillery docs (https://hexdocs.pm/distillery/use-with-systemd.html#content) that “It’s important that you have RemainAfterExit=yes set, or you will get an error trying to start the service.” I didn’t give any thought to the effect that line would have. And contrary to the note in the docs I haven’t had trouble starting the service since removing that line.

Thanks again!

Distillery does it wrong, I don’t know why they don’t do it right… ^.^;
Using Distillery’s instructions your service will not get restarted, which seems to defeat the point to me? ^.^

You are welcome! Nice to see others are using this too. :slight_smile:

@OvermindDL1: Just looked at your code - it’s simple and useful.
I don’t understand why your are calling:

:os.getpid() # charlist
|> to_string()
|> String.to_integer()
|> to_string()

i.e. update_pid/1 calls get_pid/0 that returns integer (from string) and you are changing it back to string. I could understand if it’s used in another function, but it’s not, so I’m little curious why you did it.

:os.getpid() # charlist
|> to_string() # Convert charlist to string
|> String.to_integer() # convert string to integer to make sure it really is an integer or it throws, I needed this in some cases but don't recall where
|> to_string() # Convert it back to a string for later writing out

So yeah, I needed to ensure it was an integer at one point but don’t recall where, this code is old… >.>

Probably it is worth mentioning in this thread that it is common to supervise the VM via the built-in heart mechanism: http://erlang.org/doc/man/heart.html

It works by starting a separate process called heart that not only checks if the beam process is still alive but also sends heartbeat messages to it (hence the name).

Of course if you use heart other restart mechanisms should not be in place. Also if you want to manually kill the VM you need to kill the heart process first, then the VM. Better option to stop the VM would be to call init:stop() from a remote node as described here for example: http://erlang.org/pipermail/erlang-questions/2010-August/052957.html

The main advantage of the heart method is that it is OS independent.

3 Likes

True, but systemd here at work communicates to and logs to a central logging system off-server and it is useful to see all the times it goes down, starts eating too much cpu, etc… etc… :slight_smile:

That is what my systemd service script does. ^.^

1 Like

Just because there is a SystemD job written in the question , So anyone might help me as well.

I have this Service Unit.

[Unit]
Description=Evercam Media
After=network.target

[Service]
User=root
Group=root
WorkingDirectory=/
ExecStart=/opt/evercam_media/bin/evercam_media start
ExecStop=/opt/evercam_media/bin/evercam_media stop

#Restart=always

[Install]
WantedBy=multi-user.target

And it only giving me

this error always

Aug 23 08:28:30 Ubuntu-1604-xenial-64-minimal systemd[1]: Started Evercam Media.
Aug 23 08:28:31 Ubuntu-1604-xenial-64-minimal evercam_media[30064]: Starting up
Aug 23 08:28:31 Ubuntu-1604-xenial-64-minimal evercam_media[30009]: Node 'evercam_media@Ubuntu-1604-xenial-64-minimal' not responding to pings.
Aug 23 08:28:31 Ubuntu-1604-xenial-64-minimal systemd[1]: evercam_server.service: Control process exited, code=exited status=1
Aug 23 08:28:32 Ubuntu-1604-xenial-64-minimal systemd[1]: evercam_server.service: Unit entered failed state.
Aug 23 08:28:32 Ubuntu-1604-xenial-64-minimal systemd[1]: evercam_server.service: Failed with result 'exit-code'.

Can you point me the issue in this?

That indicates to me that the evercam_media start command is failing. Have you tried launching it manually? Does it require to be started after something else other than After=network.target like the USB system or something?

EDIT: Oh wait, this is a BEAM VM?

Since you do not have a Type specified then it defaults to Simple as I recall, meaning it expects the program to persist, not fork. You need to use Type=forking as in my example. You really should copy and use my example in that case.

Actually , I converted the Upstart Job to SystemD Unit, So thats why I added that After=network.target thing…

Yes Its starts manually…

Okay I am going to try with Type=forking and with your example,

1 Like

Okay your snippet worked and in this way

LimitNOFILE 1000000 1000000

[Unit]
Description=Evercam Media
After=network.target

[Service]
Type=forking
User=root
Group=root
WorkingDirectory=/opt/evercam_media
ExecStart=/opt/evercam_media/bin/evercam_media start
ExecStop=/opt/evercam_media/bin/evercam_media stop
Restart=always
RestartSec=5
Environment=LANG=en_US.UTF-8
SyslogIdentifier=evercam_media

[Install]
WantedBy=multi-user.target

I have few concerns which I want to share with you, In case you see few things wrong here or may you can help in any case…

We were on Ubuntu 14.04 and was using upstart job, which was like

description  "evercam_media"
start on filesystem or runlevel [2345]
stop on runlevel [!2345]
limit nofile 1000000 1000000

respawn
chdir /
setuid {{user_name}}
setgid {{user_name}}

env HOME=/home/{{user_name}}
env LANG=en_US.UTF-8
env LANGUAGE=en_US:en
env LS_ALL=en_US.UTF-8
env ERL_MAX_PORTS=10240
env ERL_MAX_ETS_TABLES=5000
{% for key, value in env_vars.items() %}
env {{key.upper()}}={{value}}
{% endfor %}

exec watch -n1 '/usr/local/bin/run_evercam_media.sh'

post-stop exec sudo pkill beam

The former developer who created the upstart job, He used watch with seconds values as 1 , just to watch the script every one second. and in SH script, he was doing something

if ! (ps aux | grep evercam_media/bin/evercam_media | grep -v grep > /dev/null); then
   /opt/evercam_media/bin/evercam_media start
fi

for example, the evercam media have stopped due to some reasons it will start it again. and watching it every one second.

What if the above service unit get failed due to some reason? Will it restart it self? I have seen Restart=on-failure thing as well, do you think it will help?

Also big question: ENV variables, I dont see anywhere you pointed about ENV variables?

That looks horrifying! That is quite an extra load on the server for no reason… He’s spawning an entire BEAM VM instance every second just to ping it?! o.O

If you use Restart=always as in my script then it will always restart, including on failure. Only way it will not be restarted is it you systemctl stop ... it, and even then it will be restarted on next reboot unless you systemctl disable ... it too. ^.^

I had two in my script above:

Environment=PORT=3000
Environment=LANG=en_US.UTF-8

You can of course define your own, or you can import an external file too (see the systemd service file spec). :slight_smile:

1 Like

That looks horrifying! That is quite an extra load on the server for no reason… He’s spawning an entire BEAM VM instance every second just to ping it?! o.O

Nope, He is not doing that, new BEAM VM instance only get initiated if evercam_media process is not found through sh script.

Okay thanks for your help. :slight_smile:

Ah, missed that, thought it was pinging with the VM. ^.^;

I can start the application without problem, but when I try to launch it with systemd it keeps failing and restarting for some reason.
Here are the logs

un 08 10:00:15 union-staging systemd[1]: Starting Union Servers...
-- Subject: Unit union.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit union.service has begun starting up.
Jun 08 10:00:16 union-staging union[8741]: ==> Generated sys.config in /test/union/var
Jun 08 10:00:19 union-staging union[8998]: ==> Generated sys.config in /test/union/var
Jun 08 10:00:19 union-staging union[9530]: Starting up
Jun 08 10:00:20 union-staging systemd[1]: union.service: control process exited, code=exited status=1
Jun 08 10:00:20 union-staging union[8998]: Node union@127.0.0.1 is not running!
Jun 08 10:00:20 union-staging systemd[1]: Unit union.service entered failed state.
Jun 08 10:00:20 union-staging systemd[1]: union.service failed.
Jun 08 10:00:25 union-staging systemd[1]: union.service holdoff time over, scheduling restart.
Jun 08 10:00:25 union-staging systemd[1]: Started Union Servers.
-- Subject: Unit union.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit union.service has finished starting up.
--
lines 1190-1211/1247 98%
-- The start-up result is done.
Jun 08 10:00:15 union-staging systemd[1]: Starting Union Servers...
-- Subject: Unit union.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit union.service has begun starting up.
Jun 08 10:00:16 union-staging union[8741]: ==> Generated sys.config in /test/union/var
Jun 08 10:00:19 union-staging union[8998]: ==> Generated sys.config in /test/union/var
Jun 08 10:00:19 union-staging union[9530]: Starting up
Jun 08 10:00:20 union-staging systemd[1]: union.service: control process exited, code=exited status=1
Jun 08 10:00:20 union-staging union[8998]: Node union@127.0.0.1 is not running!
Jun 08 10:00:20 union-staging systemd[1]: Unit union.service entered failed state.
Jun 08 10:00:20 union-staging systemd[1]: union.service failed.
Jun 08 10:00:25 union-staging systemd[1]: union.service holdoff time over, scheduling restart.
Jun 08 10:00:25 union-staging systemd[1]: Started Union Servers.
-- Subject: Unit union.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit union.service has finished starting up.

Here is my systemd config

[Unit]
Description=Union Servers
After=network-online.target
[Service]
Type=forked
User=root
Group=root
WorkingDirectory=/test/union
ExecStart=/test/union/bin/union start
ExecReload=/test/union/bin/union reload_config
ExecStop=/test/union/bin/union stop
PIDFile=/test/union/union.prod.pid
Restart=always
RestartSec=5
Environment=PORT=4000
Environment=LANG=en_US.UTF-8
SyslogIdentifier=union
[Install]
WantedBy=multi-user.target

The system variable I use is
PIDFILE="./union.prod.pid"
Here is my mix file

[
      {:phoenix, "~> 1.3.0"},
      {:phoenix_pubsub, "~> 1.0"},
      {:phoenix_ecto, "~> 3.2"},
      {:postgrex, ">= 0.0.0"},
      {:phoenix_html, "~> 2.10"},
      {:gettext, "~> 0.13.1"},
      {:cowboy, "~> 1.0"},
      {:ex_admin, github: "sublimecoder/ex_admin"},
      {:timex, "~> 3.2", override: true},
      {:coherence, "~> 0.5"},
      {:cloudex, "~> 1.0"},
      {:number, "~> 0.5.4"},
      {:turbolinks, "~> 0.3.2"},
      {:ecto_enum, "~> 1.1"},
      {:jason, "~> 1.0"},
      {:absinthe, "~> 1.4.2"},
      {:absinthe_plug, "~> 1.4.0"},

      # Instrumentation
      {:observer_cli, "~> 1.3.1"},
      {:exprof, "~> 0.2.1"},
      {:eflame, "~> 1.0"},
      {:sentry, "~> 6.2.0"},
      {:prometheus_ex, "~> 1.0"},
      {:prometheus_ecto, "~> 1.0"},
      {:prometheus_phoenix, "~> 1.2"},
      {:prometheus_plugs, "~> 1.0"},
      {:prometheus_process_collector, "~> 1.1"},

      # dev dependencies
      {:phoenix_live_reload, "~> 1.0", only: :dev},
      {:ex_doc, "~> 0.18", only: :dev, runtime: false},
      {:credo, "~> 0.8", only: [:dev, :test], runtime: false},
      {:dialyxir, "~> 0.5.0", only: [:dev], runtime: false},

      # release
      {:edeliver, "~> 1.5.0"},
      {:distillery, "~> 1.5", runtime: false},
      {:pid_file, "~> 0.1.0"},
      {:conform, "~> 2.2"}
    ]

I’m not really sure what I’m doing wrong.