Phx.server re-compiling in prod leads to downtime on systemctl restart

  • Erlang/OTP 20
  • Elixir 1.5.1
  • Phoenix 1.3.0
  • Ubuntu 16.04.3 LTS (GNU/Linux 4.10.0-32-generic x86_64)
  • systemd 229

Hi everyone, I run Phoenix app in production as a systemd service and I’m getting crashes on shutdown along with forced compilation (even when none should be necessary) on startup:

[Unit]
Description=My Core WebServer

[Service]

Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=core-{{ user }}

Environment=LANG=en_US.UTF-8
EnvironmentFile=-/home/{{ user }}/.env
User={{ user }}

WorkingDirectory=/home/{{ user }}/repo
ExecStart=/usr/bin/env elixir --sname core-{{ user }}@localhost -S mix phx.server
ExecStop=/usr/bin/env elixir --sname {{ user }}-shutdown@localhost ./scripts/shutdown core-{{ user }}@localhost

[Install]
WantedBy=multi-user.target

We build the app for production with a custom script:

#!/usr/bin/env bash

# 1. Install hex and rebar
# 2. NPM install
#    bundle assets with webpack
# 3. Get all mix deps
#    mix compile app
# 4. Digest all assets

# Load environment variables
export $(cat ~/.env | xargs) > /dev/null

cd ~/repo

function log() {
  while read line; do echo "[$1] $line"; done
}

echo "Installing latest hex and rebar..."
mix do local.hex --force, local.rebar --force 2>&1 | log mix

( echo "Installing Mix Dependencies"
  mix deps.get --only prod 2>&1 | log mix
  echo "Compiling app"
  MIX_ENV=prod MIX_DEBUG=1 mix compile
)&

( echo "Installing NPM Dependencies"
  NODE_ENV=development npm install 2>&1 | log npm
  echo "Bundling Javascript Assets"
  ./node_modules/.bin/webpack 2>&1 | log webpack
)&

wait

echo "Digesting app"
mix phx.digest

cd - > /dev/null

Recently, I noticed 500s on our staging server after running sudo systemctl restart core-staging.service and realized two things:

  1. Our shutdown script (in which we call :rpc.call(node, :init, :stop, [0])) fails after successfully stopping the application, leaving it in a “failed state”.
  2. The app then starts up immediately, but the web server takes anywhere from 90 to 150 seconds to come online. (Note the timestamp on Cowboy below.) This timing is suspiciously like our mix compile timing and we get back online instantly if we add --no-compile to the phx.server call.
Sep 22 19:52:41 staging systemd[1]: Stopping OpenFn Core WebServer...
Sep 22 19:52:43 staging core-staging[3613]: Shutting down core-staging@localhost with custom shutdown script.
Sep 22 19:52:43 staging core-staging[3613]: Sending stop message...
Sep 22 19:52:43 staging core-staging[3613]: "Shutdown successful."
Sep 22 19:52:43 staging core-staging[3413]: erl_child_setup closed
Sep 22 19:52:43 staging core-staging[3413]: #015
Sep 22 19:52:44 staging core-staging[3413]: Crash dump is being written to: erl_crash.dump...done
Sep 22 19:52:44 staging systemd[1]: core-staging.service: Main process exited, code=exited, status=1/FAILURE
Sep 22 19:52:44 staging systemd[1]: Stopped My Core WebServer.
Sep 22 19:52:44 staging systemd[1]: core-staging.service: Unit entered failed state.
Sep 22 19:52:44 staging systemd[1]: core-staging.service: Failed with result 'exit-code'.
Sep 22 19:52:44 staging systemd[1]: Started My Core WebServer.
Sep 22 19:52:46 staging core-staging[3666]: warning: variable "deps" does not exist and is being expanded to "deps()", please use parentheses to remove the ambiguity or change the variable name
Sep 22 19:52:46 staging core-staging[3666]:   /home/staging/repo/deps/mailgun/mix.exs:8
Sep 22 19:54:16 staging core-staging[3666]: 19:54:16.490 [info] Running OpenFn.Endpoint with Cowboy using http://0.0.0.0:4000
Sep 22 19:54:17 staging core-staging[3666]: 19:54:17.047 [info] Starting IntervalJobsServer
Sep 22 19:54:17 staging core-staging[3666]: 19:54:17.048 [info] Starting IntervalServer

This is our shutdown script, for the record:

#! /usr/bin/env elixir

# Shutdown script
# ---------------
#
# Expects a parameter of the node to shutdown.
#
# And needs to be executed by a beam instance with it's sname set
# I.e  `elixir --sname staging-shutdown@localhost ./scripts/shutdown core-staging@localhost`

node =
  System.argv
  |> List.first
  |> String.to_atom

if Node.connect(node) do
  IO.puts "Shutting down #{node}"

  case :rpc.call(node, Application, :stop, [:exq]) do
    {:error, {:not_started, :exq}} -> IO.puts "Exq not running."
    :ok -> IO.puts "Exq stopped successfully"
  end

  IO.puts "Sending stop message..."
  :ok = :rpc.call(node, :init, :stop, [0])

else
  IO.puts("Could not connect to #{node}.") + System.halt(1)
end

Does anyone have any experience with this? Or, if it’s too specific to get into, does anyone know:

  1. What is the proper way to shut down a running elixir App using systemd?
  2. How does mix phx.server determine when it is necessary to run compile before starting up?

Thank you!

Taylor

It just checks the mdates of the source and BEAM files. If you have compiled manually and it does compile on its own on starting, it is probably because of either some time-offset or different MIX_ENVs you compile and run in. Mix does compile for each environment separately.

But instead of using mix on a production or staging server, I really think you should move on to distillery as a deployment tool. Somewhere on this forum there is also an explanation of how to deal with disitillery releases and systemd.

4 Likes

Thanks for the swift reply @NobbZ. While I’d love to get to the bottom of this particular issue (and if anyone knows what’s going on, please chime in!) but your point is well taken and we’d already flagged that we needed to start using some sort of more robust deployment system.

Is distillery the industry standard right now and is there a particular guide you’d recommend? There seem to be a number of fairly recent “this is the way to do it” posts:

  1. https://hexdocs.pm/distillery/walkthrough.html#adding-distillery-to-your-project
  2. https://hackernoon.com/state-of-the-art-in-deploying-elixir-phoenix-applications-fe72a4563cd8
  3. https://medium.com/@zek/deploy-early-and-often-deploying-phoenix-with-edeliver-and-distillery-part-two-f361ef36aa10

Thanks so much.

Taylor

Yes, and it’s own docs, even been adding more systemctl related stuff to it (see my PIDFile library to make systemctl even easier, of which it should be built-in to Distillery hopefully soonish). :slight_smile:

4 Likes