Socket activation with systemd? (to reduce downtime on restart)

I’ve got a number of apps (e.g. Phoenix, but also pure Plug + Cowboy) where 1/ hot-reload cannot be easily used and 2/ the app runs on a single server.

The restart period can be a bit slow (and some apps are behind nginx too), which leads to a period during which the requests are refused (bad gateway etc).

I’m looking at ways to reduce that downtime.

Did anyone managed to use systemd “socket activation” together with either Phoenix or even Cowboy/Ranch directly?

For some context, here is how you can achieve that with Go:

or even with Python:

It could quite improve the deploy workflow of single-server setups.

If you have any information, let me know!

Thanks!

– Thibaut

4 Likes

:wave:

It’s possible to pass fd option to :gen_tcp.listen: http://erlang.org/doc/man/gen_tcp.html#type-listen_option. So something like

# not sure about this ...
config :app, App.Endpoint,
  http: [transport_options: %{socket_opts: [fd: SYSTEMD_FIRST_SOCKET_FD]}]

might work.

So working from bottom up,

  • :gen_tcp's fd option would probably make it work
  • ranch accepts socket options via :socket_opts
  • plug uses :transport_options to configure that part of :ranch
  • phoenix uses :http option to configure that part of plug
6 Likes

This looks like a solid start - I will experiment, many thanks!

2 Likes

Very interested in hearing if you get it working. I tried once briefly and failed to get it to work.

I tried with {fd, 3} as an option to gen_tcp:listen.

I’ll have to dig it back up to try again and see where it went wrong, I remember it was a hassle to get the systemd service and socket working properly to even test the Erlang code and that is likely the reason it didn’t work, and not an issue with Erlang.

I know @garazdawi said he was interested as well.

2 Likes

I will experiment the coming week, on a client setup for which I need to do maintenance (Ansible + Elixir). Will report back with my findings!

3 Likes

I can’t help with the socket activation stuff, however I have been using a docker solution for zero downtime deploys, and it has been working great for me.

In a nutshell, it uses docker swarm which will start a new instance of the app, wait for it to become ready, move traffic over, and finally remove the old one. It’s surprisingly simple and works great on single server setups.

I can write up more details if people are interested.

2 Likes

While I won’t be able to use that technique on the projects I’m managing at the moment, I’d say it’s interesting to share anyway (if that’s not too much work on your side!).

1 Like

An update on this: so far I’m not able to start the app, but here are bits of feedback in case someone else also tackles this (I will get back to this later).

I’ve created a file under /etc/systemd/system/my_app.socket, with content:

[Unit]
Description=My App Socket
PartOf=my_app.service

[Socket]
# this is the Phoenix port
ListenStream=4000
BindIPv6Only=both

[Install]
WantedBy=sockets.target

The documentation for ListenStream etc is at http://manpages.ubuntu.com/manpages/bionic/man5/systemd.socket.5.html.

After a restart with reload systemctl_daemon, I can see an active unit for the socket with sudo systemctl status my_app.socket.

I’ve also modified my app service file under /etc/systemd/system/my_app.service, with (extract, cannot share in full):

[Unit]
# added my_app.socket
After=syslog.target network.target postgresql.service my_app.socket
Requires=my_app.socket

[Service]
Type=simple
User={{ item.user }}
# SNIP

I’ve (trial and error) updated the prod.exs configuration used to build my releases (via Distillery) with:

http: [:inet6, port: port, transport_options: [socket_opts: [fd: 3]]],

Ultimately though, my logs show the following error:

[error] Failed to start Ranch listener MyAppWeb.Endpoint.HTTP in
:ranch_tcp:listen([{:cacerts, :...}, {:key, :...}, {:cert, :...}, {:fd, 3}, :inet6,
{:port, 4000}]) for reason :einval (invalid argument)

This is already progress, because I had a lot of other errors before earlier, but at this point I must dive deeper into Ranch to figure out what is happening.

1 Like

I’d suggest starting with just a gen_tcp module and not involve ranch or anything else.

1 Like

Getting the same error {:error, :einval} with :gen_tcp.listen(5000, fd: 3) and the following systemd units:

[Unit]
Description=My App Socket
PartOf=my_app.service

[Socket]
ListenStream=5000
BindIPv6Only=both

[Install]
WantedBy=sockets.target
[Unit]
Description=My app daemon
After=syslog.target network.target my_app.socket
Requires=my_app.socket

[Service]
Type=simple
User=ubuntu
Group=ubuntu
Restart=on-failure
WorkingDirectory=/opt/socket_activation_demo
ExecStart=/opt/socket_activation_demo/bin/socket_activation_demo foreground

[Install]
WantedBy=multi-user.target

Just in case, here’s the env vars set for the service

iex(socket_activation_demo@127.0.0.1)15> System.get_env |> Enum.each(&IO.inspect/1)
{"HOME", "/home/ubuntu"}
{"TERM", "xterm"}
{"ERTS_VSN", "10.3.1"}
{"ERTS_DIR", "/opt/socket_activation_demo/erts-10.3.1"}
{"POST_CONFIGURE_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/post_configure.d"}
{"SRC_SYS_CONFIG_PATH", "/opt/socket_activation_demo/releases/0.1.0/sys.config"}
{"LANG", "C.UTF-8"}
{"SHLVL", "0"}
{"NAME_ARG", "-name socket_activation_demo@127.0.0.1"}
{"DISTILLERY_VSN", "2.1.1"}
{"PRE_UPGRADE_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/pre_upgrade.d"}
{"RELEASES_DIR", "/opt/socket_activation_demo/releases"}
{"PATH", "/opt/socket_activation_demo/erts-10.3.1/bin:/opt/socket_activation_demo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"}
{"CONSOLIDATED_DIR", "/opt/socket_activation_demo/lib/socket_activation_demo-0.1.0/consolidated"}
{"REL_DIR", "/opt/socket_activation_demo/releases/0.1.0"}
{"RELEASE_CONFIG_DIR", "/opt/socket_activation_demo"}
{"POST_UPGRADE_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/post_upgrade.d"}
{"REL_NAME", "socket_activation_demo"}
{"NAME", "socket_activation_demo@127.0.0.1"}
{"ROOTDIR", "/opt/socket_activation_demo"}
{"DEST_VMARGS_PATH", "/opt/socket_activation_demo/var/vm.args"}
{"RELEASE_ROOT_DIR", "/opt/socket_activation_demo"}
{"OLDPWD", "/opt"}
{"RELEASE_MUTABLE_DIR", "/opt/socket_activation_demo/var"}
{"PRE_CONFIGURE_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/pre_configure.d"}
{"ERL_OPTS", ""}
{"NAME_TYPE", "-name"}
{"DEST_SYS_CONFIG_PATH", "/opt/socket_activation_demo/var/sys.config"}
{"LISTEN_FDNAMES", "my_app.socket"}
{"SRC_VMARGS_PATH", "/opt/socket_activation_demo/releases/0.1.0/vm.args"}
{"ESCRIPT_NAME", "/opt/socket_activation_demo/releases/0.1.0/socket_activation_demo.sh"}
{"SHELL", "/bin/bash"}
{"LD_LIBRARY_PATH", "/opt/socket_activation_demo/erts-10.3.1/lib:"}
{"PRE_START_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/pre_start.d"}
{"BINDIR", "/opt/socket_activation_demo/erts-10.3.1/bin"}
{"DISTILLERY_TASK", "foreground"}
{"LISTEN_PID", "5248"}
{"POST_STOP_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/post_stop.d"}
{"PWD", "/opt/socket_activation_demo"}
{"ERTS_LIB_DIR", "/opt/socket_activation_demo/lib"}
{"SYS_CONFIG_PATH", "/opt/socket_activation_demo/var/sys.config"}
{"PROGNAME", "opt/socket_activation_demo/releases/0.1.0/socket_activation_demo.sh"}
{"LISTEN_FDS", "1"}
{"VMARGS_PATH", "/opt/socket_activation_demo/var/vm.args"}
{"PRE_STOP_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/pre_stop.d"}
{"REL_VSN", "0.1.0"}
{"EMU", "beam"}
{"POST_START_HOOKS", "/opt/socket_activation_demo/releases/0.1.0/hooks/post_start.d"}
{"LOGNAME", "ubuntu"}
{"USER", "ubuntu"}
{"START_ERL_DATA", "/opt/socket_activation_demo/var/start_erl.data"}
{"JOURNAL_STREAM", "9:98738"}
{"RUN_ERL_ENV", ""}
{"INVOCATION_ID", "3f625ea198164b3094464e146ff67c0b"}
:ok

the relevant ones are probably:

{"LISTEN_FDS", "1"}
{"LISTEN_FDNAMES", "my_app.socket"}

Can connect to it so it works …

iex(socket_activation_demo@127.0.0.1)12> :gen_tcp.connect({0,0,0,0}, 5000, [:binary])
{:ok, #Port<0.55>}

But cannot listen :frowning:

2 Likes

Yea, I think that is what I got when I tried as well and never had the time to debug further.

1 Like

Can you run two instances at once? Bring up new on different port, switch nginx proxy config to send traffic there, bring down old…

And for true zero downtime, maybe use Envoy instead of nginx, since Envoy does some serious hot-restart magic where it will hand off existing connections to the new instance :wink:

There are plenty of solutions to bypass the problem I’m tackling (like the one you are offering), yet in that case, I want specifically to try to solve the systemd activation thing & see if we can make it work.

Doing this because I think nginx + systemd will remain quite a common solution for entry-level deployments.

Thanks for the proposition, though, this is still helpful (& feel free to share your Envoy setup in a separate thread, by the way, it can definitely be interesting!).

4 Likes

I do the same (Docker Swarm built-in rolling deploy on a single node) for a Go & Python app and it works really well.

I have checked it out, and even written and published library for integrating Erlang applications with systemd:

And using that I have created example Plug application:

Unfortunately this do not implement socket activation, yet, because gen_tcp and gen_udp require unbind(3)ed file descriptors, and when you want socket activation then you (obviously) will get file descriptor that is already bound. However there is some hope in form of socket module and moving from init to it. When socket will get a way to use existing file descriptors and there will be a way to “create” gen_{tcp,udp,sctp} out of such existing socket then it will be possible and I will try to extend the example to use it as well.

Right now the available features in the library are:

  • Watchdog
  • Notification socket
  • Journal logging
1 Like

Ok, I have found a little bit tacky solution that I have described in my post about systemd library:

3 Likes