I currently have a couple of background workers running on a Kubernetes cluster. My application is wrapped in a Docker container and uses Distillery to create a release.
I deployed a big revamp to one of my workers yesterday and though the application started it struggled to connect to a dependent service and a supervisor was in a crash loop. Even though the application is running it is not in a healthy state.
I’m wondering what might be an effective (low overhead) way of determining if an application is healthy from outside of the application? This is simple with apps which are serving traffic, but unsure best way with a background worker.
My initial ideas:
create a custom command using distillery which sends message to app and returns status
touch a file when all supervisors have successful started and use stat to determine when it’s ready (doesn’t necessarily cover ongoing healthy state)
I can’t just let it crash as I’m doing a rolling deploy and don’t want to replace healthy services with a broken one.
defmodule Scheduler.Application do
use Application
def start(_type, _args) do
import Supervisor.Spec, warn: false
children = [
worker(Scheduler, []),
worker(Scheduler.Brokers.BulkCheckBroker, []),
]
opts = [strategy: :one_for_one, name: Scheduler.Supervisor]
{:ok, pid} = Supervisor.start_link(children, opts)
:ok = Scheduler.Health.status
{:ok, pid}
end
end
defmodule Scheduler.Health do
def status do
case bulk_check_broker_status do
%AMQP.Connection{} -> :ok
any -> {:error, any}
end
end
defp bulk_check_broker_status do
ConduitAMQP.with_conn(Scheduler.Brokers.BulkCheckBroker, fn conn -> conn end)
end
end
The application now fails if a connection cannot be obtained during startup. Is this ok? Am I doing something naive by having this health status check in the application start block?
I’m wondering what might be an effective (low overhead) way of determining if an application is healthy from outside of the application?
You might try preventing it entering an unhealthy state by using the :permanent start type for your application (I think), it would prevent the release from starting / continue running if one of the applications fails to start / run properly.
It’s a connection error. For example, if the RabbitMQ instance is down or I don’t have correct credentials. The library I’m using will simply keep retrying the connection indefinitely.
I just wanted to ensure the connections in my supervisors were all kosher before allowing the application to enter a ‘started’ state.
I just wanted to ensure the connections in my supervisors were all kosher before allowing the application to enter a ‘started’ state.
If that’s an absolute necessity, you can wrap the startup of that connection in some genserver’s init or something like this, so that it blocks the supervisor from starting other children until it’s ready. You’d need to tweak the supervisor’s startup timeout if there is one as well.
Or instead of wrapping the connection, you can do something like:
defmodule GenServerWhichWaitsForConnectionInInit do
use GenServer
def start_link(opts) do
GenServer.start_link(__MODULE__, opts)
end
def init(_opts) do
# wait for connection to be setup
end
end
At the moment I’ve configured a simple server and a _health endpoint which I think does what I want. I then use that to check I can connect to dependent services, etc,
The library I’m using (conduit) already handles reconnects. It’s the readiness I was concerned with due to my deployment environment. Have managed to figure it out using plug and a _health endpoint.