Determining when a background worker is healthy?

I currently have a couple of background workers running on a Kubernetes cluster. My application is wrapped in a Docker container and uses Distillery to create a release.

I deployed a big revamp to one of my workers yesterday and though the application started it struggled to connect to a dependent service and a supervisor was in a crash loop. Even though the application is running it is not in a healthy state.

I’m wondering what might be an effective (low overhead) way of determining if an application is healthy from outside of the application? This is simple with apps which are serving traffic, but unsure best way with a background worker.

My initial ideas:

  1. create a custom command using distillery which sends message to app and returns status
  2. touch a file when all supervisors have successful started and use stat to determine when it’s ready (doesn’t necessarily cover ongoing healthy state)

I can’t just let it crash as I’m doing a rolling deploy and don’t want to replace healthy services with a broken one.

Would appreciate any pointers.

Ok, I’ve ended up with this:

defmodule Scheduler.Application do
  use Application

  def start(_type, _args) do
    import Supervisor.Spec, warn: false

    children = [
      worker(Scheduler, []),
      worker(Scheduler.Brokers.BulkCheckBroker, []),
    ]

    opts = [strategy: :one_for_one, name: Scheduler.Supervisor]
    {:ok, pid} = Supervisor.start_link(children, opts)

    :ok = Scheduler.Health.status

    {:ok, pid}
  end
end

defmodule Scheduler.Health do
  def status do
    case bulk_check_broker_status do
      %AMQP.Connection{} -> :ok
      any -> {:error, any}
    end
  end

  defp bulk_check_broker_status do
    ConduitAMQP.with_conn(Scheduler.Brokers.BulkCheckBroker, fn conn -> conn end)
  end
end

The application now fails if a connection cannot be obtained during startup. Is this ok? Am I doing something naive by having this health status check in the application start block?

With what error does it fail?

I’m wondering what might be an effective (low overhead) way of determining if an application is healthy from outside of the application?

You might try preventing it entering an unhealthy state by using the :permanent start type for your application (I think), it would prevent the release from starting / continue running if one of the applications fails to start / run properly.

http://erlang.org/doc/design_principles/applications.html#application-start-types

  • If a permanent application terminates, all other applications and the runtime system are also terminated.

That’s what I usually use.

It’s a connection error. For example, if the RabbitMQ instance is down or I don’t have correct credentials. The library I’m using will simply keep retrying the connection indefinitely.

I just wanted to ensure the connections in my supervisors were all kosher before allowing the application to enter a ‘started’ state.

I just wanted to ensure the connections in my supervisors were all kosher before allowing the application to enter a ‘started’ state.

If that’s an absolute necessity, you can wrap the startup of that connection in some genserver’s init or something like this, so that it blocks the supervisor from starting other children until it’s ready. You’d need to tweak the supervisor’s startup timeout if there is one as well.

Or instead of wrapping the connection, you can do something like:

children = [
  # ... connection
  GenServerWhichWaitsForConnectionInInit,
  # ... rest
]
defmodule GenServerWhichWaitsForConnectionInInit do
  use GenServer

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts)
  end

  def init(_opts) do
    # wait for connection to be setup
  end
end

But both of these approaches are hacks.

Thanks for the feedback.

I think I may move in the direction of shipping the application with a simple server and putting the health checks behind that.

GenServer’s handle_continue callback might help partially?

Will have a look, cheers.

At the moment I’ve configured a simple server and a _health endpoint which I think does what I want. I then use that to check I can connect to dependent services, etc,

maybe look at https://hex.pm/packages/gen_rmq it probably already does what you need in terms of reconnecting.

Thanks.

The library I’m using (conduit) already handles reconnects. It’s the readiness I was concerned with due to my deployment environment. Have managed to figure it out using plug and a _health endpoint.