Downtime of Elixir app

I have an Elixir production app deployed with Coolify on a big Hetzner server.
Most of the time the app works fine but there are a few minutes of downtime that happen almost daily (1-2 times a day only) when my app becomes unresponsive, most of the time the downtime is around 2 minutes.
The server CPU and memory never goes more than 25-30% so it’s not because of not enough resources.
As you can see in this Grafana snapshot there are some gaps where prometheus is not able to reach the app to collect metrics.

I tried to watch for server restarts, read Coolify logs but I could not find any clue. Did you guys have any similar issue? How would you debug this further?
I would not want to switch to a cluster deployment and complicate things.

You sure there was a downtime?

To be honest looks like there is no datapoints on graph for that time.

1 Like

There are no datapoints since the server is not responding.
Neither the app is not accessible during this time, all the endpoints are down.

For more details the app deployed as a docker image with Traefik in front for domain and htts certificate.

I am excluding Erlang VM as the culprint so the only things I can think of are:

  • server restarts
  • networking issue
  • traefik proxy not being able to forward the requests

What else can it be?

1 Like

Network switch daily reboot?

If you ssh into the hosting machine or even the docker container, will that connection persist or go down? Can you trace during that time? Are other services in the same datacenter also affected?

3 Likes

I guess you mean a dedicated server by big server so I’d also check whether you are running any other services (like email) and check their logs.

Perhaps also commission a server and duplicate/run your app without Docker, Traefik, Coolify etc (I’ve no experience of the latter two myself) just to check whether one of them might be the culprit?

I had something like that a couple of years back, turned out to be backups freezing my instance.

2 Likes