I have an Elixir production app deployed with Coolify on a big Hetzner server.
Most of the time the app works fine but there are a few minutes of downtime that happen almost daily (1-2 times a day only) when my app becomes unresponsive, most of the time the downtime is around 2 minutes.
The server CPU and memory never goes more than 25-30% so it’s not because of not enough resources.
As you can see in this Grafana snapshot there are some gaps where prometheus is not able to reach the app to collect metrics.
I tried to watch for server restarts, read Coolify logs but I could not find any clue. Did you guys have any similar issue? How would you debug this further?
I would not want to switch to a cluster deployment and complicate things.
If you ssh into the hosting machine or even the docker container, will that connection persist or go down? Can you trace during that time? Are other services in the same datacenter also affected?
I guess you mean a dedicated server by big server so I’d also check whether you are running any other services (like email) and check their logs.
Perhaps also commission a server and duplicate/run your app without Docker, Traefik, Coolify etc (I’ve no experience of the latter two myself) just to check whether one of them might be the culprit?