Phoenix server randomly stops serving requests with no error reported

I’m running two Phoenix applications. Both of them deployed in the same manner (as far as I can see), with different purposes and corresponding packages installed. One of them, my personal website/blog, periodically stops serving requests - this appears as a “Bad Gateway” error from Traefik. The issue occurs at random - sometimes after an hour, sometimes after a few hours, or after a few days.

To troubleshoot the issue I’ve used the observer and dug through the stats - I don’t see anything obvious, and don’t see any crashdumps. The application is deployed with Docker and mix releases, but the same issue occurs when using mix releases deployed locally (definitely not a memory issue) and on the server as a systemd service. I’ve also tried different Erlang/Elixir version combinations (following guidelines).

There are no crash dumps, no out of the ordinary logs, it just stops serving requests.

I feel like I might be missing something obvious, or that there’s some other debugging technique I could use to find the issue. Perhaps there’s a package version incompatibility, but I’m not using any out of the ordinary packages. Here’s the repo for reference: https://github.com/BenSchZA/phoenix-blog . The other application has been running stable for a few months: https://github.com/BenSchZA/kawapi-phoenix

I appreciate any guidance and help! Let me know if there are any other details I can provide.

Off the top of my head, you should check your load balancer.

2 Likes

Indeed, I was thinking the same as @dimitarvp. If you cannot see any error, and you know that the app is actually running, try to debug what is in front of the app first. A few possible hypotheses to test:

  • If the problem is upstream of the app, the app itself should be working. Can you validate it? Does the app respond to requests sent locally from the same machine?
    • If the app responds to requests sent from the same machine, the problem must be upstream. Are the instanced restarted or scaled dynamically? Is it possible that an instance gets restarted or moved and the load balancer in front does not discover the new one?
    • If the app does not respond even to local requests, are all the children of your supervision tree running? If a child crashes and gets in a restart loop for some reason, it will reach the max restarts and won’t be restarted anymore. In this case though, you should see traces of it in the logs.
1 Like

Unfortunately the issue happens when not behind a load balancer too.

What’s the easiest way to check if the children in supervision tree are all running? I’ve checked in the observer and it all seems okay. As I said, no unusual logs or crash dumps.

There’s no autoscaling or restarting, a pretty basic deployment via Traefik, but also the issue doesn’t occur with my other application, and the issue does occur locally with no load balancer.

There’s nothing in front of the application - the issue also happens when it’s running directly from a mix release. Doing a restart fixes the issue.

I also updated the supervision strategy in an attempt to fix the issue: opts = [strategy: :one_for_all, name: App.Supervisor, max_restarts: 10, max_seconds: 30]

I don’t have Elixir running in production yet, but is it possible you are running out of memory? Maybe some sort of memory leak or you are caching too much. Have you checked memory usage?

1 Like

Okay… don’t know why I didn’t try this before. Going to a path other than the URL root works. So clearly there’s some issue with a specific module loaded on the root endpoint. Appreciate all the suggestions! Still don’t know what the issue is, but I should be able to track it down now.

2 Likes

One of the things I do on the root endpoint is make a series of calls to the GitHub API. I also use a module called {:memoize, "~> 1.3"} to memoize and rate limit the calls so I don’t hit the GitHub API rate limit. I’m not hitting the API rate limit, I checked that today and a while ago.

It could still be related to the calls. You make several sequential calls to GitHub, it could be that when the cache gets invalidated, if the endpoint is slow your request will take long. If there is a proxy in front of your app, it will likely have a timeout, after which it considers the downstream service down and returns 502.

2 Likes

Thanks @lucaong, I’ll dig into that deeper.

There’s no proxy or anything in front of my app when run locally. I’ve just removed that module and will leave it running locally for a while and see what happens, to confirm that it is the GitHub calls/caching (I’m sure it will be the case).

The cache is invalidated every 5 minutes. I make about 10 calls to the GitHub API: https://github.com/BenSchZA/phoenix-blog/blob/master/lib/app/Github/api.ex

1 Like

Yeah, the fact that restarting fixes the issue seems to point to that indeed. Just a curiosity: when you run it behind Traefik you get a 502 Bad Gateway. What kind of error do you get when you hit the app directly?

When I hit the app directly there’s no error - it just hangs, and no network calls apart from the initial GET request to root / (in the browser/frontend at least). Maybe I’ll add some logs to the module making the GitHub API calls too.

2 Likes

I checked code for memoize and for novice Elixir programmer that code looked scary. But I do C# programming for living so I’m not a novice programmer. That code for memoize looks like that it could easily have some kind of problems. Maybe it doesn’t but it could. I would switch for simple GenServer based cache and see if your problems go away.

1 Like

After more debugging, it’s definitely either the GitHub API calls using tentacat or memoize causing the issue.

My question is then, where is the best place to make these API calls, both in terms of performance (rendering view to user without waiting for API call to complete) and in terms being non-blocking when there is an issue.

Passing the result of the API call directly from the controller <%= for repo <- @github_repos do %> vs calling it in the template from a view method <%= for repo <- github_repos() do %>. My assumption is that if called from a view method, the template would be rendered first and then the call would be made - which would be non-blocking? (I’d probably need live-view for this, both seem to be blocking)

Thanks @wanton7 - I’ll check out a GenServer implementation. The memoize implementation seemed cleaner at the time :slight_smile:

If you need more speed than GenServer I think you can save cached value to ETS and read it from outside of GenServer, like as a “dirty” read. You can store creation time (when cached value was created) with the value. So when you read value you can check that time if it has been longer than time than cache timeout you have defined and then go to GenServer if necessary to reload cache. But I think this is only necessary when you have performance bottleneck.

1 Like

One thing to improve is that your current code enumerates all repos sequentially and, for each of them, creates a GitHub API client and performs a request. Ideally you should reuse the same client, and try to get all data in one request if possible, or parallelize the requests otherwise. GitHub has a GraphQL API: maybe you can use it to fetch all repos in one request.

I cannot comment on the memoize library, but definitely try to remove it or replace it with something else and see if that has an impact.

1 Like

Thanks @lucaong - good feedback. I started by trying out the GraphQL API, that should be an improvement:

query {
  repo1: repository(owner: "BenSchZA", name: "pier") {
    name
    description
    primaryLanguage {
      name
    }
    licenseInfo {
      name
    }
    stargazers {
      totalCount
    }
  },
  repo2: repository(owner: "BenSchZA", name: "molecule-alpha") {
    name
    description
    primaryLanguage {
      name
    }
    licenseInfo {
      name
    }
    stargazers {
      totalCount
    }
  }
}

Here’s an implementation of caching using cachex, I’m going to see if that solves the issue:

  def avatar_url() do
    Cachex.fetch(:github, "avatar_url", fn() ->
      IO.puts "Refreshing cache"
      Cachex.expire(:github, "avatar_url", :timer.minutes(5))
      [handle: handle, username: _username, access_token: access_token] = Application.fetch_env!(:app, :github)
      client = Tentacat.Client.new(%{access_token: access_token})
      case Tentacat.Users.find(client, handle) do
        {200, data, _response} ->
          {:commit, data["avatar_url"]}
        _ ->
          {:ignore, nil}
      end
    end)
    |> case do
      {:error, _} -> nil
      {_, nil} -> {:error, :not_found}
      {success, result} when success in [:ok, :loaded, :commit] -> result
    end
  end
2 Likes

This happened on a side project once before where it was affecting one of my phoenix apps, but not the others. It turned out to be a problem with the docker networking and the fact that I was running on a $5/mo Digital Ocean host.

I solved it by deleting my droplet and re-creating it, then updating to the latest docker version before booting up my apps. I haven’t seen it occur since then and I’ve never seen this happen in cases like Kubernetes or more hardened deployment environments. So I chalked it up to being on a cheap DigitalOcean box and docker not really being made for that sort of environment

1 Like