Phoenix server randomly stops serving requests with no error reported

BenSchZA · March 26, 2020, 12:36pm

I’m running two Phoenix applications. Both of them deployed in the same manner (as far as I can see), with different purposes and corresponding packages installed. One of them, my personal website/blog, periodically stops serving requests - this appears as a “Bad Gateway” error from Traefik. The issue occurs at random - sometimes after an hour, sometimes after a few hours, or after a few days.

To troubleshoot the issue I’ve used the observer and dug through the stats - I don’t see anything obvious, and don’t see any crashdumps. The application is deployed with Docker and mix releases, but the same issue occurs when using mix releases deployed locally (definitely not a memory issue) and on the server as a systemd service. I’ve also tried different Erlang/Elixir version combinations (following guidelines).

There are no crash dumps, no out of the ordinary logs, it just stops serving requests.

I feel like I might be missing something obvious, or that there’s some other debugging technique I could use to find the issue. Perhaps there’s a package version incompatibility, but I’m not using any out of the ordinary packages. Here’s the repo for reference: https://github.com/BenSchZA/phoenix-blog . The other application has been running stable for a few months: https://github.com/BenSchZA/kawapi-phoenix

I appreciate any guidance and help! Let me know if there are any other details I can provide.

dimitarvp · March 26, 2020, 12:54pm

Off the top of my head, you should check your load balancer.

lucaong · March 26, 2020, 1:08pm

Indeed, I was thinking the same as @dimitarvp. If you cannot see any error, and you know that the app is actually running, try to debug what is in front of the app first. A few possible hypotheses to test:

If the problem is upstream of the app, the app itself should be working. Can you validate it? Does the app respond to requests sent locally from the same machine?
- If the app responds to requests sent from the same machine, the problem must be upstream. Are the instanced restarted or scaled dynamically? Is it possible that an instance gets restarted or moved and the load balancer in front does not discover the new one?
- If the app does not respond even to local requests, are all the children of your supervision tree running? If a child crashes and gets in a restart loop for some reason, it will reach the max restarts and won’t be restarted anymore. In this case though, you should see traces of it in the logs.

BenSchZA · March 26, 2020, 1:08pm

Unfortunately the issue happens when not behind a load balancer too.

What’s the easiest way to check if the children in supervision tree are all running? I’ve checked in the observer and it all seems okay. As I said, no unusual logs or crash dumps.

There’s no autoscaling or restarting, a pretty basic deployment via Traefik, but also the issue doesn’t occur with my other application, and the issue does occur locally with no load balancer.

There’s nothing in front of the application - the issue also happens when it’s running directly from a mix release. Doing a restart fixes the issue.

I also updated the supervision strategy in an attempt to fix the issue: opts = [strategy: :one_for_all, name: App.Supervisor, max_restarts: 10, max_seconds: 30]

wanton7 · March 26, 2020, 1:21pm

I don’t have Elixir running in production yet, but is it possible you are running out of memory? Maybe some sort of memory leak or you are caching too much. Have you checked memory usage?

BenSchZA · March 26, 2020, 1:23pm

Okay… don’t know why I didn’t try this before. Going to a path other than the URL root works. So clearly there’s some issue with a specific module loaded on the root endpoint. Appreciate all the suggestions! Still don’t know what the issue is, but I should be able to track it down now.

BenSchZA · March 26, 2020, 1:27pm

One of the things I do on the root endpoint is make a series of calls to the GitHub API. I also use a module called {:memoize, "~> 1.3"} to memoize and rate limit the calls so I don’t hit the GitHub API rate limit. I’m not hitting the API rate limit, I checked that today and a while ago.

lucaong · March 26, 2020, 1:48pm

It could still be related to the calls. You make several sequential calls to GitHub, it could be that when the cache gets invalidated, if the endpoint is slow your request will take long. If there is a proxy in front of your app, it will likely have a timeout, after which it considers the downstream service down and returns 502.

BenSchZA · March 26, 2020, 1:54pm

Thanks @lucaong, I’ll dig into that deeper.

There’s no proxy or anything in front of my app when run locally. I’ve just removed that module and will leave it running locally for a while and see what happens, to confirm that it is the GitHub calls/caching (I’m sure it will be the case).

The cache is invalidated every 5 minutes. I make about 10 calls to the GitHub API: https://github.com/BenSchZA/phoenix-blog/blob/master/lib/app/Github/api.ex

lucaong · March 26, 2020, 1:58pm

Yeah, the fact that restarting fixes the issue seems to point to that indeed. Just a curiosity: when you run it behind Traefik you get a 502 Bad Gateway. What kind of error do you get when you hit the app directly?

BenSchZA · March 26, 2020, 2:01pm

When I hit the app directly there’s no error - it just hangs, and no network calls apart from the initial GET request to root / (in the browser/frontend at least). Maybe I’ll add some logs to the module making the GitHub API calls too.

wanton7 · March 27, 2020, 9:04am

I checked code for memoize and for novice Elixir programmer that code looked scary. But I do C# programming for living so I’m not a novice programmer. That code for memoize looks like that it could easily have some kind of problems. Maybe it doesn’t but it could. I would switch for simple GenServer based cache and see if your problems go away.

BenSchZA · March 27, 2020, 9:09am

After more debugging, it’s definitely either the GitHub API calls using tentacat or memoize causing the issue.

My question is then, where is the best place to make these API calls, both in terms of performance (rendering view to user without waiting for API call to complete) and in terms being non-blocking when there is an issue.

Passing the result of the API call directly from the controller <%= for repo <- @github_repos do %> vs calling it in the template from a view method <%= for repo <- github_repos() do %>. My assumption is that if called from a view method, the template would be rendered first and then the call would be made - which would be non-blocking? (I’d probably need live-view for this, both seem to be blocking)

Thanks @wanton7 - I’ll check out a GenServer implementation. The memoize implementation seemed cleaner at the time

wanton7 · March 27, 2020, 9:19am

If you need more speed than GenServer I think you can save cached value to ETS and read it from outside of GenServer, like as a “dirty” read. You can store creation time (when cached value was created) with the value. So when you read value you can check that time if it has been longer than time than cache timeout you have defined and then go to GenServer if necessary to reload cache. But I think this is only necessary when you have performance bottleneck.

lucaong · March 27, 2020, 9:32am

One thing to improve is that your current code enumerates all repos sequentially and, for each of them, creates a GitHub API client and performs a request. Ideally you should reuse the same client, and try to get all data in one request if possible, or parallelize the requests otherwise. GitHub has a GraphQL API: maybe you can use it to fetch all repos in one request.

I cannot comment on the memoize library, but definitely try to remove it or replace it with something else and see if that has an impact.

BenSchZA · March 27, 2020, 4:11pm

Thanks @lucaong - good feedback. I started by trying out the GraphQL API, that should be an improvement:

query {
  repo1: repository(owner: "BenSchZA", name: "pier") {
    name
    description
    primaryLanguage {
      name
    }
    licenseInfo {
      name
    }
    stargazers {
      totalCount
    }
  },
  repo2: repository(owner: "BenSchZA", name: "molecule-alpha") {
    name
    description
    primaryLanguage {
      name
    }
    licenseInfo {
      name
    }
    stargazers {
      totalCount
    }
  }
}

Here’s an implementation of caching using cachex, I’m going to see if that solves the issue:

  def avatar_url() do
    Cachex.fetch(:github, "avatar_url", fn() ->
      IO.puts "Refreshing cache"
      Cachex.expire(:github, "avatar_url", :timer.minutes(5))
      [handle: handle, username: _username, access_token: access_token] = Application.fetch_env!(:app, :github)
      client = Tentacat.Client.new(%{access_token: access_token})
      case Tentacat.Users.find(client, handle) do
        {200, data, _response} ->
          {:commit, data["avatar_url"]}
        _ ->
          {:ignore, nil}
      end
    end)
    |> case do
      {:error, _} -> nil
      {_, nil} -> {:error, :not_found}
      {success, result} when success in [:ok, :loaded, :commit] -> result
    end
  end

mmmries · April 2, 2020, 3:23am

This happened on a side project once before where it was affecting one of my phoenix apps, but not the others. It turned out to be a problem with the docker networking and the fact that I was running on a $5/mo Digital Ocean host.

I solved it by deleting my droplet and re-creating it, then updating to the latest docker version before booting up my apps. I haven’t seen it occur since then and I’ve never seen this happen in cases like Kubernetes or more hardened deployment environments. So I chalked it up to being on a cheap DigitalOcean box and docker not really being made for that sort of environment

BenSchZA · April 2, 2020, 9:25am

Thanks for the suggestion @mmmries, and all the other guidance. The issue ended up being with either the memoize library or the way I configured it. I’m now using cachex and the problem is resolved, although I don’t think the cache is expiring properly - so if anyone can review my code above, that would be great

dimitarvp · April 2, 2020, 10:22am

What do you mean by “cache not expiring properly”? You are giving it a 5 minutes TTL and if the user changes avatars 10 times during these 5 minutes then yes, your code will always return the stale avatar URL. 5 minutes into the future, if and only if there’s another request for the avatar URL, only then will your code fetch the new avatar URL.

I presume this is the exact behaviour you want?

BenSchZA · April 2, 2020, 10:53am

That’s the behaviour I want, but even after waiting for the expiry period, the cache doesn’t refresh. I’ve tested this locally with debug logs too. For context - I’m using it to fetch GitHub repo details too.