Understanding slow requests with telemetry event timeline

jmurphyweb · December 10, 2020, 7:50am

Occasionally my app goes into some sort of meltdown. Requests suddenly get jammed and Heroku performs a 30 second timeout on everything coming in.

I’ve struggled to locate a cause, but this appsignal report looks like it holds the answer:

This is a basic POST request and would usually be completed in around 1 second.

As can be seen, there is a 20 second wait until the first DB request is made (add current user to conn).

Any help or suggestions would be great.

LostKobrakai · December 10, 2020, 10:16am

Appsignal will only show you what’s happening on the request itself. If things are getting delayed this likely means that the node is blocked by something (or multiple things) in the meantime.

jmurphyweb · December 10, 2020, 12:58pm

Thanks.

I’ve isolated it to something within my endpoint because the very first plug in my router emits a log which is 20 odd seconds after the Plug.Logger log.

Here’s what lies between those two plugs:

  plug(Plug.Parsers,
    parsers: [:urlencoded, :multipart, :json],
    pass: ["*/*"],
    json_decoder: Poison
  )

  plug(Plug.MethodOverride)
  plug(Plug.Head)

  plug(Plug.Session,
    store: :cookie,
    key: redact,
    signing_salt: redact
  )

  plug(Ev2Web.Router)

Given these are all standard/official Plugs, do you think there is any chance it is actually anything to do with them?

Once a dyno is a clogged state, the requests consistently follow the pattern from the screenshot above.

LostKobrakai · December 10, 2020, 1:10pm

You’re still under the assumption that the code of the slow request is to blame, which I’d challenge. I would much rather expect something in your system acting up and slowing down web requests as a side effect.

jmurphyweb · December 10, 2020, 1:14pm

Hmm yes that’s what I’m thinking as those plugs are all standard.

Strangely I’ve found that restarting dynos doesn’t necessarily resolve in the short term.

I am at a bit of a loss as to how I should find out what is clogging the system?

I have live dashboard enabled
Appsignal and logging for metrics

LostKobrakai · December 10, 2020, 1:19pm

If you have the live dashboard enabled I’d monitor CPU usage in those instances. If the webrequests are indeed starved then it should be continuously very high cpu usage. Granted the dashboard might be affected as well if it’s indeed a system wide problem. Another thing you can do is look at the process list and look out for processes with very high reduction counts / quickly growing reduction count.

LostKobrakai · December 10, 2020, 1:20pm

Then I’d check in with Heroku as well. It might not even be your system causing the problems, but another system running on the same hardware.

jmurphyweb · December 10, 2020, 1:20pm

Great.

Thanks, i’ll take a look at those.

jmurphyweb · December 10, 2020, 1:49pm

Processes:

VM CPU:

We are not currently in a clogged state and dynos have been running from restart for around an hour.

Does any of that look concerning?

jmurphyweb · December 12, 2020, 9:55pm

Next time this happens I want to be sure I can start to at least rule things out.

What should I track in order measure the node becoming clogged?

total_run_queue_lengths.cpu and total_run_queue_lengths.io ?

And on the DB front I am tracking queue length avg & 95th percentile

Does this sound sufficient?

CTD · January 21, 2021, 11:06pm

We’re seeing the same thing, or at least something similar.
We’ll return thousands of a certain POST request daily in 200-500ms, and then one will take 30s.
There are others that take longer (1-2s), but those seem to correspond with payload size whereas the 30s requests do not.

We’ve instrumented all of our plug calls for AppSignal, and it’s consistently happening at Plug.Parsers.
This is not the first plug in our endpoint.

Our incoming payloads contain a base64 image string. We wondered if there may be an issue with parsing it (w/ jason), but again, most make it through just fine (including some that are much larger than those in the the 30s requests). Just mentioning this in case our requests have this in common.

We’re also on Heroku. Metrics don’t seem to indicate any issues with memory or dyno load before or after those long requests. And when the dyno survives the long response, it goes on to serve more of the same type of request (and others) right away with no trouble.

Any feedback @jmurphyweb or others?

mpraski · September 22, 2021, 12:18pm

We’re also experiencing sporadic slow requests (1-2s, sometimes up to 30s) affecting a POST endpoint with JSON body. Instrumenting the request pipeline via Appsignal points to a slowdown at Plug.Parsers, incurring a long waiting time till the request is properly handled.

The payload is invariably small (<1KB), containing one field holding one long base45 encoded string (600 bytes) and some small object. Our json_library is Jason.

This endpoint is called 500 times per minute at peak times, exhibiting minimal variance and short response times (<100ms) except for these outliers.

Could it be the same issue?
Our clients usually access this endpoint via mobile network, which might explain occasional slowdowns - but up to 30 seconds for a small-ish JSON payload?

Our endpoint (sans Plug.Parsers instrumentation):

defmodule Front.Endpoint do
  use Phoenix.Endpoint, otp_app: :front
  use Appsignal.Phoenix

  alias Front.Config

  plug(Front.Plugs.Web.CheckHealth)

  plug(Plug.Static,
    at: "/",
    from: :front,
    gzip: false,
    cache_control_for_etags: "public, max-age=31536000",
    only: ~w(css fonts images js docs favicon.ico robots.txt)
  )

  if code_reloading? do
    plug(Phoenix.CodeReloader)
  end

  plug(Plug.RequestId)
  plug(Plug.Parsers, Config.parser_options())
  plug(Plug.MethodOverride)
  plug(Plug.Head)
  plug(Plug.Session, Config.session_options(Mix.env()))
  plug(Corsica, Config.corsica_options(Mix.env()))
  plug(Front.Router)
end

jmurphyweb · October 11, 2021, 10:09am

We are pretty sure that our problem was caused by the logger going into sync mode.

I wasn’t responsible for the fix so don’t have too much understanding of the problem, but essentially we were using a 3rd party logging backend which sent logs over http. The hypothesis was that the logging API had occasional problems connecting and thus the logger would not be able to process application logs quickly enough, which supposedly causes it to move into sync mode.

Again I have limited understanding of the above so please verify the logger behaviour for yourself. I just thought I’d try to articulate my understanding in case it helps someone facing the same issue.