How to optimise P99 GC pauses on BEAM

darnahsan · December 31, 2019, 2:16am

Really loving elixir and beam , the performance is amazing, even the P99 that shoots to 150ms is far acceptable coming from ruby world. But I would like to understand and know how to improve this and help over tuning or links to improve this would be highly appreciated.
Thanks

Nicd · December 31, 2019, 6:31am

Do you know it’s GC or could it be something else? Do you have a database backing it that could cause the delays?

I’m asking because BEAM has per-process GC and most request processes shouldn’t encounter GC at all because they are so short lived. At least that’s my understanding.

benwilson512 · December 31, 2019, 1:28pm

You definitely have a 150ms P99 to figure out but, as @Nicd notes, GC is unlikely to be the cause since there is no global stop the world garbage collection on the BEAM. Have you been able to identify any of these long requests in the logs? Are there any commonalities between them? What is this measured from? The load balancer or the application metrics?

devonestes · January 1, 2020, 8:57am

I’m not a betting man, but I’d put money on the fact that what you’re seeing is not in any way related to GC (unless you’re using advanced features of OTP 22 like persistent_term incorrectly).

I’d encourage you to actually profile your app with something like xprof: https://github.com/Appliscale/xprof or instrument it with telemetry: https://github.com/beam-telemetry/telemetry. That will actually show you where the time is spent in those very long requests.

Also, I’d say that it’s the database. It’s almost always the database

darnahsan · January 2, 2020, 7:56am

Don’t have a DB it only has ets for storing data. The ets is used as a counter as well if anything then could be multiple writes to ets same key i.e increment the counter. but the counter is updated on a separate process and there request is not blocked even for the external call.

darnahsan · January 2, 2020, 7:56am

its from the application, will add some verbose logging to track it further. thanks

darnahsan · January 2, 2020, 7:57am

only got ets not database will try out the profilers to see if they help. thanks

darnahsan · January 2, 2020, 8:19am

From New Relic its the POST endpoint that is slow at times

Also same can be seen according to application logs.

It could be Jason encoding that could be slow at times

github.com

ahsandar/ultronex/blob/master/bot/lib/ultronex/server/controller/slack.ex

defmodule Ultronex.Server.Controller.Slack do
  @moduledoc """
  Documentation for Ultronex.Server.Controller.Slack
  """

  use Plug.Router
  use Appsignal.Instrumentation.Decorators

  if Mix.env() == :dev do
    use Plug.Debugger
  end

  use Plug.ErrorHandler
  use Sentry.Plug

  alias Ultronex.Slack.Api, as: SlackApi
  alias Ultronex.Server.Helper.App, as: Helper
  alias Ultronex.Server.Websocket.SocketHandler, as: SocketHandler

  plug(BasicAuth, use_config: {:ultronex, :basic_auth_config})

This file has been truncated. show original

or the SSL connection by Plug.Cowboy

github.com

ahsandar/ultronex/blob/master/bot/lib/ultronex_app.ex#L24


      
          
          def start(_type, _args) do
            initialize()
          
            children = [
              %{
                id: Slack.Bot,
                start:
                  {Slack.Bot, :start_link,
                   [SlackBot, [], Utility.slack_bot_ultron_token(), %{name: :ultronex_bot}]}
              },
              Plug.Cowboy.child_spec(
                scheme: Utility.http_scheme(),
                plug: Router,
                options: Utility.http_options()
              ),
              %{
                id: StorageServer,
                start: {StorageServer, :start_link, [%{task: "snapshot", args: [], interval: 900_000}]}
              },
              Registry.child_spec(

But then again all endpoints use the same SSL and JSON encoding. From the surface looks to be something else effecting P99.

mstalker · January 3, 2020, 3:40pm

With few exceptions, it looks like the spikes are happening on the hour, every hour. Sometimes they’re small (~10ms), and sometimes they’re bigger (> 150 ms), but there’s almost always a spike. Is there a job that runs hourly in your app, on something that uses your app, or on anything else your app relies on? What does your resource consumption look like before, during, and after the spikes? Does your traffic spike hourly?

tristan · January 3, 2020, 4:55pm

If it is something like that (hourly), are you running in AWS or similar? Sometimes those instances have internal cron jobs that run and will cause spikes and then you have to fight with AWS for them to admit there is something that they are running that is causing the issue

darnahsan · January 6, 2020, 3:12am

I noticed it run spikes ever 46th minute and its between 12:00am- 4:00pm . I have a task that runs every 15 minutes and takes a snapshot of a very small ets table (only five keys and values that are just counter) so that is very unlikely the issue as it would have happened every 15 mins and would effect all calls rather than just one endpoint. The resource consumption looks quite stable compared to the spike. There are traffic spikes not so much in sync that they happen on the same minute every hour. I am suspecting either it is the websocket connection to slack api or some host process that causes this.

darnahsan · January 6, 2020, 3:13am

I am running in cloud (Hetzner). I am inclining towards this might be something on the host level that causes this bottleneck. will continue to dig further to see

arjan · January 6, 2020, 11:33am

Judging by the name of the endpoint, does this POST request by any chance send a message to slack synchronously? In that case the P99 is the response time of the slack api

darnahsan · January 7, 2020, 2:29am

The slack api call to send message is done in a separate process the endpoint returns a response so both aren’t linked.

github.com

ahsandar/ultronex/blob/master/bot/lib/ultronex/server/controller/slack.ex#L41


        _ ->
          Helper.unprocessable_body()
      end


    conn
    |> put_resp_content_type("application/json")
    |> send_resp(status, Jason.encode!(body))
  end


  def process_msg(channel, text, payload, title) do
    spawn(BotX, :relay_msg_to_slack, [text, payload, channel, title])
    %{status: "triggered", msg: %{"channel" => channel, "text" => text, "payload" => payload, "title" => title}}
  end
end

darnahsan · January 20, 2020, 5:53am

I switched my cloud provider in order to verify if its a provider activity that would cause it but seems it got worse by rising to 175ms from 150ms. Seems this is related to latency for connecting to external service and the culprits could range from NewRelic, Sentry or Slack itself.

darnahsan · February 4, 2020, 2:29am

So I manged to drilldown further and seems its not to do with any external service. @AppSignal was instrumental in digging through it. This is happening when a burst of request are received and it seems that scheduling of work could be taking time or response rendering takes it toll.

Could be a Plug performance issue or Jason.encode! issue under load?

github.com

ahsandar/ultronex/blob/master/bot/lib/ultronex/server/controller/slack.ex#L42


      
          post "/" do
            Appsignal.Transaction.set_action("POST /ultronex/slack")
          
            {status, body} =
              case conn.body_params do
                %{
                  "msg" => %{"channel" => channel, "text" => text, "payload" => payload, "title" => title}
                } ->
                  {200, process_msg(channel, text, payload, title)}
          
                _ ->
                  Helper.unprocessable_body()
              end
          
            conn
            |> put_resp_content_type("application/json")
            |> send_resp(status, Jason.encode!(body))
          end
          
          @decorate transaction_event()
          def process_msg(channel, text, payload, title) do

APM from AppSignal
its visible that the call that to set external calls returns in 33us even for the call that got held up for 349ms

github.com

ahsandar/ultronex/blob/master/bot/lib/ultronex/server/controller/slack.ex#L34


      
          plug(Plug.Parsers,
            parsers: [:json],
            pass: ["application/json"],
            json_decoder: Jason
          )
          
          plug(:dispatch)
          
          post "/" do
            Appsignal.Transaction.set_action("POST /ultronex/slack")
          
            {status, body} =
              case conn.body_params do
                %{
                  "msg" => %{"channel" => channel, "text" => text, "payload" => payload, "title" => title}
                } ->
                  {200, process_msg(channel, text, payload, title)}
          
                _ ->
                  Helper.unprocessable_body()
              end

BEAM stats from NewRelic

any optimizations for Plug, Jason or BEAM that could help ? I am running this service on a Scaleway D
ev1-S which is a 2 vCPU and 2 GB RAM in a container

axelson · February 4, 2020, 2:34am

Do you know if the body is larger for those slow requests compared to the others?

darnahsan · February 4, 2020, 2:35am

yes it is at times compared to others. the request that took 349ms its body size was as below

and a request that took 919us its body was as below

its comparatively a small body under 100KB could the pattern matching be slow for this body size ? what would be the approach then to improve it ?

wanton7 · February 4, 2020, 7:53am

Maybe you could store those slower bigger responses before it’s JSON encoded and do a benchmark against Jason.encode!? If that ahsandar/ultronex is code that you are using to test this cowboy you seem to be using is 2.6.3, maybe upgrade to latest cowboy 2.7.0 and see if that fixes things?

idi527 · February 4, 2020, 8:03am

Maybe you could try Jason.encode_to_iodata! as well.

From Jason — jason v1.4.1

This function should be preferred to encode/2 , if the generated JSON will be handed over to one of the IO functions or sent over the socket. The Erlang runtime is able to leverage vectorised writes and avoid allocating a continuous buffer for the whole resulting string, lowering memory use and increasing performance.