App in production crashing with DBConnection.ConnectionError

Hello.
I don’t know what the best place is, hoping to find any help with the problem I’ve been facing in production for about 3 days.

I have a niche social network that, on average, has 4 thousand users online, and makes around 30 thousand requests per second to the server. The application runs smoothly with an average per request of just 0.02s and memory of around 10GB. But something happens when the site reaches 5.5k users online, everything drops very quickly (less than 10s), the application goes from 10gb to 60gb in a few seconds, the Elixir Logger starts dropping logs, the server loses all connections with the database and starts trying to reestablish new connections, but they all timeout… so the app becomes irresponsible until docker is restarted (returning to normal speed and staying like that for about 20 minutes).

Error that appears in the log thousands of times:


DBConnection.ConnectionError

tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)

My infrastructure:

The application runs on Docker (generated by Phoenix 1.7), deploys with AWS ECS on a c6i.x24large (96vcpu/189gb) and Aurora r6.x4large database (16cpu/128gb)

The database uses around 35% of CPU, and remains that way even when the application crashes. The application uses only 10% CPU and 10GB of RAM.

ps.: All online users connect via websocket, in addition to the normal request.

Elixir 1.15.7
Phoenix 1.7.10 with all deps updated
React front-end

Things I’ve tried

  • db connection pool: from 100 to 2400
  • Direct connection to Aurora and throught RDS proxy as well
  • I have tested with an instance with 50 gigabits of network performance
  • soft/hard “ulimit” from default value to 130k
  • queue_interval/target from 50ms to 2s

Because of websocket, I haven’t been able to test scaling horizontally yet.

Some screenshots

App on “normal” mode:

When app is not responding (See how memory increased):

Avarage request time, you can see it stop to responding very fast:

Aurora database shows that app is not reading the query result:

So what I need is some direction, it’s strange that it happens right after passing the 5500 users mark online, before that the app doesn’t present any problems or slowdowns.

I also don’t know if scaling horizontally would solve it, but there is no metric (memory/cpu) that indicates that the problem is close to occurring so that it can be scaled.

I would bet on some network limit, port exhaustion or process limit reached. But I don’t find anything in the logs other than the database connection error.

1 Like

Was a crash_dump generated?

1 Like

No, I said it “crash” but in fact the server just starts to timeout all requests. I’ll check next time if the docker container generates any files on disk.

I would start by adding more metrics around the application. Having such a large connection number is likely going to lead to worse performance, as they won’t ever be effectively used and the runtime now has to spend more time managing them.

If you suspect this is related to the database, you should log the metrics emitted by Ecto, such as query time, checkout time, idle time and so on.

I would also log the memory metrics from the VM and, most important, the Erlang VM run queues, which tells you how much CPU/IO work it has to do. You can get those by running Phoenix LiveDashboard on the repo, but ideally you want to push these to an external tool too.

6 Likes

Are you by chance using Bandit? I was experiencing a similar symptom, although w/ no resource issues that I noticed, just sudden timeouts that felt like port exhaustion.

If so, there’s a PR @mtrudel put out for thousand_island (Refine acceptor behaviour on abnormal conditions by mtrudel · Pull Request #103 · mtrudel/thousand_island · GitHub) that resolves an issue where the kernel might kill a connection, but the process responsible for it was not restarting since it looked “normal” to it. I was similarly seeing it occur during peaks, but it was mostly just coincidence since that’s when the most activity and aborted connections were occurring. I also saw large numbers of DB timeouts right after, but it was a red herring.

5 Likes

@felix-starman It was cowboy then yesterday I moved to bandit: same result.

@josevalim how many db connections do you recommend? 100 or even less ?

more about my app:

  • Graphql to communicate with front-end
  • Use graphql subscribe (over websocket) only to chat.
  • Nebulex to in-memory cache
  • Phoenix.Tracker to track users online (all 5k users put in same topic)
  • A Genserver that starts a postgres listen service

application.rb

def start(_type, _args) do
    # import Supervisor.Spec
    :logger.add_handler(:sentry_handler, Sentry.LoggerHandler, %{})

    children = [
      D4.Cache,
      D4.Repo,
      # {DNSCluster, query: Application.get_env(:d4, :dns_cluster_query) || :ignore},
      {Phoenix.PubSub, name: D4.PubSub},
      {D4Web.Presence, [name: D4Web.Presence, pubsub_server: D4.PubSub]},
      D4Web.Endpoint,
      {Absinthe.Subscription, D4Web.Endpoint},
      {Task.Supervisor, name: D4.TaskSupervisor},
      D4.PGListener,
      D4Web.Telemetry,
      D4Web.CacheClient,
      D4Web.RateLimit
    ]

    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: D4.Supervisor]
    Supervisor.start_link(children, opts)
  end

pg listen (there is no issue releted to that I think):

# TODO: testar de alguma forma
defmodule D4.PGListener do
  @moduledoc """
    Listen for notifications from postgres
  """

  use GenServer
  require Logger

  def start_link(opts \\ []) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def init(opts) do
    with {:ok, pid} <- Postgrex.Notifications.start_link(D4.Repo.config()),
         {:ok, _} <- Postgrex.Notifications.listen(pid, "user_updated") do
      {:ok, opts}
    else
      error -> {:stop, error}
    end
  end

  def handle_info({:notification, _pid, _ref, "user_updated", payload}, _state) do
    case Jason.decode(payload) do
      {:ok, data} ->
        D4.Helpers.broadcast_user_update(data["id"], data)
        D4.Guardian.delete_from_cache(%D4.User{id: data["id"]})
        :telemetry.execute([:d4, :pglistener, :ok], %{}, %{event: :user_updated, user_id: data["id"]})

      {:error, err} ->
        Logger.error("PG_LISTENER #{Exception.format_banner(:error, err)}")
        :telemetry.execute([:d4, :pglistener, :error], %{}, %{event: :user_updated, error: err})
    end

    {:noreply, :event_handled}
  end
end

deps version and deps tree:

absinthe 1.7.6
  absinthe_phoenix 2.0.2
  absinthe_plug 1.5.8
  bamboo 2.3.0
  bandit 1.1.2
  bcrypt_elixir 3.1.0
  bunt 1.0.0
  castore 1.0.5
  certifi 2.12.0
  combine 0.10.0
  comeonin 5.4.0
  cors_plug 3.0.3
  cowboy 2.10.0
  cowboy_telemetry 0.4.0
  cowlib 2.12.1
  credo 1.7.2
  dataloader 2.0.0
  db_connection 2.6.0
  decimal 2.1.1
  ecto 3.11.1
  ecto_psql_extras 0.7.15
  ecto_sql 3.11.1
  elixir_make 0.7.7
  eqrcode 0.1.10
  ex_aws 2.5.0
  ex_aws_s3 2.5.2
  expo 0.5.1
  feeder 2.3.2
  feeder_ex 1.1.0
  file_system 1.0.0
  gettext 0.24.0
  guardian 2.3.2
  hackney 1.20.1
  hpax 0.1.2
  httpoison 2.2.1
  idna 6.1.1
  jason 1.4.1
  jose 1.11.6
  meck 0.9.2
  metrics 1.0.1
  mime 2.0.5
  mimerl 1.2.0
  mock 0.3.8
  nebulex 2.5.2
  nimble_options 1.1.0
  nimble_parsec 1.4.0
  parse_trans 3.4.1
  phoenix 1.7.10
  phoenix_ecto 4.4.3
  phoenix_html 3.3.3
  phoenix_live_dashboard 0.8.3
  phoenix_live_view 0.20.2
  phoenix_pubsub 2.1.3
  phoenix_template 1.0.4
  plug 1.15.2
  plug_cowboy 2.6.1
  plug_crypto 2.0.0
  postgrex 0.17.4
  ranch 1.8.0
  remote_ip 1.1.0
  sentry 10.1.0
  shards 1.1.0
  ssl_verify_fun 1.1.7
  table_rex 4.0.0
  telemetry 1.2.1
  telemetry_metrics 0.6.1
  telemetry_poller 1.0.0
  thousand_island 1.2.0
  timex 3.7.11
  tzdata 1.1.1
  ua_inspector 2.2.0
  unicode_util_compat 0.7.0
  uuid 1.1.8
  websock 0.5.3
  websock_adapter 0.5.5
  xxhash 0.3.1
  yamerl 0.10.0

d4
├── absinthe ~> 1.7 (Hex package)
│   ├── dataloader ~> 1.0.0 or ~> 2.0 (Hex package)
│   ├── decimal ~> 1.0 or ~> 2.0 (Hex package)
│   ├── nimble_parsec ~> 1.2.2 or ~> 1.3 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── absinthe_phoenix ~> 2.0 (Hex package)
│   ├── absinthe ~> 1.5 (Hex package)
│   ├── absinthe_plug ~> 1.5 (Hex package)
│   ├── decimal ~> 1.0 or ~> 2.0 (Hex package)
│   ├── phoenix ~> 1.5 (Hex package)
│   ├── phoenix_html ~> 2.13 or ~> 3.0 (Hex package)
│   └── phoenix_pubsub ~> 2.0 (Hex package)
├── absinthe_plug ~> 1.5 (Hex package)
│   ├── absinthe ~> 1.5 (Hex package)
│   └── plug ~> 1.4 (Hex package)
│       ├── mime ~> 1.0 or ~> 2.0 (Hex package)
│       ├── plug_crypto ~> 1.1.1 or ~> 1.2 or ~> 2.0 (Hex package)
│       └── telemetry ~> 0.4.3 or ~> 1.0 (Hex package)
├── bamboo ~> 2.0 (Hex package)
│   ├── hackney >= 1.15.2 (Hex package)
│   │   ├── certifi ~> 2.12.0 (Hex package)
│   │   ├── idna ~> 6.1.0 (Hex package)
│   │   │   └── unicode_util_compat ~> 0.7.0 (Hex package)
│   │   ├── metrics ~> 1.0.0 (Hex package)
│   │   ├── mimerl ~> 1.1 (Hex package)
│   │   ├── parse_trans 3.4.1 (Hex package)
│   │   ├── ssl_verify_fun ~> 1.1.0 (Hex package)
│   │   └── unicode_util_compat ~> 0.7.0 (Hex package)
│   ├── jason ~> 1.0 (Hex package)
│   ├── mime ~> 1.4 or ~> 2.0 (Hex package)
│   └── plug ~> 1.0 (Hex package)
├── bandit ~> 1.0 (Hex package)
│   ├── hpax ~> 0.1.1 (Hex package)
│   ├── plug ~> 1.14 (Hex package)
│   ├── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   ├── thousand_island ~> 1.0 (Hex package)
│   │   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   └── websock ~> 0.5 (Hex package)
├── bcrypt_elixir ~> 3.0 (Hex package)
│   ├── comeonin ~> 5.3 (Hex package)
│   └── elixir_make ~> 0.6 (Hex package)
│       └── castore ~> 0.1 or ~> 1.0 (Hex package)
├── comeonin ~> 5.0 (Hex package)
├── cors_plug ~> 3.0 (Hex package)
│   └── plug ~> 1.13 (Hex package)
├── cowboy ~> 2.0 (Hex package)
│   ├── cowlib 2.12.1 (Hex package)
│   └── ranch 1.8.0 (Hex package)
├── credo ~> 1.3 (Hex package)
│   ├── bunt ~> 0.2.1 or ~> 1.0 (Hex package)
│   ├── file_system ~> 0.2 or ~> 1.0 (Hex package)
│   └── jason ~> 1.0 (Hex package)
├── dataloader ~> 2.0 (Hex package)
│   ├── ecto >= 3.4.3 and < 4.0.0 (Hex package)
│   │   ├── decimal ~> 2.0 (Hex package)
│   │   ├── jason ~> 1.0 (Hex package)
│   │   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   └── telemetry ~> 1.0 (Hex package)
├── ecto_psql_extras ~> 0.2 (Hex package)
│   ├── ecto_sql ~> 3.7 (Hex package)
│   ├── postgrex ~> 0.16.0 or ~> 0.17.0 (Hex package)
│   └── table_rex ~> 3.1.1 or ~> 4.0.0 (Hex package)
├── ecto_sql ~> 3.0 (Hex package)
│   ├── db_connection ~> 2.4.1 or ~> 2.5 (Hex package)
│   │   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   ├── ecto ~> 3.11.0 (Hex package)
│   ├── postgrex ~> 0.16.0 or ~> 0.17.0 or ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4.0 or ~> 1.0 (Hex package)
├── eqrcode ~> 0.1.10 (Hex package)
├── ex_aws ~> 2.0 (Hex package)
│   ├── hackney ~> 1.16 (Hex package)
│   ├── jason ~> 1.1 (Hex package)
│   ├── mime ~> 1.2 or ~> 2.0 (Hex package)
│   └── telemetry ~> 0.4.3 or ~> 1.0 (Hex package)
├── ex_aws_s3 ~> 2.0 (Hex package)
│   └── ex_aws ~> 2.0 (Hex package)
├── feeder_ex ~> 1.0 (Hex package)
│   └── feeder ~> 2.2 (Hex package)
├── gettext ~> 0.11 (Hex package)
│   └── expo ~> 0.5.1 (Hex package)
├── guardian ~> 2.0 (Hex package)
│   ├── jose ~> 1.8 (Hex package)
│   └── plug ~> 1.3.3 or ~> 1.4 (Hex package)
├── httpoison ~> 2.0 (Hex package)
│   └── hackney ~> 1.17 (Hex package)
├── jason ~> 1.0 (Hex package)
│   └── decimal ~> 1.0 or ~> 2.0 (Hex package)
├── jose ~> 1.11 (Hex package)
├── mock ~> 0.3.0 (Hex package)
│   └── meck ~> 0.9.2 (Hex package)
├── nebulex ~> 2.2 (Hex package)
│   ├── shards ~> 1.1 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── phoenix ~> 1.7.0 (Hex package)
│   ├── castore >= 0.0.0 (Hex package)
│   ├── jason ~> 1.0 (Hex package)
│   ├── phoenix_pubsub ~> 2.1 (Hex package)
│   ├── phoenix_template ~> 1.0 (Hex package)
│   │   └── phoenix_html ~> 2.14.2 or ~> 3.0 or ~> 4.0 (Hex package)
│   ├── plug ~> 1.14 (Hex package)
│   ├── plug_cowboy ~> 2.6 (Hex package)
│   ├── plug_crypto ~> 1.2 or ~> 2.0 (Hex package)
│   ├── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   └── websock_adapter ~> 0.5.3 (Hex package)
│       ├── bandit >= 0.6.0 (Hex package)
│       ├── plug ~> 1.14 (Hex package)
│       ├── plug_cowboy ~> 2.6 (Hex package)
│       └── websock ~> 0.5 (Hex package)
├── phoenix_ecto ~> 4.0 (Hex package)
│   ├── ecto ~> 3.5 (Hex package)
│   ├── phoenix_html ~> 2.14.2 or ~> 3.0 or ~> 4.0 (Hex package)
│   └── plug ~> 1.9 (Hex package)
├── phoenix_html ~> 3.3 (Hex package)
│   └── plug ~> 1.5 (Hex package)
├── phoenix_live_dashboard ~> 0.8.2 (Hex package)
│   ├── ecto ~> 3.6.2 or ~> 3.7 (Hex package)
│   ├── ecto_psql_extras ~> 0.7 (Hex package)
│   ├── mime ~> 1.6 or ~> 2.0 (Hex package)
│   ├── phoenix_live_view ~> 0.19 or ~> 1.0 (Hex package)
│   │   ├── jason ~> 1.0 (Hex package)
│   │   ├── phoenix ~> 1.6.15 or ~> 1.7.0 (Hex package)
│   │   ├── phoenix_html ~> 3.3 or ~> 4.0 (Hex package)
│   │   ├── phoenix_template ~> 1.0 (Hex package)
│   │   ├── plug ~> 1.15 (Hex package)
│   │   └── telemetry ~> 0.4.2 or ~> 1.0 (Hex package)
│   └── telemetry_metrics ~> 0.6 or ~> 1.0 (Hex package)
├── phoenix_pubsub ~> 2.0 (Hex package)
├── plug_cowboy ~> 2.1 (Hex package)
│   ├── cowboy ~> 2.7 (Hex package)
│   ├── cowboy_telemetry ~> 0.3 (Hex package)
│   │   ├── cowboy ~> 2.7 (Hex package)
│   │   └── telemetry ~> 1.0 (Hex package)
│   └── plug ~> 1.14 (Hex package)
├── postgrex >= 0.0.0 (Hex package)
│   ├── db_connection ~> 2.1 (Hex package)
│   ├── decimal ~> 1.5 or ~> 2.0 (Hex package)
│   └── jason ~> 1.0 (Hex package)
├── remote_ip ~> 1.1.0 (Hex package)
│   ├── combine ~> 0.10 (Hex package)
│   └── plug ~> 1.14 (Hex package)
├── sentry ~> 10.0 (Hex package)
│   ├── hackney ~> 1.8 (Hex package)
│   ├── jason ~> 1.1 (Hex package)
│   ├── nimble_options ~> 1.0 (Hex package)
│   ├── plug ~> 1.6 (Hex package)
│   └── plug_cowboy ~> 2.3 (Hex package)
├── shards ~> 1.0 (Hex package)
├── telemetry ~> 1.2.0 (Hex package)
├── telemetry_metrics ~> 0.5 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── telemetry_poller ~> 1.0 (Hex package)
│   └── telemetry ~> 1.0 (Hex package)
├── timex ~> 3.0 (Hex package)
│   ├── combine ~> 0.10 (Hex package)
│   ├── gettext ~> 0.20 (Hex package)
│   └── tzdata ~> 1.1 (Hex package)
│       └── hackney ~> 1.17 (Hex package)
├── ua_inspector ~> 2.0 (Hex package)
│   ├── hackney ~> 1.0 (Hex package)
│   └── yamerl ~> 0.7 (Hex package)
├── uuid ~> 1.0 (Hex package)
└── xxhash ~> 0.3.1 (Hex package)

Key configs from runtime/prod.exs (private data was changed):

config :d4, D4.Repo,
  hostname: System.get_env("PG_HOSTNAME"),
  database: System.get_env("PG_DATABASE"),
  username: System.get_env("PG_USERNAME"),
  password: System.get_env("PG_PASSWORD"),
  prepare: :unnamed, #that was added to fix RDS Proxy
  pool_size: String.to_integer(System.get_env("PG_POOL_SIZE", "196")),
  timeout: String.to_integer(System.get_env("PG_TIMEOUT", "4000")),
  queue_target: String.to_integer(System.get_env("PG_QTARGET", "1000")),
  queue_interval: String.to_integer(System.get_env("PG_QINTERVAL", "5000"))

config :d4, D4Web.Endpoint,
  secret_key_base: System.get_env("PHOENIX_KEY"),
  http: [ip: {0, 0, 0, 0}, port: System.get_env("PORT") || 4000],
  url: [host: hostname, schema: "https", port: 443],
  debug_errors: false,
  code_reloader: false,
  check_origin: false,
  force_ssl: false,
  server: true,
  root: ".",
  version: Mix.Project.config()[:version]
  
config :d4, D4Web.Mailer,
  adapter: Bamboo.SendGridAdapter,
  api_key: {:system, "SENDGRID_API_KEY"}  
  
config :d4, D4.Cache,
  backend: :shards,
  allocated_memory: 16_000_000_000,
  gc_interval: :timer.hours(6),
  gc_cleanup_min_timeout: :timer.minutes(10),
  gc_cleanup_max_timeout: :timer.minutes(60)  
  
config :d4, D4Services.Geo.LocationiqApi,
  token: System.get_env("LOCATION_IQ_TOKEN"),
  url: "https://us1.locationiq.com/v1"

config :d4, D4Services.Geo.GoogleMapsApi,
  token: System.get_env("GOOGLE_MAPS_TOKEN"),
  url: ""  

config :sentry,
  dsn: System.get_env("SENTRY_DNS"),
  environment_name: :prod,
  enable_source_code_context: true,
  root_source_code_path: [File.cwd!()]  
  
config :logger,
  discard_threshold: 1000,
  backends: [:console],
  level: :info,
  handle_otp_reports: true,
  utc_log: true,  
  compile_time_purge_matching: [[level_lower_than: :info]]

config :logger, :console,
  format: {D4Web.LoggerJsonFormat, :format},
  metadata: [:telemetry, :cid, :sid, :ip, :aws_id, :request_id, :label]  
  
  

When I am faced with these kinds of high-load issues I rely on recon and remote nodes. I’d suggest trying to come up with a load test that simulates this error (maybe 1/4 of the max users on 1/4 of the CPUs?), then connect to your Phoneix app using a remote node and poke around. Check scheduler usage across your CPUs, check processes with high memory / reductions, etc.

Ofcourse this can be done through automated metrics gathering by sweeping the system periodically and measuring / logging the most ‘heavy’ processes, but sometimes the proactive approach is easier to work with. You can see process state, what the current stacktrace is, etc.

Checking out Microstate Accounting and Lock Profiling can also help if nothing really stands out with scheduler usage or processes. Also try giving Erlang in Anger a read.

2 Likes

@josevalim here is my telemetry logic, I’m log queries that takes over 700ms (and I have about 10 por minute)

1 - I’m not sure if this is the correct way to log things.
2 - I’ve already tried disabling logging completely, with no effect.

IMPORT: before I said 35 thousand requests per second, but it is per minute.

#
# TODO: enviar para sentry errors de authentification ?
#
defmodule D4Web.Telemetry do
  require Logger

  use Supervisor
  import Telemetry.Metrics

  @slow System.convert_time_unit(700, :millisecond, :native)

  def start_link(arg) do
    Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  def init(_arg) do
    children = [
      {:telemetry_poller, measurements: measurements(), period: 30_000}
      # {Telemetry.Metrics.ConsoleReporter, metrics: metrics()}
    ]

    setup()
    Supervisor.init(children, strategy: :one_for_one)
  end

  def metrics() do
    [
      last_value("vm.memory.total", unit: {:byte, :megabyte}),
      last_value("vm.total_run_queue_lengths.total"),
      last_value("vm.total_run_queue_lengths.cpu"),
      last_value("vm.total_run_queue_lengths.io"),
      last_value("d4.cache.size.value", tags: [:cache, :node]),
      last_value("d4.presence.size.value")
    ]
  end

  def measurements() do
    [
      {D4.Cache, :dispatch_cache_size, []},
      {D4Web.Presence, :dispatch_online_user, []}
    ]
  end

  def setup do
    events = [
      #
      # phoenix
      # https://hexdocs.pm/phoenix/Phoenix.Logger.html
      #
      # [:plug_adapter, :call, :start],
      # [:plug_adapter, :call, :stop],
      # [:plug_adapter, :call, :exception],
      # [:phoenix, :endpoint, :start],
      # [:phoenix, :endpoint, :stop],
      # [:phoenix, :router_dispatch, :start],
      [:phoenix, :router_dispatch, :stop],
      [:phoenix, :router_dispatch, :exception],
      # [:phoenix, :error_rendered],
      # [:phoenix, :socket_connected],
      # [:phoenix, :channel_joined],
      # [:phoenix, :channel_handled_in],

      #
      # absinthe graphql
      #
      # [:absinthe, :execute, :operation, :start], # when the operation starts
      # [:absinthe, :execute, :operation, :stop], # when the operation finishes
      # [:absinthe, :subscription, :publish, :start], # when a subscription starts
      # [:absinthe, :subscription, :publish, :stop], # when a subscription finishes
      # [:absinthe, :resolve, :field, :start], # when field resolution starts
      # [:absinthe, :resolve, :field, :stop], # when field resolution finishes
      # [:absinthe, :middleware, :batch, :start], # when the batch processing starts
      # [:absinthe, :middleware, :batch, :stop], # when the batch processing finishes

      #
      # ecto
      # https://hexdocs.pm/ecto/Ecto.Repo.html#module-telemetry-events
      #
      [:d4, :repo, :query],

      # app
      [:d4_web, :graphql, :stop],
      [:d4, :repo, :black_word],
      [:d4, :repo, :black_email],
      [:d4, :authorization],
      [:d4_pay, :request, :stop],
      [:d4_pay, :request, :exception],
      # [:d4, :pglistener, :ok],
      [:d4, :pglistener, :error],
      [:d4_web, :rate_limit]
      # [:d4, :pg, :d4error]
    ]

    events =
      if Application.get_env(:d4, :prod?),
        do: events ++ [[:d4_web, :presences]],
        else: events

    unless Application.get_env(:d4, :test?) do
      :telemetry.attach_many("d4-telemetry", events, &D4Web.Telemetry.handle_event/4, nil)
    end
  end

  # #
  # #
  # #
  # def handle_event([:phoenix, :router_dispatch, :stop], %{duration: time}, %{conn: conn}, _) do
  #   path = conn.request_path

  #   if path not in ["/api/graphiql", "/favicon.ico", "/ping", "/sw.js", "/manifest.json"] do
  #     level  = if conn.status >= 500,
  #       do: :error,
  #     else: :info

  #     log level, "REQUEST", %{
  #       status: conn.status,
  #       path: path,
  #       method: conn.method,
  #       duration: parse_time(time)
  #     }
  #   end
  # end

  def handle_event([:phoenix, :router_dispatch, :stop], %{duration: time}, %{conn: conn}, _) do
    path = conn.request_path

    if String.starts_with?(path, "/wh/") do
      level =
        if conn.status == 200 || conn.status == 201,
          do: :info,
          else: :error

      log(level, "REQUEST", %{
        status: conn.status,
        path: path,
        method: conn.method,
        duration: parse_time(time)
      })
    end
  end

  #
  #
  #
  def handle_event(
        [:phoenix, :router_dispatch, :exception],
        _,
        %{kind: kind, reason: reason, stacktrace: stacktrace},
        _
      ) do
    log(:error, "REQUEST", %{
      exception: Exception.format(kind, reason, stacktrace)
    })
  end

  #
  # measurements:
  #  %{start: 10, stop: 10, time: 2}
  # metadata
  #  %{resolution: resolution, identifier: identifier, path: path}
  # TODO: distinguir cs_error (warnig) de error (error)
  #
  def handle_event([:d4_web, :graphql, :stop], %{duration: time}, metadata, _config) do
    errors = %{}
    level = :info

    {level, errors} =
      if Enum.any?(metadata.errors || []),
        do: {:error, Map.merge(errors, %{errors: metadata.errors})},
        else: {level, errors}

    {level, errors} =
      if match?(%{ok: false}, metadata.value),
        do: {:warning, Map.merge(errors, %{cs_errors: metadata.value.errors})},
        else: {level, errors}

    data =
      Map.merge(errors, %{
        identifier: metadata.identifier,
        path: metadata.path,
        duration: parse_time(time)
      })

    log(level, "GRAPHQL", data)
  end

  #
  # measurements:
  #   %{decode_time: 4000, query_time: 6335000, queue_time: 74000, total_time: 6413000}
  # metadata:
  #   %{parameters: [1], source: 'users', query: "the sql query string" }
  #
  #
  def handle_event([:d4, :repo, :query], measurements, metadata, _config) do
    query_time = measurements[:query_time] || 0
    queue_time = measurements[:queue_time] || 0

    if query_time + queue_time > @slow do
      idle_time = measurements[:idle_time] || 0

      log(:warning, "QUERY", %{
        slow: true,
        source: metadata.source,
        duration: parse_time(query_time),
        idle: parse_time(idle_time),
        queue: parse_time(queue_time),
        db: parse_time(query_time),
        sql: metadata.query,
        params: metadata.params
      })
    end
  end

  #
  #
  #
  def handle_event([:d4, :repo, :black_word], _, metadata, _config) do
    log(:warning, "BLACK_WORD", %{
      word: metadata.word,
      text: metadata.text
    })
  end

  #
  #
  #
  def handle_event([:d4, :repo, :black_email], _, metadata, _config) do
    log(:warning, "BLACK_EMAIL", %{email: metadata.email})
  end

  #
  #
  #
  def handle_event([:d4, :authorization], _, %{struct: struct, action: action}, _config) do
    log(:error, "AUTHORIZATION", %{
      action: action,
      struct: extract_struct(struct)
    })
  end

  #
  # mensurements %{total, joins, leaves}
  #
  def handle_event([:d4_web, :presences], mensurements, _, _) do
    log(:info, "PRESENCES", mensurements)
  end

  #
  # D4 PAY (MERCADO PAGO)
  #
  def handle_event(
        [:d4_pay, :request, :stop],
        %{duration: duration},
        %{response: response} = meta,
        _
      ) do
    sc = response.status_code
    kd = if sc in [200, 201], do: :info, else: :error
    # bd = if sc not in [200, 201], do: response.body, else: nil

    log(kd, "D4PAY_REQUEST", %{
      duration: parse_time(duration),
      path: meta[:path],
      status_code: sc
    })
  end

  def handle_event([:d4_pay, :request, :exception], %{duration: duration}, meta, _) do
    log(:error, "D4PAY_REQUEST", %{
      path: meta[:path],
      duration: parse_time(duration),
      exception: Exception.message(meta.reason),
      params: meta.params
    })
  end

  #
  # D4_PGLISTENER
  #
  def handle_event([:d4, :pglistener, :ok], _, meta, _) do
    log(:info, "D4_PGLISTENER", meta)
  end

  def handle_event([:d4, :pglistener, :error], _, meta, _) do
    log(:error, "D4_PGLISTENER", meta)
  end

  #
  # D4_WEB RATE_LIMIT
  #
  def handle_event([:d4_web, :rate_limit], _, %{key: {path, _}, limit: limit, name: name}, _) do
    log(:warning, "RATE_LIMIT", %{path: path, limit: limit, name: name})
  end

  def handle_event([:d4_web, :rate_limit], _, %{key: key, limit: limit, name: name}, _) do
    key = inspect(key)
    log(:warning, "RATE_LIMIT", %{key: key, limit: limit, name: name})
  end

  #
  # Handle no handler
  #
  def handle_event(event, measurements, metadata, _config) do
    log(:warning, "TELEMETRY", %{
      message: "No telemetry handler for #{inspect(event)} with #{inspect(measurements)} and metadata #{inspect(Map.keys(metadata))}"
    })
  end

  #
  #
  #
  defp log(level, what, telemetry) do
    Logger.log(level, fn -> {what, telemetry: telemetry} end)
  end

  defp parse_time(diff) do
    us = System.convert_time_unit(diff || 0, :native, :microsecond)
    div(us, 1000)
  end

  defp extract_struct(%_{id: id} = struct) do
    "#{D4.Helpers.get_type_name(struct)}:#{id}"
  end

  defp extract_struct(%_{} = struct) do
    "#{D4.Helpers.get_type_name(struct)}"
  end

  defp extract_struct(_) do
    ""
  end
end

Thank you very much for the tips and book, I’ve never needed to remotely debug an application in elixir/erlang, I’m going to read this book today. Thanks!

1 Like

Likely unrelated, but what’s the provisioned storage size for the DB?

Years ago, ran into a situation where some flavors of RDS backed by EBS would limit disk IOPS based on size. We had an app that didn’t use a ton of data but was very read-heavy - it ran into similar weird “spikes” when a temporary load increase exhausted the available I/O.

2 Likes

likely unrelated disclaimer as well (kinda reminds me of this :slightly_smiling_face: ), but, what do you keep in the cache?

I use it as a middleware to graphql (Absinthe) that keep query results in cache to fast response. Like “query usersSearch() {}”, an user does not need to receive a very updated list of registered profiles, we keep a 30min cached data for that case.

usersSearch can receive a list of filter, so cache is per filters and per logged user.

I’m working on that api since 2010 (initially rails), I rewrote everything in elixir 2015 or 2016. Since 2010 until today, each month has set a new record in the number of accesses.

This API has been greatly refined over time, and today it has around 700 local tests. It’s been many months since the backend has had any glitches or bugs, so I’m finding what’s happening very strange.

2 Likes

@maxguzenski quick questions:

  • How many container s are running in the cluster when this happens? ( With a pool of 100 and running 10 containers that will result in 1000 connections)
  • What is the criteria of scaling up if is scaling too fast it will put pressure on the connections to DB (specially if it’s over SSL) what can cause some timeouts
  • What is the connect timeout?
  • There are also some configs (timeout to killl if idle) around DB Idle connection, that can cause issues.
  • Depending how heavy is the app accessing the database I would use a smaller pool for connections (100 is too much IMO without knowing more).
  • it’s running on ECS on EC2 or Fargate? EC2 may give you some hints about the OS (network buffers, open files etc) using sysctl/netstat and similar tools

Maybe you got already over it.

Next questions would be around Erlang VM using Recon and some Erlang VM inspection tooling.

Cheers

lots of great points already here, but not much on OS level.

questions

  • any info on what is the bottleneck? storage? network? kernel?
  • is there a load balancer in between you and the users?
  • are there any errors reported in kernel logs at this time?
  • any other sysctl or other tunables set for ulimit, max open files, …
  • are you running into any tcp ephemeral port exhaustion? this is surprisingly common when running behind load balancers, you should monitor total ports, and what state they are in, at runtime, both in kernel and of elixir app.

I would initially check the networking side first, before looking at storage. A reasonable hypothesis is this:

  • you are running out of free tcp sockets for inbound connections and/or queued connections in the kernel, and they’re not being recycled fast enough by OS
  • this could cause both the DB pool to be unable to spawn new connections, and same for phoenix on accepting new user connections
  • user requests start piling up in process mailboxes, causing memory to balloon
  • everything rapidly turns to custard

TLDR

You need to check/fix all of ulimits as seen by beam.smp process, kernel backlog & accept queues, phoenix acceptors, backlogs, and max requests.

  • check cat /proc/<pid>/limits for elixir beam.smp pid to ensure ulimits are ok (fix in systemctl unit file or ulimit if not a systemd service)
  • check ss -s to see how many “active” connections there are at runtime
  • save full output of ss -tan somewhere for review later
  • check sysctl net.ipv4.ip_local_port_range and increase it sysctl -w net.ipv4.ip_local_port_range="4096 65535" to give more headroom
  • check ss -lnt to see how many active connections phoenix is handling atm
  • adjust phoenix to handle more active connections if required (use gen_tcp backlog 1024 as a reasonable number)

Longer

NB many caveats and handwavey innacuracies, to keep it simple.

Incoming network connections arrive in the kernel, and are queued in several places.

  • first kernel queue is backlog queue, until tcp handshake is completed (syn, then syn-ack reply, final ack from client)
  • next kernel queue is the accept queue, waiting for your listening app to accept the next connection. The kernel will buffer 1.5x the maximum configured connections on behalf of your app. This default is either 128 or 4096 depending on OS. Note that 1.5 * 4096 is 6144 which is awfully close to 5500. Could be coincidence of course!
  • final queue is gen_tcp acceptors in phoenix app, if there are no free ones (busy with existing requests), then we “push back” and requests start piling up in the buffer, then the accept queue itself overflows, and the backlog queue, until new requests for both receive and send get rejected.

If you are using the default port range, you’ll have around 28000 available sockets behind a NAT. If each user request creates 2 tcp requests in, and a further temporary 1 or 2 out to DB or elsewhere, then 5500 active users brings you pretty close to maxing that out. Also another coincidence possibly!

You may see this info in AWS LB metrics, or in whatever grafana or similar server metrics you collect too.

Probably something like this:

### whats the configured ephemeral port range
# sysctl net.ipv4.ip_local_port_range
### how many sockets are open atm? one of these commands should work for you
### they all give slightly different info
# netstat -nat
# ss -lnt
# ss -tan | awk '{print $4}' | cut -d':' -f2 | sort | uniq -c | sort -n

I wrote a bit about this in the past so start with that. Some details on tcp networking in general:

background

finding appropriate tunables / settings

can’t I just query how many ephemeral ports are left?

No. That complexity can wait for another day.

3 Likes

@posilva @dch

We believe we have fixed the problem. Monday will be the real test (as it is the day of the week with the most hits).

Problem summary:

Under high demand (more than 40k requests/minute), the app completely freeze from one moment to the next, which was resolved with a complete docker container restart. RAM usage grew exponentially in a few seconds and the logs stopped being reported (Logger dropped the logs, but what could be seen in the logs a few seconds before they stopped was a large number of connection errors with the database. )

Even requests without access to the database gave a timeout. Which prevented me from seeing any telemetry generated by the application. (and I wasn’t using any external services to collect beam data)

The application runs in Docker on a c6i.4xlarge (only 1 instance) connecting to AWS Aurora (r6.4xlarge).

I immediately scaled the application to a c6.12xlarge and the database as well, hoping to have time to investigate the problem. But the problem remained the same.

Initially what helped was increasing the db connection pool (ecto) from 200 to 2000, but even then the application needed to be restarted every 30 minutes.

It wasn’t an OS-level problem.

Solution:
A good database engineer improved some aurora queries and configurations, this didn’t solve the problem at all but it improved the time the application was responsive.

A RateLimit was implemented to prevent abuse of requests from some users. This helped the application to run much longer (+/- 1 hour).

But what really resolved it was reversing absinthe (graphql) from version 1.7.5 to 1.6.8, which had already been well tested in the past. Version 1.7 had already been used for a couple months, but I believe that something happens with asinthe, when it is under intense use, that does not happen in version 1.6.x.

Now let’s wait for Monday, and if the issue really solved, I will do performance tests with absinthe (or stay with version 1.6.8).

Thank you very much to everyone who helped!

4 Likes

@maxguzenski let us know if it fixed the issue. Maybe there are some n+1 query problems with GraphQL data loaders :thinking: . I am just curious

I roll it back to version 1.6.8 of absithe and 1.0.11 of dataloader and everything just worked.

For now I haven’t had time to test further what causes the performance issue with the latest version of absinthe/dataloader.

There are no n+1 queries.

1 Like

@maxguzenski were you able to do any further investigation into the issues you were seeing with Absinthe 1.7.x? My team and I are currently facing issues that sound very similar to the ones you were having.