Too many processes error

chungwong · May 6, 2019, 11:55pm

From time to time I am getting a strange error on production

I am not sure what system limit it is referring to and the server is running on Debian 9 with Nginx.

Does anyone have similar experience?


[error] Ranch protocol :error of listener MyWeb.Endpoint.HTTP (cowboy_protocol) terminated
** (exit) :system_limit
[error] Too many processes

[error] Ranch protocol #PID<0.15000.5164> of listener MyWeb.Endpoint.HTTP (cowboy_protocol) terminated
** (exit) exited in: Phoenix.Endpoint.CowboyWebSocket.resume()
    ** (EXIT) an exception was raised:
        ** (SystemLimitError) a system limit has been reached
            :erlang.spawn_opt(:proc_lib, :init_p, [#PID<0.15000.5164>, [], :gen, :init_it, [:gen_server, #PID<0.15000.5164>, #PID<0.15000.5164>, Phoenix.Channel.Server, {%Phoenix.Socket{assigns: %{user_id: nil}, channel: MyWeb.FrontendChannel, channel_pid: nil, endpoint: <MyWeb.Endpoint, handler: MyWeb.FrontendSocket, id: nil, join_ref: "178", joined: false, private: %{log_handle_in: :debug, log_join: :info}, pubsub_server: My.PubSub, ref: nil, serializer: Phoenix.Transports.V2.WebSocketSerializer, topic: "frontend:lobby", transport: Phoenix.Transports.WebSocket, transport_name: :websocket, transport_pid: #PID<0.15000.5164>, vsn: "2.0.0"}, %{}, #PID<0.15000.5164>, #Reference<0.1744046310.3617325059.238298>}, []]], [:link])
            (stdlib) proc_lib.erl:344: :proc_lib.start_link/5
            (phoenix) lib/phoenix/channel/server.ex:22: anonymous fn/2 in Phoenix.Channel.Server.join/2
            (my) lib/my_web/endpoint.ex:1: MyWeb.Endpoint.instrument/4
            (phoenix) lib/phoenix/socket/transport.ex:269: Phoenix.Socket.Transport.do_dispatch/3
            (phoenix) lib/phoenix/transports/websocket.ex:123: Phoenix.Transports.WebSocket.ws_handle/3
            (phoenix) lib/phoenix/endpoint/cowboy_websocket.ex:77: Phoenix.Endpoint.CowboyWebSocket.websocket_handle/3
            (cowboy) /home/sydneytools/my/deps/cowboy/src/cowboy_websocket.erl:588: :cowboy_websocket.handler_call/7

benwilson512 · May 7, 2019, 12:17am

This is referring to the default VM limit. Can you tell us more about your application? Do you spawn processes explicitly yourself? Do you handle a very large number of websocket connections?

cmkarlsson · May 7, 2019, 12:18am

There is a limit on how many processes can be concurrently started in the BEAM. The default value is 262144. If you have more processes than this you will get the “Too many processes” error message.

You can change the max value to a higher number with the -P NUM switch (i.e iex --erl '-P 134217727')

On the other hand you may want to investigate why you have such a large number of processes as it is quite hard to reach the limit even with the default value.

chungwong · May 7, 2019, 4:06am

I don’t use websocket connections a lot.
There is only one URL for websocket

wss://domain.com.au/frontend-socket/websocket?vsn=2.0.0

presence.ex

defmodule MyWeb.Presence do
  use Phoenix.Presence,
    otp_app: :My,
    pubsub_server: My.PubSub

  def fetch(_topic, _entries) do
    %{}
  end
end

frontend_socket.ex

defmodule MyWeb.FrontendSocket do
  use Phoenix.Socket
  use Absinthe.Phoenix.Socket, schema: MyWeb.Graphql.Schema

  channel("frontend:*", MyWeb.FrontendChannel)

  transport(:websocket, Phoenix.Transports.WebSocket)
  def connect(params, socket) do
    {:ok, assign(socket, :user_id, params["user_id"])}
  end

  def id(_socket), do: nil
end

fontend_channel.ex

defmodule MyWeb.FrontendChannel do
  use Phoenix.Channel

  alias MyWeb.Presence

  def join("frontend:lobby", _message, socket) do
    {:ok, %{}, socket}
  end

  def handle_info(:after_join, socket) do
    Presence.track(socket, socket.assigns.user_id, %{
      device: "browser",
      online_at: DateTime.utc_now()
    })

    push(socket, "presence_state", Presence.list(socket))
    {:noreply, socket}
  end

  def handle_in("new_version", %{"body" => body}, socket) do
    {:noreply, body, socket}
  end
end

And I receive warning every few minutes for
[warn] Ignoring unmatched topic "frontend:lobby" in MyWeb.FrontendSocket

Not sure if this is related.

benwilson512 · May 7, 2019, 12:48pm

Do you explicitly spawn any processes?

Do you have a sense of how many connected users you have?

Also: Elixir, Phoenix versions?

chungwong · May 8, 2019, 12:24am

Elixir 1.6.5 and Phoenix 1.3.2

I checked the amount of active users in Google Analytics and on average we have no more than 2000 users per hour.
And there are occasions had 4-5k users per hour and I didn’t see the error at all.

I have two identical servers hosting the same application for load balancing and both of them were getting the “Too Many Processes” error

"explicitly spawn any processes" Does it include GenServer and Task.async? I do use GenServer and Task.async but just occasionally, not thing major.

Is there way for me to debug the problem? Like I can start a remote_console from the instance and monitor how many processes are spawned?

benwilson512 · May 8, 2019, 2:51am

Yes. Can you talk about / show code with respect to how you start these and what they do?

Exactly this. If you’re using releases you can just bin/my_app remote_console in. From there I usually have the observer_cli | Hex package installed so I can do :observer_cli.start and take a look around.

chungwong · May 8, 2019, 6:39am

I found these top 4 types of processes.
And I believe supervisor:Elixir.Task.Supervisor/1 is abnormal to me.

System     | Count/Limit           | System Switch             | Status                | Memory Info          | Size                     |
|Proc Count | 1806/262144           | Smp Support               | true                  | Allocted Mem         | 214.1133 MB     | 100.0% | 
|Port Count | 515/65536             | Multi Scheduling          | enabled               | Use Mem              | 134.0921 MB     | 62.63% |
|Atom Count | 46732/1048576         | Logical Processors        | 8                     | Unuse Mem            | 80.0212 MB      | 37.37% |

|No | Pid        |     Memory   |Name or Initial Call                  |           Reductions| MsgQueue |Current Function                 
|19 |<0.26478.0> |   66.3359 KB |cowboy_protocol:init/4                |                 4876| 0        |cowboy_websocket:handler_loop/4  |
|610|<0.2935.0>  |    6.8125 KB |ranch_acceptor:loop/3                 |                76428| 0        |prim_inet:accept0/2              |
|719|<0.28352.0> |    5.7109 KB |supervisor:Elixir.Task.Supervisor/1   |                  158| 0        |gen_server:loop/7                |
|908|<0.29139.0> |    2.8828 KB |Elixir.Phoenix.Channel.Server:init/1  |                   72| 0        |gen_server:loop/7                |

And I have only 3 related functions which are using Task.Supervisor
Do I have to manually destroy the tasks?

  @impl true
  @doc """
  It retrieves products from neo4j. It only waits for 2s for the results, and if the it takes longer,
  return empty list  [] as dummy value
  """
  def handle_call({:get_suggestions, product}, _from, state) do
    products =
      query = """
      # trimmed
      """
      try do
        {:ok, pid} = Task.Supervisor.start_link()

        Task.Supervisor.async(pid, fn ->
          # query Neo4j database
          Bolt.query!(Bolt.conn(), query)
          |> Repo.all()
        end)
        |> Task.yield(2000)
      rescue
        _ -> []
      catch
        _ -> []
      end
    {:reply, products, state}
  end

  @impl true
  @doc """
  Insert entries to Neo4j
  """
  def handle_cast({:upsert_products, products}, state) do
    product_map = get_product_map()

    query = """
    # trimmed
    """
    {:ok, pid} = Task.Supervisor.start_link()

    Task.Supervisor.async(pid, fn ->
      Bolt.query!(Bolt.conn(), query, %{products: filter_product_keys(products)})
    end)
    |> Task.yield()

    {:noreply, state}
  end

  @impl true
  @doc """
  Insert entries to Neo4j
  """
  def handle_cast({:upsert_orders, ids, opts}, state) do
     # trimmed codes
     # trimmed codes

     {:ok, pid} = Task.Supervisor.start_link()

     Task.Supervisor.async(pid, fn ->
        orders =
          from(
            o in order_query,
            where: o.id in ^ids
          )
          |> Repo.all()
          Bolt.query!(Bolt.conn(), query, %{orders: filter_order_keys(orders)})
      end)
    {:noreply, state}
  end

idi527 · May 8, 2019, 6:49am

Sorry for an off-topic, but I just wanted to note that instead of wrapping your task in a try block, you can use a non linked task with async_nolink. I’m also not quite sure try actually does what you want, since it can’t catch exits from crashed linked processes, you’d need to be trapping exists to avoid a crashing task bringing down the caller.

Now for your actual problem, you seem to be starting task supervisors and not stopping them for every neo4j query. You probably can start it only once like here and reference it in the tasks by the registered name like here.

benwilson512 · May 8, 2019, 11:37am

This is your problem. Tasks do not need to be cleaned up, they terminate when they’re done. However you’re also spawning a dedicated task supervisor every time, and those DO NOT terminate on their own. You should have one task supervisor in your supervision tree and then spawn all the tasks under that instead of spawning an infinite number of task supervisors.

alvises · May 8, 2019, 11:57am

Are you using Task because you don’t want that a Bolt driver exception kills your GenServer?
You can use Bolt.query (without the bang !) which returns {:ok, _} or {:error, _} without the need of spawning processes to isolate the query.

(I don’t know Bolt, but I’ve just found this documentation: https://hexdocs.pm/bolt_sips/Bolt.Sips.html#query/2)

NobbZ · May 8, 2019, 1:21pm

No, this looks more like delegating work to externally spawned processes to not block the GenServer longer than necessary.

alvises · May 8, 2019, 1:41pm

In the handle_call and the first handle_cast the genserver is blocked, waiting for the result with Task.yield (which if I remember correctly timeout after 5seconds).

NobbZ · May 8, 2019, 1:43pm

Oh yes, I haven’t seen the yields…

chungwong · May 8, 2019, 11:03pm

Thanks all for the help and @benwilson512 is helpful as always in addition to Absinthe

@idi527 and @alvises are right, I do not want any errors from these “bonus” features crash the main application,
Bolt.query/2 is not handling :nxdomain(wrong IP/domain or anything) error.

I do have Process.flag(:trap_exit, true) in my GenServer to prevent errors from crash my main application.

alvises · May 9, 2019, 10:36am

Does Bolt.query/2 spawn a process linked to the genserver? Can’t you just catch the exception in a try without spawning the task?