How Can I Overflow the Large Binary Space?

JEG2 · July 18, 2016, 7:26pm

Based on this talk from @rvirding and in preparation for my upcoming ElixirConf talk, I’ve been purposefully trying to run into large binary space overflow problems. My hope is that doing so would help me show ways to diagnose it and, to some extent, treat it. Sadly, I can’t seem to expose the problem.

I tried code like this:

defmodule MemoryLeak do
  def send_binaries(0), do: Process.sleep(:infinity)
  def send_binaries(limit) do
    pid = spawn(__MODULE__, :receive_and_leak, [ ])
    big_binary = :crypto.strong_rand_bytes(1024 * 1024)
    send(pid, big_binary)
    send_binaries(limit - 1)
  end

  def receive_and_leak do
    receive do
      big_binary -> big_binary |> byte_size |> IO.puts
    end
    receive_and_leak
  end
end

MemoryLeak.send_binaries(100_000)

I had hoped that the receiving processes wouldn’t garbage collect since they never do anything else, but I have seen memory dropping while it runs.

Does anyone know of some code I can use to see the issue?

bbense · July 18, 2016, 7:56pm

I think you need to keep the binary in the message queue. (i.e. send it, but do not recieve
it in the spawned process. ) But that may just overflow the queue. Another approach would
be to pass the binary to the second invocation of recieve_and_leak i.e.

def receive_and_leak do
receive do
big_binary → big_binary |> byte_size |> IO.puts
end
receive_and_leak(big_binary)
end

def receive_and_leak(old_binary) do
receive do
big_binary → big_binary |> byte_size |> IO.puts
end
receive_and_leak(old_binary)
end

As far as I know there is no way to “sleep” a process such that it does not
do garbage collection.

gregvaughn · July 18, 2016, 8:55pm

IIRC, you might see it if you slice out a small piece of the binary in the receiving process and store it in your state. Insidious, that one.

rvirding · July 22, 2016, 6:02pm

One way I know how to do this is to use 2 nodes. The first node will send large binary messages to the second. Start up a few thousand processes on each “linked” together so that one process on node A knows about a process on node B. Now create a 1 Mb binary on node A which all the processes send to their respective processes on node B. Eventually node B will crash.

The reason is that even though there is only one binary on node A every time it is sent it will result in a new copy on node B and eventually you run out of binary space.

mgwidmann · July 23, 2016, 3:12am

It seems to me that the big_binary has a chance to be garbage collected. When you spawn receive_and_leak, it is then waiting for you to send it the binary which the current process does and then it moves onto the next tail recursive call of send_binaries/1. At this point the current process no longer has access to big_binary but the spawned process does. As soon as it runs its receive block though it can print it out and then loop, losing access to big_binary and allowing garbage collection. I’d say you should make the receive return big_binary and pipe it back into receive_and_leak by putting a default parameter of nil. That’ll mean its blocked waiting for something that will never come but has access to big_binary in its parameter…

@bbense 's example I don’t think would work since he didn’t grab the big_binary (and I believe the scope won’t leak it) before passing it onto the next iteration of receive_and_leak. But this is the same idea as he mentioned.

JEG2 · July 28, 2016, 12:06am

I tried creating a file called shared.ex containing:

defmodule Shared do
  def local(node) do
    link = Node.spawn_link(node, &remote/0)
    receive do
      {:forward, big_binary} ->
        send(link, big_binary)
        Process.sleep(:infinity)
    end
  end

  def remote do
    receive do
      big_binary ->
        IO.puts "Received #{byte_size(big_binary)}."
        Process.sleep(:infinity)
    end
  end
end

Then I started a node in one shell with:

$ elixir --sname bar@localhost -r shared.ex --no-halt

In another shell I launched a script with this command:

$ elixir --sname foo@localhost -r shared.ex memory_leak.exs

Here’s the memory_leak.exs script:

node = :bar@localhost
Node.connect(node)
big_binary = :crypto.strong_rand_bytes(1024 * 1024)

Stream.repeatedly(fn ->
  spawn_link(fn ->
    Shared.local(node)
  end)
end)
|> Enum.take(10_000)
|> Enum.each(fn pid ->
  send(pid, {:forward, big_binary})
end)
Process.sleep(:infinity)

This didn’t work at all. One process held steady at 45 MB while the other stopped at 71 MB.

I still can’t seem to trigger this problem… which is kind of encouraging!

rvirding · July 28, 2016, 11:44pm

This module seemed to do it.

defmodule Bins do
  @moduledoc """
  Start 2 nodes, a sender node and a receiver node.
  Binaries are sent from sender to receiver node.
  """

  @doc "Start the run."
  def start(rnode, count, size) do
    bin = :erlang.list_to_binary(:lists.duplicate(size, 42))
    :rpc.call(rnode, Bins, :receivers, [count])
    senders(rnode, count, bin)
  end

  @doc "Start the senders."
  def senders(rnode, count, bin) when count > 0 do
    spawn(fn () -> sender(rnode, count, bin) end)
    senders(rnode, count - 1, bin)
  end
  def senders(_rnode, 0, _bin) do :ok end

  # Sender process.
  defp sender(rnode, number, bin) do
    sender_loop(rnode, receiver_name(number), bin)
  end

  defp sender_loop(rnode, name, bin) do
    send({name,rnode}, bin)
    sender_loop(rnode, name, bin)
  end

  @doc "Start the receivers."
  def receivers(count) when count > 0 do
    spawn(fn () -> receiver(count) end)
    receivers(count - 1)
  end

  # Receiver process.
  defp receiver(number) do
    Process.register(self(), receiver_name(number))
    receiver_loop()
  end

  defp receiver_loop() do
    receive do
      _msg -> receiver_loop()
    end
  end

  defp receiver_name(number) do
    :erlang.list_to_atom(:erlang.integer_to_list(number))
  end

end

Compile it, and run 2 distributed nodes, I call them s and r. On the sender node s do

Bins.start(:r@renat, 200000, 100000)

and you can see that the binary space on r keeps growing until the node crashes. The reason is that the receiver processes do not create any data so they never gc which means the binaries will never be reclaimed. On my machine the receiver node grew to about 60Gb before it crashed.

My machine is called renat which means that the nodes get the names :r@renat and :s@renat.

JEG2 · July 29, 2016, 1:16am

Thanks so much for your help!