Memory leaking with long running processes

Hello.
I have an application in production under some load and it has a memory leak.
I’m preaty new in Erlang world, anyway I tried to debug it using recon library. And it looks like the problem is with binaries.
recon_alloc:memory(allocated_types) shows me that my binary_alloc and eheap_alloc increase all the time.
Tha application is quite simple.
At boot time my application loads some data from DB and run for each record a process (GenServer) with some state (not big). At one moment there might be from 1 to 10 such long running processes. I use Phoenix endpoints to listen HTTP requests and each go through all these long lived processes with request payload in format: %{"segments": [%{"some_var" => "some bn data"}]}. I suspect, that these binaries are not garbage collected, bevcause they still may have references from those long lived processes.
Migh be that a reason? And how to solve it?

Elixir 1.4, Erlang 19.3

Are you sure? AFAIK elixir can’t support OTP 16, since maps have only been introduced with OTP 17.

The README of the 1.4.0 tag of the github repository clearly states:

(Elixir requires Erlang 18.0 or later)

Also I think I’ve seen some kind of matrix once which desc ribed which version of erlang was supported by which version of elixir, but I can’t find it right now.


Aside of that, you can create space leaks (not memory leaks) easily when creating substrings from strings that are longer 64 byte. Therefore it is considered good practice to :binary.copy/1 strings you have extracted and want to use for longer than a few moments.

As a rule of thumb I do copy every string I pass into another function.

1 Like

My bad. I forgot that I include erlang VM into build, it is 19.3.
So, do you suggest to copy all strings in my map and then pass it futher?

1 Like

Do you mean “another process”?

Since I do not know whats the source of the strings in the map actually is, I can’t tell for sure. But if they are smaller parts of some huge string, copying might help.

But to be really sure, we had to see much more code.

Similar problems can occur when you steadily eat from an input binary and then hold a reference to the very last sub-binary somewhere.

And since this is a really complex matter, I pulled in 3 links, which I found randomly:

I’ve read the GC 19.0 article by myself about a year ago and it helped me a lot to understand how garbage collection and the BEAM are working. But I’m not sure anymore if the binheap was explained well there. Therefore I put the other 2 links as well.

1 Like

Nope.

But of course, I think I have to explain my rule of thumb a bit better.

I do copy strings and binaries explicitely when I extracted them from larger ones and there is a chance that they are a “subslice” of the original binary and pointing into the same region of memory and I do pass that substring into either another function that I have not written myself[1] or I go into a recursion that keeps the substring around a while[2].

1: Perhaps the other library does create some GenServer and uses my string in its state and will therefore block that memory area for a long time
2: Well, Maybe I have a GenServer which will hog that string for a while in its state :wink:

I do this much more often in functions that are public than in internal ones.

If I am sure about a string beeing discarded quickly, I do not copy anyway.

1 Like

That might indeed be a reason. A simple fix could be to hibernate each process after they do the work on the segment. So if it’s a GenServer, you can add :hibernate at the end of each tuple you return from handle_* callbacks (e.g. {:noreply, new_state, :hibernate}). This will lead to the GC of the GenServer and thus the references to binaries will be released.

Hibernation will cause some performance penalty, but whether that’s significant or not depends on the load on the process being hibernated.

Note that you can also use :recon.bin_leak/1 to try to find out the offending processes.

3 Likes

@NobbZ

In short, I receiving a json, do some minor manipulation (the same process) and then pass it to all long running processes:

Experiment.accept_segments?(pid, segments)

Where segments something like : %{"browser" => "Explorer", "country" => "EN", "device" => "desktop", "transferdownloads" => 0 "transferuploads" => 0, "useraccounttype" => "free", "version" => "11.0"}
Actually it is a part of params, what I have in Phoenix controller.
And then

def accept_segments?(pid, segments) do
  GenServer.call(pid, {:check_segemets, segments})
end

def handle_call({:check_segemets, segments}, _caller, state) do
  {:reply, compare_rules(segments, state.rules), state}
end

defp compare_rules(_, []), do: true
defp compare_rules(segments, _) when segments == %{}, do: false
defp compare_rules(segments, _) when is_nil(segments), do: false
defp compare_rules(segments, rules) when map_size(segments) < length(rules), do: false
defp compare_rules(segments, rules) do
  result =
      Enum.map(rules, fn r ->
      with given_value = Map.get(segments, String.downcase(r.parameter)),
           {:ok , casted_val} <- cast_to_type(r.type, r.value) do
        apply(Kernel, String.to_atom(r.operator), [given_value, casted_val])
      else
         _err -> false
      end
    end)
     |> Enum.dedup
  [true] == result

I thoght about :hibernate, but all my processes should be very fast (I need to response as fast as possible). Maybe I don’t have a huge load and it is a way for me - I have about 300 requests per second.

You could store the rules in ETS. That would remove the possibility of leaks in the first place, and remove the GenServer bottleneck.

My advice is to simply give it a try and see how it fares. Adding hibernation is fairly trivial (just append :hibernate to each return tuple). And with hibernation in place you can quickly check whether the problem is fixed (if not, then the leak likely happens someplace else). If the hibernation is fast enough for you, than cool. If not, you’ll need to think of another way.

Thanks!
Looks like a hibernation helped. I will monitor for couple days to be sure, but it looks prommising.
Anyway, now I know more about GC :smile:

1 Like

It might be faster to just do a gc, :erlang.garbage_collect(), as doing a hibernate does more work because it first packs the process together then unpacks it on the next message. I don’t know if there is an Elixir call to this function.

2 Likes