Got "Killed" message in iex, followed by exit to system prompt (no IEx restart). How/why?

david_ex · May 24, 2020, 7:51am

Hi,

I ran 2 concurrent GenServer pipelines to process a few million records overnight. These pipelines are meant to be always running (after processing, take a break, then check for more records, and repeat).

Coming back to my iex session in the morning, I see some expected log output, followed by

Killed

Followed by my system shell input prompt: in other words, IEx had exited. The Killed message doesn’t appear in the logs, so it looks like it originated from an IO.puts call or similar. In addition, this doesn’t come from my application code.

There are no crashes or unhandled errors, and IEx didnt’t restart or give any other indication of something being “wrong”.

Does anybody have any info that would help me look into this? As you can imagine, searching for “Killed” along with any Elixir/Erlang/iex terms yields many results, but none that are relevant…

dom · May 24, 2020, 12:06pm

Is that on a Linux server? Perhaps it could have been killed for using too much memory.

benwilson512 · May 24, 2020, 12:14pm

Hi @david_ex in these circumstances you generally need to look for the logs of whatever is managing your running Elixir release, eg: Systemd, K8s, or whatever.

Based on your described situation I agree with @dom, it probably used all available memory and was killed by the OS.

david_ex · May 24, 2020, 7:53pm

Thanks @benwilson512 and @dom!

It looks like that is indeed the problem: running the pipelines while looking at the load charts in Obeserver, I can see that the processes memory steadily inches upwards…

As I’m doing string manipulation within the pipeline (essentially parsing text and returning lists of text fragments), it looks like I need to learn more about binary management and garbage collection (e.g. Memory leaking with long running processes). So if you’ve got any pointers there (or generally debugging and figuring out what is taking up memory), I’m all ears!

Thanks again for pointing me in the right direction, it probably would have taken me a while to figure out the OS was killing the IEx session…

benwilson512 · May 24, 2020, 8:02pm

Glad we could help. Do note that if it’s the processes memory you see increasing and not the binary memory, then it isn’t a binary memory leak, it’s just a good old fashioned issue with holding on to too much stuff. Without knowing more about how your GenServer works or the code it’s running though it’s hard to suggest more than that

david_ex · May 24, 2020, 8:26pm

Yeah, it’s the “processes” memory usage that increases.

The GenStage pipeline is pretty straight forward:

Producer loads batches of records from DB
Consumer supervisor consumes these records, and start an EventProcessor process
The event processor processes the event (parsing text, etc.), then sends a notification event to the process managing the entire pipeline. Upon receiving an ack message from the managing process, the event processor shuts down.

It appears the event processor is where I have some issues, process-wise, as Observer is reporting 746MB in “memory” which is just nuts. The event handling module is quite straightforward:

defmodule IngestionEngine.Pipeline.EventProcessor do
  use GenServer

  require Logger

  @ack_timeout 6_000

  def start_link(args, event), do: start_link(Keyword.put(args, :event, event))

  def start_link(args) do
    GenServer.start_link(__MODULE__, args)
  end

  @impl GenServer
  def init(args) do
    state =
      args
      |> Keyword.take([:report_to, :handler, :event])
      |> Enum.into(%{})
      |> Map.put(:ack_await_ref, nil)

    {:ok, state, {:continue, :handle_event}}
  end

  @impl GenServer
  def handle_continue(:handle_event, state), do: handle_event(state)

  defp handle_event(%{handler: handler, event: event} = state) do
    result = handler.(event)

    case Map.get(state, :report_to) do
      nil ->
        stop(state)

      report_to ->
        send(report_to, {:event_processed, self(), %{event: event, result: result}})
        ack_await_ref = Process.send_after(self(), :ack_not_received, @ack_timeout)

        {:noreply, Map.put(state, :ack_await_ref, ack_await_ref)}
    end
  end

  @impl GenServer

  def handle_info(:retry, state) do
    state =
      state
      |> cancel_timer()
      |> handle_event()

    {:noreply, state}
  end

  def handle_info(:ack_event_processed, state) do
    stop(state)
  end

  def handle_info(:ack_not_received = reason, state) do
    stop(state, reason)
  end

  def handle_info(msg, state) do
    Logger.warn("Unhandled info message: #{inspect(msg)}")
    {:noreply, state}
  end

  defp stop(state, reason \\ :normal) do
    {:stop, reason, cancel_timer(state)}
  end

  defp cancel_timer(%{ack_await_ref: nil} = state), do: state

  defp cancel_timer(%{ack_await_ref: ref} = state) when is_reference(ref) do
    Process.cancel_timer(ref)
    Map.put(state, :ack_await_ref, nil)
  end
end

So I believe the issue is likely in the handler function: Observer tells me the problem processes are spending a lot of time in erlang:re_run_trap (which I assume is a regex-related function, as I haven’t found any info on it) as well as Elixir.String:do_replace, so I’m probably doing something dumb in there… I’ll probably start by adding a send_after in the event processor to crash and dump the current state if the process runs for too long to see if I can reproduce the issue with a given set of data.

ityonemo · May 25, 2020, 2:36am

In BEAM if you take a substring out of a big string (e.g. user inputted json) in the naive case, the substring is a slice out of the big string and it will be kept around and not gc’d even if you have no references to big string other than the substrings.

benwilson512 · June 1, 2020, 12:07pm

Yeah if you’re building a string based routing process you need to be either manually calling :erlang.garbage_collect/0 at some strategic point, or considering using hiberate_after if there are predictable periods of downtime.

david_ex · June 2, 2020, 8:04am

It turns out that although I’m doing a bunch of string manipulation, that wasn’t the root cause (as @benwilson512 pointed out since the binary memory wasn’t the one constantly increasing) since those strings aren’t kept around (once the result is obtained it is stored and the process/task ends).

Instead, there were specific substrings (some as short as 4 chars) that made certain parser combinators I wrote go haywire. Haven’t yet had the chance to do a full root-cause analysis to pinpoint the origin of the problem within the parsers, but at least I’ve got it solved (tweaking the parsers) and the memory usage now remains stable.

dimitarvp · June 2, 2020, 3:40pm

Did that involve using :binary.copy?

david_ex · June 3, 2020, 12:11pm

Nope. Using :binary.copy was my initial attempt to resolve the issue but it didn’t change anything (since the string references weren’t the underlying issue). I reverted that change, and my code now runs as expected since I’ve solved the real problem.

Having now looked into the issue fully, my problem was that an unexpected behavior from the parser combiner lib caused my code to recurse infinitely…

dimitarvp · June 3, 2020, 12:12pm

Well, if you get to the bottom of it I’d be very curious.

david_ex · June 3, 2020, 1:11pm

Nothing really exciting: I’m parsing text to do named entity recognition (https://en.wikipedia.org/wiki/Named-entity_recognition). To do so, I’ve written parsers for the entities I’m interested in, using https://hexdocs.pm/combine/Combine.html.

As it turns out, one of the entites I’m interested in looks like “AB26”: the AB string followed by an integer. And that how I defined the parser in Combine: “AB” string followed by integer (https://hexdocs.pm/combine/Combine.Parsers.Text.html#integer/1).

These entity parsers essentially go through the text to process as follows: check the string for an entity match at the start of the string. If match, remove the matched sub-string from the main string, then parse the rest of the main string. If no match, remove the first token (e.g. until next space) from the string and parse the rest of the string.

Now for the specific problem: it turns out that Combine’s integer will parse “0123” and return 123. So the combined parser would parse “AB012 foo bar” and report that “AB12” had been processed in the string. My code would the “remove” “AB12” from the string’s beginning (i.e. do nothing since the string doesn’t start with that), and would then “continue” by parsing exactly the same string again. And of course it would find a parser match, remove nothing, and do the same thing again over and over…