Parse this string

chasers · February 19, 2020, 7:10pm

I have not done a ton of string parsing. I could be parsing potentially A LOT of these. Would appreciate any ideas or direction here!

Here is the string:

"START RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69 Version: $LATEST\n2020-02-19T17:32:52.353Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting metadata\n2020-02-19T17:32:52.364Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting projects\n2020-02-19T17:32:52.401Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting Logflare sources\nEND RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\nREPORT RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\tDuration: 174.83 ms\tBilled Duration: 200 ms\tMemory Size: 1024 MB\tMax Memory Used: 84 MB\t\n"

I would like to turn it into this:

%{
  request_id: "4d0ff57e-4022-4bfd-8689-a69e39f80f69",
  lines: [
    %{timestamp: "2020-02-19T17:32:52.353Z", level: "INFO", message: "Getting metadata"},
    %{timestamp: "2020-02-19T17:32:52.364Z", level: "INFO", message: "Getting projects"},
    %{
      timestamp: "2020-02-19T17:32:52.401Z",
      level: "INFO",
      message: "Getting Logflare sources"
    }
  ],
  report: %{
    "billed_duration_ms" => 200,
    "duration_ms" => 175,
    "max_memory_used_mb" => 84,
    "memory_size_mb" => 1024
  }
}

Here is what I have that is working (with some caveats):

  def parse_lambda_message(message) do
    message = String.split(message, "\n")

    report =
      message
      |> Enum.find(fn x -> String.contains?(x, "REPORT") end)
      |> parse_lambda_report()

    request_id =
      message
      |> Enum.find(fn x -> String.contains?(x, "START") end)
      |> parse_request_id()

    lines =
      message
      |> Enum.filter(fn x -> !String.match?(x, ~r"START|REPORT|END") end)
      |> parse_lambda_lines()

    %{request_id: request_id, report: report, lines: lines}
  end

  def parse_lambda_lines(lines) do
    Enum.map(lines, fn x -> parse_lambda_line(x) end)
  end

  def parse_lambda_line(line) do
    line = String.split(line, "\t")
    timestamp = Enum.at(line, 0)
    level = Enum.at(line, 2)
    message = Enum.at(line, 3)

    %{timestamp: timestamp, level: level, message: message}
  end

  def parse_request_id(start) do
    String.split(start, " ")
    |> Enum.at(2)
  end

  def parse_lambda_report(report) do
    report
    |> String.split("\t")
    |> Enum.drop_while(fn x -> String.contains?(x, "RequestId:") == true end)
    |> Enum.map(fn x ->
      case String.split(x, ":", trim: true) do
        [k, v] ->
          [v, kind] =
            String.trim(v)
            |> String.split(" ")

          key =
            "#{k}_#{kind}"
            |> String.downcase()
            |> String.replace(" ", "_")

          value =
            case Float.parse(v) do
              {float, _rem} -> Kernel.round(float)
              :error -> v
            end

          {key, value}

        _ ->
          {"key", "value"}
      end
    end)
    |> Enum.into(%{})
    |> Map.drop(["key"])
    |> Map.put("parse_status", "success")
  end

Caveats:

I have a "" from splitting the main string (String.split(message, "\n") that I can’t seem to match on (got get rid of it) for some reason.
A line could be anything from console.log (javascript, or console.whatever in some other language) so I could have \ns which can exist in there. What I really need to do is split on a timestamp and then parse the lines from that.

bottlenecked · February 19, 2020, 8:29pm

Hi there! If the timestamps have always the same format and length and you’re not worried about them appearing inside a message, you could define 2 function heads, the first pattern matching on the timestamp <> rest and the second on the first character <> rest of the string. That would give you two different states: either just met a timestamp or anywhere else, and work from there.

Although to be honest if you have any control of the incoming log line, surrounding the message with single quotes (and escaping any single quotes inside the message) it would help make your parsing more explicit and less brittle

chasers · February 19, 2020, 8:44pm

Thanks! Yeah unfortunately I don’t have control over the message.

pdgonzalez872 · February 19, 2020, 8:47pm

@chasers what about something like the below? Naive approach, but may help you get there.
Would be helpful if you had more of examples of these.

defmodule Feb20Test do
  @moduledoc """
  Source: https://elixirforum.com/t/parse-this-string/29252

  Human friendly input:
  START RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69 Version: $LATEST
  2020-02-19T17:32:52.353Z        4d0ff57e-4022-4bfd-8689-a69e39f80f69    INFO    Getting metadata
  2020-02-19T17:32:52.364Z        4d0ff57e-4022-4bfd-8689-a69e39f80f69    INFO    Getting projects
  2020-02-19T17:32:52.401Z        4d0ff57e-4022-4bfd-8689-a69e39f80f69    INFO    Getting Logflare sources
  END RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69
  REPORT RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69  Duration: 174.83 ms     Billed Duration: 200 ms Memory Size: 1024 MB    Max Memory Used: 84 MB

  Would be helpful if you had more of examples of these.
  """

  use ExUnit.Case

  defmodule Solution do
    def parse(input) do
      %{original: input, newline_split: String.split(input, "\n", trim: true)}
      |> get_request_id()
      |> get_lines()
      |> get_report()
    end

    defp get_request_id(%{newline_split: [input | _]} = state) do
      [_, _, request_id, _, _] = String.split(input, " ")
      Map.put(state, :request_id, request_id)
    end

    defp get_lines(%{newline_split: input} = state) do
      lines =
        input
        |> Enum.filter(fn el -> String.contains?(el, "INFO") end)
        |> Enum.map(fn el ->
          [timestamp, _, level, message] = String.split(el, "\t")

          %{
            timestamp: timestamp,
            level: level,
            message: message
          }
        end)

      Map.put(state, :lines, lines)
    end

    defp get_report(%{newline_split: input} = state) do
      [
        _,
        "Duration: " <> duration_ms,
        "Billed Duration: " <> billed_duration_ms,
        "Memory Size: " <> memory_size_mb,
        "Max Memory Used: " <> max_memory_used_mb
      ] = input |> Enum.at(-1) |> String.split("\t", trim: true)

      report = %{
        # I'm not going to parse this more than converting it to string. If you
        # want to do that, you can add a couple of functions that do that
        # below. They are staying as strings for now. If you will deal with
        # Floats, I suggest you look at the Decimal library and use that.
        "billed_duration_ms" => remove_metric(billed_duration_ms, "ms"),
        "duration_ms" => remove_metric(duration_ms, "ms"),
        "max_memory_used_mb" => remove_metric(max_memory_used_mb, "MB"),
        "memory_size_mb" => remove_metric(memory_size_mb, "MB")
      }

      Map.put(state, :report, report)
    end

    defp remove_metric(input, to_replace) do
      input
      |> String.replace(" ", "")
      |> String.replace(to_replace, "")
    end
  end

  test "parses text as expected" do
    input =
      "START RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69 Version: $LATEST\n2020-02-19T17:32:52.353Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting metadata\n2020-02-19T17:32:52.364Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting projects\n2020-02-19T17:32:52.401Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting Logflare sources\nEND RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\nREPORT RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\tDuration: 174.83 ms\tBilled Duration: 200 ms\tMemory Size: 1024 MB\tMax Memory Used: 84 MB\t\n"

    expected = %{
      request_id: "4d0ff57e-4022-4bfd-8689-a69e39f80f69",
      lines: [
        %{timestamp: "2020-02-19T17:32:52.353Z", level: "INFO", message: "Getting metadata"},
        %{timestamp: "2020-02-19T17:32:52.364Z", level: "INFO", message: "Getting projects"},
        %{
          timestamp: "2020-02-19T17:32:52.401Z",
          level: "INFO",
          message: "Getting Logflare sources"
        }
      ],
      report: %{
        "billed_duration_ms" => "200",
        "duration_ms" => "174.83",
        "max_memory_used_mb" => "84",
        "memory_size_mb" => "1024"
      }
    }

    result = Solution.parse(input)

    assert result.request_id == "4d0ff57e-4022-4bfd-8689-a69e39f80f69"
    assert Enum.sort(result.lines) == Enum.sort(expected.lines)
    assert result.report == expected.report
    # tests pass
  end
end

trarbr · February 19, 2020, 9:20pm

Here’s a “solution” using NimbleParsec. I recently used that for parsing server logs from Counter Strike: Global Offensive (CS:GO log parser), and I really liked that approach to building parsers. It looks weird at first, but there are some good talks on YouTube, and the docs are pretty good as well.

The solution is incomplete because it does not allow loglines to contain linebreaks - so that will be an exercise left for the reader (can probably be accomplished with lookahead_not)

defmodule LogParser do
  import NimbleParsec

  # Example: 4d0ff57e-4022-4bfd-8689-a69e39f80f69
  uuid = ascii_string([?0..?9, ?a..?z, ?-], min: 1)

  # Example: 2020-02-19T17:32:52.353Z
  timestamp = ascii_string([?0..?9, ?-, ?:, ?., ?T, ?Z], 24)

  # Example: START RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69 Version: $LATEST\n
  start =
    ignore(string("START RequestId: "))
    |> concat(uuid)
    |> ignore(string(" Version: $LATEST\n"))

  # Example: 2020-02-19T17:32:52.353Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting metadata\n
  logline =
    timestamp
    |> ignore(string("\t"))
    |> concat(ignore(uuid))
    |> ignore(string("\t"))
    |> ascii_string([?A..?Z], min: 1)
    |> ignore(string("\t"))
    |> utf8_string([{:not, ?\n}], min: 1)
    |> ignore(string("\n"))

  # TODO: logline until \n{timestamp} END
  loglines =
    logline
    |> repeat()

  # Example: END RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69
  end_ =
    ignore(string("END RequestId: "))
    |> concat(ignore(uuid))
    |> ignore(string("\n"))

  # Example: \nREPORT RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\tDuration: 174.83 ms\tBilled Duration: 200 ms\tMemory Size: 1024 MB\tMax Memory Used: 84 MB\t\n
  report =
    ignore(string("REPORT RequestId: "))
    |> concat(ignore(uuid))
    |> ignore(string("\tDuration: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" ms\tBilled Duration: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" ms\tMemory Size: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" MB\tMax Memory Used: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" MB\t\n"))

  parser =
    start
    |> concat(loglines)
    |> concat(end_)
    |> concat(report)

  defparsec(:parse, parser)

  def test() do
    {:ok,
     [
       "4d0ff57e-4022-4bfd-8689-a69e39f80f69",
       "2020-02-19T17:32:52.353Z",
       "INFO",
       "Getting metadata",
       "2020-02-19T17:32:52.364Z",
       "INFO",
       "Getting projects",
       "2020-02-19T17:32:52.401Z",
       "INFO",
       "Getting Logflare sources",
       "174.83",
       "200",
       "1024",
       "84"
     ], _leftovers, _, _,
     _} =
      parse(
        "START RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69 Version: $LATEST\n2020-02-19T17:32:52.353Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting metadata\n2020-02-19T17:32:52.364Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting projects\n2020-02-19T17:32:52.401Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting Logflare sources\nEND RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\nREPORT RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\tDuration: 174.83 ms\tBilled Duration: 200 ms\tMemory Size: 1024 MB\tMax Memory Used: 84 MB\t\n"
      )
  end
end

chasers · February 19, 2020, 10:57pm

@pdgonzalez872 I like the idea of a feux state. Here is one that handles line breaks in a line…

message = "START RequestId: c7842ad7-aaa3-475f-a0e3-ba38c3e07fc6 Version: $LATEST\n2020-02-19T21:48:36.769Z\tc7842ad7-aaa3-475f-a0e3-ba38c3e07fc6\tINFO\tGetting drains\n2020-02-19T21:48:36.839Z\tc7842ad7-aaa3-475f-a0e3-ba38c3e07fc6\tINFO\tLogging map\n2020-02-19T21:48:36.841Z\tc7842ad7-aaa3-475f-a0e3-ba38c3e07fc6\tINFO\tMap {\n  'a string' => \"value associated with 'a string'\",\n  {} => 'value associated with keyObj',\n  [Function: keyFunc] => 'value associated with keyFunc'\n}\n2020-02-19T21:48:36.841Z\tc7842ad7-aaa3-475f-a0e3-ba38c3e07fc6\tINFO\tGetting metadata\nEND RequestId: c7842ad7-aaa3-475f-a0e3-ba38c3e07fc6\nREPORT RequestId: c7842ad7-aaa3-475f-a0e3-ba38c3e07fc6\tDuration: 142.26 ms\tBilled Duration: 200 ms\tMemory Size: 1024 MB\tMax Memory Used: 82 MB\t\n"

  def parse_lambda_message(message) do
    n_split_message = String.split(message, "\n")
    [_, latest] = String.split(message, "LATEST\n")
    [lines, _] = String.split(latest, "\nEND")

    %{
      n_split_message: n_split_message,
      lines_string: lines,
      og_message: message,
      request_id: "",
      report: %{},
      lines: []
    }
    |> parse_lambda_report()
    |> parse_request_id()
    |> parse_lambda_lines()
    |> Map.drop([:n_split_message, :lines_string, :og_message])
  end

  def parse_lambda_lines(%{lines_string: lines_string} = state) do
    regex =
      ~r"((?:((?:16|17|18|19|20|21)\d\d)-(0[1-9]|10|11|12)-([0-3]\d))|(?:((?:16|17|18|19|20|21)\d\d)-([0-3]\d\d)))T(([01]\d|2[0-4])(?:\:([0-5]\d)(?::([0-5]\d|60)(\.\d{1,9})?)?)?)?Z"

    lines =
      Regex.split(regex, lines_string, include_captures: true, trim: true)
      |> Enum.map(fn x -> String.split(x, "\t", trim: true) end)

    lines =
      for [n] <- lines,
          [_request_id, level, message] <- lines,
          do: %{timestamp: n, level: level, message: message}

    %{state | lines: lines}
  end

  def parse_lambda_line(line) do
    line = String.split(line, "\t")
    timestamp = Enum.at(line, 0)
    level = Enum.at(line, 2)
    message = Enum.at(line, 3)

    %{timestamp: timestamp, level: level, message: message}
  end

  def parse_request_id(%{n_split_message: n_split_message} = state) do
    request_id =
      n_split_message
      |> Enum.find(fn x -> String.contains?(x, "START") end)
      |> String.split(" ")
      |> Enum.at(2)

    %{state | request_id: request_id}
  end

  def parse_lambda_report(%{n_split_message: n_split_message} = state) do
    report =
      n_split_message
      |> Enum.find(fn x -> String.contains?(x, "REPORT") end)
      |> String.split("\t")
      |> Enum.drop_while(fn x -> String.contains?(x, "RequestId:") == true end)
      |> Enum.map(fn x ->
        case String.split(x, ":", trim: true) do
          [k, v] ->
            [v, kind] =
              String.trim(v)
              |> String.split(" ")

            key =
              "#{k}_#{kind}"
              |> String.downcase()
              |> String.replace(" ", "_")

            value =
              case Float.parse(v) do
                {float, _rem} -> Kernel.round(float)
                :error -> raise("Error parsing floats")
              end

            {key, value}

          _ ->
            {"key", "value"}
        end
      end)
      |> Enum.into(%{})
      |> Map.drop(["key"])

    %{state | report: report}
  end

chasers · February 19, 2020, 10:59pm

I definitely feel like I should be using NimbleParsec but I wasn’t sure how long it would take me to figure out.

I think I’m iterating through a bunch of lists too much for how often this fn is going to run.

ityonemo · February 20, 2020, 6:26am

I think NimbleParsec is 150% the tool you want eventually. Caveat: It is very difficult to figure out and it’s sparsely documented.

However, I’m starting to figure out some best practices for myself. If you want some examples, here is an example NimbleParsec for this reddit post:

gist.github.com

https://gist.github.com/ityonemo/c1366f28cfd71978d25f080f0694e5af

CalcParsec.ex

defmodule CalcParsec do
  import NimbleParsec

  @from_string %{"one" => 1, "two" => 2, "three" => 3, "four" => 4,
  "five" => 5, "six" => 6, "seven" => 7, "eight" => 8, "nine" => 9,
  "ten" => 10}
  @num_strs Map.keys(@from_string)

  number_list = @num_strs |> Enum.map(&string/1)

This file has been truncated. show original

I am using a ‘keep no context and substitute string terms directly into the growing list of tokens’

By contrast, here is me using nimble parsec to parse zig code. Since context is extremely relevant, I generate a elixir struct and ninja the struct in for the context. I also aggressively check certain checkpoints to make sure they are populating the context correctly:

https://github.com/ityonemo/zigler/blob/long_nifs_revisited/lib/zigler/parser.ex (note the Mix.env==test fences: https://github.com/ityonemo/zigler/blob/long_nifs_revisited/lib/zigler/parser.ex#L284-L289)

And here are the tests:

github.com

ityonemo/zigler/blob/long_nifs_revisited/test/unit/parser/function_header_test.exs

defmodule ZiglerTest.Parser.FunctionHeaderTest do
  # these tests make sure that the parser can correctly identify docstrings.
  use ExUnit.Case, async: true

  alias Zigler.Parser
  alias Zigler.Parser.{Nif, Resource, ResourceCleanup}

  @moduletag :parser
  @moduletag :function

  describe "the parameter parser" do
    test "correctly parses a basic identifier parameter" do
      assert {:ok, ["i64"], _, _, _, _} = Parser.parse_parameter("foo: i64")
    end

    test "correctly parses a namespaced identifier parameter" do
      assert {:ok, ["beam.env"], _, _, _, _} = Parser.parse_parameter("foo: beam.env")
    end

    test "correctly parses an ErlNifEnv identifier parameter" do

This file has been truncated. show original

github.com

ityonemo/zigler/blob/long_nifs_revisited/test/unit/parser/docstring_test.exs

defmodule ZiglerTest.Parser.DocstringTest do
  # these tests make sure that the parser can correctly identify docstrings.
  use ExUnit.Case, async: true

  alias Zigler.Parser
  alias Zigler.Parser.Nif
  alias Zigler.Parser.Resource

  @moduletag :parser
  @moduletag :docstring

  describe "the docstring_line parser" do
    test "turns a single doc line into a document string" do
      assert {:ok, _, _, %Parser{local: local}, _, _} = Parser.parse_docstring("""
      /// foo
      """)
      assert {:doc, "foo"} == local
    end

    test "turns a single doc line with spaces into a document string" do

This file has been truncated. show original

there’s a lot of little gotchas that you’re going to have to deal with, like how tokens are accumulated backwards in the token list.

good luck!

chasers · February 20, 2020, 3:25pm

Thanks for all the help everyone! I’m going to roll with the naive approach for the short term while I wrap my head around NimbleParsec. We’ve got some in our codebase already so I definitely need to do that.

trarbr · February 20, 2020, 8:08pm

Well, I couldn’t help myself so I implemented the handling of multiple lines per messages

defmodule LogParser do
  import NimbleParsec

  def parse(input) do
    {:ok, [result], _, _, _, _} = do_parse(input)
    {:ok, result}
  end

  def test() do
    {:ok,
     %{
       lines: [
         %{
           message: "Getting metadata",
           severity: "INFO",
           timestamp: "2020-02-19T17:32:52.353Z"
         },
         %{
           message: "Getting projects",
           severity: "INFO",
           timestamp: "2020-02-19T17:32:52.364Z"
         },
         %{
           message: "Getting Logflare sources\nOh see, it handles more than one line per message",
           severity: "INFO",
           timestamp: "2020-02-19T17:32:52.401Z"
         }
       ],
       report: %{
         "billed_duration_ms" => "174.83",
         "duration_ms" => "200",
         "max_memory_used_mb" => "1024",
         "memory_size_mb" => "84"
       },
       request_id: "4d0ff57e-4022-4bfd-8689-a69e39f80f69"
     }} == parse(test_input())
  end

  def test_input() do
    "START RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69 Version: $LATEST\n2020-02-19T17:32:52.353Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting metadata\n2020-02-19T17:32:52.364Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting projects\n2020-02-19T17:32:52.401Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting Logflare sources\nOh see, it handles more than one line per message\nEND RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\nREPORT RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\tDuration: 174.83 ms\tBilled Duration: 200 ms\tMemory Size: 1024 MB\tMax Memory Used: 84 MB\t\n"
  end

  # Example: 4d0ff57e-4022-4bfd-8689-a69e39f80f69
  uuid = ascii_string([?0..?9, ?a..?z, ?-], min: 1)

  # Example: 2020-02-19T17:32:52.353Z
  timestamp = ascii_string([?0..?9, ?-, ?:, ?., ?T, ?Z], 24)

  # Example: START RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69 Version: $LATEST\n
  start =
    ignore(string("START RequestId: "))
    |> concat(uuid)
    |> ignore(string(" Version: $LATEST\n"))

  # Example: END RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69
  end_ =
    ignore(string("END RequestId: "))
    |> concat(ignore(uuid))
    |> ignore(string("\n"))

  # Example: Getting metadata\n
  message_line =
    lookahead_not(choice([timestamp, end_]))
    |> optional(utf8_string([{:not, ?\n}], min: 1))
    |> ignore(string("\n"))

  # Example: It also\nworks with\nseveral lines
  message =
    message_line
    |> repeat()
    |> reduce({Enum, :join, ["\n"]})

  # Example: 2020-02-19T17:32:52.353Z\t4d0ff57e-4022-4bfd-8689-a69e39f80f69\tINFO\tGetting metadata\n
  logline =
    timestamp
    |> ignore(string("\t"))
    |> concat(ignore(uuid))
    |> ignore(string("\t"))
    |> ascii_string([?A..?Z], min: 1)
    |> ignore(string("\t"))
    |> concat(message)
    |> reduce({:to_logline, []})

  defp to_logline([ts, severity, message]) do
    %{
      timestamp: ts,
      severity: severity,
      message: message
    }
  end

  loglines =
    logline
    |> repeat()
    |> reduce({:to_loglines, []})

  defp to_loglines(loglines), do: loglines

  # Example: \nREPORT RequestId: 4d0ff57e-4022-4bfd-8689-a69e39f80f69\tDuration: 174.83 ms\tBilled Duration: 200 ms\tMemory Size: 1024 MB\tMax Memory Used: 84 MB\t\n
  report =
    ignore(string("REPORT RequestId: "))
    |> concat(ignore(uuid))
    |> ignore(string("\tDuration: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" ms\tBilled Duration: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" ms\tMemory Size: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" MB\tMax Memory Used: "))
    |> ascii_string([?0..?9, ?.], min: 1)
    |> ignore(string(" MB\t\n"))
    |> reduce({:to_report, []})

  defp to_report([duration, billed_duration, memory_size, max_memory_used]) do
    %{
      "billed_duration_ms" => duration,
      "duration_ms" => billed_duration,
      "max_memory_used_mb" => memory_size,
      "memory_size_mb" => max_memory_used
    }
  end

  parser =
    start
    |> concat(loglines)
    |> concat(end_)
    |> concat(report)
    |> reduce({:to_result, []})

  defp to_result([uuid, lines, report]) do
    %{
      request_id: uuid,
      lines: lines,
      report: report
    }
  end

  defparsecp(:do_parse, parser)
end

When you get the time to sink your teeth into NimbleParsec, I can recommend @sasajuric’s talk on parser combinators. He does not use NimbleParsec, but implements a parser combinator from scratch, which uses many of the same principles as NimbleParsec. It helped me understand how to use NimbleParsec.

chasers · February 20, 2020, 8:46pm

This is awesome! Thanks so much!!

pdgonzalez872 · February 20, 2020, 8:54pm

So cool @trarbr! I have a question: I had started benchmarking some implementations yesterday and added your solution to it. Here are the results: https://github.com/pdgonzalez872/elixir_forum_parse_strings_feb_2020

Not sure why, but the Nimble solution didn’t perform as I was expecting. Is this something where the difference starts to show in massive scale?

Here is the benchmark output:

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Number of Available Cores: 12
Available memory: 16 GB
Elixir 1.9.1
Erlang 22.0.7

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 28 s

Benchmarking filter_map...
Benchmarking for_comprehension...
Benchmarking reduce...
Benchmarking trarbr_nimble...

Name                        ips        average  deviation         median         99th %
reduce                  94.25 K       10.61 μs    ±88.53%          10 μs          24 μs
for_comprehension       93.15 K       10.74 μs    ±84.56%          10 μs          22 μs
filter_map              90.56 K       11.04 μs    ±65.85%          10 μs          24 μs
trarbr_nimble           67.90 K       14.73 μs    ±54.39%          14 μs          40 μs

Comparison:
reduce                  94.25 K
for_comprehension       93.15 K - 1.01x slower +0.125 μs
filter_map              90.56 K - 1.04x slower +0.43 μs
trarbr_nimble           67.90 K - 1.39x slower +4.12 μs

Edit: here is the output on Linux box with the Elixir 1.10.1:

Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Number of Available Cores: 12
Available memory: 15.35 GB
Elixir 1.10.1
Erlang 22.2.7

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 28 s

Benchmarking filter_map...
Benchmarking for_comprehension...
Benchmarking reduce...
Benchmarking trarbr_nimble...

Name                        ips        average  deviation         median         99th %
filter_map              75.34 K       13.27 μs    ±52.28%       12.82 μs       24.33 μs
for_comprehension       74.21 K       13.48 μs    ±29.33%       12.99 μs       27.54 μs
reduce                  73.76 K       13.56 μs    ±56.27%       13.02 μs       27.46 μs
trarbr_nimble           61.16 K       16.35 μs    ±34.07%       15.40 μs       40.63 μs

Comparison: 
filter_map              75.34 K
for_comprehension       74.21 K - 1.02x slower +0.20 μs
reduce                  73.76 K - 1.02x slower +0.28 μs
trarbr_nimble           61.16 K - 1.23x slower +3.08 μs

trarbr · February 20, 2020, 9:08pm

To be honest, I have no idea But there are definitely things to try to improve the performance, e.g.:

The uuid parser handles strings of arbitrary size, and checks that all chars are of the expected type. If we assume that data is always going to be OK, we can just grab 36 bytes - that will be faster. How to grab 36 bytes in NimbleParsec? I don’t know, tbh.
Timestamp parsing could do the same - just grab 24 bytes (or maybe this won’t work since we use it in a lookahead_not, but it can still be optimised I’m sure)
Instead of parsing the severity level (INFO, WARN, etc) with ascii_string([?A..?Z], min: 1) maybe do it with choice([string("INFO"), string("WARN")]), that should (I’m speculating here) be faster.

There are probably more, but I can’t think of those right now. I’m sure people with more experience with NimbleParsec than me will have good advice to give.

trarbr · February 21, 2020, 7:49am

I just tried making those three changes locally, and they did improve the benched performance quite a bit. I’m not sure I’d trust the benchmark 100% since it’s very little work per iteration, and I’m seeing a lot of deviation on my machine (usually 200% - 300%, sometimes 1000% - probably because it’s busy with a lot of other things, CPU throttling, etc). But all solutions do consistently bench as approx. equally fast, here’s my latest run:

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.9.4
Erlang 22.0.7

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 28 s

Benchmarking filter_map...
Benchmarking for_comprehension...
Benchmarking reduce...
Benchmarking trarbr_nimble...

Name                        ips        average  deviation         median         99th %
trarbr_nimble           31.87 K       31.38 μs   ±300.49%       23.90 μs      146.47 μs
reduce                  26.86 K       37.22 μs   ±231.36%       27.90 μs      182.90 μs
filter_map              26.02 K       38.43 μs   ±233.55%       28.90 μs      198.90 μs
for_comprehension       25.99 K       38.48 μs   ±361.59%       28.90 μs      178.90 μs

Comparison:
trarbr_nimble           31.87 K
reduce                  26.86 K - 1.19x slower +5.84 μs
filter_map              26.02 K - 1.22x slower +7.04 μs
for_comprehension       25.99 K - 1.23x slower +7.10 μs

PR here: https://github.com/pdgonzalez872/elixir_forum_parse_strings_feb_2020/pull/1

pdgonzalez872 · February 21, 2020, 3:33pm

Sweet! Thanks! I merged that and then organized it so the modules have their own files. If you (or anyone) take another stab at it, please add a file to the lib and adjust the lib/benchmark.exs with the new addition.

Here is the repo: https://github.com/pdgonzalez872/elixir_forum_parse_strings_feb_2020

I created an issue to keep track of the benchmarks: https://github.com/pdgonzalez872/elixir_forum_parse_strings_feb_2020/issues/3

Seems like the elixir version does impact the results. I haven’t experience any fluctuation as @trarbr mentioned, but my machine was not under heavy load at the time of running the benchmarks.

This is fun! Keep the PRs coming