How to properly parse a list of lines, with 'look-ahead' functionality?

jeroenbourgois · October 25, 2017, 9:33am

How to properly parse a list of lines, with ‘look-ahead’ functionality?

I have been struggling with this for a long time, and it is maybe not even strictly Elixir related, but that is the language I am implementing it in, so I thought this forum would be the best place to start my quest. I am quite sure for a lot of you my question might be trivial.

The essence is just that I need to grasp how to tackle this parsing problem. Note that I prefer to do it all with Elixir code and would like to stay away from yecc and leex for now.

Here it goes: I have a rather simple log file with some lines in it, that I want to parse. The log has information about a list of our web projects that we ‘scan’ every day for issues or when we launch them, as an extra check. I succeeded in the first step which is transforming the raw text lines into something a bit more meaningful. This is an example of the raw log:

[ ] URL: www.example.com
[ ] Started: Mon May 1 22:20:01 2017

[ ] robots.txt file found
[!] robots.txt exposes too much
[ ] correct utf8 meta tag found
[ ] correct og meta tags found
[!] missing correct doctype

[ ] Server info
 |  Server: Apache
 |  Version: 2.2
 |  Lang: PHP
 |  php-fpm: yes
[!] server exposes too much
 |  exposing via response headers

I was able to transform the above into the following on a first pass:

[
  {:site, "www.example.com"},
  {:started, _parsed_erlang_date_},
  {:block, "robots.txt file found"},
  {:warn, "robots.txt exposes too much"},
  {:block, "correct utf8 meta tag found"},
  {:block, "correct og meta tags found"},
  {:warn, "missing correct doctype"},
  {:pipe, "Server: Apache"},
  {:pipe, "Version: 2.2"},
  {:pipe, "Lang: PHP"},
  {:pipe, "php-fpm: yes"},
  {:warn, "server exposes too much"},
  {:pipe, "exposing via response headers"}
]

This looked pretty good and more structured, but now I am stuck… As you can derivate from the raw log, some information spans multiple lines that I would like to group in a struct instead of have it in separate entries in the resulting list.

For example, the ‘Server info’ block should be one {:server, {_server_props_or_info_lines}} list. And the last warning should be combined with the line after it which contains some warning meta.

And that is where I get stuck. I think it is a programming paradigm (I am new to FP, have been doing imperative for +12 years) that I need to grasp and it is not tied to Elixir at all. I hope somebody can guide me through this; but bear in mind: I have no CS Masters degree so I do not know that much about parsers and lexers. It is just that the thing I try to do seems so so trivial, and it frustrates me that I cannot get it right.

Thank you, anyone, in advance!

idi527 · October 25, 2017, 1:13pm

You mean something like this?

defmodule LogParser do

  @site "[ ] URL: "
  @server "[ ] Server info"
  @started "[ ] Started: "

  @block "[ ] "
  @warning "[!] "
  @pipe " |  "

  @spec parse(binary, [{atom, binary}]) :: [{atom, binary}]
  def parse(<<>>, result), do: :lists.reverse(result)
  def parse(<<@site, rest::bytes>>, acc) do
    parse_line(rest, :site, "", acc)
  end
  def parse(<<@started, rest::bytes>>, acc) do
    parse_line(rest, :started, "", acc)
  end
  def parse(<<@server, rest::bytes>>, acc) do
    parse_line(rest, :server, "", acc)
  end
  def parse(<<@block, rest::bytes>>, acc) do
    parse_line(rest, :block, "", acc)
  end
  def parse(<<@warning, rest::bytes>>, acc) do
    parse_line(rest, :warning, "", acc)
  end
  def parse(<<@pipe, rest::bytes>>, [{prev_tag, prev_info} | acc]) do
    parse_line(rest, prev_tag, <<prev_info::bytes, ?\n>>, acc)
  end
  def parse(<<_other::8, rest::bytes>>, acc) do
    parse(rest, acc)
  end

  def parse_line(<<?\n, rest::bytes>>, atom, inner_acc, outer_acc) do
    parse(rest, [{atom, inner_acc} | outer_acc])
  end
  def parse_line(<<char::8, rest::bytes>>, atom, inner_acc, outer_acc) do
    parse_line(rest, atom, <<inner_acc::bytes, char::8>>, outer_acc)
  end
  def parse_line(other, atom, inner_acc, outer_acc) do
    parse(other, [{atom, inner_acc} | outer_acc])
  end
end

With your log it produces

[site: "www.example.com", started: "Mon May 1 22:20:01 2017",
 block: "robots.txt file found", warning: "robots.txt exposes too much",
 block: "correct utf8 meta tag found", block: "correct og meta tags found",
 warning: "missing correct doctype",
 server: "\nServer: Apache\nVersion: 2.2\nLang: PHP\nphp-fpm: yes",
 warning: "server exposes too much\nexposing via response headers"]

You can collect lines staring with pipes into a list or a struct (in my example I concatenate them to the previous block).

idi527 · October 25, 2017, 1:55pm

My code probably would not be particularly illuminating for you …

For a better example you might want to look into how Poison or Msgpax do decoding. I think both of them try to do it in a single pass.

This is the last I’ve got:

iex(21)> LogParser.parse log, []
[site: "www.example.com", started: "Mon May 1 22:20:01 2017",
 block: "robots.txt file found", warning: "robots.txt exposes too much",
 block: "correct utf8 meta tag found", block: "correct og meta tags found",
 warning: "missing correct doctype",
 server: ["info", "Server: Apache", "Version: 2.2", "Lang: PHP",
  "php-fpm: yes"],
 warning: ["server exposes too much", "exposing via response headers"]]

from

defmodule LogParser do

  @site "[ ] URL: "
  @server "[ ] Server "
  @started "[ ] Started: "

  @block "[ ] "
  @warning "[!] "
  @pipe " |  "

  @spec parse(binary, [{atom, binary | [binary]}]) :: [{atom, binary | [binary]}]
  def parse(<<>>, result), do: :lists.reverse(result)
  def parse(<<@site, rest::bytes>>, acc) do
    parse_line(rest, :site, "", acc)
  end
  def parse(<<@started, rest::bytes>>, acc) do
    parse_line(rest, :started, "", acc)
  end
  def parse(<<@server, rest::bytes>>, acc) do
    parse_line(rest, :server, "", acc)
  end
  def parse(<<@block, rest::bytes>>, acc) do
    parse_line(rest, :block, "", acc)
  end
  def parse(<<@warning, rest::bytes>>, acc) do
    parse_line(rest, :warning, "", acc)
  end
  def parse(<<@pipe, rest::bytes>>, [{prev_tag, prev_info} | acc]) do
    parse_line(rest, prev_tag, "", prev_info, acc)
  end
  def parse(<<_other::8, rest::bytes>>, acc) do
    parse(rest, acc)
  end

  def parse_line(data, tag, inner_acc, pipe_acc \\ nil, outer_acc)
  def parse_line(<<?\n, rest::bytes>>, tag, inner_acc, pipe_acc, outer_acc) do
    parse(rest, [{tag, collect_inner_acc(inner_acc, pipe_acc)} | outer_acc])
  end
  def parse_line(<<char::8, rest::bytes>>, tag, inner_acc, pipe_acc, outer_acc) do
    parse_line(rest, tag, <<inner_acc::bytes, char::8>>, pipe_acc, outer_acc)
  end
  def parse_line(other, tag, inner_acc, pipe_acc, outer_acc) do
    parse(other, [{tag, collect_inner_acc(inner_acc, pipe_acc)} | outer_acc])
  end

  def collect_inner_acc(inner_acc, nil), do: inner_acc
  def collect_inner_acc(inner_acc, pipe_acc) when is_binary(pipe_acc) do
    [pipe_acc, inner_acc]
  end
  def collect_inner_acc(inner_acc, pipe_acc) when is_list(pipe_acc) do
    pipe_acc ++ [inner_acc]
  end
end

josevalim · October 25, 2017, 2:33pm

FWIW, I really liked your approach. Another idea is to use File.stream/2 to get something that emits line by line and then do a Enum.reduce. The logic at the end will be very similar to yours except you leave the job of moving to the next line to the stream.

OvermindDL1 · October 25, 2017, 4:07pm

I’ve also made an in-elixir parsing library, you parse with just normal elixir. Here is one that parses your format, multiple lines if necessary for a body, and converts the funky datetime format into a normal Elixir NaiveDateTime structure:

defmodule LogSpiritTesting do
  @moduledoc """
  Documentation for LogSpiritTesting.
  """

  use ExSpirit.Parser, text: true

  @testlog """
  [ ] URL: www.example.com
  [ ] Started: Mon May 1 22:20:01 2017
  
  [ ] robots.txt file found
  [!] robots.txt exposes too much
  [ ] correct utf8 meta tag found
  [ ] correct og meta tags found
  [!] missing correct doctype
  
  [ ] Server info
   |  Server: Apache
   |  Version: 2.2
   |  Lang: PHP
   |  php-fpm: yes
  [!] server exposes too much
   |  exposing via response headers
  """

  @doc """

  ## Examples

      iex> LogSpiritTesting.test_parse()
      [
         site: "www.example.com",
         started: ~N[2017-05-01 22:20:01],
         block: "robots.txt file found",
         warn: "robots.txt exposes too much",
         block: "correct utf8 meta tag found",
         block: "correct og meta tags found",
         warn: "missing correct doctype",
         block: "Server info",
         pipe: "Server: Apache",
         pipe: "Version: 2.2",
         pipe: "Lang: PHP",
         pipe: "php-fpm: yes",
         warn: "server exposes too much",
         pipe: "exposing via response headers"
       ]

  """
  def test_parse(input \\ @testlog) do
    case parse(input, repeat(parse_entry())) do
      %{error: nil, result: result} -> result
      %{error: error} -> raise error
    end
  end

  defrule parse_type(alt([
    lit("[ ] URL: ") |> success(:site),
    lit("[ ] Started: ") |> success(:started),
    lit("[ ] ") |> success(:block),
    lit("[!] ") |> success(:warn),
    lit(" |  ") |> success(:pipe),
  ]))

  defrule parse_body(seq([
    chars(-?\n, 0),
    alt([
      seq([char(?\n), lookahead_not(parse_type()) |> parse_body()]),
      ignore(char(?\n)),
      success([]),
    ]),
  ])), pipe_result_into: :erlang.iolist_to_binary() |> String.trim()

  defrule parse_weirddatetimeformat(seq([
    chars(-?\s), ignore(char(?\s)), # Day of week name
    chars(-?\s), ignore(char(?\s)), # Month name
    uint(), ignore(char(?\s)),       # Day of month I guess??
    chars(-?\s), ignore(char(?\s)), # Time
    uint(),                          # Year
  ])), pipe_result_into: (case do [_day_of_week, month_name, month_day, time_str, year] ->
    time = Time.from_iso8601!(time_str)

    month =
      case month_name do
        "Jan" -> 1
        "Feb" -> 2
        "Mar" -> 3
        "Apr" -> 4
        "May" -> 5
        "Jun" -> 6
        "Jul" -> 7
        "Aug" -> 8
        "Sep" -> 9
        "Oct" -> 10
        "Nov" -> 11
        "Dec" -> 12
      end

    %NaiveDateTime{
      year: year,
      month: month,
      day: month_day,
      hour: time.hour,
      minute: time.minute,
      second: time.second,
    }
  end)

  defrule parse_entry(seq([
    parse_type(),
    parse_body(),
  ])), pipe_result_into: List.to_tuple() |> (case do
    {:started, datetime_string} -> {:started, parse(datetime_string, parse_weirddatetimeformat()).result}
    result -> result
  end)
end

Not necessarily the best way to do it but I whipped it up in a couple of minutes and it works. ^.^;
You can also get error information with details about why it failed and more too.

EDIT: Oh wait, your output format is not what you were ‘wanting’ but was rather what you ‘had’, you should always show the final output of what you ‘want’ in addition to what you ‘have’ too. ^.^
If you do so then I can transform the spirit parser into that too if you are curious?

jeroenbourgois · October 26, 2017, 2:04pm

Thanks everybody! Amazing, so fast! I will look into it, I already see I can learn a lot!