Tagging data with NimbleParsec based on a lookahead

John-Goff · February 20, 2021, 11:06pm

Hey everyone,

I’m trying to write a parser for todo.txt using NimbleParsec, it’s my first time writing a parser as well as first time being exposed to the idea of parser combinators, so I’m not really sure how to express what I want. At the moment, I have a pretty basic parser which correctly does done state as well as priority of a task. I’m trying to parse dates next.

The todo.txt specification says that two dates can be provided, a start date and an end date. A line item with both a start and end date might look like this

x 2021-01-02 2021-01-01 Make a new years resolution

If two dates are provided, the end date must come first followed by the start date. If only one date is given, it should be the start date. Right now, I have my dates defined like this

  date =
    integer(4)
    |> ignore(string("-"))
    |> integer(2)
    |> ignore(string("-"))
    |> integer(2)

  start_date =
    date
    |> post_traverse({TodoTex.ParserHelper, :map_date, [:end]})
    |> lookahead_not(date)

  end_date =
    date
    |> post_traverse({TodoTex.ParserHelper, :map_date, [:end]})

the function ParserHelper.map_date turns the three integers into a %Date{} struct, as well as adding :start or :end respectively. It looks like this:

  def map_date(_rest, results, context, _line, _offset, tag) do
    {[{:date, tag, apply(Date, :new!, Enum.reverse(results))}], context}
  end

and finally my parser is created like so:

  defparsec(
    :todo,
    optional(done)
    |> optional(ignore(string(" ")))
    |> optional(priority)
    |> optional(ignore(string(" ")))
    |> optional(choice([start_date, end_date]))
    |> optional(ignore(string(" ")))
    |> optional(end_date)
    |> optional(ignore(string(" "))),
    debug: true
  )

When I run this parser with the string with two dates, both of them are tagged with :end. If I run the parser on a string with only one date it’s tagged with :end as well. How can I get the date to be correctly tagged as start if it’s either the only date or the last date given? Thanks.

kip · February 21, 2021, 12:23am

John-Goff:

  start_date =
    date
    |> post_traverse({TodoTex.ParserHelper, :map_date, [:end]})
    |> lookahead_not(date)

  end_date =
    date
    |> post_traverse({TodoTex.ParserHelper, :map_date, [:end]})

Looks like you’re tagging both of these with :end which might be part of the issue.

In a more general case, this kind of challenge is better suited to an approach of parsing the longest match first, then shorter matches. This way you avoid having to do lookahead and you leverage the parser combinators automatic backtracking. Making everything optional also makes it harder to understand the intent and harder to debug. Using your example, I would think about something like this:

  import ParserHelpers

  defparsec(
    :todo,
    optional(done())
    |> ignore_whitespace() 
    |> choice([
      end_date_and_start_date(),
      start_date()
    ])
  )

  # In parser_helpers.ex
  defmodule ParserHelpers do
    def end_date_and_start_date do
      date()
      |> ignore_whitespace()
      |> date()
    end

    def start_date do
      date()
    end

    def ignore_whitespace do
      ascii_char([?\s, ?\t])
      |> ignore()
      |> repeat()
    end
  end

John-Goff · February 21, 2021, 12:30am

You’re totally right, I copy pasted some code there and forgot to change that Thanks for the spot! When I change that to have the correct tag, I now get the start tag correctly applied when I only have one date, but when I have two dates the start tag and end tag are applied to the wrong dates. So progress, but still not correct. I’ll look into the approach you suggested, thanks!

kip · February 21, 2021, 12:34am

The reason is that in your line optional(choice([start_date, end_date])), the start_date will match and be tagged with :start

Then later on you have optional(end_date) which will be tagged with :end.

Overall your code is expected the dates to be in the order of start_date end_date and that sounds like what you are seeing.

John-Goff · February 21, 2021, 12:59am

Thanks, I definitely think I understand parsers slightly more now. I tried to modify the code I had quickly to fix the ordering issue you pointed out, but I ended up refactoring it as you suggested, parsing both dates if both are available and only one otherwise. My final (for now) code looks like this, for anyone interested:

defmodule TodoTex.ParserHelper do
  @moduledoc false

  import NimbleParsec

  def map_priority(pri), do: {:priority, <<pri>>}

  def map_date(_rest, results, context, _line, _offset, tag) do
    {[{:date, tag, apply(Date, :new!, Enum.reverse(results))}], context}
  end

  def map_end_and_start_date(_rest, results, context, _line, _offset) do
    [start_day, start_month, start_year, end_day, end_month, end_year] = results

    {[
       {:date, :start, Date.new!(start_year, start_month, start_day)},
       {:date, :end, Date.new!(end_year, end_month, end_day)}
     ], context}
  end

  def ignore_whitespace(combinator \\ empty()) do
    combinator
    |> ignore(repeat(ascii_char([?\s, ?\t])))
  end

  def done(combinator \\ empty()), do: combinator |> ascii_char([?x]) |> replace({:done, true})

  def priority(combinator \\ empty()) do
    combinator
    |> ignore(string("("))
    |> ascii_char([?A..?Z])
    |> ignore(string(")"))
    |> map({:map_priority, []})
  end

  def date(combinator \\ empty()) do
    combinator
    |> integer(4)
    |> ignore(string("-"))
    |> integer(2)
    |> ignore(string("-"))
    |> integer(2)
  end

  def start_date(combinator \\ empty()),
    do: combinator |> date() |> post_traverse({:map_date, [:start]})

  def end_and_start_date(combinator \\ empty()) do
    combinator
    |> date()
    |> ignore_whitespace()
    |> date()
    |> post_traverse({:map_end_and_start_date, []})
  end
end

defmodule TodoTex.TodoParser do
  import NimbleParsec
  import TodoTex.ParserHelper

  defparsec(
    :todo,
    optional(done())
    |> ignore_whitespace()
    |> optional(priority())
    |> ignore_whitespace()
    |> optional(choice([end_and_start_date(), start_date()]))
    |> ignore_whitespace()
  )
end

kip · February 21, 2021, 1:10am

I think the intent is much clearer in your refactored code and I expect maintenance will be easier in the future as a result.