requesting critiques of my first NimbleParsec attempt

markquezada · March 17, 2022, 9:35am

Hi,

I’m trying to parse plain text durations like 5 minutes into structured data e.g., the integer 300 representing the number seconds for the string “5 minutes”. This can get complicated as there are various forms like 5 mins or 5mins., 5M etc.

Also, it needs to handle ranges and fractions like 1 hour to 1 1/2 hours. In this case, I only need to pick one value from the range so I’m taking the larger value. E.g., for this example, I should get back the integer 5400 representing 1.5 hours, in seconds.

In my research as I tried to figure out a way to do this, I stumbled across parsecs and NimbleParsec. Here’s my first attempt:

defmodule DurationParser do
  import NimbleParsec

  approx =
    choice([
      string("about"),
      string("up to")
    ])

  fraction =
    integer(min: 1)
    |> ignore(string("/"))
    |> integer(min: 1)
    |> reduce(:to_fraction)
    |> tag(:fraction)

  minute =
    choice([
      string("minutes"),
      string("minute"),
      string("mins"),
      string("min"),
      string("m"),
      string("M")
    ])
    |> optional(ignore(string(".")))
    |> unwrap_and_tag(:minutes)

  hour =
    choice([
      string("hours"),
      string("hour"),
      string("H")
    ])
    |> optional(ignore(string(".")))
    |> unwrap_and_tag(:hours)

  duration =
    optional(ignore(approx))
    |> optional(ignore(string(" ")))
    |> integer(min: 1)
    |> optional(ignore(string(" ")))
    |> optional(fraction)
    |> tag(:quantity)
    |> optional(ignore(string(" ")))
    |> choice([
      minute,
      hour
    ])
    |> reduce(:to_duration)
    |> tag(:duration)

  range =
    choice([
      string("-"),
      string("to")
    ])

  defparsec(
    :do_parse,
    duration
    |> optional(ignore(repeat(string(" "))))
    |> optional(repeat(duration))
    |> optional(ignore(range))
    |> optional(duration)
    |> optional(repeat(duration))
    |> eos()
  )

  def parse(str) do
    with {:ok, rest, _args, _context, _line, _offset} <- do_parse(str) do
      {:ok, unwrap_duration(rest)}
    end
  end

  defp unwrap_duration({:duration, [duration]}), do: duration

  defp unwrap_duration(duration) when is_list(duration) do
    duration
    |> List.last()
    |> unwrap_duration
  end

  defp to_fraction([numerator, denominator]) do
    numerator / denominator
  end

  defp to_duration(quantity: [quantity], hours: _) do
    quantity * 60 * 60
  end

  defp to_duration(quantity: [quantity], minutes: _) do
    quantity * 60
  end

  defp to_duration(quantity: [quantity, {:fraction, [fraction]}], hours: _) do
    (quantity + fraction) * 60 * 60
  end

  defp to_duration(quantity: [quantity, {:fraction, [fraction]}], seconds: _) do
    (quantity + fraction) * 60
  end
end

Currently, it passes these test cases:

defmodule DurationParserTest do
  use ExUnit.Case, async: true

  test "1 hour" do
    str = "1 hour"
    assert {:ok, 60*60} == DurationParser.parse(str)
  end

  test "2 hours" do
    str = "2 hours"
    assert {:ok, 2*60*60} == DurationParser.parse(str)
  end

  test "1 minute" do
    str = "1 minute"
    assert {:ok, 60} == DurationParser.parse(str)
  end

  test "15 minutes" do
    str = "15 minutes"
    assert {:ok, 15*60} == DurationParser.parse(str)
  end

  test "1 1/2 hours" do
    str = "1 1/2 hours"
    assert {:ok, 1.5*60*60} == DurationParser.parse(str)
  end

  test "about 20 mins." do
    str = "about 20 mins."
    assert {:ok, 20*60} == DurationParser.parse(str)
  end

  test "up to 2 1/2 hours" do
    str = "up to 2 1/2 hours"
    assert {:ok, 2.5*60*60} == DurationParser.parse(str)
  end

  test "1 hour to 1 1/2 hours" do
    str = "1 hour to 1 1/2 hours"
    assert {:ok, 1.5*60*60} == DurationParser.parse(str)
  end

  test "40 minutes to 1 hour" do
    str = "40 minutes to 1 hour"
    assert {:ok, 60*60} == DurationParser.parse(str)
  end
end

That said… it feels pretty inelegant. What are some ways I can make this better?

In particular, some rough edges I see:

I don’t know if my use of variables makes sense or if I should break those into a helper module for reusability/composability.
the handoff of parse to do_parse
how i’m reducing the entire parsing to a single integer using reduce and do_parse seems like it’s the wrong way to go about that
listing literal minute/hour matches seems brittle (e.g., it’s not case insensitive so it will fail on “Minutes”)
pattern matching on the tags to further parse data

Said differently… what’s the idiomatic approach for using defparsec but returning a single value that isn’t related to how the data is being parsed.

LostKobrakai · March 17, 2022, 10:00am

That’s not the goal of a parser, so I’m not sure you actually want that. I’ve never written a parser for anything production, but my approach would be parsing the text to tokens like {:hour, [5]} or {:hour, [1, {:fraction, 1, 2}]}, discarding anything you don’t need and when parsing is done reduce this to seconds. This clearly splits up the domain of the parser (make sense of text) from what your ultimate goal is (getting a number of seconds).

Your input format also feels very loosely structured and you don’t seem to need a way for telling users what’s wrong in how their text is structured. This is much of the reason why NimbleParsec returns all the stuff it does. Imagine an error message in SQL or in a HTML validity checker. It will tell you not only what’s wrong and why, but at best even where it’s wrong. The fact that you don’t need that is not wrong at all, but at the same time I think it’s fine to acknowledge that those things are very necessary pieces for others.

I’d probably approach that differently. Try to let the parser parse a full word – any word – and then use any of the transformation API (map, reduce, …_traverse) to check if their downcases value matches one of your units.

markquezada · March 17, 2022, 7:46pm

Correct, in the sense that the two concerns are separate. But ultimately, the point of parsing it is to get it into a format that can be converted into seconds. Agree that this format {:hour, [1, {:fraction, 1, 2}]} is ultimately the end goal for the parser so that it can be converted to seconds as a subsequent step.

Agree! That’s where I felt this exploration was leading me in the end.

Correct. I’m trying to use this to smartly parse some user generated content that does not conform to a specific grammar per se. It seems parser combinators may be more obviously useful with structured data that conforms to a specified grammar. In this case, having this work some or most of the time is better than not having any structured data at all.

This is what I wanted to do but I couldn’t figure out how to do it. I’ll look through the docs again.

Thanks for the feedback!