Hi,
I’m trying to parse plain text durations like 5 minutes
into structured data e.g., the integer 300
representing the number seconds for the string “5 minutes”. This can get complicated as there are various forms like 5 mins
or 5mins.
, 5M
etc.
Also, it needs to handle ranges and fractions like 1 hour to 1 1/2 hours
. In this case, I only need to pick one value from the range so I’m taking the larger value. E.g., for this example, I should get back the integer 5400
representing 1.5 hours, in seconds.
In my research as I tried to figure out a way to do this, I stumbled across parsecs and NimbleParsec. Here’s my first attempt:
defmodule DurationParser do
import NimbleParsec
approx =
choice([
string("about"),
string("up to")
])
fraction =
integer(min: 1)
|> ignore(string("/"))
|> integer(min: 1)
|> reduce(:to_fraction)
|> tag(:fraction)
minute =
choice([
string("minutes"),
string("minute"),
string("mins"),
string("min"),
string("m"),
string("M")
])
|> optional(ignore(string(".")))
|> unwrap_and_tag(:minutes)
hour =
choice([
string("hours"),
string("hour"),
string("H")
])
|> optional(ignore(string(".")))
|> unwrap_and_tag(:hours)
duration =
optional(ignore(approx))
|> optional(ignore(string(" ")))
|> integer(min: 1)
|> optional(ignore(string(" ")))
|> optional(fraction)
|> tag(:quantity)
|> optional(ignore(string(" ")))
|> choice([
minute,
hour
])
|> reduce(:to_duration)
|> tag(:duration)
range =
choice([
string("-"),
string("to")
])
defparsec(
:do_parse,
duration
|> optional(ignore(repeat(string(" "))))
|> optional(repeat(duration))
|> optional(ignore(range))
|> optional(duration)
|> optional(repeat(duration))
|> eos()
)
def parse(str) do
with {:ok, rest, _args, _context, _line, _offset} <- do_parse(str) do
{:ok, unwrap_duration(rest)}
end
end
defp unwrap_duration({:duration, [duration]}), do: duration
defp unwrap_duration(duration) when is_list(duration) do
duration
|> List.last()
|> unwrap_duration
end
defp to_fraction([numerator, denominator]) do
numerator / denominator
end
defp to_duration(quantity: [quantity], hours: _) do
quantity * 60 * 60
end
defp to_duration(quantity: [quantity], minutes: _) do
quantity * 60
end
defp to_duration(quantity: [quantity, {:fraction, [fraction]}], hours: _) do
(quantity + fraction) * 60 * 60
end
defp to_duration(quantity: [quantity, {:fraction, [fraction]}], seconds: _) do
(quantity + fraction) * 60
end
end
Currently, it passes these test cases:
defmodule DurationParserTest do
use ExUnit.Case, async: true
test "1 hour" do
str = "1 hour"
assert {:ok, 60*60} == DurationParser.parse(str)
end
test "2 hours" do
str = "2 hours"
assert {:ok, 2*60*60} == DurationParser.parse(str)
end
test "1 minute" do
str = "1 minute"
assert {:ok, 60} == DurationParser.parse(str)
end
test "15 minutes" do
str = "15 minutes"
assert {:ok, 15*60} == DurationParser.parse(str)
end
test "1 1/2 hours" do
str = "1 1/2 hours"
assert {:ok, 1.5*60*60} == DurationParser.parse(str)
end
test "about 20 mins." do
str = "about 20 mins."
assert {:ok, 20*60} == DurationParser.parse(str)
end
test "up to 2 1/2 hours" do
str = "up to 2 1/2 hours"
assert {:ok, 2.5*60*60} == DurationParser.parse(str)
end
test "1 hour to 1 1/2 hours" do
str = "1 hour to 1 1/2 hours"
assert {:ok, 1.5*60*60} == DurationParser.parse(str)
end
test "40 minutes to 1 hour" do
str = "40 minutes to 1 hour"
assert {:ok, 60*60} == DurationParser.parse(str)
end
end
That said… it feels pretty inelegant. What are some ways I can make this better?
In particular, some rough edges I see:
- I don’t know if my use of variables makes sense or if I should break those into a helper module for reusability/composability.
- the handoff of parse to do_parse
- how i’m reducing the entire parsing to a single integer using reduce and do_parse seems like it’s the wrong way to go about that
- listing literal minute/hour matches seems brittle (e.g., it’s not case insensitive so it will fail on “Minutes”)
- pattern matching on the tags to further parse data
Said differently… what’s the idiomatic approach for using defparsec
but returning a single value that isn’t related to how the data is being parsed.