Playing with NimbleParsec — Some beginner questions

darraghenright · June 7, 2020, 6:42pm

Hi there!

I should preface this by saying I’m an absolute beginner to parser combinators and I’m pretty much an Elixir hobbyist so far.

In any case, I’ve very recently become interested in exploring them since watching Saša Jurić’s Parsing from first principles video and I’ve just started scratching the surface with nimble_parsec.

The introductory datetime example got me thinking about how one might go about parsing integers with additional constraints. As a standalone example, how one would go about parsing a “valid month” value; e.g: in the range 1..12 with an optional leading 0.

Strings like "1", "01" and "12" would be valid, returning 1, 1 and 12 respectively.

Conversely, strings like "0", "00", "001" and "13" are invalid.

I am not sure what an idiomatic approach is here, so I suspect my various attempts so far have been very naive. The following is my best attempt, where I accept any integer between one or two characters and then validate with post_traverse:

defmodule MonthParser do
  import NimbleParsec

  def valid_month?(_rest, [n] = args, context, _line, _offset) when n >= 1 and n <= 12,
    do: {args, context}

  def valid_day?(_rest, [n], _context, _line, _offset),
    do: {:error, "Invalid month: #{n}"}

  month =
    integer(min: 1, max: 2)
    |> post_traverse(:valid_month?)
    |> eos()

  defparsec :month, month 

end

This seems satisfactory enough:

iex(1)> MonthParser.month "0"
{:error, "Invalid month: 0", "", %{}, {1, 0}, 1}
iex(2)> MonthParser.month "00"
{:error, "Invalid month: 0", "", %{}, {1, 0}, 2}
iex(3)> MonthParser.month "01"
{:ok, [1], "", %{}, {1, 0}, 2}
iex(4)> MonthParser.month "1" 
{:ok, [1], "", %{}, {1, 0}, 1}
iex(5)> MonthParser.month "31"
{:ok, [31], "", %{}, {1, 0}, 2}
iex(6)> MonthParser.month "32"
{:error, "Invalid month: 32", "", %{}, {1, 0}, 2}

However, in the spirit of education I’d love to learn about better solutions. This being Elixir I assume there’s a far more elegant and succinct solution

Additionally, I am wondering in this example if range validation might be better somewhere else — in other words, maybe worrying about the validity of the values comes later? I am imagining a more complex scenario where a full date is being parsed, where the validity of the date depends on the month.

Thanks!

kip · June 7, 2020, 6:50pm

Thats the approach I generally use.

In some cases, like the one you outline, I may opt for a more explicit expression of validity like:

month = 
  choice([
    string("12"),
    string("11"),
    ...
    string("1")
  ])
  |> reduce({String, :to_integer, []})
  |> label("must be resolved to be from 1 to 12")

dbern · June 7, 2020, 10:06pm

If you want some examples of NimbleParsec and date time parsing you can look at TaxJar’s date_time_parser.

I helped make it. I won’t say it’s a shining example of parsing well, but it works.

darraghenright · June 8, 2020, 11:48am

Thanks! That’s a good suggestion, try to be as explicit as possible, which is obviously a lot clearer and a bit more declarative.

Just wondering, are the strings defined in descending numeric order for a reason?

darraghenright · June 8, 2020, 11:49am

Very cool! Thanks for sharing, looks like there’s a lot of good material to learn from in here.

ityonemo · June 8, 2020, 3:29pm

A few notes:

overall, looks like a fantastic start.
by convention functions that end in ? should emit boolean values.
I would do range validation inside your nimble parsec module as you have it here. Because let’s say you want to make a complex analysis tool, you want to quit early, instead of process the whole file and then hunt for problems, because you have contextual information that what you are parsing is a “month value”.
I typically don’t like to expose the complex nimbleparsec function return as part of the public api of my modules. I usually wrap it in another function that simplifies the output.

I can’t say I’m an expert (some of these suggestions I’m about to give are very much my own), and I’ve only really been using nimbleparsec for a few months. If you’d like a more complex example, here are some of the highlights of a relatively complex parser (it parses zig code), note that these are things that may only apply to more complex situations:

All throughout you’ll see “ignore” blocks. This parser is strictly for analysis, so my preference is to throw away intermediate parser results, and only use the context as an accumulator. I could have dropped structured parser results into the result stream, but some of my results necessarily, contextually cross “sub-parser” boundaries, so I felt this was the better choice. If you have an analysis only parser, you can also drop your results into the result stream, which may be a better choice if your results don’t depend on context between sub-parses; if you have a parser that mutates the text block, then you almost certainly should use the context to store document metadata/analysis.
I like to ninja in a structured datatype for context, instead of an unstructured map. This helps organize my thoughts and keep a sane accounting of what’s going on inside of nimbleparsec, which can spiral into complexity. https://github.com/ityonemo/zigler/blob/master/lib/zigler/parser.ex#L75
I like to make intermediate parsecs where I can shoot valid and invalid forms in tests. If you’re building more complex parsers, this is vital, or else you’re gonna have a bad time debugging as the complexity starts to spiral.
https://github.com/ityonemo/zigler/blob/master/lib/zigler/parser.ex#L150
Don’t be afraid to raise in your parsers. Naked NimbleParsec errors aren’t necessarily helpful, and inside your parsers you have more contextual information.
https://github.com/ityonemo/zigler/blob/master/lib/zigler/parser.ex#L186
Along those same lines, I like to do intermediate validations early, as soon as the contextual information is available:
https://github.com/ityonemo/zigler/blob/master/lib/zigler/parser.ex#L276
As I said before, wrap your nimble parsec parser in an module API function. In the case of this module, I return a structured datatype, or raise.
https://github.com/ityonemo/zigler/blob/master/lib/zigler/parser.ex#L431

kip · June 8, 2020, 6:54pm

Its not in descending number order, its in descending string order. If we don’t capture the 2-digit numbers first, like string("12") then we would capture string("1") and then the next character would be “2” which would be a parse error. When parsing strings like this its important to parse the longest strings first for that reason.

darraghenright · June 16, 2020, 4:29pm

I forgot to come back again and say thanks to everyone for their replies since my last visit.

@kip — Excellent point about string length, seems so obvious in retrospect, and of course this would factor when including comparisons for leading zero values; i.e: string("01") should come before string("1") (or string("9") for that matter!

@ityonemo — point taken about functions that end with ?. I’m usually more vigilant about that convention (I promise) Thanks for all the reference material, super helpful stuff.