How to use NimbleParsec to split text with a separator sequence?

linusdm · December 27, 2022, 1:08pm

I’m trying to parse a file with nimble_parsec but I got stuck quite early. The input is a file with many lines, and the content is divided with separators. A simplified example:

some text, may also contain character "=", which is part of the separator
can be any number of lines long until the first separator line
==========
next block of text
==========
and another one
==========

I can’t come up with a good combination of building blocks to get to the output:

["some text, may also contain some character \"=\", which is...", "next block of text", "and another one"]

I’m trying this:

defmodule MyParser do
  import NimbleParsec

  separator = string("==========")
  block = utf8_string([], min: 1) |> lookahead(separator)

  defparsec(:parse, block |> times(min: 1))
end

But this yields an error:

{:error, "expected string \"==========\"", "", %{}, {8, 210}, 210}

I think the key part of my problem is that the utf8_string parser is eagerly matching beyond the first separator. Fiddling with the accepted codepoints and excluding = does help when the text blocks do not contain the =, but that’s not how I want to use it (the text should be able to contain =, only a newline with a succession of = characters should break the text up).
I’m probably using lookahead wrong here, but I’m out of ideas

kip · December 27, 2022, 7:55pm

First thing to note is that your separator is really \n=======\n - that is it is inclusive of the newlines. My instinct would be to reach for String.split/2, for example:

iex(1)> String.split("""
...(1)> some text, may also contain character "=", which is part of the separator
...(1)> can be any number of lines long until the first separator line
...(1)> ==========
...(1)> next block of text
...(1)> ==========
...(1)> and another one
...(1)> ==========
...(1)> """, ~r/\n=*\n/)
["some text, may also contain character \"=\", which is part of the separator\ncan be any number of lines long until the first separator line",
 "next block of text", "and another one", ""]

If you’re doing this as an exercise in NimbleParsec here’s a few thoughts:

the separator includes the surrounding newlines so its actually "\n==========\n"
logically a block is made up of one or more lines
a line is a sequence of characters bounded by a newline

Using these ideas we can build a parser which I think does what you want. I suspect this is slower that then String.split/2 version above.

  separator =
    string("\n==========\n")

  line =
    repeat(utf8_char([{:not, ?\n}]))

  block =
    line
    |> repeat(lookahead_not(separator) |> ascii_char([?\n]) |> concat(line))
    |> reduce({List, :to_string, []})

  defparsec(:parse, repeat(block |> ignore(separator)))

kip · December 27, 2022, 10:00pm

I’ve added a gist for future reference with some comments (and also handling the case where the last block is not followed by a separator).

linusdm · December 28, 2022, 5:10pm

Thanks for taking the time Kip!

Yes, I should’ve noted that this is a challenge I’ve taken upon myself. It’s the first step for parsing a larger file (mostly line-based, an artifact of an old Progess/open-edge program I’m trying to replace), where each block is parsed into it’s smaller parts. Since the file is mostly line-based, splitting and other String utils are also very suited, but I’d still like to try the other approach, and compare (and learn).

I would not have guessed how to stitch those combinators together… especially the reducer step. But it makes sense. I’ll have to sleep over this, to really get it. Also, with the last block, you were one step ahead of me…

I’m still not sure where parser-combinators really shine, but I’m finding it very interesting!

Thanks again! I’ll update when I have new updates/challenges.