How to parse an ATX Heading (MarkDown) with NimbleParsec?

nimbleparsec

#1

Just as an exercise and learn more about NimbleParsec, I started trying to parse a few sections of the CommonMark specification with NimbleParsec v0.5, some sections are easier (e.g. thematic breaks) than others, but I still have some issues trying to complete the “ATX Heading” section, according to the spec:

An ATX heading consists of a string of characters, parsed as inline content, between an opening sequence of 1–6 unescaped # characters and an optional closing sequence of any number of unescaped # characters. The opening sequence of # characters must be followed by a space or by the end of line. The optional closing sequence of #s must be preceded by a space and may be followed by spaces only. The opening # character may be indented 0-3 spaces. The raw contents of the heading are stripped of leading and trailing spaces before being parsed as inline content. The heading level is equal to the number of # characters in the opening sequence.

What I have so far is this:

defmodule ATXHeading do
  import NimbleParsec

  @space 0x0020
  @tab 0x009

  spacechar = utf8_char([@space, @tab])
  sp = optional(times(spacechar, min: 1))

  non_indented_space =
    [@space]
    |> utf8_char()
    |> times(max: 3)

  atx_start =
    non_indented_space
    |> optional()
    |> ignore()
    |> ascii_char([?#])
    |> times(min: 1, max: 6)
    |> reduce(:length)
    |> unwrap_and_tag(:level)

  atx_end =
    [@space]
    |> utf8_char()
    |> times(ascii_char([?#]), min: 1)
    |> concat(sp)
    |> choice([
      ascii_char([?\n]),
      eos()
    ])

  heading =
    atx_start
    |> choice([
      [?\n] |> ascii_char() |> ignore(),
      [@space]
      |> utf8_char()
      |> ignore()
      # NOTE: `lookahead_not` seems to work with string(" #"), but it
      # should also accept `atx_end`, right?
      |> repeat([not: ?\n] |> utf8_char() |> lookahead_not(string(" #")))
      # Take the next character in case lookahead(combinator) is found and ignore `atx_end`
      |> optional([not: ?\n, not: ?\s] |> utf8_char() |> ignore(atx_end))
      |> optional(ascii_char([?\n]))
      |> reduce({List, :to_string, []})
      # TODO: After trim, the result should be parsed as inline
      # content
      |> reduce(:trim)
    ])
    |> tag(:heading)

  defp trim([string]), do: String.trim(string)

  @doc """
  Parses ATX Headings.

  ## Examples

      iex> ~S(### foo) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo"]]
      iex> ~S(###      foo     ) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo"]]
      iex> ~S(###      foo    #### ) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo"]]
      iex> ~S(### foo \\###) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo ###"]]
      iex> ~S(### foo #\\##) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo ###"]]
      iex> ~S(# foo \\#) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 1}, "foo #"]]
      iex> ~S(## foo ### b) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 2}, "foo ### b"]]

  """

  # defparsec(:heading, heading)
  # NOTE: This is just to see if `atx_end` is parsed correctly
  defparsec(:heading, choice([heading, atx_end]))
end

The problem with the current implementation is that I haven’t found a way to use the combinator atx_end with lookahead_not (but atx_end works with ignore and also is parsed correctly if I run something like ATXHeading.heading(" ######### \n"). In the meantime I used lookahead_not(string(" #")) and it works but does not cover all the cases of the specification. So, at the moment seems that I can’t comply with this part of the spec:

The optional closing sequence of #s must be preceded by a space and may be followed by spaces only

Is this behavior a bug on lookahead_not or am I missing something here? Do you have any recommendation on how to parse an ATX Heading with NimbleParsec?


#2

It looks like what you want to do is made more complex because of the fact that the sequence of # in the beginning must match the sequence of # at the end. That requires context-sensitive features, which are not possible on nimble_parsec as it currently works.

However, you can “cheat” your way out of context-sensitivity by relying on the fact that headers can’t go more than 6 levels deep. That way, you only need to handle 6 possibilities and it can be made to work with nimble_parsec.

For an example (which doesn’t handle markup inside the header), look at the following code:

defmodule CommonMark do
  import NimbleParsec

  ignored_whitespace =
    ascii_string([?\s], min: 1)
    |> ignore()
    |> optional()

  header_char = utf8_char(not: ?\n)

  headers =
    for n <- 6..1 do
      prefix = String.duplicate("#", n)
      suffix = " " <> prefix

      content =
        lookahead_not(string(suffix))
        |> concat(header_char)
        |> times(min: 1)
        |> reduce({List, :to_string, []})

      ignore(optional(ascii_string([?\s], max: 3)))
      |> string(prefix)
      |> concat(ignored_whitespace)
      |> concat(content)
      |> optional(ignore(string(suffix)))
      |> post_traverse(:tag_with_level)
    end

  @doc false
  def tag_with_level(_rest, [content, prefix] = _args, context, _line, _offset) do
    result = [{:header, [level: byte_size(prefix)], content}]
    {result, context}
  end

  defparsec(
    :atx_header,
    choice(headers)
  )
end

Example output:

iex(50)> CommonMark.atx_header("## abc ")
{:ok, [{:header, [level: 2], "abc "}], "", %{}, {1, 0}, 7}
iex(51)> CommonMark.atx_header("## abc #")
{:ok, [{:header, [level: 2], "abc #"}], "", %{}, {1, 0}, 8}
iex(52)> CommonMark.atx_header("# abc #")
{:ok, [{:header, [level: 1], "abc"}], "", %{}, {1, 0}, 7}
iex(53)> CommonMark.atx_header("## abc #")
{:ok, [{:header, [level: 2], "abc #"}], "", %{}, {1, 0}, 8}
iex(54)> CommonMark.atx_header("# abc")
{:ok, [{:header, [level: 1], "abc"}], "", %{}, {1, 0}, 5}
iex(55)> CommonMark.atx_header("## abc")
{:ok, [{:header, [level: 2], "abc"}], "", %{}, {1, 0}, 6}
iex(56)> CommonMark.atx_header("### abc")
{:ok, [{:header, [level: 3], "abc"}], "", %{}, {1, 0}, 7}
iex(57)> CommonMark.atx_header("### abc ###")
{:ok, [{:header, [level: 3], "abc"}], "", %{}, {1, 0}, 11}
iex(58)> CommonMark.atx_header("### abc ##")
{:ok, [{:header, [level: 3], "abc ##"}], "", %{}, {1, 0}, 10}
iex(59)> CommonMark.atx_header("# abc ##")
{:ok, [{:header, [level: 1], "abc"}], "#", %{}, {1, 0}, 7}
iex(60)> CommonMark.atx_header("####### abc")
{:ok, [{:header, [level: 6], "# abc"}], "", %{}, {1, 0}, 11}

There are some warts, but you get the idea. You have to ensure that the ### is followed by a space, that a new line or an oef() comes afterwards and stuff like that. Regarding the handling of markup inside a header, it might be better to do that in another path.

But I hope this can help you move forward.

EDIT: There are similar tricks for other context-sensitive features of markdown, which you can emulate with a simpler context-free grammar. For example, code delimited by a variable number of backticks.


#3

On re-reading the spec, I’ve just noticed that the closing sequence of hashes doesn’t need to be the same sequence as the opening sequence. Which means I haven’t undertood your problem at all. Sorry.


#4

@tmbb First of all, thanks for your feedback.

The specification does not mention this, I mean, the sequence of # at the end does not need to match with the sequence of # at the beginning, for example:

# foo ######
## foo #####
### foo ####
#### foo ###
##### foo ##
###### foo #

Is a valid input and it should produce something like this (if you transform the result into HTML for example):

<h1>foo</h1>
<h2>foo</h2>
<h3>foo</h3>
<h4>foo</h4>
<h5>foo</h5>
<h6>foo</h6>

But, if you provide a string like this: "### foo ### b\n" it should produce <h3>foo ### b</h3>\n as the result. That’s why I need to be sure that the optional closing sequence of #s must be preceded by a space and may be followed by spaces only (until I find the end of the string or the end of the line).


#5

I don’t really know how to help you, except by showing you the code. You can find a working implementation here. It was written in about 15 mins and has no tests whatsoever.

It turns out the syntax is actually context-free, which is great. The way to implement such things in NimbleParsec (which is basically a PEG parser now) is to just transcribe into code the verbal description of what it should do using the available combinations.

The description in the CommonMark spec is not very easy, unfortunately, because it’s mostly a bunch of examples.


#6

I’ve spent some time adding functionality to my markdown parser linked above. You might wish to reuse some of the rules.

Even more interestingly, I’ve build a test suite from the examples of the spec (there is a function to build that automatically if you eant to regenerate the test suite). The tests are grouped by section (sections are parsed using my AtxHeader parser) and all sections are skipped by default, but you can delete the :skip tag as you implement the functionality.

You might find that test suite useful.


#7

@tmbb Hey, thanks for following this, I’ll definitely check out your project, TBH I haven’t continue with this recently, but, I just created a project on GitHub and pushed the code that I have so far, you can find my early implementation here:

From your description we followed similar paths:

  • For the unit tests, I got the code from https://github.com/commonmark/CommonMark and then after running: python3 test/spec_tests.py --dump-tests > spec_tests.json I got the examples from the spec and their result as HTML, those are my test fixtures which are decoded with Jason.
  • The project includes a dummy CLI module, for now that module just read a file and dump the result as HTML to stdout. This could also read the input from stdin in the future, but that’s not important now.
  • I added a first formatter (HTML), but my initial idea was that the AST followed almost the same structure as the DTD from the spec.
  • I haven’t checked NimbleParsec code in detail yet, but my doubt still remains about why lookahead_not does not seem to play nicely with:
  atx_end =
    [@space]
    |> utf8_char()
    |> times(ascii_char([?#]), min: 1)
    |> concat(sp)
    |> choice([
      ascii_char([?\n]),
      eos()
    ])

But it works perfectly with a fixed string like string(" #").

Anyway, I’ll try to continue this weekend with Aaron and I’ll report back if I have any progress.


#8

That’s cool too. The only asvantage of my test suite is that it sits the tests into sections (by parsing the headers in the spec). That way you can skip some sections as you implement them. Unfortunately, it wasn’t as useful as I expected, because tests in a given section often require paraers from other sections


#9

I have no idea what a DTD is, but it’s probably a good idea