Just as an exercise and learn more about NimbleParsec, I started trying to parse a few sections of the CommonMark specification with NimbleParsec v0.5, some sections are easier (e.g. thematic breaks) than others, but I still have some issues trying to complete the “ATX Heading” section, according to the spec:
An ATX heading consists of a string of characters, parsed as inline content, between an opening sequence of 1–6 unescaped # characters and an optional closing sequence of any number of unescaped # characters. The opening sequence of # characters must be followed by a space or by the end of line. The optional closing sequence of #s must be preceded by a space and may be followed by spaces only. The opening # character may be indented 0-3 spaces. The raw contents of the heading are stripped of leading and trailing spaces before being parsed as inline content. The heading level is equal to the number of # characters in the opening sequence.
What I have so far is this:
defmodule ATXHeading do
import NimbleParsec
@space 0x0020
@tab 0x009
spacechar = utf8_char([@space, @tab])
sp = optional(times(spacechar, min: 1))
non_indented_space =
[@space]
|> utf8_char()
|> times(max: 3)
atx_start =
non_indented_space
|> optional()
|> ignore()
|> ascii_char([?#])
|> times(min: 1, max: 6)
|> reduce(:length)
|> unwrap_and_tag(:level)
atx_end =
[@space]
|> utf8_char()
|> times(ascii_char([?#]), min: 1)
|> concat(sp)
|> choice([
ascii_char([?\n]),
eos()
])
heading =
atx_start
|> choice([
[?\n] |> ascii_char() |> ignore(),
[@space]
|> utf8_char()
|> ignore()
# NOTE: `lookahead_not` seems to work with string(" #"), but it
# should also accept `atx_end`, right?
|> repeat([not: ?\n] |> utf8_char() |> lookahead_not(string(" #")))
# Take the next character in case lookahead(combinator) is found and ignore `atx_end`
|> optional([not: ?\n, not: ?\s] |> utf8_char() |> ignore(atx_end))
|> optional(ascii_char([?\n]))
|> reduce({List, :to_string, []})
# TODO: After trim, the result should be parsed as inline
# content
|> reduce(:trim)
])
|> tag(:heading)
defp trim([string]), do: String.trim(string)
@doc """
Parses ATX Headings.
## Examples
iex> ~S(### foo) |> ATXHeading.heading() |> elem(1)
[heading: [{:level, 3}, "foo"]]
iex> ~S(### foo ) |> ATXHeading.heading() |> elem(1)
[heading: [{:level, 3}, "foo"]]
iex> ~S(### foo #### ) |> ATXHeading.heading() |> elem(1)
[heading: [{:level, 3}, "foo"]]
iex> ~S(### foo \\###) |> ATXHeading.heading() |> elem(1)
[heading: [{:level, 3}, "foo ###"]]
iex> ~S(### foo #\\##) |> ATXHeading.heading() |> elem(1)
[heading: [{:level, 3}, "foo ###"]]
iex> ~S(# foo \\#) |> ATXHeading.heading() |> elem(1)
[heading: [{:level, 1}, "foo #"]]
iex> ~S(## foo ### b) |> ATXHeading.heading() |> elem(1)
[heading: [{:level, 2}, "foo ### b"]]
"""
# defparsec(:heading, heading)
# NOTE: This is just to see if `atx_end` is parsed correctly
defparsec(:heading, choice([heading, atx_end]))
end
The problem with the current implementation is that I haven’t found a way to use the combinator atx_end
with lookahead_not
(but atx_end
works with ignore
and also is parsed correctly if I run something like ATXHeading.heading(" ######### \n")
. In the meantime I used lookahead_not(string(" #"))
and it works but does not cover all the cases of the specification. So, at the moment seems that I can’t comply with this part of the spec:
The optional closing sequence of #s must be preceded by a space and may be followed by spaces only
Is this behavior a bug on lookahead_not
or am I missing something here? Do you have any recommendation on how to parse an ATX Heading with NimbleParsec
?