milmazz

How to parse an ATX Heading (MarkDown) with NimbleParsec?

Just as an exercise and learn more about NimbleParsec, I started trying to parse a few sections of the CommonMark specification with NimbleParsec v0.5, some sections are easier (e.g. thematic breaks) than others, but I still have some issues trying to complete the “ATX Heading” section, according to the spec:

An ATX heading consists of a string of characters, parsed as inline content, between an opening sequence of 1–6 unescaped # characters and an optional closing sequence of any number of unescaped # characters. The opening sequence of # characters must be followed by a space or by the end of line. The optional closing sequence of #s must be preceded by a space and may be followed by spaces only. The opening # character may be indented 0-3 spaces. The raw contents of the heading are stripped of leading and trailing spaces before being parsed as inline content. The heading level is equal to the number of # characters in the opening sequence.

What I have so far is this:

defmodule ATXHeading do
  import NimbleParsec

  @space 0x0020
  @tab 0x009

  spacechar = utf8_char([@space, @tab])
  sp = optional(times(spacechar, min: 1))

  non_indented_space =
    [@space]
    |> utf8_char()
    |> times(max: 3)

  atx_start =
    non_indented_space
    |> optional()
    |> ignore()
    |> ascii_char([?#])
    |> times(min: 1, max: 6)
    |> reduce(:length)
    |> unwrap_and_tag(:level)

  atx_end =
    [@space]
    |> utf8_char()
    |> times(ascii_char([?#]), min: 1)
    |> concat(sp)
    |> choice([
      ascii_char([?\n]),
      eos()
    ])

  heading =
    atx_start
    |> choice([
      [?\n] |> ascii_char() |> ignore(),
      [@space]
      |> utf8_char()
      |> ignore()
      # NOTE: `lookahead_not` seems to work with string(" #"), but it
      # should also accept `atx_end`, right?
      |> repeat([not: ?\n] |> utf8_char() |> lookahead_not(string(" #")))
      # Take the next character in case lookahead(combinator) is found and ignore `atx_end`
      |> optional([not: ?\n, not: ?\s] |> utf8_char() |> ignore(atx_end))
      |> optional(ascii_char([?\n]))
      |> reduce({List, :to_string, []})
      # TODO: After trim, the result should be parsed as inline
      # content
      |> reduce(:trim)
    ])
    |> tag(:heading)

  defp trim([string]), do: String.trim(string)

  @doc """
  Parses ATX Headings.

  ## Examples

      iex> ~S(### foo) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo"]]
      iex> ~S(###      foo     ) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo"]]
      iex> ~S(###      foo    #### ) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo"]]
      iex> ~S(### foo \\###) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo ###"]]
      iex> ~S(### foo #\\##) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 3}, "foo ###"]]
      iex> ~S(# foo \\#) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 1}, "foo #"]]
      iex> ~S(## foo ### b) |> ATXHeading.heading() |> elem(1)
      [heading: [{:level, 2}, "foo ### b"]]

  """

  # defparsec(:heading, heading)
  # NOTE: This is just to see if `atx_end` is parsed correctly
  defparsec(:heading, choice([heading, atx_end]))
end

The problem with the current implementation is that I haven’t found a way to use the combinator atx_end with lookahead_not (but atx_end works with ignore and also is parsed correctly if I run something like ATXHeading.heading(" ######### \n"). In the meantime I used lookahead_not(string(" #")) and it works but does not cover all the cases of the specification. So, at the moment seems that I can’t comply with this part of the spec:

The optional closing sequence of #s must be preceded by a space and may be followed by spaces only

Is this behavior a bug on lookahead_not or am I missing something here? Do you have any recommendation on how to parse an ATX Heading with NimbleParsec?

8 comments

#nimbleparsec

5 2536 8

2019-01-09 19:53:07 UTC

Most Liked

tmbb

It looks like what you want to do is made more complex because of the fact that the sequence of # in the beginning must match the sequence of # at the end. That requires context-sensitive features, which are not possible on nimble_parsec as it currently works.

However, you can “cheat” your way out of context-sensitivity by relying on the fact that headers can’t go more than 6 levels deep. That way, you only need to handle 6 possibilities and it can be made to work with nimble_parsec.

For an example (which doesn’t handle markup inside the header), look at the following code:

defmodule CommonMark do
  import NimbleParsec

  ignored_whitespace =
    ascii_string([?\s], min: 1)
    |> ignore()
    |> optional()

  header_char = utf8_char(not: ?\n)

  headers =
    for n <- 6..1 do
      prefix = String.duplicate("#", n)
      suffix = " " <> prefix

      content =
        lookahead_not(string(suffix))
        |> concat(header_char)
        |> times(min: 1)
        |> reduce({List, :to_string, []})

      ignore(optional(ascii_string([?\s], max: 3)))
      |> string(prefix)
      |> concat(ignored_whitespace)
      |> concat(content)
      |> optional(ignore(string(suffix)))
      |> post_traverse(:tag_with_level)
    end

  @doc false
  def tag_with_level(_rest, [content, prefix] = _args, context, _line, _offset) do
    result = [{:header, [level: byte_size(prefix)], content}]
    {result, context}
  end

  defparsec(
    :atx_header,
    choice(headers)
  )
end

Example output:

iex(50)> CommonMark.atx_header("## abc ")
{:ok, [{:header, [level: 2], "abc "}], "", %{}, {1, 0}, 7}
iex(51)> CommonMark.atx_header("## abc #")
{:ok, [{:header, [level: 2], "abc #"}], "", %{}, {1, 0}, 8}
iex(52)> CommonMark.atx_header("# abc #")
{:ok, [{:header, [level: 1], "abc"}], "", %{}, {1, 0}, 7}
iex(53)> CommonMark.atx_header("## abc #")
{:ok, [{:header, [level: 2], "abc #"}], "", %{}, {1, 0}, 8}
iex(54)> CommonMark.atx_header("# abc")
{:ok, [{:header, [level: 1], "abc"}], "", %{}, {1, 0}, 5}
iex(55)> CommonMark.atx_header("## abc")
{:ok, [{:header, [level: 2], "abc"}], "", %{}, {1, 0}, 6}
iex(56)> CommonMark.atx_header("### abc")
{:ok, [{:header, [level: 3], "abc"}], "", %{}, {1, 0}, 7}
iex(57)> CommonMark.atx_header("### abc ###")
{:ok, [{:header, [level: 3], "abc"}], "", %{}, {1, 0}, 11}
iex(58)> CommonMark.atx_header("### abc ##")
{:ok, [{:header, [level: 3], "abc ##"}], "", %{}, {1, 0}, 10}
iex(59)> CommonMark.atx_header("# abc ##")
{:ok, [{:header, [level: 1], "abc"}], "#", %{}, {1, 0}, 7}
iex(60)> CommonMark.atx_header("####### abc")
{:ok, [{:header, [level: 6], "# abc"}], "", %{}, {1, 0}, 11}

There are some warts, but you get the idea. You have to ensure that the ### is followed by a space, that a new line or an oef() comes afterwards and stuff like that. Regarding the handling of markup inside a header, it might be better to do that in another path.

But I hope this can help you move forward.

EDIT: There are similar tricks for other context-sensitive features of markdown, which you can emulate with a simpler context-free grammar. For example, code delimited by a variable number of backticks.

Post #2

tmbb

On re-reading the spec, I’ve just noticed that the closing sequence of hashes doesn’t need to be the same sequence as the opening sequence. Which means I haven’t undertood your problem at all. Sorry.

Post #3

milmazz

@tmbb First of all, thanks for your feedback.

The specification does not mention this, I mean, the sequence of # at the end does not need to match with the sequence of # at the beginning, for example:

# foo ######
## foo #####
### foo ####
#### foo ###
##### foo ##
###### foo #

Is a valid input and it should produce something like this (if you transform the result into HTML for example):

<h1>foo</h1>
<h2>foo</h2>
<h3>foo</h3>
<h4>foo</h4>
<h5>foo</h5>
<h6>foo</h6>

But, if you provide a string like this: "### foo ### b\n" it should produce <h3>foo ### b</h3>\n as the result. That’s why I need to be sure that the optional closing sequence of #s must be preceded by a space and may be followed by spaces only (until I find the end of the string or the end of the line).

Post #4

Where Next?

View thread on forum (has 8 responses!)

nimbleparsec

Home Questions & Help>Questions

#nimbleparsec

5 2552 8

Last post

Questions & Help>Questions

Help with elixir-ts-mode in doom-emacs config

Questions & Help>Questions

Are Vi keybindings possible inside IEx?

Questions & Help>Questions

I miss the ternary operator - does anyone have a macro that allows a ternary operator in Elixir code?

Questions & Help>Questions

Empty Result on Generic Action with graphql_unnested_unions

Questions & Help>Questions

Clarification about `assign/2,3` usage in `render/1` callbacks

Questions & Help>Questions

With the new 1.20 release does it change the way you see Gleam?

Questions & Help>Questions

Using Phoenix.LiveView.TagEngine as an EEx.Engine is deprecated!

Questions & Help>Questions

About ambiguity introduced in function default arguments

Questions & Help>Questions

OpenApiSpex schema - are there any naming conventions on handling show and index routes?

Questions & Help>Questions

How to get type warnings before test failure reports

Questions & Help>Questions

Questions Questions ❯

Latest on Elixir Forum

improv - BLE Wi-Fi provisioning for Elixir/Nerves devices

News>Announcing

Andrew (Nature) Okoye - Senior Full Stack Engineer (Elixir, Phoenix, React) | Remote

Jobs & Member Profiles>Member Profiles

Annotai - turn UI annotations into structured context that AI agents can act on

News>Announcing

Cfonb - a parser for CFONB, the French banking statement format

News>Announcing

AddToCalendar - server-side "add to calendar" links + ICS generation for Phoenix LiveView

News>Announcing

Senior Backend Engineer (Elixir) - Nabu Casa, Remote (North America, Latin America & Europe)

Jobs & Member Profiles>Jobs

Keynote: The Latest on Elixir Types - José Valim | ElixirConf EU

Learning Resources>Talks

Workflow: downstream dependency on a graft is dropped when the grafting workflow is itself grafted (nested graft)

Questions & Help>Troubleshooting

Biomine - Javascript and Css formatter using biome

News>Announcing

bluez - bluez over d-bus library

News>Announcing

Potions - deploy and manage Phoenix apps on your own VPS

News>Announcing

Senior Full Stack Engineer (Elixir, React) - Rabbet, Austin, Remote USA (TX, CO, NC preferred)

Jobs & Member Profiles>Jobs

2026/09/09 - Building Local-First Apps in Pure Elixir with Hologram (ElixirConf US training) - Chicago, USA

Events/Confs/Meet Ups>List

Let libraries be libraries

Blogs & Podcasts>Blog Posts

Nature_whistle v0.3.0 is out - telemetry driven alerting with recovery notifications

News>News & Updates

Elixir Forum ❯

Sub Categories:

Forums

We're in Beta

About us Mission Statement

How to parse an ATX Heading (MarkDown) with NimbleParsec?

milmazz

How to parse an ATX Heading (MarkDown) with NimbleParsec?

Most Liked

tmbb

tmbb

milmazz

Where Next?

Popular in Questions

(EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Idiomatic guard clause for checking not nil

Best way to send multiple files as HTTP response

How can I check Phoenix version?

Websocket server in Elixir or phoenix

Visual Studio Code - how to highlight html closing tags in html.eex?

Upgrading Elixir - how to check versions, delete, and upgrade?

How To Get Phoenix & VueJS working Together?

Pattern matching against a string

Mint vs Finch vs Gun vs Tesla vs HTTPoison etc

Other popular topics

Pow - Robust, modular, extendable user authentication and management system

Oban - Reliable and Observable Job Processing

What to learn first - Rust or Elixir?

How to rollback a specific ecto migration?

How do I kill a process ` #PID<0.186.0` in iex?

Can we beat Kafka if we build it in Elixir?

IEX in Windows Powershell?

How to get struct from map - elixir?

Websocket connection works on localhost, but get 403 error when deployed via docker

Form submit on Enter keypress for textarea input type

Questions & Help>Questions

Latest on Elixir Forum

Categories:

Sub Categories:

Forums

Popular Tags

We're in Beta

How to parse an ATX Heading (MarkDown) with NimbleParsec?

milmazz

How to parse an ATX Heading (MarkDown) with NimbleParsec?

Most Liked

tmbb

tmbb

milmazz

Where Next?

Popular in Questions

(EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Idiomatic guard clause for checking not nil

Best way to send multiple files as HTTP response

How can I check Phoenix version?

Websocket server in Elixir or phoenix

Visual Studio Code - how to highlight html closing tags in html.eex?

Upgrading Elixir - how to check versions, delete, and upgrade?

How To Get Phoenix & VueJS working Together?

Pattern matching against a string

Mint vs Finch vs Gun vs Tesla vs HTTPoison etc

Other popular topics

Pow - Robust, modular, extendable user authentication and management system

Oban - Reliable and Observable Job Processing

What to learn first - Rust or Elixir?

How to rollback a specific ecto migration?

How do I kill a process ` #PID<0.186.0` in iex?

Can we beat Kafka if we build it in Elixir?

IEX in Windows Powershell?

How to get struct from map - elixir?

Websocket connection works on localhost, but get 403 error when deployed via docker

Form submit on Enter keypress for textarea input type

Questions & Help>Questions

Latest on Elixir Forum

Sponsor Spotlight

Our Sponsors

Categories:

Sub Categories:

Forums

Popular Tags

Our Sponsors

We're in Beta