Earmark - Elixir's Markdown Converter

Earmark’s v1.4 just got released.

Normally I do not bother the forum with releases, however 1.4, finally if I may say so, exposes the parsed Markdown as an AST with Earmark.as_ast. This is a prerequisite for quite some issues I had to turn down and therefore I prefer to make a little bit of noise about it here.

28 Likes

Thanks! I was using Earmark.parse to convert MD to plaintext, but the Earmark.as_ast is a much more convenient and simpler API to do the same job.

1 Like

please be aware that Earmark.parse is private now

2 Likes

BTW

1.4.0 2019/09/05

2 Likes

Wow this is great news, thanks for the update here, AST is going to be incredibly useful!

2 Likes

I’m using Earmark in my project and I’m planning to implement a few custom markups. For example, auto-link specific characters ( @ mentions, # issues, etc.) and support emojis. See almightycouch/gitgud#44.

The plugin mechanism as been deprecated in 1.4 and working with the AST should work just fine for implementing custom markups. I see that #27 should provide a way to render the AST as HTML.

When writing custom markups, say @ mentions for example. Should I walk the entire AST searching for a @ character in each tag (third element in tuple), ignore code elements and build-up the new AST from there?

Will a future version of Earmark provide a generic way for this kind of things? I think that most custom markups will need the same mechanism for traversing the AST and skipping content in inline code and code blocks.

Also, if I want to replace a markup with my own implementation, for example by supporting syntax highlighting in code blocks. Should I simply replace each pre AST element with my own element?

2 Likes

I am really happy that folks are starting to use the AST.

I would walk the AST, I am planning to release such an AstWalker in a future version, right now I am busy with exposing an AstToHtmlTransformer v1.4.1 ETA middle of this week (as you spotted correctly), but I am hunting jobs right now, so that is not sure yet.
Walking the AST yourself should not be a very difficult task, have a look at things like Macro.prewalk or look at my Traverse lib to get some ideas.

Be careful if you want to use the to be released transformer though, if you change the AST’s type it will break, please remember that we are still in experimentation mode.

Summary:

  • yes #277 will allow you to just change the AST and get your HTML for free iff you do not change the AST’s type

  • Walking the AST will eventually be facilitated by tools in Earmark or maybe an associated lib, let us not forget that every project using ex_doc will pull in all of Earmark’s code, so tools shall probably go elsewhere.

  • Yes absolutely change e.g. {"pre", [], [whatever]} to {"post", atts, whateverelse} just watch out for the correct type.

Please keep me updated if you have any problem or question, either here or open a ticket in Earmark.
The more feedback I get, the faster the AST API will converge.

3 Likes

:+1:

Sure, this could be implemented in a separated package. Maybe a module such as AstWalker will suffice to cover most cases.

I was wondering if the current implementation was working with the AST internally when rendering HTML. Are Earmark.Options reflected in the AST or are they only applied to the rendered HTML?

In the meanwhile, can i use Floki.raw_html/2 to render the AST?

2 Likes

The internal implementation does not work with the AST yet, but it is a clear goal, and also the reason why the Transformer will stay inside Earmark, as it will be used for as_html eventually.

There are some subtle differences to Floki, which was the inspiration for the AST, it would be great if the use of Floki.raw_html/2 would work. Please let me know.

One thing I am almost sure would break Floki are comments, as Floki has a different shape for comment nodes, I did not see a reason to not have a more uniform type, so instead of {:comment, ...} Earmark produces {:comment, [], ...}

I apologize but I am too stupid to quote your questions in my answer :blush:

1 Like

And finally, concerning the Options.
Most options are needed for the AST rendering, e.g. pure_links: however all options concerning the inline rendering are ignored, e.g. smartypants:, and yes the Transformer will take those into account.

I guess I can be more precise in the doc in the next version (or I get I nice PR maybe :smirk:)

Will give it a try asap.

Just select the text you want to quote in the comment you want and click the “Quote” popover link.

2 Likes

Thanks so much

1 Like

I’ve tested with different inputs and Floki.raw_html/2 definitely works great (with basic typography tags, lists, code blocks) so far.

1 Like

Great news, thx

I started experimenting with the AST. Pretty straight forward for what I want to do:

defmodule GitGud.Web.Markdown do
  @moduledoc """
  Conveniences for rendering Markdown.
  """

  @doc """
  Renders a Markdown formatted `content` to HTML.
  """
  @spec markdown(binary | nil) :: binary | nil
  def markdown(nil), do: nil
  def markdown(content) do
    case Earmark.as_ast(content) do
      {:ok, ast, _warnings} ->
        ast
        |> transform_ast()
        |> Floki.raw_html()
    end
  end

  #
  # Helpers
  #

  defp transform_ast(ast) do
    ast
    |> Enum.map(&transform_ast_node/1)
    |> List.flatten()
  end

  defp transform_ast_node({tag, _attrs, _ast} = node) when tag in ["code"], do: node
  defp transform_ast_node({tag, attrs, ast}) do
    {tag, attrs, transform_ast(ast)}
  end

  defp transform_ast_node(content) when is_binary(content) do
    content = Regex.replace(~r/:([a-z0-1\+]+):/, content, &emojify_short_name/2)
    auto_link(content, Regex.scan(~r/#[0-9]+|@[a-zA-Z0-9_-]+|[a-f0-9]{7}/, content, return: :index))
  end

  defp emojify_short_name(match, short_name) do
    if emoji = Exmoji.from_short_name(short_name),
     do: Exmoji.EmojiChar.render(emoji),
   else: match
  end

  defp auto_link(content, []), do: content
  defp auto_link(content, indexes) do
    {content, rest, _offset} =
      Enum.reduce(List.flatten(indexes), {[], content, 0}, fn {idx, len}, {acc, rest, offset} ->
        {head, rest} = String.split_at(rest, idx - offset)
        {link, rest} =
          case String.split_at(rest, len) do
            {"#" <> number, rest} ->
              {{"a", [], ["##{number}"]}, rest} # TODO
            {"@" <> login, rest} ->
              {{"a", [{"class", "has-text-black"}], ["@#{login}"]}, rest} # TODO
            {hash, rest} ->
              {{"a", [], [{"code", [{"class", "has-text-link"}], [hash]}]}, rest} # TODO
          end
        {acc ++ [head, link], rest, idx+len}
      end)
    List.flatten(content, [rest])
  end
end
6 Likes

I am afraid of what will happen with comments, as I deliberately chose to diverge from Floki's decision here.

But I can change this, given the value to refeed into Floki.raw_html/1 might provide, so please keep the great information flow up.

Using Floki.raw_html/1 is just a means to an end until Earmark provides it’s own AST -> HTML function. I would not bother much about compatibility between the two. Perhaps if the only divergence is for comments it might be easier to just change that…

In my experimental implementation, parsing the text content of each node works fine but I’m wondering if this is the way to go.

Basically, each time I encounter an AST node text content (ignoring "code" tags) , I

  1. use a regex to replace :emoji: to unicode emojis.
  2. use a regex to match #N (issue reference), @USER (user mention) and ffffff (commit hash).
defp transform_ast_node(content) when is_binary(content) do
  content = Regex.replace(~r/:([a-z0-1\+]+):/, content, &emojify_short_name/2)
  auto_link(content, Regex.scan(~r/#[0-9]+|@[a-zA-Z0-9_-]+|[a-f0-9]{7}/, content, return: :index))
end

Now 1) is replacing text directly while 2) injects new nodes (links) into the AST.

I have more complex use-cases that will inject new nodes into the AST but sometime, I will require to mark/flag theses as “already parsed” for further processing. This is something that already happens with "code" tags. I just want further processing to skip theses parts because there are already in their “final-state”.

1 Like

Hmm it might then be a good idea to apply the robustness principle if not too costly.

Maybe the transformer should accept also extended tuples in the AST, e.g.

    {:code, [], children, _}

That would allow AST transformers to leave their annotations.

However it will fence us in concerning later extensions of the AST format, an alternative would be to allow only

    {:code, [], children, {:meta, _}}
1 Like

RELEASE NOTES for the latest Earmark Release :slight_smile:

version 1.4.1 2019/09/24

2 Likes

Slight change in the API, after some reflection, how an internal extension might look like.
I prefer to keep the cool name meta: for myself :blush:

So the acceptable ast would be extended by a map rather where the custom: key shall be reserved for 3rd party applications

      {tag, atts, children, %{custom: ...}}

c.f. https://github.com/pragdave/earmark/issues/288