Earmark - Elixir's Markdown Converter

RobertDober · September 5, 2019, 7:10am

Earmark is a pure-Elixir Markdown converter.

It is intended to be used as a library (just call Earmark.as_html), but can also be used as a command-line tool (run mix escript.build first).

Output generation is pluggable.

Table Of Content

Options

Earmark.Cli.Implementation

Earmark.Options

Earmark.Options.make_options/1

Earmark.Options.relative_filename/2

Earmark.Options.with_postprocessor/2

Earmark.Internal

Earmark.Internal.as_ast!/2

Earmark.Internal.from_file!/2

Earmark.Internal.include/2

Earmark.Transform

Structure Conserving Transformers

Postprocessors and Convenience Functions

Structure Modifying Transformers

Earmark.Restructure.walk_and_modify_ast/4

Earmark.Restructure.split_by_regex/3

Contributing

Author

stefanchrobot · September 5, 2019, 8:36am

Thanks! I was using Earmark.parse to convert MD to plaintext, but the Earmark.as_ast is a much more convenient and simpler API to do the same job.

RobertDober · September 5, 2019, 8:51am

please be aware that Earmark.parse is private now

RobertDober · September 6, 2019, 5:57am

BTW

1.4.0 2019/09/05

edisonywh · September 10, 2019, 1:52am

Wow this is great news, thanks for the update here, AST is going to be incredibly useful!

MarioFlach · September 15, 2019, 7:59pm

I’m using Earmark in my project and I’m planning to implement a few custom markups. For example, auto-link specific characters ( @ mentions, # issues, etc.) and support emojis. See almightycouch/gitgud#44.

The plugin mechanism as been deprecated in 1.4 and working with the AST should work just fine for implementing custom markups. I see that #27 should provide a way to render the AST as HTML.

When writing custom markups, say @ mentions for example. Should I walk the entire AST searching for a @ character in each tag (third element in tuple), ignore code elements and build-up the new AST from there?

Will a future version of Earmark provide a generic way for this kind of things? I think that most custom markups will need the same mechanism for traversing the AST and skipping content in inline code and code blocks.

Also, if I want to replace a markup with my own implementation, for example by supporting syntax highlighting in code blocks. Should I simply replace each pre AST element with my own element?

RobertDober · September 15, 2019, 8:15pm

I am really happy that folks are starting to use the AST.

I would walk the AST, I am planning to release such an AstWalker in a future version, right now I am busy with exposing an AstToHtmlTransformer v1.4.1 ETA middle of this week (as you spotted correctly), but I am hunting jobs right now, so that is not sure yet.
Walking the AST yourself should not be a very difficult task, have a look at things like Macro.prewalk or look at my Traverse lib to get some ideas.

Be careful if you want to use the to be released transformer though, if you change the AST’s type it will break, please remember that we are still in experimentation mode.

Summary:

yes #277 will allow you to just change the AST and get your HTML for free iff you do not change the AST’s type
Walking the AST will eventually be facilitated by tools in Earmark or maybe an associated lib, let us not forget that every project using ex_doc will pull in all of Earmark’s code, so tools shall probably go elsewhere.
Yes absolutely change e.g. {"pre", [], [whatever]} to {"post", atts, whateverelse} just watch out for the correct type.

Please keep me updated if you have any problem or question, either here or open a ticket in Earmark.
The more feedback I get, the faster the AST API will converge.

MarioFlach · September 15, 2019, 8:37pm

Sure, this could be implemented in a separated package. Maybe a module such as AstWalker will suffice to cover most cases.

I was wondering if the current implementation was working with the AST internally when rendering HTML. Are Earmark.Options reflected in the AST or are they only applied to the rendered HTML?

In the meanwhile, can i use Floki.raw_html/2 to render the AST?

RobertDober · September 15, 2019, 8:38pm

The internal implementation does not work with the AST yet, but it is a clear goal, and also the reason why the Transformer will stay inside Earmark, as it will be used for as_html eventually.

There are some subtle differences to Floki, which was the inspiration for the AST, it would be great if the use of Floki.raw_html/2 would work. Please let me know.

One thing I am almost sure would break Floki are comments, as Floki has a different shape for comment nodes, I did not see a reason to not have a more uniform type, so instead of {:comment, ...} Earmark produces {:comment, [], ...}

I apologize but I am too stupid to quote your questions in my answer

RobertDober · September 15, 2019, 8:46pm

And finally, concerning the Options.
Most options are needed for the AST rendering, e.g. pure_links: however all options concerning the inline rendering are ignored, e.g. smartypants:, and yes the Transformer will take those into account.

I guess I can be more precise in the doc in the next version (or I get I nice PR maybe )

MarioFlach · September 15, 2019, 8:59pm

Will give it a try asap.

Just select the text you want to quote in the comment you want and click the “Quote” popover link.

RobertDober · September 16, 2019, 6:35am

Thanks so much

MarioFlach · September 18, 2019, 8:33am

I’ve tested with different inputs and Floki.raw_html/2 definitely works great (with basic typography tags, lists, code blocks) so far.

RobertDober · September 18, 2019, 9:08am

Great news, thx

MarioFlach · September 19, 2019, 8:01am

I started experimenting with the AST. Pretty straight forward for what I want to do:

defmodule GitGud.Web.Markdown do
  @moduledoc """
  Conveniences for rendering Markdown.
  """

  @doc """
  Renders a Markdown formatted `content` to HTML.
  """
  @spec markdown(binary | nil) :: binary | nil
  def markdown(nil), do: nil
  def markdown(content) do
    case Earmark.as_ast(content) do
      {:ok, ast, _warnings} ->
        ast
        |> transform_ast()
        |> Floki.raw_html()
    end
  end

  #
  # Helpers
  #

  defp transform_ast(ast) do
    ast
    |> Enum.map(&transform_ast_node/1)
    |> List.flatten()
  end

  defp transform_ast_node({tag, _attrs, _ast} = node) when tag in ["code"], do: node
  defp transform_ast_node({tag, attrs, ast}) do
    {tag, attrs, transform_ast(ast)}
  end

  defp transform_ast_node(content) when is_binary(content) do
    content = Regex.replace(~r/:([a-z0-1\+]+):/, content, &emojify_short_name/2)
    auto_link(content, Regex.scan(~r/#[0-9]+|@[a-zA-Z0-9_-]+|[a-f0-9]{7}/, content, return: :index))
  end

  defp emojify_short_name(match, short_name) do
    if emoji = Exmoji.from_short_name(short_name),
     do: Exmoji.EmojiChar.render(emoji),
   else: match
  end

  defp auto_link(content, []), do: content
  defp auto_link(content, indexes) do
    {content, rest, _offset} =
      Enum.reduce(List.flatten(indexes), {[], content, 0}, fn {idx, len}, {acc, rest, offset} ->
        {head, rest} = String.split_at(rest, idx - offset)
        {link, rest} =
          case String.split_at(rest, len) do
            {"#" <> number, rest} ->
              {{"a", [], ["##{number}"]}, rest} # TODO
            {"@" <> login, rest} ->
              {{"a", [{"class", "has-text-black"}], ["@#{login}"]}, rest} # TODO
            {hash, rest} ->
              {{"a", [], [{"code", [{"class", "has-text-link"}], [hash]}]}, rest} # TODO
          end
        {acc ++ [head, link], rest, idx+len}
      end)
    List.flatten(content, [rest])
  end
end

RobertDober · September 19, 2019, 9:35am

I am afraid of what will happen with comments, as I deliberately chose to diverge from Floki's decision here.

But I can change this, given the value to refeed into Floki.raw_html/1 might provide, so please keep the great information flow up.

MarioFlach · September 19, 2019, 2:30pm

Using Floki.raw_html/1 is just a means to an end until Earmark provides it’s own AST -> HTML function. I would not bother much about compatibility between the two. Perhaps if the only divergence is for comments it might be easier to just change that…

In my experimental implementation, parsing the text content of each node works fine but I’m wondering if this is the way to go.

Basically, each time I encounter an AST node text content (ignoring "code" tags) , I

use a regex to replace :emoji: to unicode emojis.
use a regex to match #N (issue reference), @USER (user mention) and ffffff (commit hash).

defp transform_ast_node(content) when is_binary(content) do
  content = Regex.replace(~r/:([a-z0-1\+]+):/, content, &emojify_short_name/2)
  auto_link(content, Regex.scan(~r/#[0-9]+|@[a-zA-Z0-9_-]+|[a-f0-9]{7}/, content, return: :index))
end

Now 1) is replacing text directly while 2) injects new nodes (links) into the AST.

I have more complex use-cases that will inject new nodes into the AST but sometime, I will require to mark/flag theses as “already parsed” for further processing. This is something that already happens with "code" tags. I just want further processing to skip theses parts because there are already in their “final-state”.

RobertDober · September 19, 2019, 2:52pm

Hmm it might then be a good idea to apply the robustness principle if not too costly.

Maybe the transformer should accept also extended tuples in the AST, e.g.

    {:code, [], children, _}

That would allow AST transformers to leave their annotations.

However it will fence us in concerning later extensions of the AST format, an alternative would be to allow only

    {:code, [], children, {:meta, _}}

RobertDober · September 24, 2019, 8:20am

RELEASE NOTES for the latest Earmark Release

version 1.4.1 2019/09/24

277 Expose an AST to HTML Transformer
While it should be faster to call to_ast|>transform it cannot be used instead of as_html yet
as the API is not yet stable and some subtle differences in the output need to be addressed.
278 Implementing better GFM Table support
Because of compatility issues we use a new option gfm_tables defaulting to false for this.
Using this option Earmark will implement its own table extension + GFM tables at the same
time.
279 Languages in code blocks were limited to alphanum names, thus excluding, e.g. C#
281 Urls in links were URL endoded, that is actually a bug
It is the markdown author’s responsability to url encode her urls, if she does so correctly
we double encoded the url before this fix.
282 Always create a <tbody> in tables
Although strictly speaking a <tbody> is only needed when there is a <thead>, semantic
HTML suggests the presence of <tbody> anyway.

RobertDober · September 24, 2019, 8:31am

Slight change in the API, after some reflection, how an internal extension might look like.
I prefer to keep the cool name meta: for myself

So the acceptable ast would be extended by a map rather where the custom: key shall be reserved for 3rd party applications

      {tag, atts, children, %{custom: ...}}

c.f. https://github.com/pragdave/earmark/issues/288