Earmark - Possible to extract all text nodes from a Markdown file?

Is it possible to extract all the raw text nodes from Markdown files using Earmark?

I’ve been using the built-in map_ast_with function and not having much luck, I seem to miss a lot of text.

  defp markdown_to_text(markdown) do
    {:ok, ast, _} = Earmark.as_ast(markdown)
    fun = fn node, acc ->
      case node do
        [content] when is_binary(content) ->
          { node, [acc | content] }
        _ ->
          { node, acc }
      end
    end

    { ast, text } = Earmark.Transform.map_ast_with(ast, [], fun)
    Enum.join(text)
  end

how about this?

  def extract_text_from_markdown(md) when is_binary(md) do
    {:ok, ast, _} = Earmark.as_ast(md)
    String.trim(extract_text_from_ast(ast, ""))
  end

  defp extract_text_from_ast(ast, result) when is_list(ast) and is_binary(result) do
    Enum.reduce(ast, result, fn
      {_html_tag, _atts, children, _m}, acc ->
        extract_text_from_ast(children, acc)

      text_leaf, acc when is_binary(text_leaf) ->
        acc <> " " <> text_leaf
    end)
  end

Example:

iex(46)> md_string = "# A Header\n\n**bold**\n\n+ li 1\n+ li 2"
"# A Header\n\n**bold**\n\n+ li 1\n+ li 2"
iex(47)> Earmark.as_ast!(md_string)
[
  {"h1", [], ["A Header"], %{}},
  {"p", [], [{"strong", [], ["bold"], %{}}], %{}},
  {"ul", [], [{"li", [], ["li 1"], %{}}, {"li", [], ["li 2"], %{}}], %{}}
]
iex(48)> extract_text_from_markdown(md_string)
"A Header bold li 1 li 2"
1 Like

That works perfectly, thanks @knoebber!

I had forgotten you could do pattern matching on anonymous functions like this, it makes reducers so much easier to read.

2 Likes