Excerpt implementation for Floki

Hello there! I’ve been working on an Excerpt implementation on top of Floki that greedily includes tags. I’d love for the next step to be able to preserve tags but this is a good starting point; any advice on that welcome.

Let me know if there is anything I could improve or if there are any performance issues I’m encountering by using recursion for this. It’s also worth noting this is an adaption
of the DeepText search that comes with Floki for my own needs.

defmodule PayPip.Text do
  @moduledoc """
  PayPip.Text is a strategy to get text nodes from a HTML tree using a deep search
  algorithm limited to a certain length + the full content of the final node. It will 
  get all string nodes and concat them into an accumulator up to that point.
  """

  @type html_tree :: tuple | list

  @doc """
  Get text nodes of a desired length + final node length.
  ## Examples
      iex> PayPip.Text.get([{"a", [], ["The meaning of life is...", {"strong", [], ["something else"]}] }], 5)
      {"The meaning of life is...", 20}
      iex> PayPip.Text.get([{"a", [], ["The meaning of life is...", {"strong", [], ["something else"]}] }], 140, " ")
      {"The meaning of life is... something else", -101}
  """
  @spec get(html_tree, number, binary) :: {binary, number}
  def get(html_tree, max_length \\ 255, sep \\ "") do
    get_text(html_tree, {"", max_length}, sep)
  end
  
  # initial piece of text
  defp get_text(text, {"", max_length}, _sep) when is_binary(text), do: {text, max_length - String.length(text)}
  # all other text
  defp get_text(text, {acc, max_length}, sep) when is_binary(text) do
    {Enum.join([acc, text], sep), max_length - String.length(text)}
  end
  # deal with nodes lists and sort out over count to be correct (negative when under original max_length)
  defp get_text(nodes, {acc, max_length}, sep) when is_list(nodes) do
    Enum.reduce_while nodes, {acc, max_length}, fn(child, istr) ->
      if max_length > 0 do
        {final_acc, final_length} = get_text(child, istr, sep)
        {:cont, {final_acc, -final_length}}
      else
        {:halt, {acc, -max_length}}
      end
    end
  end
  # ignore comments
  defp get_text({:comment, _}, {acc, max_length}, _), do: {acc, max_length}
  # turn BR tags into new lines
  defp get_text({"br", _, _}, {acc, max_length}, _), do: {acc <> "\n", max_length - 1}
  # process children
  defp get_text({_, _, nodes}, acc, sep) do
    get_text(nodes, acc, sep)
  end
end
2 Likes