HTML to text?

Hi,

anyone knows a lib or an easy way to convert html to text (keeping the structure)?

Important: The readability should be maintained.

thanks

This looks promising (as far as options for keeping the formatting):

1 Like

Thanks. Sorry for the obvious question perhaps, but how can i use node tools server side in a phoenix controller?

Easiest way is to just call the shell command, but there are lots of helper libraries out that makes a lot of things easier, I’m partial to erlexec when I need longer running process if it is not a Port program.

1 Like

For anyone looking after the same…

I now use the lynx browser with the dump option. The node solution worked too, but was too slow and the results were not so good compared to lynx.

def html2text(html) do
  random = Integer.to_string(:rand.uniform(10000000))
  file_with_path = System.tmp_dir()<>"/"<>random
  file_in = file_with_path<>".html"
  File.write(file_in, html)        
  command = "lynx -dump  "<>file_in
  result = Porcelain.shell(command)
  result.out
end
3 Likes

If you need something simpler, this just appends new lines (\n\r) to the br, p, div, li and h* tags… and adds a hyphen to li tags. Then it strips all tags (Phoenix required)

 def sanitize(text) do
    {:safe, result} =
      text
      |> String.replace(~r/<li>/, "\\g{1}- ", global: true)
      |> String.replace(
        ~r/<\/?\s?br>|<\/\s?p>|<\/\s?li>|<\/\s?div>|<\/\s?h.>/,
        "\\g{1}\n\r",
        global: true
      )
      |> PhoenixHtmlSanitizer.Helpers.sanitize(:strip_tags)

    result
  end
2 Likes

Another way would be to use pandex. Elixir wrapper for Pandoc

iex> Pandex.html_to_plain "<h1 id=\"title\">Title</h1>\n<h2 id=\"list\">List</h2>\n<ul>\n<li>one</li>\n<li>two</li>\n<li>three</li>\n</ul>\n"
{:ok, "\n\nTITLE\n\n\nList\n\n-   one\n-   two\n-   three\n\n"}

With that you need to have pandoc installed.

2 Likes