Hi,
anyone knows a lib or an easy way to convert html to text (keeping the structure)?
Important: The readability should be maintained.
thanks
Hi,
anyone knows a lib or an easy way to convert html to text (keeping the structure)?
Important: The readability should be maintained.
thanks
This looks promising (as far as options for keeping the formatting):
Thanks. Sorry for the obvious question perhaps, but how can i use node tools server side in a phoenix controller?
Easiest way is to just call the shell command, but there are lots of helper libraries out that makes a lot of things easier, I’m partial to erlexec
when I need longer running process if it is not a Port program.
For anyone looking after the same…
I now use the lynx browser with the dump option. The node solution worked too, but was too slow and the results were not so good compared to lynx.
def html2text(html) do
random = Integer.to_string(:rand.uniform(10000000))
file_with_path = System.tmp_dir()<>"/"<>random
file_in = file_with_path<>".html"
File.write(file_in, html)
command = "lynx -dump "<>file_in
result = Porcelain.shell(command)
result.out
end
If you need something simpler, this just appends new lines (\n\r) to the br, p, div, li and h* tags… and adds a hyphen to li tags. Then it strips all tags (Phoenix required)
def sanitize(text) do
{:safe, result} =
text
|> String.replace(~r/<li>/, "\\g{1}- ", global: true)
|> String.replace(
~r/<\/?\s?br>|<\/\s?p>|<\/\s?li>|<\/\s?div>|<\/\s?h.>/,
"\\g{1}\n\r",
global: true
)
|> PhoenixHtmlSanitizer.Helpers.sanitize(:strip_tags)
result
end
Another way would be to use pandex. Elixir wrapper for Pandoc
iex> Pandex.html_to_plain "<h1 id=\"title\">Title</h1>\n<h2 id=\"list\">List</h2>\n<ul>\n<li>one</li>\n<li>two</li>\n<li>three</li>\n</ul>\n"
{:ok, "\n\nTITLE\n\n\nList\n\n- one\n- two\n- three\n\n"}
With that you need to have pandoc
installed.