Html parsing in elixir?

SupaSaya · September 19, 2016, 12:21pm

Anyone using floki or something similar for parsing html? How does it compare to something like pythons lxml or bs4 in terms of features/performance ?

Elixir seems like a perfect choice for making requests/downloading content but I am not sure if it’s the right choice for the actual processing?

adammokan · September 19, 2016, 1:56pm

I’m using Floki quite heavily. I deal with HTML parsing all day for work. 99% of our parsing was done in Ruby with Nokogiri before moving to Elixir (so I’ve not been overly concerned with speed).

TL;DR - Floki has a very simple feature set compared to some others in Ruby and Python, but it has made me realize I do not need a lot of abstractions and kept my code very clean and easy to maintain so far.

As far as Floki goes, the feature set is pretty light compared to other parsing libraries in Ruby and Python. Basically it just parses the HTML and allows traversing nodes / doing searches. At first this was a concern coming from a library with many more features, but the further I got into Elixir for parsing the code became much simpler than my Ruby/Nokogiri equivalent. I now realize I don’t need any real ‘sugar’ or abstractions over the basics that Floki provides. Having said that, I’ve made some helper functions to simplify things like grabbing the text from the first result on an element search and other things like that. Those do help a bit. I think it makes sense to keep that sort of thing out of the core Floki project, though. It made me step back and think about my use case more and that helped me trim a lot of bloat in some of my complex parsing tasks.

One concrete example last week was converting a Ruby-based parser that was a class of about 300 lines to an Elixir module of less than 60 lines. I know LoC doesn’t mean much, but the code is now much easier to maintain and understand the flow. I’ve not done any performance comparisons and not sure I need to - but simply reducing complexity in the code is worth it for me… and I can finally test specific parts of the code much much easier. A lot of that can be chalked up to piping smaller functions and some recursion rather than for loops in ruby and so-on. Granted, I now realize things in the Ruby class could be written a lot better in hindsight, but I still prefer the Elixir approach to dealing with HTML and processing data in general.

dotdotdotPaul · September 19, 2016, 6:04pm

I’m using Floki for “finding things” in a large XML set. Seems to work really well for that, although it’s not really intended to be a true XML library. It was the easiest to “get going” of all the XML parsing packages I looked at “that day”. That said, I don’t know that I’d use it for really detail-oriented XML parsing going forward, particularly if the structure is ‘intricate’.

…Paul

sergio · September 19, 2016, 6:15pm

Floki is really nice to use, and runs really fast as well. I use GenServer and Redis to queue up different urls I need to scrape and get fantastic speeds (although it does peg the CPU at 100%)

The pipe operator in Elixir lends itself really well to Floki, you find yourself writing code that’s concise yet correct. Strange!

Check out how I use it for different sites here: https://github.com/sergiotapia/magnetissimo/tree/master/lib/parsers

It’s really easy to read and write Floki parsing code:

def torrent_links(html_body) do
  html_body
  |> Floki.find("td.tone_1_pad a")
  |> Floki.attribute("href")
  |> Enum.filter(fn(a) -> String.contains?(a, "/files/details/") end)
  |> Enum.map(fn(url) -> "https://www.demonoid.ooo" <> url end)
end

seb5law · February 2, 2018, 5:16pm

Very nice example @sergio.
If preferred that could even be shortened a bit using a capture.
Although it is arguable whether the readability increases or not…

def torrent_links(html_body) do
  html_body
  |> Floki.find("td.tone_1_pad a")
  |> Floki.attribute("href")
  |> Enum.filter(&(String.contains?(&1, "/files/details/")))
  |> Enum.map(&("https://www.demonoid.ooo" <> &1)
end