Web scraping tools

mischov · September 23, 2019, 10:44pm

@oltarasenko Couple bits of feedback that are common mistakes I see people making when using Floki.

Recommend a safer HTML parser than Floki+mochiweb_html.

I know it’s nice to not need to start an Elixir article with “and then install Rust”, but web scraping is exactly the situation you do want an HTML5 compliant parser because you don’t know how well formed the HTML will be and mochiweb_html (Floki’s default parser) can incorrectly parse parts of the HTML if it’s malformed (and potentially just drop those parts silently).

At the very least use the html5ever parser with Floki, though you may have trouble getting it to compile since html5ever_elixir hasn’t been updated for nine months despite an outstanding need to upgrade Rustler so that it works with more recent versions of Erlang/OTP.

Better yet, use Meeseeks instead of Floki because it will by default provide you an HTML5 compliant parser based on html5ever that does compile on the latest versions of Erlang/OTP.

Floki’s mochiweb_html parser has a place, mainly in situations where you are dealing with known, well-formed HTML and you don’t need the weight of an HTML5 compliant parser (like when you’re testing your Phoenix endpoints), but people should know the risk they’re taking if they use it for web scraping.
Stop parsing each page four times.

When you run response.body |> Floki.find(...), you’re really running the equivalent of response.body |> Floki.parse() |> Floki.find(...) which means your four Floki.finds are parsing the whole document four times.

Instead, try parsed_body = Floki.parse(response.body) then parsed_body |> Floki.find(...).
Don’t select over the whole document when you don’t need to.

Three of your selectors are: "article.blog_post h1:first-child", "article.blog_post p.subheading" and "article.blog_post". That means you’re selecting the same article.blog_post three times, then making sub-selections two of those times. Instead, try something like:
```
parsed_body = Floki.parse(response.body)
blog_post = Floki.find(parse_body, "article.blog_post")

title =
  blog_post
  |> Floki.find("h1:first_child")
  |> Floki.text

author = 
  blog_post
  |> Floki.find("p.subheading")
...
```
Doing that means that instead of walking the whole document each time you want to make a sub-selection you just walk the the portion you’re interested in. In this case when there is only one of the thing you’re making sub-selections on it’s probably not a huge difference, but in cases where you’re sub-selecting over a list of items it can add up.