Web scraping tools

I want to try my hand at web scraping. What tools/libraries do I need to use. I’m hoping to turn this into something professional so don’t hold back. Thanks.

3 Likes

I think the choice depends on whether you need to evaluate js or not.

If you do, then in elixir there is hound [0] for this. I haven’t used it though since I usually use selenium with python. Actually, I usually try to avoid evaluating js completely by figuring out what apis it calls and calling them myself.

If you don’t, then just an html parser like floki [1] will do. You might be interested in reading the source code for magnetissimo [2] to see how a scraper works with it.

[0] https://github.com/HashNuke/hound
[1] https://github.com/philss/floki
[2] https://github.com/sergiotapia/magnetissimo

4 Likes

You can try to combine httpoison, floki and poolboy.

There is a nice article here http://www.akitaonrails.com/2015/11/18/ex-manga-downloader-an-exercise-with-elixir.

2 Likes

There’s also https://github.com/mischov/meeseeks if you don’t want to use Floki for whatever reason

2 Likes

@_russellb: My list:

  1. Floki - fetching data from HTML pages
  2. HTTPoison - fetching files (also HTML pages)
  3. Poison - decoding JSON for example when you have json in element attribute value
  4. NimbleCSV - some data you can fetch from CSV files on some pages for example export search results to CSV
  5. ExVCR - recording library for HTTPoison - useful in TDD (Test Driver Development)
  6. Retry - just to retry :slight_smile:
  7. Hound - fallback for HTTPoison and Floki for pages using JavaScript frameworks/libraries

There is also Meeseeks (really good replacement for Floki), but for me it requires Rust that in my case I need to compile (Funtoo Linux) that takes lots of time on my old laptop.

10 Likes

I’ve done a lot of scraping work on the Node side. PhantomJS and Node’s NightmareJS are potent tools.

This thread introduced me to Hound. It can pair with PhantomJS and looks well done. I’d start there. You’re going to run into JS rendered HTML and it’s rarely beneficial to deconstruct their API.

3 Likes

This is a good list.

It is worth noting that while Floki defaults to using :mochiweb_html (an Erlang library) to parse HTML, it gives you the option to use the Rust library html5ever instead and I strongly suggest using that unless you have a compelling reason not to. :mochiweb_html is not HTML5 spec compliant and malformed HTML that renders correctly in your browser might not be parsed correctly by :mochiweb_html.

The Rust compilation time is annoying, but I think compilation should only occur when you update your Meeseeks or (for Floki) html5ever_elixir dependency or need to fetch dependencies for the first time, so it shouldn’t be a problem most of the time.

I have a small benchmark comparing Meeseeks and Floki performance if such things might matter.

4 Likes

@mischov: 100% agree, but I’m using this not for HTML5 pages, but for old and really bad written ASP.NET pages …

3 Likes

@Eiji If it works for your case, good.

I just want people to be aware that, while :mochiweb_html isn’t bad and will work well in many cases including some malformed ones, it won’t always behave like a browser when parsing HTML.

For instance, note the differences when parsing this very malformed HTML.

iex> html = "<p =a>One<a <p>Something</p>Else"
iex> :mochiweb_html.parse(html)
{"p", [{"=", "="}, {"a", "a"}], ["One", {"a", [{"<p", "<p"}], ["Something"]}]}

iex> Html5ever.parse(html)
{:ok,
 [{"html", [],
   [{"head", [], []},
    {"body", [],
     [{"p", [{"=a", ""}], ["One", {"a", [{"<p", ""}], ["Something"]}]},
      {"a", [{"<p", ""}], ["Else"]}]}]}]}

That’s a pretty extreme example, but it shows the kind of differences that are possible.

4 Likes

I just saw https://github.com/nietaki/crawlie on github. It seems quite neat, maybe check it out? =)

1 Like

You might want to check excrawl (https://github.com/mlankenau/excrawl). It is a slim DSL above floki. It produces a hash/array struct out of html defined by a dsl.

1 Like

Does floki with either html5ever or mochiweb_html normalize the HTML?

I’ve used nokogiri on the Ruby side and one of my problems with it is that it normalizes the HTML.

@veverkap
I am not exactly sure what you mean by normalizing the HTML, but both html5ever and mochiweb_html attempt to interpret malformed HTML into valid HTML, which means that the result of parsing might not be exactly the HTML you passed in (but generally in a good way).

If that doesn’t answer your question, could you give a more concrete example?

1 Like

That’s exactly what I mean.

We have a tool that we use to scrape HTML pages. When we tried to use Mechanize (which uses Nokogiri under the hood), it would correct the HTML, which sometimes gave different results.

You answered my question, thanks!

I also started a little scraping framework based on the Python-based Scrapy. It’s on https://github.com/sntran/scrapex

It uses HTTPoison for making HTTP request, and Floki to parse HTML. I tried not to bring the whole headless browser in, so it does not handle SPA.

2 Likes

@_russellb - I’ve been using a combo of NightmareJS via ports (using Porcelain). I’ve been scraping at a pretty large scale since last summer with Elixir. There are some more details in some previous threads of mine if you search.

The only downfall with Nightmare or any JS-capable solution is the CPU/memory resources needed. If you are just needing HTML capabilities, HTTPoison (or swap with HTTPotion, raw hackney, etc) works fine in my experience. Unfortunately, I do not have the HTML only luxury as I’m dealing with single page app, JS required systems. As far as the heavy resource load for JS stuff via headless browsing - I have a distributed elixir/erlang solution in place with dedicated scraping nodes I can bring up dynamically and dispatch work to from a master node. Everything is running headless on linux boxes (some in our own datacenter and some on AWS spot instances).

For parsing the results, I use Floki like most people do. Speed is never an issue here for me so I’ve not bothered with the other parsers out there.

FWIW, we came from a lot of legacy Ruby/Mechanize code. The Elixir stuff is much, much easier to maintain and way more stable. I do have times where Nightmare acts up and doesn’t release resources, but I resolve that with some elixir-based cleanup tasks that run every few minutes to check for zombie processes that get missed. Porcelain helps a lot for the port logic, but is not 100%. I also keep all crawling off my main logic node so I can afford to lose a crawl and just bounce a node without sweating it. I’d recommend something similar if you need large scale throughput. I just assume the job was lost if I do not get a reply back from a crawling node within five minutes and send it to another node.

Good luck!

5 Likes

I’ve been using Scrapy for a while and it is a nice and mature framework. I am writing Xpath selectors for getting data from htmls and using the built in Scrapy system for following links and building pipelines. There are some things that can be improved in Scrapy though. You need to write your own spider management process, when you have many spiders in a projects this gets complex. Also, today the same url that is crawled can contain frequently updated data. Scrapy is designed to through a bunch of URLs and store the data somewhere, but today the web works differently, I need to know if a field on a web site is updated. Seems like things like these can be managed very well in Elixir. Using xpath selectors (or the mentioned Rust implementation) is easy. Some html diff logic on top of that in order to listen for updates would be nice to have instead of how Scrapy functions.

Seems like a nice start, taking some of Scrapy’s strenghts and impplementing in Elixir.

I wanted to add this tiny library: https://github.com/oltarasenko/crawly to our discussion list. You might find it useful, for the web scraping case.

I think it’s quite well documented, and could be used as a good starting point. Please find the relevant tutorial here:

Crawly Tutorial

2 Likes

Web scraping is simple, IMHO there is two type of site :

  1. Server-Side Rendered page
    easy to scrape with curl or any HTTP Client (+ HTML Parser)
  2. Javascript Rendered page
    A. You can use Headless / Browser Automation, but, it will be slow (+ HTML Parser)
    B. Do “little bit” Reverse Engineering on their Web API (FASTER)

Important Point :

  • Make sure your scrapper support Proxy Usage
  • If your site target has anti-scraper / crawler / bot (Like your bot follow pagination, 1->2->3 and so on) and it block your IP, you can use IP Rotation Service like geosurf.com and luminati.io
  • In some Country / Site, Web Scraping are prohibited
2 Likes