Web scraping tools

oltarasenko · June 7, 2019, 8:16am

Thanks for your comment.

I would not say that web scraping is simple. Everything depends on the scale. Probably for very small sites you can use curl or wget. But I am talking about cases when you need to scrape millions of pages. And in these cases, things are way more complex, as you have to solve concurrency and resource management problems. Also, finding a good strategy of crawling a million pages is a challenge (also consider cases when URLS are dynamically generated)!

To answer your comments:

Server-Side Rendered page
easy to scrape with curl or any HTTP Client (+ HTML Parser)

Will not scale. Also, you will have to avoid visiting pages twice and filtering out duplicates

Javascript Rendered page
A. You can use Headless / Browser Automation, but, it will be slow (+ HTML Parser)

In most of the cases, you will be able to find how a web page (e.g. a product page) is fetching data from API, so in most of the cases, you don’t need selenium.

B. Do “little bit” Reverse Engineering on their Web API (FASTER)

Unfortunately, this does not work. Most of the web sites do not have API. And those who have, would not provide a full and up to date data. Even more, some of the APIs are just horrible and can’t be used for data extraction.

Important Point :

Make sure your scrapper support Proxy Usage

It does

If your site target has anti-scraper / crawler / bot (Like your bot follow pagination, 1->2->3 and so on) and it block your IP, you can use IP Rotation Service like geosurf.com and luminati.io

You’re right. But please take into that proxies management is a complex stand-alone task. There are some quite advanced systems which allow overcoming bans with proxies, and I was developing one of them in the past.
Also nowadays in some cases, it’s just not enough just to perform a request through another proxy, as the most advanced system would also perform 3-4 levels of request fingerprint analysis. With this regards, I would suggest looking at Crawlera

In some Country / Site, Web Scraping are prohibited

Well… is that correct to assume the internet is also prohibited in these countries? Please take into account that no search engine can work without web scraping. And I don’t see the web without search these days. (But it’s just an opinion).

jihantoro · June 7, 2019, 12:32pm

Your scraper application are depends on you (the logic, structure and so along) and i just want to tell ways to do scraping

And choose that fit with your site target

adammokan · September 9, 2019, 11:40pm

Web scraping is simple

I’d say this is a very generalized statement that should be cleared up for future readers bumping into your post. Not being disrespectful, but saying something like that without some real meat could get a new person in trouble.

But trust me - If you find yourself scraping 5-10 million jobs a day, it quickly becomes “not simple”. The premise of crawling/scraping is not complex, but I can assure you that sustaining it for 7-8 years on end and returning data in a timely fashion to paying customers is not easy at all.

Some more tips from my view, having done this for so long (without a single legal issue):

Try to form a personal relationship with an IP provider - yes you can use the publicly available providers you mention above and there are tons, but none of those will scale to the numbers I needed to hit in any reasonably economic way. Easier said than done, but ask around - exhaust friends in SEO and marketing fields.
Start slow. If you only have like 100 IPs to work with, don’t touch a target more than 1 time per hour with the same IP to start unless you plan to treat it like a “smash and grab”.
Link your IPs to user agents somehow. Meaning if you pull IP #1 and go off to hit a site and randomly grab a UA string to roll with it - make sure the next time you show that IP - you come with the same UA string.
Have some controls for measuring data quality over time. This may need to be manual in your case. You’d be amazed at how many big sites now start throwing you trash data that looks correct at a glance.
When you think you have a crawl script ironed out - be sure you toss it at a tool like the EFF Panopticlick (and others) to look for obvious fingerprinting you may have not plugged. https://panopticlick.eff.org/
Make sure, if using a headless browser you’re plugging all the massive, truck-sized holes they all expose… Things like mocking the navigator.platform (and it better match the UA string) are major gaps I see all the time.

mythicalprogrammer · September 10, 2019, 5:46am

https://scrapy.org/

They also have splash which is their own html/javascript renderer.

I think it’s straight up professional and industrial.

I’ve seen floki and I’ve done beautifulsoup. Scrapy straight up is the tool you want if you want to webscrape website you can scrape like 90% or more of the websites out there with it.

I do webscrapping professionally for one startup stint and I’ve been doing it on the side for the side projects. I also got paid to webscrape linkedin and it didn’t go anywhere, linkedin is pretty dang hard unless you’re willing to put in lots of man hours into it.

I am willing to put my reputation on that tool. It’s really really good. It’s python though.

dimitarvp · September 10, 2019, 1:09pm

Extremely valuable comment, thank you for it!

Can you clarify on that? I’ve heard sites like Amazon deliberately give you wrong prices if they detect a bot, is that true for them and others? Or what kind of trash data?

Additionally, how do you even test your bot against tools like Panopticlick at all? Have your bot GET their root page and click the “Go” button? Is that what you meant, or do they (or others) have a dedicated bot testing toolkit?

adammokan · September 12, 2019, 2:09am

Can you clarify on that? I’ve heard sites like Amazon deliberately give you wrong prices if they detect a bot, is that true for them and others? Or what kind of trash data?

Yes. I don’t want to speak for specific sites - but I can tell you many ‘popular’ sites will start delivering results that are either not ordered correctly (if you think in terms of SERP rank data as an example, where order matters to buyers), E-commerce sites will start throwing bogus prices, and so-on. The only real way to protect/test this is to do manual A/B comparisons. The other thing to consider is we’re fully immersed in a world of personalized content - so absolutely it gets difficult, even if not being thrown bogus data, is that your clients will think your results are wrong because they are viewing personalized content if they manually compare results.

Additionally, how do you even test your bot against tools like Panopticlick at all? Have your bot GET their root page and click the “Go” button? Is that what you meant, or do they (or others) have a dedicated bot testing toolkit?

I should have been more specific on that. I mostly reused the same boilerplate headless crawl scripts and would include the site-specific nav/logic separately. So this meant all of my ‘pre-crawl’ setup to plug gaps like setting a legit navigator.platform and making sure navigator.webdriver returns false, etc (there are quite a few of these you need to cover).

Anyhow, I would traditionally include that initial setup logic and then have a custom script that would navigate the Panopticlick site, run the full checks, screenshot the results, and study later. Just to make sure I wasn’t missing something obvious. So yes, click the go button - load the full results and screenshot the page.

Aside from that there are many other “how private is my browser” checks out there that test for hardware-level info that may be worth mocking on some targets.

This reminds me of the fun I had with geolocation/gps coords - again, depending on what you are going for. Most common case for that effort was a well-known map site. Some hints for geolocation - and this may be outdated, but it was always important to mock both navigator.geolocation.getCurrentPosition as well as navigator.geolocation.watchPosition to spoof lat/long. A little bit of noise in the coords.accuracy attribute went a long way here

_rubenfa · September 12, 2019, 8:09am

Geolocated results are pretty interesting too. I had a long battle with that in a previous job. You can try to use IPs from the destination country but is not always easy. I remember to obtain interesting results when the IP is geolocated in a country border.

Finally, we made it work by mocking geolocation properties as you said. It was especially useful for geolocated results in regions of the same country.

I think scraping concept is easy to understand, but there is a ton of variables to take in mind, depending on the stuff you are scraping. Some of them make the process really hard to implement.

oltarasenko · September 23, 2019, 8:03pm

Hey people. Just wanted to announce that I have made an article on Erlang Solutions blog, about using Crawly: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html

Also I have made a couple of sample/tutorial projects to get started with: https://github.com/oltarasenko/crawly-spider-example
https://github.com/oltarasenko/products-advisor
https://github.com/oltarasenko/crawly-cars

Hopefully they could be interesting for people doing web scraping.

mischov · September 23, 2019, 10:44pm

@oltarasenko Couple bits of feedback that are common mistakes I see people making when using Floki.

Recommend a safer HTML parser than Floki+mochiweb_html.

I know it’s nice to not need to start an Elixir article with “and then install Rust”, but web scraping is exactly the situation you do want an HTML5 compliant parser because you don’t know how well formed the HTML will be and mochiweb_html (Floki’s default parser) can incorrectly parse parts of the HTML if it’s malformed (and potentially just drop those parts silently).

At the very least use the html5ever parser with Floki, though you may have trouble getting it to compile since html5ever_elixir hasn’t been updated for nine months despite an outstanding need to upgrade Rustler so that it works with more recent versions of Erlang/OTP.

Better yet, use Meeseeks instead of Floki because it will by default provide you an HTML5 compliant parser based on html5ever that does compile on the latest versions of Erlang/OTP.

Floki’s mochiweb_html parser has a place, mainly in situations where you are dealing with known, well-formed HTML and you don’t need the weight of an HTML5 compliant parser (like when you’re testing your Phoenix endpoints), but people should know the risk they’re taking if they use it for web scraping.
Stop parsing each page four times.

When you run response.body |> Floki.find(...), you’re really running the equivalent of response.body |> Floki.parse() |> Floki.find(...) which means your four Floki.finds are parsing the whole document four times.

Instead, try parsed_body = Floki.parse(response.body) then parsed_body |> Floki.find(...).
Don’t select over the whole document when you don’t need to.

Three of your selectors are: "article.blog_post h1:first-child", "article.blog_post p.subheading" and "article.blog_post". That means you’re selecting the same article.blog_post three times, then making sub-selections two of those times. Instead, try something like:
```
parsed_body = Floki.parse(response.body)
blog_post = Floki.find(parse_body, "article.blog_post")

title =
  blog_post
  |> Floki.find("h1:first_child")
  |> Floki.text

author = 
  blog_post
  |> Floki.find("p.subheading")
...
```
Doing that means that instead of walking the whole document each time you want to make a sub-selection you just walk the the portion you’re interested in. In this case when there is only one of the thing you’re making sub-selections on it’s probably not a huge difference, but in cases where you’re sub-selecting over a list of items it can add up.

mischov · September 23, 2019, 11:36pm

Here is a version of your crawler that uses Meeseeks and fixes all of the above problems.

defmodule Esl do
  @behaviour Crawly.Spider

  import Meeseeks.CSS

  @impl Crawly.Spider
  def base_url() do
    "https://www.erlang-solutions.com"
  end

  @impl Crawly.Spider
  def init() do
    [
      start_urls: ["https://www.erlang-solutions.com/blog.html"]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    # Parse the response body as html, though Meeseeks comes with an
    # XML parser too if you want to parse the blog posts from the 
    # RSS feed instead
    parsed_body = Meeseeks.parse(response.body, :html)

    # Get new urls to follow
    urls =
      parsed_body
      |> Meeseeks.all(css("a.more"))
      |> Enum.map(&Meeseeks.attr(&1, "href"))

    # Convert urls into requests
    requests =
      Enum.map(urls, fn url ->
        url
        |> build_absolute_url(response.request_url)
        |> Crawly.Utils.request_from_url()
      end)

    # Extract item from a page, e.g.
    # https://www.erlang-solutions.com/blog/introducing-telemetry.html
    # 
    # Find the post using `Meeseeks.one`, which will stop after
    # the first match rather than looking through the rest of the
    # document for something it will never find more of.
    post = Meeseeks.one(parsed_body, css("article.blog_post"))

    title =
      post
      |> Meeseeks.one(css("h1:first-child"))
      |> Meeseeks.own_text()

    # `Meeseeks.own_text` trims automatically and only gets the text
    # from the selected element.
    author =
      post
      |> Meeseeks.one(css("p.subheading"))
      |> Meeseeks.own_text()

    # `Meeseeks.text` will get the combined text from the element and 
    # all of its descendants
    text = Meeseeks.text(post)

    %Crawly.ParsedItem{
      :requests => requests,
      :items => [
        %{
          title: title, 
          author: author, 
          text: text, 
          url: response.request_url
        }
      ]
    }
  end

  def build_absolute_url(url, request_url) do
    URI.merge(request_url, url) |> to_string()
  end
end

oltarasenko · September 24, 2019, 5:57am

Hey @mischov!

Thanks for the feedback. Actually I was aware that parsing would happen 3 times, but decided not to do anything with it for now. At this point, I think that it’s better to update examples, so they are not misleading! Thanks for the hint.

I don’t have problems with Rust. Parsing should be done by low level and fast languages! Basically, scrapy uses C parses in order to parse pages under the hood. We should do the same.

Q: I am currently looking for the XPath library, to use it in Crawly. But at this point, nothing works properly when in comes to HTML pages. Are you aware of the options to try XPath selectors?

Again huge thanks for comments!

oltarasenko · September 24, 2019, 11:48am

Ok. Looks like the Meeseeks also have Xpath support. I will check that.

mischov · September 24, 2019, 4:23pm

I want to be clear that just because it’s Rust doesn’t mean it’ll be faster than the mochiweb_html parser all the time. My observation is that it can be faster, particularly for large input, but the mochiweb parser does a less thorough job and can be faster on small input, particularly when you take into account NIF overhead. That said, using Rust does seem to have a very positive impact memory-wise in most cases.

The really big reason I advocate Rust is because it lets us use html5ever which mean we’re getting an HTML5 compliant parser that will give us very similar results to a browser when parsing HTML, which means that it generally will handle malformed content in a more desirable fashion than mochiweb_html.

I will try to get a benchmark together this evening based on the code in your post.

mischov · September 30, 2019, 10:01pm

@oltarasenko Took me a little longer than planned, but put together the promised benchmark, which compares the version of the HTML parsing and extraction code from your post against a version using the optimizations I suggested and a Meeseeks translation of the optimized version.

Name                        ips        average  deviation         median         99th %
Meeseeks                 225.19        4.44 ms     ±3.55%        4.39 ms        5.02 ms
Floki optimized          166.03        6.02 ms     ±7.36%        5.96 ms        7.82 ms
Floki unoptimized         55.46       18.03 ms     ±6.26%       17.77 ms       22.34 ms

Comparison:
Meeseeks                 225.19
Floki optimized          166.03 - 1.36x slower +1.58 ms
Floki unoptimized         55.46 - 4.06x slower +13.59 ms

Memory usage statistics:

Name                 Memory usage
Meeseeks                  0.42 MB
Floki optimized           6.25 MB - 14.98x memory usage +5.84 MB
Floki unoptimized        20.90 MB - 50.04x memory usage +20.48 MB

You can see the source along with some notes in the benchmark repo.

oltarasenko · October 1, 2019, 8:25am

Hey @mischov. Thanks for sharing those.

In the meantime I have tried to use the Meeseeks for Xpath related extractors. Looks quite great. Thanks for the work you’re doing!

dimitarvp · October 1, 2019, 8:31am

Holy cow, only 420kb memory usage. Quite amazing.

jacksmith · October 16, 2019, 10:53am

For web scraping, you can try this tool https://app.scrape.works/create-project to explore more scraping information.

oltarasenko · October 21, 2019, 8:19pm

I wonder if it’s built with elixir? E.g. I did not find any information regarding how the framework works.

oltarasenko · October 24, 2019, 8:28pm

Btw here I have made another article about using Crawly with Tensorflow, which allows to organize machine learning project: https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html

oltarasenko · February 19, 2020, 7:55pm

Hey people,

I want to announce that I have released a new version of

Crawly

. The 0.8.0 contains a few important features like:

retries support
browser rendering (for JavaScript-based websites).

Hopefully, you could find it useful for your needs!