Crawler - easy web crawling / scraping powered by GenStage

Hi all,

Finally, after a month of hard work (with the occasional “oops it’s 5AM already?!”), I’m happy to announce that Crawler has reached v1.0. It has not been put in production use yet though, so please help test and use it, and report issues and feedback. :slight_smile:

Check it out here: GitHub - fredwu/crawler: A high performance web crawler / scraper in Elixir.

14 Likes

I will try to use it to crawl data :grin:

1 Like

Very neat.

Do you respect (or provide the option to respect) robots.txt?

Also, can you compare Crawler to Crawlie?

3 Likes

I only just discovered Crawlie a few days ago, but from a quick glance at the implementations, I think the main differences are 1) scope 2) flow control.

In terms of scope, Crawler offers save to disk / offline, rate limiting, and a bunch of hooks to swap in your own logic. I noticed Crawlie leverages erlang’s pqueue library to provide priority queueing, whereas Crawler simply uses erlang’s builtin FIFO queue.

In terms of flow control, Crawlie uses Flow, whereas Crawler uses its own GenStage implementation that offers worker pooling and rate limiting. I believe the latter offers a bit more fine grained control, especially useful to crawl sources that might have rate limit on the server end.

These are just quick observations, so please correct me if I’m wrong.

I’d love people to try both and give both projects some love and feedback. :slight_smile:

3 Likes

Hi guys and @fredwu :smile: . I`m trying to start crawler. I have a trouble with setting filter parameter. Can you please give me a clue? Thanks.

1 Like

What sort of issues are you having?

2 Likes

I have a module:

defmodule CrawlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec

  def filter(Preformatted texturl, opt \\ []) do
    String.match?(url, ~r/example\.com/)
  end
end

It’s a harcoded filter for test
I call Crawler like:

C> rawler.crawl(“http://example.com”, [save_to: “/path/to/save”, url_filter: CrawlFilter])

In this case error is rise (MatchError) no match of right hand side value: true

I’m new to Elixir and sorry if my question is stupid.

1 Like

Damn, :grin: I need to return tuple instead of boolean. It’s fixed.

1 Like

I’m glad it’s worked out for you. :slight_smile:

2 Likes

This is very useful library. Congrats for your work! :+1:

2 Likes

Thank you, @fredwu, for creating this library and for writing clear documentation. The high-level architecture diagram was particularly useful for a beginner like me to get a good overview.

I’m planning on trying out your library for my first real Elixir project.

One thing I’m wondering about is whether it would be feasible to assign different IP addresses to each crawler via some kind of proxy or VPN service, such as Tor, TorGuard or NordVPN.

When scraping, I want to respect each site’s robots.txt. I’m also thinking about using SchedEx to trigger scraping at night-time (low-traffic hours), to be mindful of the target sites’ performance.

1 Like

Am I using this right? I continually get “Fetch failed ‘not_fetched_yet?’, …” for each resource on a page.

Interactive Elixir (1.7.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Crawler.crawl("http://elixir-lang.org", max_depths: 2)
{:ok,
 %{
   assets: [],
   depth: 0,
   encode_uri: false,
   html_tag: "a",
   interval: 0,
   max_depths: 2,
   modifier: Crawler.Fetcher.Modifier,
   parser: Crawler.Parser,
   queue: #PID<0.256.0>,
   retrier: Crawler.Fetcher.Retrier,
   save_to: nil,
   scraper: Crawler.Scraper,
   timeout: 5000,
   url: "http://elixir-lang.org",
   url_filter: Crawler.Fetcher.UrlFilter,
   user_agent: "Crawler/1.0.0 (https://github.com/fredwu/crawler)",
   workers: 10
 }}
iex(2)> 
23:18:58.487 [debug] "Fetch failed 'not_fetched_yet?', with opts: %{assets: [], content_type: \"text/html\", depth: 1, encode_uri: false, headers: [{\"Server\", \"GitHub.com\"}, {\"Content-Type\", \"text/html; charset=utf-8\"}, {\"Last-Modified\", \"Wed, 05 Sep 2018 18:30:34 GMT\"}, {\"ETag\", \"\\\"5b9020ca-4e80\\\"\"}, {\"Access-Control-Allow-Origin\", \"*\"}, {\"Expires\", \"Fri, 07 Sep 2018 02:57:17 GMT\"}, {\"Cache-Control\", \"max-age=600\"}, {\"X-GitHub-Request-Id\", \"CBCE:36A8:66F1A1:8881F9:5B91E6B4\"}, {\"Content-Length\", \"20096\"}, {\"Accept-Ranges\", \"bytes\"}, {\"Date\", \"Fri, 07 Sep 2018 03:18:58 GMT\"}, {\"Via\", \"1.1 varnish\"}, {\"Age\", \"0\"}, {\"Connection\", \"keep-alive\"}, {\"X-Served-By\", \"cache-cmh8820-CMH\"}, {\"X-Cache\", \"MISS\"}, {\"X-Cache-Hits\", \"0\"}, {\"X-Timer\", \"S1536290339.583748,VS0,VE24\"}, {\"Vary\", \"Accept-Encoding\"}, {\"X-Fastly-Request-ID\", \"6be5015e3bd8a7dbc5292078f32e40a48f6fe0ce\"}], html_tag: \"a\", interval: 0, max_depths: 2, modifier: Crawler.Fetcher.Modifier, parser: Crawler.Parser, queue: #PID<0.256.0>, referrer_url: \"http://elixir-lang.org\", retrier: Crawler.Fetcher.Retrier, save_to: nil, scraper: Crawler.Scraper, timeout: 5000, url: \"http://elixir-lang.org\", url_filter: Crawler.Fetcher.UrlFilter, user_agent: \"Crawler/1.0.0 (https://github.com/fredwu/crawler)\", workers: 10}."

Any suggestions?
Michael

1 Like

Wow, it’s been… how many years? I have just picked up Crawler again after a looong hiatus. Just released v1.2.0 with memory usage improvement. :slight_smile:

3 Likes