Crawler - easy web crawling / scraping powered by GenStage

fredwu · August 30, 2017, 2:30pm

Hi all,

Finally, after a month of hard work (with the occasional “oops it’s 5AM already?!”), I’m happy to announce that Crawler has reached v1.0. It has not been put in production use yet though, so please help test and use it, and report issues and feedback.

Check it out here: GitHub - fredwu/crawler: A high performance web crawler / scraper in Elixir.

cjen07 · August 30, 2017, 2:45pm

I will try to use it to crawl data

mischov · August 30, 2017, 4:20pm

Very neat.

Do you respect (or provide the option to respect) robots.txt?

Also, can you compare Crawler to Crawlie?

fredwu · August 30, 2017, 11:36pm

I only just discovered Crawlie a few days ago, but from a quick glance at the implementations, I think the main differences are 1) scope 2) flow control.

In terms of scope, Crawler offers save to disk / offline, rate limiting, and a bunch of hooks to swap in your own logic. I noticed Crawlie leverages erlang’s pqueue library to provide priority queueing, whereas Crawler simply uses erlang’s builtin FIFO queue.

In terms of flow control, Crawlie uses Flow, whereas Crawler uses its own GenStage implementation that offers worker pooling and rate limiting. I believe the latter offers a bit more fine grained control, especially useful to crawl sources that might have rate limit on the server end.

These are just quick observations, so please correct me if I’m wrong.

I’d love people to try both and give both projects some love and feedback.

hristonev · December 13, 2017, 7:25pm

Hi guys and @fredwu . I`m trying to start crawler. I have a trouble with setting filter parameter. Can you please give me a clue? Thanks.

fredwu · December 15, 2017, 12:36am

What sort of issues are you having?

hristonev · December 15, 2017, 6:30am

I have a module:

defmodule CrawlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec

  def filter(Preformatted texturl, opt \\ []) do
    String.match?(url, ~r/example\.com/)
  end
end

It’s a harcoded filter for test
I call Crawler like:

C> rawler.crawl(“http://example.com”, [save_to: “/path/to/save”, url_filter: CrawlFilter])

In this case error is rise (MatchError) no match of right hand side value: true

I’m new to Elixir and sorry if my question is stupid.

hristonev · December 15, 2017, 8:37am

Damn, I need to return tuple instead of boolean. It’s fixed.

fredwu · December 15, 2017, 10:15am

I’m glad it’s worked out for you.

hristonev · December 15, 2017, 10:56am

This is very useful library. Congrats for your work!

IRLeif · September 2, 2018, 5:27pm

Thank you, @fredwu, for creating this library and for writing clear documentation. The high-level architecture diagram was particularly useful for a beginner like me to get a good overview.

I’m planning on trying out your library for my first real Elixir project.

One thing I’m wondering about is whether it would be feasible to assign different IP addresses to each crawler via some kind of proxy or VPN service, such as Tor, TorGuard or NordVPN.

When scraping, I want to respect each site’s robots.txt. I’m also thinking about using SchedEx to trigger scraping at night-time (low-traffic hours), to be mindful of the target sites’ performance.

maz · September 7, 2018, 3:23am

Am I using this right? I continually get “Fetch failed ‘not_fetched_yet?’, …” for each resource on a page.

Interactive Elixir (1.7.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Crawler.crawl("http://elixir-lang.org", max_depths: 2)
{:ok,
 %{
   assets: [],
   depth: 0,
   encode_uri: false,
   html_tag: "a",
   interval: 0,
   max_depths: 2,
   modifier: Crawler.Fetcher.Modifier,
   parser: Crawler.Parser,
   queue: #PID<0.256.0>,
   retrier: Crawler.Fetcher.Retrier,
   save_to: nil,
   scraper: Crawler.Scraper,
   timeout: 5000,
   url: "http://elixir-lang.org",
   url_filter: Crawler.Fetcher.UrlFilter,
   user_agent: "Crawler/1.0.0 (https://github.com/fredwu/crawler)",
   workers: 10
 }}
iex(2)> 
23:18:58.487 [debug] "Fetch failed 'not_fetched_yet?', with opts: %{assets: [], content_type: \"text/html\", depth: 1, encode_uri: false, headers: [{\"Server\", \"GitHub.com\"}, {\"Content-Type\", \"text/html; charset=utf-8\"}, {\"Last-Modified\", \"Wed, 05 Sep 2018 18:30:34 GMT\"}, {\"ETag\", \"\\\"5b9020ca-4e80\\\"\"}, {\"Access-Control-Allow-Origin\", \"*\"}, {\"Expires\", \"Fri, 07 Sep 2018 02:57:17 GMT\"}, {\"Cache-Control\", \"max-age=600\"}, {\"X-GitHub-Request-Id\", \"CBCE:36A8:66F1A1:8881F9:5B91E6B4\"}, {\"Content-Length\", \"20096\"}, {\"Accept-Ranges\", \"bytes\"}, {\"Date\", \"Fri, 07 Sep 2018 03:18:58 GMT\"}, {\"Via\", \"1.1 varnish\"}, {\"Age\", \"0\"}, {\"Connection\", \"keep-alive\"}, {\"X-Served-By\", \"cache-cmh8820-CMH\"}, {\"X-Cache\", \"MISS\"}, {\"X-Cache-Hits\", \"0\"}, {\"X-Timer\", \"S1536290339.583748,VS0,VE24\"}, {\"Vary\", \"Accept-Encoding\"}, {\"X-Fastly-Request-ID\", \"6be5015e3bd8a7dbc5292078f32e40a48f6fe0ce\"}], html_tag: \"a\", interval: 0, max_depths: 2, modifier: Crawler.Fetcher.Modifier, parser: Crawler.Parser, queue: #PID<0.256.0>, referrer_url: \"http://elixir-lang.org\", retrier: Crawler.Fetcher.Retrier, save_to: nil, scraper: Crawler.Scraper, timeout: 5000, url: \"http://elixir-lang.org\", url_filter: Crawler.Fetcher.UrlFilter, user_agent: \"Crawler/1.0.0 (https://github.com/fredwu/crawler)\", workers: 10}."

Any suggestions?
Michael

fredwu · September 28, 2023, 4:02pm

Wow, it’s been… how many years? I have just picked up Crawler again after a looong hiatus. Just released v1.2.0 with memory usage improvement.