Crawly - A high-level web crawling & scraping framework for Elixir

Dear Elixir community,

After a year of development, bug fixes, and improvements, we are proudly ready to share the release of Crawly 0.10.0 here with you.

Check out the source code here: GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.
Check out our experimental, pre-alpha visual UI here: http://18.216.221.122/ and play with jobs scheduling there.

We have dedicated a lot of time and knowledge to build a fast and feature-rich web scraping framework. To be absolutely honest, I have to say that I took a lot of ideas from other popular web scraping framework (Scrapy, python), as I have previously worked with the Scrapy core team.

We have some reported production usages of Crawly (some of them with really long-running crawls), however, we still have to approach the stable version (hopefully it will happen in a couple of releases).

To describe how Crawly is different from other known elixir scraping frameworks, I will list crawly’s features which I believe make it outstanding:

  1. Documentation - we have spent an enormous amount of time and effort to build great and clear versioned documentation!
  2. Rate limiting
  3. Robots.txt support
  4. Requests and Items validators
  5. Automatic duplications filtering
  6. Automatic cookies management (allows to bypass login pages and cookie-based regional filtering)
  7. Browser rendering (with the help of Splash)
  8. Retries support
  9. Proxies support
  10. HTTP API
  11. Visual jobs management dashboard which allows operating multiple Crawly nodes at the same time (experimental): see it deployed on demo ec2 micro instance: http://18.216.221.122/

We hope it will be useful for you!

If you have a suggestion or a production use case you’re happy for us to share, please get in touch.

58 Likes

Congrats and thank you for launching this for the community.

1 Like

Hi,

Is crawly the right tool to create something like inoreader ?

1 Like

If it’s about extracting data from multiple RSS feeds - yes.

I can also refer my talk from November 2019 explaining how scraping can be used in general: https://www.youtube.com/watch?v=ovSQGlkakAQ

6 Likes

Congrats on the release.

I still think you should be at the very least suggesting people use Floki’s html5ever parser. Scraping is exactly the context when the mochiweb_html parser not being HTML5 compliant is most problematic.

1 Like

The use of Floki in the docs is only for demonstrative purposes, since Crawly does not dictate how your data is parsed. Its faster to get up and running without having to install rust, after all.

1 Like

Faster, yes, but in my experience people tend towards doing what they see in examples, particularly when the drawbacks of that approach aren’t discussed.

2 Likes

We’re trying to show different examples. E.g. in one of the most recent articles I was using the Rust based lib called meeseeks.

But indeed, I tend to use Floki in examples, as it’s easier to start this way.

Regarding the comment about drawbacks - I would say the HTML parser is definitely a thing to discuss and highlight. But what we’re also trying to do is to bring the scraping experience to the level of the process: when you can build a spider, schedule a job and validate (QA) results in a visual way.

1 Like

Good choice. :slight_smile:

Discussing and highlighting parsers is all I was suggesting. Most examples, articles, etc, in the Elixir ecosystem don’t, and so I don’t think people get a chance to learn about potential pitfalls and trade-offs.

It’s great to see the UI and also that you’ve been working hard to be a good citizen by limiting impact and respecting robots.txt. Keep up the good work.

2 Likes

That’s not that slow or complex either. Going to https://rustup.rs and installing the Rust toolchain has honestly been the most painless tech installing experience I ever had – even on Windows (which can’t be said for the C/C++ toolchain; it’s a nightmare to get several Ruby and Elixir projects to compile their native dependencies without at least one library refusing to compile).

I wish every other language adopted such a model of seamless and “just works” installation that literally takes a minute.

1 Like

I don’t mean to sound unethical but I’d only be interested in such a library / platform if I could opt out of respecting robots.txt (meaning: not respecting it). There are a number of business mediator and governmental websites that haven’t touched their robots.txt files in 10+ years but many businesses need the publicly available data and scrape them daily regardless.

So, is there a way to ignore robots.txt?

2 Likes

It looks like robots.txt is handled by a middleware, so it’s likely that you can run without that middleware if you want (though I haven’t confirmed that this is true).

1 Like

Yes, sure. You can ignore it. Obviously no one should recommend it in a tutorial, right :slight_smile: ?

2 Likes

Yes, true! Thanks for the hint!

Yes true. I also did not have problems installing Rust. However it still feels like it’s better to use something which does not require anything for the sake of getting started guide?

LOL :003: Good point. Thank you!

1 Like

50/50. I actually agree with you but in a world where people just assume you already have Node.JS installed in order to do anything, I would guess also assuming they have Rust is not a stretch. :slight_smile: (especially having in mind how brainlessly easy it is to install)

Was a minor remark anyway, ignore it. It was mostly aimed at the tutorial being Windows-friendly, with which Rust helps a lot. Nothing else.

2 Likes

Yeah, but on the other hand it always makes sense to lower the curve :slight_smile:

2 Likes

Also wanted to add, that I am looking for new production usages of Crawly or CrawlyUI. If you want to try the toolset I am suggesting, I will be happy to help. Just write here, or via issues on github!

3 Likes