Crawly - A high-level web crawling & scraping framework for Elixir

oltarasenko · May 25, 2020, 11:48am

Dear Elixir community,

After a year of development, bug fixes, and improvements, we are proudly ready to share the release of Crawly 0.10.0 here with you.

Check out the source code here: GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.
Check out our experimental, pre-alpha visual UI here: http://18.216.221.122/ and play with jobs scheduling there.

We have dedicated a lot of time and knowledge to build a fast and feature-rich web scraping framework. To be absolutely honest, I have to say that I took a lot of ideas from other popular web scraping framework (Scrapy, python), as I have previously worked with the Scrapy core team.

We have some reported production usages of Crawly (some of them with really long-running crawls), however, we still have to approach the stable version (hopefully it will happen in a couple of releases).

To describe how Crawly is different from other known elixir scraping frameworks, I will list crawly’s features which I believe make it outstanding:

Documentation - we have spent an enormous amount of time and effort to build great and clear versioned documentation!
Rate limiting
Robots.txt support
Requests and Items validators
Automatic duplications filtering
Automatic cookies management (allows to bypass login pages and cookie-based regional filtering)
Browser rendering (with the help of Splash)
Retries support
Proxies support
HTTP API
Visual jobs management dashboard which allows operating multiple Crawly nodes at the same time (experimental): see it deployed on demo ec2 micro instance: http://18.216.221.122/

We hope it will be useful for you!

If you have a suggestion or a production use case you’re happy for us to share, please get in touch.

dgreiss · May 25, 2020, 6:30pm

Congrats and thank you for launching this for the community.

obsidienne · May 26, 2020, 6:11am

Hi,

Is crawly the right tool to create something like inoreader ?

oltarasenko · May 26, 2020, 9:21am

If it’s about extracting data from multiple RSS feeds - yes.

oltarasenko · May 26, 2020, 9:37am

I can also refer my talk from November 2019 explaining how scraping can be used in general: https://www.youtube.com/watch?v=ovSQGlkakAQ

mischov · May 26, 2020, 7:42pm

Congrats on the release.

I still think you should be at the very least suggesting people use Floki’s html5ever parser. Scraping is exactly the context when the mochiweb_html parser not being HTML5 compliant is most problematic.

Brainiac · May 27, 2020, 8:15am

The use of Floki in the docs is only for demonstrative purposes, since Crawly does not dictate how your data is parsed. Its faster to get up and running without having to install rust, after all.

mischov · May 27, 2020, 8:41am

Faster, yes, but in my experience people tend towards doing what they see in examples, particularly when the drawbacks of that approach aren’t discussed.

oltarasenko · May 27, 2020, 1:04pm

We’re trying to show different examples. E.g. in one of the most recent articles I was using the Rust based lib called meeseeks.

But indeed, I tend to use Floki in examples, as it’s easier to start this way.

Regarding the comment about drawbacks - I would say the HTML parser is definitely a thing to discuss and highlight. But what we’re also trying to do is to bring the scraping experience to the level of the process: when you can build a spider, schedule a job and validate (QA) results in a visual way.

mischov · May 27, 2020, 3:31pm

Good choice.

Discussing and highlighting parsers is all I was suggesting. Most examples, articles, etc, in the Elixir ecosystem don’t, and so I don’t think people get a chance to learn about potential pitfalls and trade-offs.

It’s great to see the UI and also that you’ve been working hard to be a good citizen by limiting impact and respecting robots.txt. Keep up the good work.

dimitarvp · May 27, 2020, 7:27pm

That’s not that slow or complex either. Going to https://rustup.rs and installing the Rust toolchain has honestly been the most painless tech installing experience I ever had – even on Windows (which can’t be said for the C/C++ toolchain; it’s a nightmare to get several Ruby and Elixir projects to compile their native dependencies without at least one library refusing to compile).

I wish every other language adopted such a model of seamless and “just works” installation that literally takes a minute.

dimitarvp · May 27, 2020, 7:29pm

I don’t mean to sound unethical but I’d only be interested in such a library / platform if I could opt out of respecting robots.txt (meaning: not respecting it). There are a number of business mediator and governmental websites that haven’t touched their robots.txt files in 10+ years but many businesses need the publicly available data and scrape them daily regardless.

So, is there a way to ignore robots.txt?

mischov · May 27, 2020, 7:33pm

It looks like robots.txt is handled by a middleware, so it’s likely that you can run without that middleware if you want (though I haven’t confirmed that this is true).

oltarasenko · May 28, 2020, 11:31am

Yes, sure. You can ignore it. Obviously no one should recommend it in a tutorial, right ?

oltarasenko · May 28, 2020, 11:32am

Yes, true! Thanks for the hint!

oltarasenko · May 28, 2020, 11:33am

Yes true. I also did not have problems installing Rust. However it still feels like it’s better to use something which does not require anything for the sake of getting started guide?

dimitarvp · May 28, 2020, 11:43am

LOL Good point. Thank you!

dimitarvp · May 28, 2020, 11:45am

50/50. I actually agree with you but in a world where people just assume you already have Node.JS installed in order to do anything, I would guess also assuming they have Rust is not a stretch. (especially having in mind how brainlessly easy it is to install)

Was a minor remark anyway, ignore it. It was mostly aimed at the tutorial being Windows-friendly, with which Rust helps a lot. Nothing else.

oltarasenko · May 28, 2020, 12:03pm

Yeah, but on the other hand it always makes sense to lower the curve

oltarasenko · May 29, 2020, 12:35pm

Also wanted to add, that I am looking for new production usages of Crawly or CrawlyUI. If you want to try the toolset I am suggesting, I will be happy to help. Just write here, or via issues on github!