We have dedicated a lot of time and knowledge to build a fast and feature-rich web scraping framework. To be absolutely honest, I have to say that I took a lot of ideas from other popular web scraping framework (Scrapy, python), as I have previously worked with the Scrapy core team.
We have some reported production usages of Crawly (some of them with really long-running crawls), however, we still have to approach the stable version (hopefully it will happen in a couple of releases).
To describe how Crawly is different from other known elixir scraping frameworks, I will list crawly’s features which I believe make it outstanding:
Documentation - we have spent an enormous amount of time and effort to build great and clear versioned documentation!
Requests and Items validators
Automatic duplications filtering
Automatic cookies management (allows to bypass login pages and cookie-based regional filtering)
Browser rendering (with the help of Splash)
Visual jobs management dashboard which allows operating multiple Crawly nodes at the same time (experimental): see it deployed on demo ec2 micro instance: http://18.104.22.168/
We hope it will be useful for you!
If you have a suggestion or a production use case you’re happy for us to share, please get in touch.
I still think you should be at the very least suggesting people use Floki’s html5ever parser. Scraping is exactly the context when the mochiweb_html parser not being HTML5 compliant is most problematic.
We’re trying to show different examples. E.g. in one of the most recent articles I was using the Rust based lib called meeseeks.
But indeed, I tend to use Floki in examples, as it’s easier to start this way.
Regarding the comment about drawbacks - I would say the HTML parser is definitely a thing to discuss and highlight. But what we’re also trying to do is to bring the scraping experience to the level of the process: when you can build a spider, schedule a job and validate (QA) results in a visual way.
Discussing and highlighting parsers is all I was suggesting. Most examples, articles, etc, in the Elixir ecosystem don’t, and so I don’t think people get a chance to learn about potential pitfalls and trade-offs.
It’s great to see the UI and also that you’ve been working hard to be a good citizen by limiting impact and respecting robots.txt. Keep up the good work.
That’s not that slow or complex either. Going to https://rustup.rs and installing the Rust toolchain has honestly been the most painless tech installing experience I ever had – even on Windows (which can’t be said for the C/C++ toolchain; it’s a nightmare to get several Ruby and Elixir projects to compile their native dependencies without at least one library refusing to compile).
I wish every other language adopted such a model of seamless and “just works” installation that literally takes a minute.
I don’t mean to sound unethical but I’d only be interested in such a library / platform if I could opt out of respecting robots.txt (meaning: not respecting it). There are a number of business mediator and governmental websites that haven’t touched their robots.txt files in 10+ years but many businesses need the publicly available data and scrape them daily regardless.
50/50. I actually agree with you but in a world where people just assume you already have Node.JS installed in order to do anything, I would guess also assuming they have Rust is not a stretch. (especially having in mind how brainlessly easy it is to install)
Was a minor remark anyway, ignore it. It was mostly aimed at the tutorial being Windows-friendly, with which Rust helps a lot. Nothing else.
Also wanted to add, that I am looking for new production usages of Crawly or CrawlyUI. If you want to try the toolset I am suggesting, I will be happy to help. Just write here, or via issues on github!