My first web scraper using Crawly

dogweather · October 22, 2022, 11:48pm

I’m happy to get code critique.

Coding with Crawly was a very smooth experience. No unpleasant surprises; everything worked as expected. The Domo framework and the new dbg debugger hugely contributed to the experience.

I coded with TDD — tests are here — and it was even a little easier than Python/Scrapy, which it seems to be modelled on.

The spider creates a 60MB JSON-lines file. I currently check it in by hand to our Datasets repo. It uses one API call and approx. 500 web pages as its source. The output “simply” captures the source structure of the Statutes with their content, leaving semantic additions to later pipeline stages:

{"number":"1","name":"Courts, Oregon Rules of Civil Procedure","kind":"volume","chapter_range":["1","55"]}
{"number":"2","name":"Business Organizations, Commercial Code","kind":"volume","chapter_range":["56","88"]}
{"number":"3","name":"Landlord-Tenant, Domestic Relations, Probate","kind":"volume","chapter_range":["90","130"]}

etc.

dimitarvp · October 23, 2022, 12:37am

Your Crawlers.String module functions can easily be folded into the Util module with no loss of expressiveness. Same for RegexCase. IMO too much micro-separation.
Not a good idea to have a non-namespaced modules e.g. Util → Crawlers.Util. Same for RegexCase.
In general your source code tree seems too deep for such a small project. Maybe you’re future-proofing because it can grow to much more files? If so then cool. If not then I feel it’s unnecessary.

dogweather · October 23, 2022, 4:34am

Thanks for the feedback!

And yeah, definitely future-proofing if I switch to Elixir. My Python repo has a dozen crawlers and growing:

pdgonzalez872 · October 23, 2022, 7:55pm

This is somewhat of a tangent but I wanted to say I really appreciate that you build in public and post about it here on the forum. Great job and please keep doing it!