Using crawly for 100's of different domains

lardcanoe · January 20, 2024, 4:06pm

I have a list of 100’s of different domains that I am hunting across for certain links. I got crawly working on one, but crawly’s config expects a single base_url, e.g. https://www.foo.com.

Any suggestions?

KP123 · January 20, 2024, 6:53pm

You can define base_url as a function and query your db or some other state container (a genserver perhaps) to fetch the next base url to crawl

lardcanoe · January 22, 2024, 3:29pm

Thank you for the reply.

I spent an extra hour looking at the crawly code and existing middleware and pipelines and decided to simply hack up my own. It was easy enough. Made my own DomainsFilter based on their DomainFilter, as well as my own WriteToFile that takes in a :filename to allow append mode so I can run the crawler and update a single csv with the results. Also needed to stop crawling a domain once the needle I was hunting for was found, and a simple Process.put(hostname, true) in my spider was good enough to abort further crawling of that domain. Sadly, crawly doesn’t pass the state to the process_item callback which would have made things slightly easier since middleware doesn’t see the items found, only pipelines.