I have a list of 100’s of different domains that I am hunting across for certain links. I got crawly working on one, but crawly’s config expects a single base_url
, e.g. https://www.foo.com.
Any suggestions?
I have a list of 100’s of different domains that I am hunting across for certain links. I got crawly working on one, but crawly’s config expects a single base_url
, e.g. https://www.foo.com.
Any suggestions?
You can define base_url as a function and query your db or some other state container (a genserver perhaps) to fetch the next base url to crawl
Thank you for the reply.
I spent an extra hour looking at the crawly code and existing middleware and pipelines and decided to simply hack up my own. It was easy enough. Made my own DomainsFilter
based on their DomainFilter
, as well as my own WriteToFile
that takes in a :filename
to allow append mode so I can run the crawler and update a single csv with the results. Also needed to stop crawling a domain once the needle I was hunting for was found, and a simple Process.put(hostname, true)
in my spider was good enough to abort further crawling of that domain. Sadly, crawly doesn’t pass the state
to the process_item
callback which would have made things slightly easier since middleware doesn’t see the items found, only pipelines.