As far as I see it’s not possible to pass meta information from init
into parse_item
.
As I see it, you have two approaches at this point:
- You need to be able to identify the product by the information from the website (usually it might be SKU). Once you have this id inside your
parse_item
it will be possible to update the record.
- As you have suggested you can use Crawly.Request object and insert meta there, for example:
response = Crawly.fetch("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", [headers: [id: "a897fe39b1053632"]])
response.request
%HTTPoison.Request{
method: :get,
url: "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
headers: [{"User-Agent", "Crawly Bot"}, {:id, "a897fe39b1053632"}],
body: "",
params: %{},
options: []
}
```
1 Like
I got the spider to work in local in my Phoenix app, but when I go to deploy it I get this error:
No spiders found to auto-load: %MatchError{term: {:error, :enoent}}
All the files are there. Do I have to add something to my Dockerfile?
I think you can add more requests when the spider is running by using Crawly.RequestStorage.store/2. I think my challenge was similar to yours: Weekly I have to go through a long list of urls (20k+) to update their status. At first I started my spider by listing all the records from the db, but that would keep the db connection open for too long. Then I started looking into doing some sort of pagination.
I came up with this solution:
# PluginProphet.AppSpider.ex
def init do
# set up an ets table to keep track of the pagination
:ets.new(@ets_table, [:set, :public, :named_table])
:ets.insert(@ets_table, {:current_page, 1})
# format the initial urls
urls =
Apps.list_apps(1, @per_page)
|> Enum.map(&get_app_url/1)
[start_urls: urls]
end
@impl Crawly.Spider
def parse_item(response) do
should_add_more_requests?()
# ...
# do some work
# ...
end
defp should_add_more_requests?() do
# grab the current requests
{:requests, requests} = Crawly.RequestsStorage.requests(PluginProphet.AppSpider)
# check if needs to add more requests
if Enum.count(requests) < 2 do
[{:current_page, current_page}] = :ets.lookup(@ets_table, :current_page)
next_page = current_page + 1
:ets.insert(@ets_table, {:current_page, next_page})
new_requests =
Apps.list_apps(next_page, @per_page)
|> Enum.map(&get_app_url/1)
|> Enum.map(&Crawly.Request.new(&1))
Crawly.RequestsStorage.store(PluginProphet.AppSpider, new_requests)
end
end
1 Like
How do I wipe the spider’s memory and run it again?
I would like to run the spider on a list of urls from my db. The way that the spider parses items could add a repeated item to the list of urls to crawl. Because of that I turned on Crawly.Middlewares.UniqueRequest
. That works fine the first time. However, when I try to rerun the spider, it will log out a bunch of these:
...
[debug] Dropping request: [some_url], as it's already processed
[debug] Dropping request: [some_url], as it's already processed
...
I’ve tried to delete the dets_simple_storage
file, I tried to force wipe the spider’s RequestsStorage on init with
...
Crawly.RequestsStorage.store(PluginProphet.AppSpider, [])
...
But none worked. The only way is to turn off Crawly.Middlewares.UniqueRequest
.
Is there a way to wipe the spider’s memory?
1 Like
Hi Rico!
The requests middlewares use RequestsStorageWorker process in order to store information. The process itself is supposed to die when the spider finishes. Isn’t it the case for you?
Thanks, indeed that is working as you mentioned. I was in dev starting / stopping the spider manually and never gave it a chance to stop properly.
In any case, after more consideration, I’ve decided to keep it off.
I’m using Crawly in a Phoenix application. I see Crawly automatically looks for spiders in ./spiders
. I normally put all my backend modules in ./lib/my_app
. Is there a way to change the default folder from ./spiders
to ./lib/my_app/spiders
?
Thanks again for Crawly.
I’m hoping to contribute back to it – as soon as I figure out how to better display the docs graphs in both Hex and in the IDE.