Crawly - A high-level web crawling & scraping framework for Elixir

oltarasenko · August 15, 2024, 7:45am

As far as I see it’s not possible to pass meta information from init into parse_item.

As I see it, you have two approaches at this point:

You need to be able to identify the product by the information from the website (usually it might be SKU). Once you have this id inside your parse_item it will be possible to update the record.
As you have suggested you can use Crawly.Request object and insert meta there, for example:

response = Crawly.fetch("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", [headers: [id: "a897fe39b1053632"]])

response.request
%HTTPoison.Request{
  method: :get,
  url: "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  headers: [{"User-Agent", "Crawly Bot"}, {:id, "a897fe39b1053632"}],
  body: "",
  params: %{},
  options: []
}
```

RicoTrevisan · August 17, 2024, 6:32am

I got the spider to work in local in my Phoenix app, but when I go to deploy it I get this error:

No spiders found to auto-load: %MatchError{term: {:error, :enoent}}

All the files are there. Do I have to add something to my Dockerfile?

RicoTrevisan · August 19, 2024, 5:59am

I think you can add more requests when the spider is running by using Crawly.RequestStorage.store/2. I think my challenge was similar to yours: Weekly I have to go through a long list of urls (20k+) to update their status. At first I started my spider by listing all the records from the db, but that would keep the db connection open for too long. Then I started looking into doing some sort of pagination.

I came up with this solution:

# PluginProphet.AppSpider.ex

  def init do
    # set up an ets table to keep track of the pagination
    :ets.new(@ets_table, [:set, :public, :named_table])
    :ets.insert(@ets_table, {:current_page, 1})

    # format the initial urls
    urls =
      Apps.list_apps(1, @per_page)
      |> Enum.map(&get_app_url/1)

    [start_urls: urls]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    should_add_more_requests?()
    # ...
    # do some work
    # ...
  end

  defp should_add_more_requests?() do
    # grab the current requests
    {:requests, requests} = Crawly.RequestsStorage.requests(PluginProphet.AppSpider)

    # check if needs to add more requests
    if Enum.count(requests) < 2 do
      [{:current_page, current_page}] = :ets.lookup(@ets_table, :current_page)
      next_page = current_page + 1
      :ets.insert(@ets_table, {:current_page, next_page})

      new_requests =
        Apps.list_apps(next_page, @per_page)
        |> Enum.map(&get_app_url/1)
        |> Enum.map(&Crawly.Request.new(&1))

      Crawly.RequestsStorage.store(PluginProphet.AppSpider, new_requests)
    end
  end

RicoTrevisan · August 28, 2024, 9:33am

How do I wipe the spider’s memory and run it again?

I would like to run the spider on a list of urls from my db. The way that the spider parses items could add a repeated item to the list of urls to crawl. Because of that I turned on Crawly.Middlewares.UniqueRequest. That works fine the first time. However, when I try to rerun the spider, it will log out a bunch of these:

...
[debug] Dropping request: [some_url], as it's already processed
[debug] Dropping request: [some_url], as it's already processed
...

I’ve tried to delete the dets_simple_storage file, I tried to force wipe the spider’s RequestsStorage on init with

...
Crawly.RequestsStorage.store(PluginProphet.AppSpider, [])
...

But none worked. The only way is to turn off Crawly.Middlewares.UniqueRequest.

Is there a way to wipe the spider’s memory?

oltarasenko · August 29, 2024, 7:15am

Hi Rico!

The requests middlewares use RequestsStorageWorker process in order to store information. The process itself is supposed to die when the spider finishes. Isn’t it the case for you?

RicoTrevisan · August 31, 2024, 5:48am

Thanks, indeed that is working as you mentioned. I was in dev starting / stopping the spider manually and never gave it a chance to stop properly.
In any case, after more consideration, I’ve decided to keep it off.

I’m using Crawly in a Phoenix application. I see Crawly automatically looks for spiders in ./spiders. I normally put all my backend modules in ./lib/my_app. Is there a way to change the default folder from ./spiders to ./lib/my_app/spiders?

Thanks again for Crawly.
I’m hoping to contribute back to it – as soon as I figure out how to better display the docs graphs in both Hex and in the IDE.