Crawly - A high-level web crawling & scraping framework for Elixir

dogweather · September 14, 2020, 9:13pm

What JSON output does Crawly support? With some sites, I like Scrapy’s default style of a JSON object per page. But with other sites, I want to create a single JSON tree representing the site. This is possible with Scrapy with a few tricks. How hard would this be to do with Crawly?

oltarasenko · September 15, 2020, 6:41am

Hey @dogweather,

We support JL and CSV formats of outputs. Having a line per object.

At this moment we don’t have support for a possibility to create a root object with all items inside. As far as I see it might be complex for large crawls.

Also just in case, check our experimental UI. It still has a quite basic styling, but we’re migrating some parts to LiveView, so it will get better soon!

dogweather · September 15, 2020, 5:35pm

Sounds very nice. I replied to your other comment, but here’s a shortened example of a scrape where I get just one object. The real result has info for 20,000 or so web pages:

{
  "date_accessed": "2019-03-21",
  "chapters": [
    {
      "kind": "Chapter",
      "db_id": "36",
      "number": "101",
      "name": "Oregon Health Authority, Public Employees' Benefit Board",
      "url": "https://secure.sos.state.or.us/oard/displayChapterRules.action?selectedChapter=36",
      "divisions": [
        {
          "kind": "Division",
          "db_id": "1",
          "number": "1",
          "name": "Procedural Rules",
          "url": "https://secure.sos.state.or.us/oard/displayDivisionRules.action?selectedDivision=1",
          "rules": [
            {
              "kind": "Rule",
              "number": "101-001-0000",
              "name": "Notice of Proposed Rule Changes",
              "url": "https://secure.sos.state.or.us/oard/view.action?ruleNumber=101-001-0000",
              "authority": [
                "ORS 243.061 - 243.302"
              ],
              "implements": [
                "ORS 183.310 - 183.550",
                "192.660",
                "243.061 - 243.302",
                "292.05"
              ],
              "history": "PEBB 2-2009, f. 7-29-09, cert. ef. 8-1-09<br>PEBB 1-2009(Temp), f. &amp; cert. ef. 2-24-09 thru 8-22-09<br>PEBB 1-2004, f. &amp; cert. ef. 7-2-04<br>PEBB 1-1999, f. 12-8-99, cert. ef. 1-1-00",
              }
            ]
          }
        ]
      }
    ]
  }

oltarasenko · January 18, 2021, 8:15am

As a part of the project development, I have decided to create a short cookbook of scraping recipes, if you’re using Crawly or Scraping in general, these articles might be useful for you (I am including medium friend links, so everyone can read them):

oltarasenko · February 11, 2021, 9:17pm

Exadra37 · February 12, 2021, 12:11am

20 claps for this article on Medium from me

Thanks for your excellent work

oltarasenko · April 20, 2021, 6:13am

dogweather · July 8, 2022, 9:01pm

Would you have a link to working crawly demo code? I tried several, including the one in the README, but I don’t get any output from it. It also has a bug or two that I fixed to get it to compile. I’m ready to try crawly out, but haven’t seen it work yet.

oltarasenko · July 9, 2022, 9:17pm

Sorry to say it, but I don’t have time (mostly due to an ongoing war in my country :() to work on Crawly.

oltarasenko · April 24, 2023, 9:21am

Hey people,

I’ve got a bit of time to work on Crawly recently, and as a result have made a new release and a new article about it

Hopefully you will find it useful.

dogweather · April 29, 2023, 4:29am

Crawly rocks. I use it to scrape laws & statutes. I’m on a path to moving to it from Python Scrapy.

thinkingcat · May 20, 2023, 7:00pm

Thanks for your work in updating your article! I’m starting to work on Crawly and will use your article as a reference.

Do you have any plans to make an article about downloading files? e.g., pdfs with Crawly. That would be really helpful for me and I suspect many others…

Sending you and your countrymen all the best! Thank you in advance!

oltarasenko · May 22, 2023, 10:00pm

Thanks for kind words! It would be interesting to give it a try. Do you have some website to check in mind?

oltarasenko · August 25, 2023, 9:37am

Adding few new articles:

kasvith · January 22, 2024, 4:05pm

Hi,

One limitation is see in Crawly is inability to submit urls after a spider started, we have a list of urls stored in a db and need to crawl it by some limit

RicoTrevisan · August 12, 2024, 1:57pm

Hi, I’ve been getting my head around Crawly for the past couple of days. Before that I was stubbing my toes in major skill-issues with Crawlee JS on the Apify platform.

My app has 20k urls that it’s tracking. Once a week I would like to ping each one to get the latest info. Am I able to pass more information with the start_urls or start_requests? I would like to pass along the db id of the related url so that it’s easier to handle it in the parse_item function.

RicoTrevisan · August 14, 2024, 7:26am

I’m getting these debug messages. Is it a version mismatch?

Rebuilding...
[debug] Could not classify module Elixir.Hex.API as spider: %UndefinedFunctionError{module: Hex.API, function: :module_info, arity: 1, reason: nil, message: nil}
[debug] Could not classify module Elixir.Hex.API.Auth as spider: %UndefinedFunctionError{module: Hex.API.Auth, function: :module_info, arity: 1, reason: nil, message: nil}
[debug] Could not classify module Elixir.Hex.API.Key as spider: %UndefinedFunctionError{module: Hex.API.Key, function: :module_info, arity: 1, reason: nil, message: nil}
[debug] Could not classify module Elixir.Hex.API.Key.Organization as spider: %UndefinedFunctionError{module: Hex.API.Key.Organization, function: :module_info, arity: 1, reason: nil, message: nil}
[debug] Could not classify module Elixir.Hex.API.Package as spider: %UndefinedFunctionError{module: Hex.API.Package, function: :module_info, arity: 1, reason: nil, message: nil}
[debug] Could not classify module Elixir.Hex.API.Package.Owner as spider: %UndefinedFunctionError{module: Hex.API.Package.Owner, function: :module_info, arity: 1, reason: nil, message: nil}

Here are my mix.exs file

 defp deps do
    [
      {:bcrypt_elixir, "~> 3.0"},
      {:phoenix, "~> 1.7.14"},
      {:phoenix_ecto, "~> 4.5"},
      {:ecto_sql, "~> 3.10"},
      {:postgrex, ">= 0.0.0"},
      {:phoenix_html, "~> 4.1"},
      {:phoenix_live_reload, "~> 1.2", only: :dev},
      # TODO bump on release to {:phoenix_live_view, "~> 1.0.0"},
      {:phoenix_live_view, "~> 1.0.0-rc.1", override: true},
      # {:floki, ">= 0.30.0", only: :test},
      {:floki, ">= 0.30.0"},
      {:phoenix_live_dashboard, "~> 0.8.3"},
      {:esbuild, "~> 0.8", runtime: Mix.env() == :dev},
      {:tailwind, "~> 0.2", runtime: Mix.env() == :dev},
      {:heroicons,
       github: "tailwindlabs/heroicons",
       tag: "v2.1.1",
       sparse: "optimized",
       app: false,
       compile: false,
       depth: 1},
      {:swoosh, "~> 1.5"},
      {:finch, "~> 0.13"},
      {:telemetry_metrics, "~> 1.0"},
      {:telemetry_poller, "~> 1.0"},
      {:gettext, "~> 0.20"},
      {:jason, "~> 1.2"},
      {:dns_cluster, "~> 0.1.1"},
      {:bandit, "~> 1.5"},
      {:httpoison, "~> 2.2"},
      {:dotenv, "~> 3.1"},
      {:csv, "~> 3.2"},
      {:crawly, "~> 0.17.2"}
    ]
  end

oltarasenko · August 14, 2024, 7:32am

No. Crawly scans the project files in order to find spiders inside. Those are debug messages, for files that are not spiders.

oltarasenko · August 14, 2024, 7:33am

Sorry I don’t understand that. Could you explain what you’re trying to do?

RicoTrevisan · August 14, 2024, 10:09am

Thanks for taking the time, let me try to explain.

I’m trying to start the spider with a bunch of items from my database and – when parsing the responses – I want to write back to those records. It would be convenient to me to be able to pass the record itself as metadata so that inside parse_item/1 I can use that record’s information.

def init do
  requests = Apps.list_apps()
  |> Enum.map(fn app -> 
    Crawly.Request.new("https://#{app.bubble_id}.bubbleapps.io", [], [], meta: %{app: app})
   end)

  [start_requests: requests]
end

def parse_item(response) do
  app = response.meta.app
  # do something with response.body based on response.meta.app
end

Originally I thought I could (ab)use the middlewares of Crawly.Request.new/4, but that info would not be available in the response, correct?