What's the best way to scrape a webpage in 2024?

vkryukov · November 12, 2024, 7:10pm

Hello,

I working on a project that need to scrap web pages from time to time, and some of them require a “visit from a browser” to prevent scrapping, e.g., many website using Cloudflare protection:

iex(19)> Req.get("https://www.cell.com/chem/fulltext/S2451-9294(24)00494-7")
{:ok,
 %Req.Response{
   status: 403,
   ...

What’s the best way to solve this task with Elixir in 2024?

I can think of three possible ways:

Use Wallaby (Commits · elixir-wallaby/wallaby · GitHub)
Use Crawly (GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.), which seems to be a bit of an overkill for my use case
Use GitHub - andrewvy/chrome-remote-interface: Elixir Client for the Chrome Debugger Protocol (doesn’t seem to be supported anymore)

It looks like Wallaby is my best bet, but I’m curious if people have any comments or other suggestions here?

kevinschweikert · November 12, 2024, 9:36pm

For simple stuff you could try GitHub - seanmor5/hop: A tiny web crawling framework for Elixir

There is also a nice Dockyard article Web Crawling with Hop, Mighty, and Instructor

schneebyte · November 13, 2024, 1:59pm

Might be enough to send proper HTTP headers and using the right User-Agent.

A lot of websites are nice enough these days to provide an internal API for their Javascript that already returns nicely structured JSON.
Search boxes and auto-complete are also great and often already return all the data i need.

Wallaby is probably easier to use but here is a fork with the newest cdp protocol versions, works fine.

D4no0 · November 13, 2024, 3:57pm

I did the exact same request with curl and it works, I think you are just missing some headers or you spammed too much and got temporarily banned.