What's the best way to scrape a webpage in 2024?

Hello,

I working on a project that need to scrap web pages from time to time, and some of them require a “visit from a browser” to prevent scrapping, e.g., many website using Cloudflare protection:

iex(19)> Req.get("https://www.cell.com/chem/fulltext/S2451-9294(24)00494-7")
{:ok,
 %Req.Response{
   status: 403,
   ...

What’s the best way to solve this task with Elixir in 2024?

I can think of three possible ways:

It looks like Wallaby is my best bet, but I’m curious if people have any comments or other suggestions here?

1 Like

For simple stuff you could try GitHub - seanmor5/hop: A tiny web crawling framework for Elixir

There is also a nice Dockyard article Web Crawling with Hop, Mighty, and Instructor

3 Likes

Might be enough to send proper HTTP headers and using the right User-Agent.

A lot of websites are nice enough these days to provide an internal API for their Javascript that already returns nicely structured JSON.
Search boxes and auto-complete are also great and often already return all the data i need.

Wallaby is probably easier to use but here is a fork with the newest cdp protocol versions, works fine.

2 Likes

I did the exact same request with curl and it works, I think you are just missing some headers or you spammed too much and got temporarily banned.

1 Like