Scrape network requests from external website

berts-4865 · October 18, 2022, 3:08am

I’m looking for a way to scrape a media file from an external website.

Rather than inspecting the returned html, I wanted to try and capture the network requests instead.
Similar to the way you open Chrome and look at the network tab.

I’ve looked at the docs for HTTPoison, Finch, Crawly, and as far as I can tell, they only return html.

Can someone point me in the right direction or list potential libraries to manage this?

Thanks

benwilson512 · October 18, 2022, 3:17am

Hi @berts-4865

If you open the network tab and then refresh the page, the very first request IS just the HTML, same as with HTTPoison / Finch / etc.

What’s happening from there though is that chrome is running that HTML and all of the associated javascript, and that kicks off the fetching of a bunch of other resources like CSS, JS, images, etc.

HTTPoison / Finch / Etc can download images and so forth too, but you have to do what Chrome does and parse the HTML to go look for those resources. As long as the resources are mentioned in the HTML itself (like an <img> tag) then you’re good to go.

Where you’re going to run into trouble is modern JS heavy websites. If the site is basically just JS based then you aren’t gonna find links to the content in the HTML, you have to basically run the javascript. At that point you’re in full blown “headless browser” land.

berts-4865 · October 18, 2022, 3:42am

Thanks Ben. Great explanation.

I’m going to play with Polly.JS to generate a HAR file I wanted.
But scrapping the html from a simple get request might just be easier.
The scrape is simple, so hopefully won’t break often.

The last thing I want to do is run a headless browsers for this job.

Thanks

benwilson512 · October 18, 2022, 3:43am

Ah yeah if it’s a simple scrape I’d just pull the page down with Finch, run the HTML through GitHub - philss/floki: Floki is a simple HTML parser that enables search for nodes using CSS selectors. to find whatever inside you’re looking for, and make another request to fetch it.

Hope this helps!