I’m looking for a way to scrape a media file from an external website.
Rather than inspecting the returned html, I wanted to try and capture the network requests instead.
Similar to the way you open Chrome and look at the network tab.
I’ve looked at the docs for HTTPoison, Finch, Crawly, and as far as I can tell, they only return html.
Can someone point me in the right direction or list potential libraries to manage this?
If you open the network tab and then refresh the page, the very first request IS just the HTML, same as with HTTPoison / Finch / etc.
HTTPoison / Finch / Etc can download images and so forth too, but you have to do what Chrome does and parse the HTML to go look for those resources. As long as the resources are mentioned in the HTML itself (like an
<img> tag) then you’re good to go.
Thanks Ben. Great explanation.
I’m going to play with Polly.JS to generate a HAR file I wanted.
But scrapping the html from a simple get request might just be easier.
The scrape is simple, so hopefully won’t break often.
The last thing I want to do is run a headless browsers for this job.
Ah yeah if it’s a simple scrape I’d just pull the page down with Finch, run the HTML through GitHub - philss/floki: Floki is a simple HTML parser that enables search for nodes using CSS selectors. to find whatever inside you’re looking for, and make another request to fetch it.
Hope this helps!