I'm making a web scraper to download zip files where the filename does not appear in the URL

htazewell · October 18, 2018, 4:59pm

I’ve been stuck trying to figure this out for a few weeks now. I am using a headless browser (phantomjs) to download a zip file. I have tried using HTTPoison however the URL does not contain the filename. In fact every attempt to download this zip file using HTTPoison returns a 302 redirect. I was wondering if there is a method by which files could be downloaded in a headless browser by either using the content-disposition or some other AJAX property since the filename is not part of the URL. Any ideas or insight would be helpful! Thank you.

OvermindDL1 · October 18, 2018, 5:07pm

You need to follow redirects then, you can do that by passing the option follow_redirect: true, max_redirect: 20 for whatever values you want. Or follow them manually.

I’m not sure why there would be a point to use a headless browser, should just use HTTPoison.

htazewell · October 18, 2018, 6:20pm

Thank you for your response! When I tried using follow_redirect: true, I found myself stuck in a loop where I was being redirected to the same URL. The reason I’m trying to do this in a headless browser is because doing this manually is time consuming and it would be better to have this run over night without any need for user input. I was thinking that if there was an Elixir library that allows you to inspect the HTTP traffic then I could identify the file that I need.

OvermindDL1 · October 18, 2018, 6:29pm

Uh… that’s a server bug, like big-time server bug… o.O

Well HTTPoison is the thing for that, plus processing the HTML through Meeseeks or something if you need to do that.

htazewell · October 18, 2018, 6:38pm

See that’s exactly what I was thinking as far as the server bug goes. My guess is that it was possibly set up that way to prevent web scraping and if that’s the case then there really isn’t much that can be done. I will definitely look into meeseeks, thanks for the suggestion!

htazewell · October 18, 2018, 8:24pm

One last question if you have a minute… there’s a button that, when clicked, triggers an http post which causes the document of interest to be downloaded. This is the behavior in a regular browser. However when I use the headless browser its as if that http post never happens. Do you have any suggestions for further diagnosing this issue?

sribe · October 18, 2018, 8:43pm

My only suggestion is that as long as you’re using a headless browser you’ll never know what is actually happening. You need to download the document in a regular browser, then look at the network activity in your browser’s dev tools to see what actually happened in terms of redirects etc, then use an HTTP client lib (HTTPoison is the usual choice) to follow the same path.

It’s perfectly possibly that the site you’re hitting uses a combination of redirects AND cookies. Advertising & tracking sometimes complicate the heck out of things. I’ve seen a single page load result in a series of 8 redirects each of which added some damned tracking/advertising info until the request was finally answered.

htazewell · October 18, 2018, 8:54pm

I really appreciate your response, thank you for the insights. Although I’m not closer to the solution, your feedback has validated my approach and has provided a much needed sanity check. If I end up finding the solution I will update this post!

Phillipp · October 19, 2018, 8:15am

I can just recommend to mirror the exact same request using HTTPoison as with the browser, including all cookies (just copy&paste them from chrome devtools) AND headers. If that works, just remove some cookies and headers step by step until it does not work anymore. I am doing a lot of web scraping and saw so many strategies to prevent scraping already. Sites often look for a valid User-Agent for example.

htazewell · October 19, 2018, 12:13pm

Thank you for the suggestion! At this point every different approach is helpful.

mtarnovan · October 24, 2018, 3:34pm

I’ve seen some strange behavior myself when trying to scrape some sites that were behind application level firewalls. Some of these firewalls do a sort of finger-printing of the browser / detecting of headless browsers, sometimes by timing operations on <canvas> objects, possibly by other means too. The way they are behaving is they redirect to an endpoint that serves an obfuscated js that redirects to the final endpoint only if the non-headless check succeeds. My advice would be to use an intercepting proxy like https://www.charlesproxy.com/ to see exactly what’s going on over the wire.

fahri.nurul · September 21, 2019, 7:02am

Hope this helps :