Scraping Google search result?

Context:

I am trying to dynamically make google search requests. However, it seems like google doesn’t offer an API for it. So my next approach is to use httpoison to make a google search via query parameter and parse through the results.

Problem:

However, I am running to a problem in which google doesn’t recognize my httpoison request as a legitimate request.

Question:

  • Have any of you guys/gals worked with google search?
  • How can I send a legitimate request via HTTPoison?

I can’t tell you exactly what you need to include, but here is a good approach to find out what you need:

  1. Make a manual request via your browser and record it using the devtools. Make sure you are in incognito mode so you don’t have any session data.

  2. Recreate the request, including all the headers and parameters using a HTTP tool, e.g. Postman.

  3. If step 2 works, then start to remove some headers which you think are not needed. The goal is to end up with just the stuff you need.

  4. Recreate the request using HTTPoison. It’s mostly just headers and in some cases you might also need some cookies.

The approach is pretty simple. It’s basically just emulating a real person. Have a look at headers like User-Agent, Accept, etc. Some sites try to be fancy and use those to detect crawlers.

If, for some reason, you need a cookie. Then figure out where and how the cookie is set. This means, you might end up with an extra request to obtain a cookie which you then use for the following requests.

2 Likes

Recreate the request, including all the headers and parameters using a HTTP tool, e.g. Postman.

Or you can “copy as cURL” the request (in most browsers) and do the same with cURL.
Also startpage.com returns google results, maybe that is easier to scrape? (just an idea, something to try)

Or use the cURL command to import into Postman :smiley: Fiddling around with a cURL command that contains headers and cookies is pretty awful :smiley:

1 Like

Just to help you with some future problems, there are more that you probably don’t expect. Let me list them below:

  1. This goes against Google’s TOS. You can be in legally not great position if you do that. I’m not saying Google will actually come after you but you have to be aware that this is not valid usage of the service, from Google’s point of view.

  2. They not only say this is a violation and not do anything about it, but they will actively detect and block you from scraping search results. Google builds profiles of all browsers / users with cookies and also IP addresses, and it will record these, do some matching behind the scenes and figure out that there is a suspicious browser, coming from a suspicious IP address, and will do several things: first, they will present you with a captcha to make sure “you are not a robot”. They can also block you even if you keep solving the captcha, effectively blocking off the IP address from Google’s services. You will have to rotate the IP addresses on regular basis and build browsers’ “human-like” behavior pattern to prevent being detected. You probably want to slow things down, also visit other sites, make your profile seem like a real user etc.

This is still doable, esp. if you want to do it at small scale for own usage, but at large scale it becomes expensive, risky, and you may not want to do it.

1 Like

As @hubertlepicki have said, you have to mimic user behaviour to make sure it works. HttpPoison is not a good option for this job. Tools like Nightmare or Puppeter are better choices.

You have to take in mind stuff like the User-Agent of the browser, user typing speed, captchas, continuous layout changes, privacy popups etc.

Anyway, you have not said too much about what are you trying to do. Depending on the volume of requests, it can be relatively easy or hard and expensive.

I feared this would be the case since they didn’t offer a public search api. I’ll keep this in mind.

Gotcha. I was able to make the google request work using Postman and I noticed that the requested generated a cookie and token. And I wasn’t sure how I can create a token and cookie.