22Apr/0919
Scraping Google SERP
Google SERP scraping solves many SEO problems. For example, you can monitor website ranking and scrap content from top websites. SERP scraping is just a part of SEO life ;} So, let's scrap my friends!
Simple Google SERP scraper
This nice little PHP script uses getPage function that you can find here.
<?php $result = getPage( '[proxy IP]:[port]', 'http://www.google.com/search?q=twitter', 'http://www.google.com/', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8', 1, 5); if (empty($result['ERR'])) { // TODO: check there is no captcha // preg_match("/sorry.google.com/", $result['EXE']); preg_match_all("(<h3 class=r><a href=\"(.*)\".*>(.*)</a></h3>)siU", $result['EXE'], $matches); for ($i = 0; $i < count($matches[2]); $i++) { $matches[2][$i] = strip_tags($matches[2][$i]); } // Job’s done! // $matches[1] array contains all URLs, and // $matches[2] array contains all anchors // … } else { // WTF? Problems? // ... } ?>
5 scraper improvement tips
- Use as many proxies as you can, because Google doesn't like scrapers. You're stupid if you send hundred requests from one IP address. Make a list of proxies and take random proxy each time you scrap Google.
- Use anonymous proxies. No need to explain why.
- Get keywords from a database or file and build URL on the fly. urlencode function will help you.
- Be natural, give Google a rest. Use sleep and rand functions, something like sleep(rand($x, $y)).
- Use multi cURL if you like, but use wisely.
See ya!
May 9th, 2009 - 00:35
Thanks – I found this really useful – I’ve been looking around for days for a decent solution for SERP Scraping, and I’ve seen software that costs over $100 that does the same exact thing that your code does – thanks a TON – this code has given me all sorts of evil ideas LOL
July 12th, 2009 - 12:25
Hi! I find your script to be VERY useful and best of all its free
but I can’t seem to make it work. It only shows a blank page. (I’ve already placed the code for the getpage function.)
please help me.
July 12th, 2009 - 19:46
Jan, copy print_r($result); after $result = getPage… See what data you get.
And it’s always a good idea to try another proxy.
The code works, trust me ;] Have a nice day
August 13th, 2009 - 10:26
Just as info, you do not need sleeps or similar.
The trick is to change the IP for each new search term, you can browse all sites of a keyword (max 10 sites/1000 hits in google) without changing ip and risking a ban.
But as soon as you change the keyword you need to change your IP as well.
If you don’t do that google will add a captcha before you can continue.
If you continue google will temporarily block the IP for a few hours.
If you continue again it can happen that you go into a permanent block, that’s something you want to avoid at all cost.
greets
August 13th, 2009 - 18:57
Valuable info, thanks a lot!
August 18th, 2009 - 04:10
I feel lame for asking…but do you have any recommendations as to where to get proxies. All the ones I got with a google search come back with ‘couldn’t connect to host’ (and I’ve tried about 50 so far)
August 18th, 2009 - 23:52
Hey Brad!
You probably tried to use “public proxies”.
Those are a waste of time.
Most are incredible slow, overused and short lived.
If you want good results and a reliable system you’ll have to invest a few bucks.
I’ve been using Cloakfish (www.cloakfish.com) for a while.
Cloakfish is cheap, you can scrape a lot for small bucks.
The downside is lower performance.
Check the site I linked, that’s seo-proxies.com (www.seo-proxies.com). They are specialized on scraping and similar jobs and offer php scripts that do most of the job (you only have to add the code from this blog and you’re done)
hope that helps
November 20th, 2009 - 05:55
First:
Thanks for the article, it’s interesting.
I found another one specialized on google scraping:
http://google-scraper.squabbel.com
If you don’t allow other URLs here just remove it please.
But that one goes into much more detail and it has a much more advanced PHP project included.
It can filter advertisements and it scrapes all google pages. (It mainly aims for large scaled scraping)
Well thanks again for your nice site
December 15th, 2009 - 16:43
In your preg_match_all, could you explain what the siU bit at the end does?
December 16th, 2009 - 09:50
anon, plz look here
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
Try to experiment if you can’t get it. Remove/add modifiers one by one and see what happens.
HTH.
February 9th, 2010 - 10:56
Search Engine Optimization is a passion. You got it right in this article
May 12th, 2010 - 23:25
Useful content. I created a similar program in vb.net scraping different search engines and google has blocked me prior to reading your post. I now have implemented changing my ip and using a list of proxy addresses but yet to test it since i put the google searches into sleep mode. My questions is that since I run large keyword lists and subsequently create 1000′s of urls, how long would be appropriate to let my threads sleep between google fetches since I change the ips but want to be careful though?
May 12th, 2010 - 23:26
oops sorry forgot to include my next question:
I am limited to 1000 results on each keyword, how do you suggest getting more results more some searches like blogs, etc
May 14th, 2010 - 21:58
Joe,
Rotate proxies, make sure they are anonymous and use multi curl.
Doy you really need more than 1K results/keyword?
Yes, scraping blog search is good idea.
May 19th, 2010 - 18:41
Thanks seozero. What exactly do u mean by multi curl?
And what sort of performance is considered to be efficient. Right now my app is very slow and it has a lot to do with the amount of query urls I create since the user has the option to select AND/OR/DATE/LANGUAGE etc and to create the combination queries was hard so I just created a query for each option. I also would like to hear input on the use of threads to speed up the process.
May 20th, 2010 - 08:53
“What exactly do u mean by multi curl?”
I mean http://www.php.net/manual/en/function.curl-multi-init.php
Not sure if it’s useful for you, because you use vb. I can post smth on php if you want.
“what sort of performance is considered to be efficient” -totally up to you.
“very slow” – Slow how? I guess your goal is to get data (and save it?).
May 19th, 2010 - 23:49
Thanks for the script. I’m getting the blank page too. What do you mean by “copy print_r($result); after $result = getPage… See what data you get. “
July 7th, 2010 - 16:04
hi
do you know if there is much difference in quantity or quality of the data returned via screen scraping versus data received from the google search api?
July 9th, 2010 - 14:27
Don’t know, but I vote for API