Scraping Google SERP

Google SERP scraping solves many SEO problems. For example, you can monitor website ranking and scrap content from top websites. SERP scraping is just a part of SEO life ;} So, let’s scrap my friends!

Simple Google SERP scraper

This nice little PHP script uses getPage function that you can find here.

<?php
$url = 'http://www.google.com/search?hl=en&as_q=buy+viagra&as_epq=&as_oq=&as_eq=&lr=&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images&num=100';
$result = getPage(
    '[proxy IP]:[port]',
    $url,
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    5);
 
if (empty($result['ERR'])) {
 
    // TODO: check there is no captcha
    // preg_match("/sorry.google.com/", $result['EXE']);
 
    preg_match_all("@<h3\s*class="r">\s*<a[^<>]*href="([^<>]*)"[^<>]*>(.*)</a>\s*</h3>@siU",
        $result['EXE'], $matches);
 
    for ($i = 0; $i < count($matches[2]); $i++) {
        $matches[2][$i] = strip_tags($matches[2][$i]);
    }
 
    // Job’s done!
    // $matches[1] array contains all URLs, and 
    // $matches[2] array contains all anchors
    // …
} else {
    // WTF? Problems? 
    // ...
}
?>

5 scraper improvement tips

  1. Use as many proxies as you can, because Google doesn’t like scrapers. You’re stupid if you send hundred requests from one IP address. Make a list of proxies and take random proxy each time you scrap Google.
  2. Use anonymous proxies. No need to explain why.
  3. Get keywords from a database or file and build URL on the fly. urlencode function will help you.
  4. Be natural, give Google a rest. Use sleep and rand functions, something like sleep(rand($x, $y)).
  5. Use multi cURL if you like, but use wisely.

See ya!

40 Responses to Scraping Google SERP

  1. Thanks – I found this really useful – I’ve been looking around for days for a decent solution for SERP Scraping, and I’ve seen software that costs over $100 that does the same exact thing that your code does – thanks a TON – this code has given me all sorts of evil ideas LOL

  2. Hi! I find your script to be VERY useful and best of all its free 🙂

    but I can’t seem to make it work. It only shows a blank page. (I’ve already placed the code for the getpage function.)

    please help me.

  3. Jan, copy print_r($result); after $result = getPage… See what data you get.

    And it’s always a good idea to try another proxy.

    The code works, trust me ;] Have a nice day

  4. Just as info, you do not need sleeps or similar.
    The trick is to change the IP for each new search term, you can browse all sites of a keyword (max 10 sites/1000 hits in google) without changing ip and risking a ban.
    But as soon as you change the keyword you need to change your IP as well.
    If you don’t do that google will add a captcha before you can continue.
    If you continue google will temporarily block the IP for a few hours.
    If you continue again it can happen that you go into a permanent block, that’s something you want to avoid at all cost.

    greets

  5. Valuable info, thanks a lot!

  6. I feel lame for asking…but do you have any recommendations as to where to get proxies. All the ones I got with a google search come back with ‘couldn’t connect to host’ (and I’ve tried about 50 so far)

  7. Hey Brad!
    You probably tried to use “public proxies”.
    Those are a waste of time.
    Most are incredible slow, overused and short lived.

    If you want good results and a reliable system you’ll have to invest a few bucks.
    I’ve been using Cloakfish (www.cloakfish.com) for a while.
    Cloakfish is cheap, you can scrape a lot for small bucks.
    The downside is lower performance.

    Check the site I linked, that’s seo-proxies.com (www.seo-proxies.com). They are specialized on scraping and similar jobs and offer php scripts that do most of the job (you only have to add the code from this blog and you’re done)

    hope that helps

  8. First:
    Thanks for the article, it’s interesting.

    I found another one specialized on google scraping:
    http://google-scraper.squabbel.com

    If you don’t allow other URLs here just remove it please.
    But that one goes into much more detail and it has a much more advanced PHP project included.

    It can filter advertisements and it scrapes all google pages. (It mainly aims for large scaled scraping)

    Well thanks again for your nice site 😉

  9. In your preg_match_all, could you explain what the siU bit at the end does?

  10. anon, plz look here
    http://php.net/manual/en/reference.pcre.pattern.modifiers.php

    Try to experiment if you can’t get it. Remove/add modifiers one by one and see what happens.

    HTH.

  11. Useful content. I created a similar program in vb.net scraping different search engines and google has blocked me prior to reading your post. I now have implemented changing my ip and using a list of proxy addresses but yet to test it since i put the google searches into sleep mode. My questions is that since I run large keyword lists and subsequently create 1000’s of urls, how long would be appropriate to let my threads sleep between google fetches since I change the ips but want to be careful though?

  12. oops sorry forgot to include my next question:
    I am limited to 1000 results on each keyword, how do you suggest getting more results more some searches like blogs, etc

  13. Joe,

    Rotate proxies, make sure they are anonymous and use multi curl.

    Doy you really need more than 1K results/keyword?

    Yes, scraping blog search is good idea.

    • Thanks seozero. What exactly do u mean by multi curl?
      And what sort of performance is considered to be efficient. Right now my app is very slow and it has a lot to do with the amount of query urls I create since the user has the option to select AND/OR/DATE/LANGUAGE etc and to create the combination queries was hard so I just created a query for each option. I also would like to hear input on the use of threads to speed up the process.

  14. Thanks for the script. I’m getting the blank page too. What do you mean by “copy print_r($result); after $result = getPage… See what data you get. “

  15. hi

    do you know if there is much difference in quantity or quality of the data returned via screen scraping versus data received from the google search api?

  16. Hi,

    I’ve implemented this script and seems to work, but I can’t determine from the following array what my page rank position is:

    [INF] => Array
    (
    [url] => http://www.google.com/search?q=Chanhassen Fitness Center
    [http_code] => 0
    [header_size] => 0
    [request_size] => 421
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 0.004516
    [namelookup_time] => 0.000102
    [connect_time] => 0.00022
    [pretransfer_time] => 0.004365
    [size_upload] => 0
    [size_download] => 36339
    [speed_download] => 8046722
    [speed_upload] => 0
    [download_content_length] => 0
    [upload_content_length] => 0
    [starttransfer_time] => 0.00438
    [redirect_time] => 0
    )

    Am I missing something? I’m trying to determine what the Google page rank position is for the keyword phrase: Chanhassen Family Center. I have to provide reporting to my client Life Time Fitness and for each page I host for them, they simply want a # to represent the Google Page Rank.

    Maybe what I am asking for is not what the nature of this script is intended for, and in that case, does anyone have an idea of where I might find what I am looking for?

    Any and all help is greatly appreciated!

    Thanks in advance!

    Rick

  17. Is it possible to use this script without a proxy?

  18. Works!

    It takes like 10 seconds before it returns anything! Isn’t that considered a long time?

  19. I’m only able to get 10 results at a time, and not 100 even if I set &num=100 in the URL.

    Any ideas?

    • Params had been changed. I have to update this post.

      • Can you explain a bit more so that we might try to make the changes in the script? When I view the results, it seems looking for the url following the h3 class should be enough, but it isn’t.

        Also, it seems dom would be a more efficient, though, longer, way of doing it, but I kind like how there is just 1 line for the regex that does all the magic. That said, these long regex expressions can be a little cryptic, thought still manageable, more or less.

        • I updated URL and regex. It should scrape “buy viagra” top 100 ;}

          There are many methods available. Choose one you like and improve it. Perhaps url+regex is exactly what you need at the moment, but later you can move from it to Python or Java + threads.

        • forgot to tell you that google can return different code. Make a few tests with various proxies and user agents.

  20. I use Google Analytic for my website monitored. I think its the best free service from Google’s side. You can see every single stats with it. Your post is really informative, thanks a lot for sharing with all..

  21. We use both Google Analytics ang Fengui.com to track both the traffic and the users on the webpage combinated with A/B testing to boost our ROI by

  22. ZoomRank works for us – accurate and very reliable!

  23. if you don’t like regular expressions to get a result you can work nicely with the DOM object in a css manner by using the ‘PHP Simple HTML DOM Parser’

    http://simplehtmldom.sourceforge.net/

  24. But a better idea would be to scrape scroogle.org . They don’t change the html and I bet you don’t need proxies to scrape them.

    But a still have no idea of WHY scrape google is for. For the mapping of keyword->sites and where to put efforts on getting a linkback?!?! Is that it?!

    for the 301 redirection trick?!?

    Please tell me why… I’m burning my front lobe here.

  25. I need to scrap only 50 results???? Pls anyone can help me!!! I dont to search whole results its taking time to scrap…

  26. I’ll get straight to the point. I’m looking for someone who can make my website rank 1st page on Google SERPS,using my keywords.
    I’ve tried whitehat seo, to find it’s a big rip off or to slow.
    The major deciding factor to find someone like you is, I’m currently running a large scale Google and FaceBook PPC campaign that is being maliciously attacked by 1 or maybe 2 of my competitors, this has been proven by my site analytics, with 1,000’s of IP addresses and host names. More then likely done by a proxy server?

    I’ve decide if you can’t beat them join them, now, I don’t want to attack their campaigns.

    What I do want to do is find someone who can use similar techniques or better to have my site be on the 1st page of Google SERPs using my product specific keywords or phrases.
    2 of my keywords have more then 16,000,000 searches a month. “Interview” & “Job Search”, as well “resume”
    I’m not technical so would need someone who knows and understands how to do this.
    If you can or want to do this, I will pay cash$$ for results. Time is of the essence as I’m launching a TV commercial in the next couple of days and have other great branding campaigns happening. I’m in Los Angeles California and you may contact me at 1-818-530-3259 anytime or send me your phone number and I’ll call you.

    Hopefully we can establish a mutually rewarding business relationship

    Regards,
    Ben Best
    818-530-3259

  27. If your not free to do it can you refer someone who can?

    • Ben,

      What you are looking for is a link builder. There are many out there that can do this for you. We are one of them. Sen me a mail at bjorn@devenia.com if you want to know more. On our contact page you will find our phone numbers if you prefer to call. Pick the one closest to you.

  28. For scraping Google there are a lot of additional things required, maybe you’d like to include those PHP projects: google-scraper.squabbel.com (in depth knowhow) and http://google-rank-checker.squabbel.com (contains a huge PHP project for scraping google).

    Your site and articles are all really well written, I hope you find my information useful as well.

  29. Wow! very useful tip and script.. I will try to use it now.. Thank for this..

  30. We can no longer get the SERP HTML from http://www.google.com/search?q=term because Google returns a 302 Moved redirect page. Any idea how to get around this?

    To illustrate, hit the URL in your example with curl in term

Leave a Reply

Your email address will not be published. Required fields are marked *