Scraping Yahoo SERP

Yahoo SERP scraper is a little more difficult to implement than Google SERP scraper. Yahoo guys are mad about redirects (former blackhats?). You have to clean URLs after them. But nothing can stop you from scraping 😉

Scraper code example

First time here? Read about scraping websites with PHP cURL under proxy. You will find getPage source code there.

<?php
$result = getPage(
    '[proxy IP]:[port]', // get a proxy from somewhere
    'http://search.yahoo.com/search?p=apple',
    'http://www.yahoo.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    5);
 
if (empty($result['ERR'])) {
    preg_match_all('(<h3><a class.*href="(.*)".*>(.*)</a>)siU',
        $result['EXE'], $matches);
 
    for ($i = 0; $i < count($matches[1]); $i++) {
        // decode url
        $matches[1][$i] = urldecode($matches[1][$i]);
        // get rid of rds.yahoo.com redirect
        preg_match_all('/\*\*(http:\/\/.*$)/siU',
            $matches[1][$i], $urls);
        $matches[1][$i] = $urls[1][0];
    }
 
    // strip tags
    for ($i = 0; $i < count($matches[2]); $i++) {
        $matches[2][$i] = strip_tags($matches[2][$i]);
    }
 
    // Job’s done!
    // $matches[1] contains URLs 
    // $matches[2] contains anchors
    // …
} else {
    // Something went wrong... 
}
?>

P.S.: Some URLs can still be unreadable (http://rdre1.yahoo.com/click?u=http://feedpoint.net…). Don’t panic 🙂 There’s a workaround.

yahoo-serp

Take care

3 thoughts on “Scraping Yahoo SERP”

  1. I start to like your site, nice free codes you have

    Public proxies will make the process really slow, you need private ones if you want good results.
    Also google and yahoo tend to detect shared proxies soon and ban them from scraping.

    If you got for about 30 IPs you can scrape all day long without ban.

    just my few cents

  2. what is the work around for the unreadable backlinks?

    I get error on the line that removes the redirects. any possible solution?

Comments are closed.