Scraping Bing SERP

Bing is no exception when it comes to scraping.

$result = getPage(
    '[proxy IP]:[port]', // get a proxy from somewhere
    'http://www.bing.com/search?q=twitter',
    'http://www.bing.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    5);
 
if (empty($result['ERR'])) {
 
    preg_match_all(
        '(<div class="sb_tlst">.*<h3>.*<a href="(.*)".*>(.*)</a>.*</h3>.*</div>)siU',
        $result['EXE'], $matches);
 
    for ($i = 0; $i < count($matches[2]); $i++) {
        $matches[2][$i] = strip_tags($matches[2][$i]);
    }
 
    // Jobโ€™s done!
    // $matches[1] array contains all URLs, and
    // $matches[2] array contains all anchors
    // โ€ฆ
} else {
    // WTF? Problems?
    // ...
}

Grab the getPage function from Scraping websites with PHP cURL under proxy.

21 thoughts on “Scraping Bing SERP”

  1. Hi, Thanks for the tutorial code. If I want to put the call in a loop to scrape more than the 1st page results, is there a way of throttling the calls to appear more natural?

  2. Hi, Have bing.com changed the html of their results? The regex does not seem to be working anymore?

  3. Hi seozero, Have you had any luck scraping eBay search results? Since they removed their XML feeds I’ve been looking for a way to scrape search results, especially for sold items to see trends etc. Can you work your scraping magic on those? Thanks!

  4. Hi seozero, Thanks for your reply.

    I’m part of the eBay developer network and use their sales API quite a bit.

    The problem is the market API is not free, see http://developer.ebay.com/programs/marketdata/ and the free version you mentioned above only returns a summary.

    Therefore I thought I’d just hit the eBay listing themselves!

  5. ill share mine with you when im done with it ๐Ÿ™‚

    when you get chance can you drop me an email id like to show something that you can use with your scraped content ๐Ÿ™‚

  6. Have anyone managed to scrap the first 100 Bing results page:
    http://www.bing.com/search?q=twitter&count=100

    I get the following in return:
    HTTP/1.1 200 OK Cache-Control: no-cache Date: Tue, 16 Mar 2010 09:50:41 GMT Content-Length: 0 Connection: keep-alive Set-Cookie: OVR=flt=0&flt2=0&DomainVertical=0&Cashback=cbtest4&MSCorp=kievfinal&GeoPerf=0&Release=osf1; domain=.bing.com; path=/

    If you figured it out please send an email at tonixx AT gmail.com

    Thx!

  7. Thanks for this. I’ve modified it a bit to allow for searching through pages.

    $bing_url = ‘http://www.bing.com/search?q=’ . urlencode($keyword) . ‘&first=’;

    for ($page = 0; $page < 9; $page++) {
    // Grab website
    $curl = curl_init();
    if ($page == 0) {
    curl_setopt ($curl, CURLOPT_URL, $bing_url . '1');
    } else {
    curl_setopt ($curl, CURLOPT_URL, $bing_url . $page . 1);
    }

    I didn't use the getPage() function, but using the bing_url and actually passing either$ bing_url . '1' (first = 1) or $bing_url . $page . 1 (first=11, 21, etc) allows you to parse the pages.

Comments are closed.