How to build scrapers that matter. Code reuse.

Party time! Some scraper building tips for you guys. Hope you can benefit from them and save some time for wild drunk orgies. OK-OK, for families and friends.

Scrapers are common. Grab a website or page, extract data, save it. See logical parts? Good. You got the idea.

Building SERP proscraper

Pro: How many search engines do you know?
Noob: About 20.
Pro: How many of them do you scrap?
Noob: Five. Google, Yahoo, Bing, Ask and Cuil.
Pro: And how many scrapers do you have?
Noob: Five. [It’s pain in the ass to build them.]
Pro: Five?! I have only one server side scraper for all engines. And I can easily extend the scraper in no time.
Noob: Teach me! Do you have a ebook? [I’ll be your pet. I’ll suck your balls… Why not?]
Pro: [F*ck off!] Learn design patterns and quit reading gay forums.
Noob: I’ll do what you say. I want to be a blackhat.
Pro: [Hate noobs…] Read carefully. You’ll be a SERP scraper hero soon.

A closer look at SERP scrapers

Haven’t open the Noob’s links yet? Do it now. Can you identify the common parts and the parts that vary? The getPage part is common, but the extraction parts vary. E.g. it’s one-to-one relationship between search engine and parsing-extraction part; and one-to-many relationship between the getPage and search engines. What does it mean? It means – separate the getPage part and the extraction part.

Coding SERP Parser Factory

Factory is the best choice for coding SERP parsers by all means. You are passing the type of search engine to the factory and based on that type the SERP parser factory creates the concrete SERP parser. When you need new search engine scraper, you create concrete SERP parser class and add two lines to the SERP parser factory. Checkmate!

Here are some code snippets for you. Now you have a head start on noobs.

iSerpParser interface (i_serp_parser.php)

interface iSerpParser {
    public function parse($serp);
}

Google SERP parser – example of concrete SERP parser (google_serp_parser.php)

include_once('i_serp_parser.php');
 
class GoogleSerpParser implements iSerpParser {
 
    public function parse($serp) {
       // necessary code …
       return $serpData;
    }
 
    // other methods …
}

Yahoo SERP parser – example of concrete SERP parser (yahoo_serp_parser.php)

include_once('i_serp_parser.php');
 
class YahooSerpParser implements iSerpParser {
 
    public function parse($serp) {
       // necessary code …
       return $serpData;
    }
 
    // other methods …
}

And finally, the SerpParserFactory (serp_parser_factory.php)

include_once('google_serp_parser.php');
include_once('yahoo_serp_parser.php');
// include concrete SERP parsers …
 
class SerpParserFactory {
    public static function createSerpParser($searchEngineName) {
        $serpParser = NULL;
 
        if ($searchEngineName == 'google') {
            $serpParser = new GoogleSerpParser();
        } elseif ($searchEngineName == 'yahoo') {
            $serpParser = new YahooSerpParser();
        }
        // create concrete SERP parser and assign to $serpParser …
        return $serpParser;
    }
}

The SERP proscraper coding is almost done. Find out which engine, get page, parse and save (serp_scraper.php)

include_once('serp_parser_factory.php');
 
// get engine name from the command line, database …
$seName = getSearchEngine();
 
// getPage…
$page = getPage();
 
// create SERP parser
$serpParser = SerpParserFactory::createSerpParser($seName);
 
// parse SERP
$data = $serpParser->parse($page);
 
// save data …

Aftercoding chat

Pro: Now you’re prepared for SERP scraping battle.
Noob: I can add as many engines as I want. F*king easy! I have to code only SERP parsing class. Thanks Pro! Could you share more blackhat tips? [Teach me and I …]
Pro: Dude, it’s just the beginning. You can scrap thousands articles if you switch your stupid WH/BH mindset to coding mindset. But don’t go out to sea before you’ve learned how to swim. Study coding and design patterns: Five common PHP design patterns, Recommended PHP reading listHead First Design Patterns (aff).
Noob: Thanks man!

7 thoughts on “How to build scrapers that matter. Code reuse.”

  1. Hey zero, Do you have a method to scrape google local business results and see where a listing appears in a) the 7-pack or b) the full listings for certain search terms? WP

  2. It’s interesting to look at the code of how it is done, I used to be a software engineer, so I can understand what the code is doing. But just like the comment by “Metal Briefcase” I would also say that I won’t be using this method, but it is still fun to look at =D

    Till then,

    Jean

  3. this is really gay!

    Wtf do you pretend to do and what your code is differs like pussy from peniz. oop and arrogantly proud on that gay stuff a noob programmmer should already know. Awful

Leave a Reply

Your email address will not be published. Required fields are marked *