Scraping article directories by exploiting search functionality

Search is a bottleneck of thousands of article directories. ezinearticles, articlesbase and another hundred of websites become your pets for scraping soon.

Solution Overview

You need a pair of scrapers. One of them uses search functionality. It grabs search results, URLs and titles, and saves them to database. Another scraper takes URLs (and titles) from the database and scraps articles.

Why two scrapers instead of one? Because running multiple instances of both scrapers, i.e. scraping different websites in parallel, reduces the risk of getting caught, but, at the same time, allows you to pull out many articles. Directories are happy, because you’re whoring them politely and you’re happy, because you get many articles fast.

Two scrapers per hundreds of directories might be a challenging coding task. I recommend you to write bots so they can get websites related data from the database. Don’t use files and command line arguments too much, because it’s a pain in the ass. Create the design constraint – ‘100% code reuse’ and never violate it. You will reuse this bitch many times over and over again without changing single line of code. It pays off, because when you want to scrap another directory, you simply add one database record and run the scripts. Got it? OK, it’s time to get your hands dirty.

Searching Directories

Don’t type just “article website” or “article directory”. Target a group of websites or platforms, because you need a single solution for multiple websites.

Select two or three websites. It’s enough for you to start with. Later you will screw them all.

Creating Database

The structure is pretty self-evident, but some fields will be explained later.

articles_websites table:

  1. Website Id – auto increment primary key;
  2. Website name – domain is OK;
  3. Search URL;
  4. Next Page regex;
  5. Titles URLs regex;
  6. Article regex.

search_results table:

  1. Article URL – unique index;
  2. Article Title;
  3. Website name or website Id – foreign key articles_websites.[Website Id].

articles table:

  1. Article Id – auto increment primary key;
  2. Article Title;
  3. Article text;
  4. Website name or website Id – foreign key articles_websites.[Website Id].

Whoring Search

Take each selected directory and fill in the articles_websites table:

  1. Find out how search URL looks like. Should be something like http://www.website.com/search=keywords. Save it to articles_websites.[Search URL] without ‘keywords’ or replace ‘keywords’ with the %s (depends on the way you build URL).
  2. Write regular expression, which extracts “Next page” URL from the search results page. Usually search results are paginated and there is the button Next/NEXT/>/etc. Play with search and keywords if you don’t see the button. Save regex to articles_websites.[Next Page regex].
  3. Write regular expression, which extracts all Title+URL pairs (search results) from the search results page. Usually this regex is trickier than previous, but it’s 100% feasible. Save it to articles_websites.[Titles URLs regex].
  4. Open any article and write regular expression, which extracts article content. Save it to articles_websites.[Article regex].

Steps 1-4 is everything I need to do to kill another victim. Guess, how much directories you can add in one hour? “Suck boobs and let scrapers suck content” principle in action :twisted:

Coding Search Results Scraper

Input: website name and keyword(s).

  1. Load website’s data from database.
  2. For each keyword(s) build URL replacing %s and scrap the search results page.
  3. Extract Next Page URL.
  4. Extract all Titles+URLs; save them to the search_results table.
  5. Timeout.
  6. While Next Page URL not empty repeat 2-6.

Output: Titles+URLs saved in the search_results table.

Coding Articles Scraper

Input: website name.

  1. Get Titles+URLs from the search_results table where website name is [website name].
  2. For each Title+URL scrap URL.
  3. If success, then extract article (Article regex), save Title+Article+Website Name (Or Id) to the articles table. Delete record URL+Title from the search_results table.
  4. Timeout.

Output: Articles saved in the articles table.

Implementation Tips

  1. Rotate/select random proxies while scraping.
  2. Always timeout and timeout wisely. The point is to run several instances of the scrapers and scrap different websites in parallel, but not to scrap one website every second.
  3. Strip of scripts and tags from article before saving/posting.
  4. Find the optimal scraping rate and schedule scrapping.

Happy coding!

14 Responses to Scraping article directories by exploiting search functionality

  1. easiest way is to find sitemap scrape all urls to database and read each link and save to database. I have done it many times. If any ip restriction is there you can spoof IP address and scrape it.

  2. some tip you can fake ip adress using headers. Hope it is useful. May be if admins allows i will write guest post.

  3. I really like the regex in the database stuff. Nize. I have to get on board with this 100% code reuse stuff. Thanks for sharing.

  4. one more thing….you can scrape article links from search engines and scrape it 🙂

  5. Wow, you are opening my eyes to new opportunities. I definitely have to get on board and get some article scraping going. This could be very powerful.

  6. Do you re-write/combine etc. the articles once scraped?

  7. @winalot, depends on website. I rewrite/spin/outsource only if I need some quality and uniqueness.

  8. @serial number, sitemap scraping is another alternative. But what I like about search exploiting is that I get only keyword related articles.

  9. OT: So are you scraping and storing Geocities pages now before they get de-indexed?

  10. @serozero, You can get full list of urls also from search engine. Ex : site:ezinearticles.com .

  11. @Winalot, no. My mistake :}

  12. @serial number, orly? 🙂

  13. Wow, even combining and/or rewriting scraped articles is also a great idea. Not only can you scrape a ton of content, but you can even make the search engines think it is all unique.

  14. Try setting up and using TOR (http://www.torproject.org/) with your scrapers instead of searching for public proxies that are actually working.

    Its slow but works fine for scraping google/yahoo etc.

Leave a Reply

Your email address will not be published. Required fields are marked *