From Zero To SEO Achieving High Rankings Through Coding

20Apr/0915

Scraping websites with PHP cURL under proxy

Scraping websites with PHP cURL is damn easy. Just do it the right way - use a proxy. Here is a simple function that does the job.

Simple PHP cURL scraper

<?php
function getPage($proxy, $url, $referer, $agent, $header, $timeout) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, $header);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 
    $result['EXE'] = curl_exec($ch);
    $result['INF'] = curl_getinfo($ch);
    $result['ERR'] = curl_error($ch);
 
    curl_close($ch);
 
    return $result;
}
?>

PHP cURL functions used

  • curl_init - initializes a cURL session.
  • curl_setopt - sets and option for a cURL transfer.
  • curl_exec - performs a cURL session.
  • curl_getinfo - gets information about the last transfer.
  • curl_error - returns a string containing the last error for the current session.
  • curl_close - close a cURL session.

curl_setopt options used

  • CURLOPT_URL - the URL to scrap.
  • CURLOPT_HEADER - inlude/exclude the header?
  • CURLOPT_RETURNTRANSFER - return the transfer as a string or output it out directly? Use 1, i.e. return.
  • CURLOPT_PROXY - the HTTP proxy to tunnel request through.
  • CURLOPT_HTTPPROXYTUNNEL - tunnel through a given HTTP proxy? Use 1, i.e. tunnel.
  • CURLOPT_CONNECTTIMEOUT - it's obvious.
  • CURLOPT_REFERER - header to be used in a HTTP request.
  • CURLOPT_USERAGENT - "User Agent:" to be used in a HTTP request.

Scraper usage

<?php
$result = getPage(
    '[proxy IP]:[port]', // use valid proxy
    'http://www.google.com/search?q=twitter',
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    5);
 
if (empty($result['ERR'])) {
    // Job's done! Parse, save, etc.
    // ...
} else {
    // WTF? Captcha or network problems? 
    // ...
}
?>

P.S.: Activate cURL in php.ini if required.

  • Twitter
  • Facebook
  • Digg
  • Reddit
  • del.icio.us
  • MySpace
  • Google Bookmarks
  • Technorati
  • StumbleUpon
  • Sphinn
  • Slashdot
  • NewsVine
  • Propeller
  • Tumblr
  • BlinkList
  • Faves
  • LinkedIn
  • Mixx
  • Netvibes
  • connotea
  • MisterWong
  • Diigo
  • email
Tagged as: , Leave a comment
Comments (15) Trackbacks (0)
  1. Hey man!
    What to say… I have try it!!!

  2. Thanks for info. I am building online tool now that has a lot of useful info in a single location versus most tools that have various menus, etc.

    I am curious what you can get away with ‘ethically’ in terms of scraping.

  3. I will let you know how it goes. We’ll see what I can get away with LOL
    For what I do in research on a particular client, I think the tool set will be extremely useful to others.

  4. What if I have proxies w/ login?

  5. Hi,
    I’m using the proxies at seo-proxies.com for web scraping.
    The great things: they have a php API and php source code examples !

    So you don’t have to think about proxy user authentication (usernames passwords) or how to change and manage your proxies ..
    All you do is take their free php source code, add your userid and password into it, and then add a few lines to specify what to do.
    The proxy is working from begin on, changing the proxy IP is just a single php line.

    It’s really great how easy programming web tools can be!

    greets,
    Henk

  6. I’ve signed up for a free trial on seo-proxies.com.
    Can you give me an example how to use the API now ?

    I want it to automatic change my IP and scrape google for a keyword in specific languages.

    thanks !

  7. Thanks, this has been very helpful…
    except for some reason I can’t get *any* proxy to work.
    Every time I just get curl error: couldn’t connect to host. No further info is provided and it’s frustrating,
    so I thought I’d vent here. ARRGGHH!!!!! Thanks.

  8. @Tony, try to find some alive proxies. Double check all params. BTW, what are you scraping? Is it server or client side scraper?

  9. I am trying this and keep getting Proxy CONNECT aborted. I believe it’s error 56. Do you know what that means? Googling around for it has been useless, and I know the proxy is good as I just used a proxy checker and am randomly choosing many proxies.

    Thanks!

  10. @Berto, are you picking connect/forward proxies? Or HTTP?

  11. Hi there.. thx for useful info.. I’m a noob actually and this helped me alot.. but i have one problem.. I need to round-robin my proxies how can I implement that?
    any help is very appreciated.. :)

  12. There are some links about the legality of scraping that i’ve been putting together here: http://www.legalcasual.com/questions/1/automated-site-scraping

  13. couldn’t connect to host

    $result = getPage(
    ‘li56-17.members.linode.com:3128′, // use valid proxy
    ‘http://www.google.com/search?q=twitter’,
    ‘http://www.google.com/’,
    ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8′,
    1,
    5);

    Thanks :)


Leave a comment


No trackbacks yet.