Scraping websites with PHP cURL under proxy

Scraping websites with PHP cURL is damn easy. Just do it the right way – use a proxy. Here is a simple function that does the job.

Simple PHP cURL scraper

<?php
function getPage($proxy, $url, $referer, $agent, $header, $timeout) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, $header);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_PROXY, $proxy);
    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 
    $result['EXE'] = curl_exec($ch);
    $result['INF'] = curl_getinfo($ch);
    $result['ERR'] = curl_error($ch);
 
    curl_close($ch);
 
    return $result;
}
?>

PHP cURL functions used

  • curl_init – initializes a cURL session.
  • curl_setopt – sets and option for a cURL transfer.
  • curl_exec – performs a cURL session.
  • curl_getinfo – gets information about the last transfer.
  • curl_error – returns a string containing the last error for the current session.
  • curl_close – close a cURL session.

curl_setopt options used

  • CURLOPT_URL – the URL to scrap.
  • CURLOPT_HEADER – inlude/exclude the header?
  • CURLOPT_RETURNTRANSFER – return the transfer as a string or output it out directly? Use 1, i.e. return.
  • CURLOPT_PROXY – the HTTP proxy to tunnel request through.
  • CURLOPT_HTTPPROXYTUNNEL – tunnel through a given HTTP proxy? Use 1, i.e. tunnel.
  • CURLOPT_CONNECTTIMEOUT – it’s obvious.
  • CURLOPT_REFERER – header to be used in a HTTP request.
  • CURLOPT_USERAGENT – “User Agent:” to be used in a HTTP request.

Scraper usage

<?php
$result = getPage(
    '[proxy IP]:[port]', // use valid proxy
    'http://www.google.com/search?q=twitter',
    'http://www.google.com/',
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
    1,
    5);
 
if (empty($result['ERR'])) {
    // Job's done! Parse, save, etc.
    // ...
} else {
    // WTF? Captcha or network problems? 
    // ...
}
?>

P.S.: Activate cURL in php.ini if required.

25 Responses to Scraping websites with PHP cURL under proxy

  1. Thanks for info. I am building online tool now that has a lot of useful info in a single location versus most tools that have various menus, etc.

    I am curious what you can get away with ‘ethically’ in terms of scraping.

  2. I will let you know how it goes. We’ll see what I can get away with LOL
    For what I do in research on a particular client, I think the tool set will be extremely useful to others.

  3. What if I have proxies w/ login?

  4. Hi,
    I’m using the proxies at seo-proxies.com for web scraping.
    The great things: they have a php API and php source code examples !

    So you don’t have to think about proxy user authentication (usernames passwords) or how to change and manage your proxies ..
    All you do is take their free php source code, add your userid and password into it, and then add a few lines to specify what to do.
    The proxy is working from begin on, changing the proxy IP is just a single php line.

    It’s really great how easy programming web tools can be!

    greets,
    Henk

  5. I’ve signed up for a free trial on seo-proxies.com.
    Can you give me an example how to use the API now ?

    I want it to automatic change my IP and scrape google for a keyword in specific languages.

    thanks !

  6. Thanks, this has been very helpful…
    except for some reason I can’t get *any* proxy to work.
    Every time I just get curl error: couldn’t connect to host. No further info is provided and it’s frustrating,
    so I thought I’d vent here. ARRGGHH!!!!! Thanks.

  7. @Tony, try to find some alive proxies. Double check all params. BTW, what are you scraping? Is it server or client side scraper?

  8. I am trying this and keep getting Proxy CONNECT aborted. I believe it’s error 56. Do you know what that means? Googling around for it has been useless, and I know the proxy is good as I just used a proxy checker and am randomly choosing many proxies.

    Thanks!

  9. @Berto, are you picking connect/forward proxies? Or HTTP?

  10. Hi there.. thx for useful info.. I’m a noob actually and this helped me alot.. but i have one problem.. I need to round-robin my proxies how can I implement that?
    any help is very appreciated.. :)

  11. There are some links about the legality of scraping that i’ve been putting together here: http://www.legalcasual.com/questions/1/automated-site-scraping

  12. couldn’t connect to host

    $result = getPage(
    ‘li56-17.members.linode.com:3128′, // use valid proxy
    ‘http://www.google.com/search?q=twitter’,
    ‘http://www.google.com/’,
    ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8′,
    1,
    5);

    Thanks :)

  13. Hi all,

    I tried this coading. But I got the error message like this. Any one can help me ???

    Warning: curl_setopt() [function.curl-setopt]: CURLOPT_FOLLOWLOCATION cannot be activated when in safe_mode or an open_basedir is set in

    /home/wwwlawy/public_html/mails/gmail/libgmailer.php on line 42

    Warning: curl_setopt() [function.curl-setopt]: CURLOPT_FOLLOWLOCATION cannot be activated when in safe_mode or an open_basedir is set in /home/wwwlawy/public_html/mails/gmail/libgmailer.php on line 77

    No contacts found

  14. how can i do this to search emails on google ?

  15. Yeah, I’ve used this post several times to help clients use our proxies, and it seems to work out well.

  16. Can you provide some examples concerning multithread issue?

  17. Hello.
    Thanks for post.
    I have problem with running that code.
    Error 404.
    I have proxy from http://www.xroxy.com/proxy-country-DE.htm and it work when i put it into settings in firefox.
    Why it don’t work when I’m doing it like this: ?
    $options = array(
    CURLOPT_FAILONERROR => True,
    CURLOPT_RETURNTRANSFER => True,
    CURLOPT_TIMEOUT => 25,
    CURLOPT_SSL_VERIFYPEER => False,
    CURLOPT_USERAGENT => ‘Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)’,
    CURLOPT_FOLLOWLOCATION => True,
    CURLOPT_POST => 0,
    CURLOPT_COOKIEJAR => $cookiesFileName,
    CURLOPT_COOKIEFILE => $cookiesFileName,
    CURLOPT_PROXY => ‘62.157.186.16:80′,
    CURLOPT_HTTPPROXYTUNNEL => 1,
    );
    $URL= ‘http://hanibal.hopto.org/';
    $c = curl_init($URL);
    curl_setopt_array($c, $options);
    $result = curl_exec($c);

  18. Thanks man! Saved my life.

  19. Hey, I´ve tested it that way with many different proxies, but I always get “Received HTTP code 403 from proxy after CONNECT” error. Any ideas what to do?

  20. That’s good information in this blog, thanks for sharing.
    I’m currently using the Google scraper on google-scraper.squabbel.com (a bit customized). I think I found that link actually on one of your blogs a while ago!

    I need to scrape about 1 million keywords per month with it, do you know how many proxies I will need ?
    I’m using a “guru” license at seo-proxies at this time, not sure if it’s enough to keep going or if I need to upgrade even higher.

    This is an important project, that’s why I’d really appreciate an answer.

  21. When using proxies (in my case for a proxy tester) my script just keeps running as soon as i have more than 5 curls in a loop, even with the multiple curl handler … cant figure out why that is?!

  22. I´ve tested it that way with many different proxies, but I always get “Received HTTP code 403 from proxy after CONNECT” error. Any ideas what to do?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>