20Apr/0915
Scraping websites with PHP cURL under proxy
Scraping websites with PHP cURL is damn easy. Just do it the right way - use a proxy. Here is a simple function that does the job.
Simple PHP cURL scraper
<?php function getPage($proxy, $url, $referer, $agent, $header, $timeout) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, $header); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_PROXY, $proxy); curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); curl_setopt($ch, CURLOPT_REFERER, $referer); curl_setopt($ch, CURLOPT_USERAGENT, $agent); $result['EXE'] = curl_exec($ch); $result['INF'] = curl_getinfo($ch); $result['ERR'] = curl_error($ch); curl_close($ch); return $result; } ?>
PHP cURL functions used
- curl_init - initializes a cURL session.
- curl_setopt - sets and option for a cURL transfer.
- curl_exec - performs a cURL session.
- curl_getinfo - gets information about the last transfer.
- curl_error - returns a string containing the last error for the current session.
- curl_close - close a cURL session.
curl_setopt options used
- CURLOPT_URL - the URL to scrap.
- CURLOPT_HEADER - inlude/exclude the header?
- CURLOPT_RETURNTRANSFER - return the transfer as a string or output it out directly? Use 1, i.e. return.
- CURLOPT_PROXY - the HTTP proxy to tunnel request through.
- CURLOPT_HTTPPROXYTUNNEL - tunnel through a given HTTP proxy? Use 1, i.e. tunnel.
- CURLOPT_CONNECTTIMEOUT - it's obvious.
- CURLOPT_REFERER - header to be used in a HTTP request.
- CURLOPT_USERAGENT - "User Agent:" to be used in a HTTP request.
Scraper usage
<?php $result = getPage( '[proxy IP]:[port]', // use valid proxy 'http://www.google.com/search?q=twitter', 'http://www.google.com/', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8', 1, 5); if (empty($result['ERR'])) { // Job's done! Parse, save, etc. // ... } else { // WTF? Captcha or network problems? // ... } ?>
P.S.: Activate cURL in php.ini if required.
April 20th, 2009 - 18:09
Hey man!
What to say… I have try it!!!
April 30th, 2009 - 09:11
Thanks for info. I am building online tool now that has a lot of useful info in a single location versus most tools that have various menus, etc.
I am curious what you can get away with ‘ethically’ in terms of scraping.
April 30th, 2009 - 11:53
What ethics?
May 27th, 2010 - 12:57
lol.. like your style “your da’ man”..lol
April 30th, 2009 - 21:32
I will let you know how it goes. We’ll see what I can get away with LOL
For what I do in research on a particular client, I think the tool set will be extremely useful to others.
June 23rd, 2009 - 05:49
What if I have proxies w/ login?
August 13th, 2009 - 10:16
Hi,
I’m using the proxies at seo-proxies.com for web scraping.
The great things: they have a php API and php source code examples !
So you don’t have to think about proxy user authentication (usernames passwords) or how to change and manage your proxies ..
All you do is take their free php source code, add your userid and password into it, and then add a few lines to specify what to do.
The proxy is working from begin on, changing the proxy IP is just a single php line.
It’s really great how easy programming web tools can be!
greets,
Henk
August 18th, 2009 - 19:16
I’ve signed up for a free trial on seo-proxies.com.
Can you give me an example how to use the API now ?
I want it to automatic change my IP and scrape google for a keyword in specific languages.
thanks !
October 15th, 2009 - 09:49
Thanks, this has been very helpful…
except for some reason I can’t get *any* proxy to work.
Every time I just get curl error: couldn’t connect to host. No further info is provided and it’s frustrating,
so I thought I’d vent here. ARRGGHH!!!!! Thanks.
October 16th, 2009 - 09:37
@Tony, try to find some alive proxies. Double check all params. BTW, what are you scraping? Is it server or client side scraper?
October 27th, 2009 - 07:55
I am trying this and keep getting Proxy CONNECT aborted. I believe it’s error 56. Do you know what that means? Googling around for it has been useless, and I know the proxy is good as I just used a proxy checker and am randomly choosing many proxies.
Thanks!
October 28th, 2009 - 09:18
@Berto, are you picking connect/forward proxies? Or HTTP?
November 24th, 2009 - 09:22
Hi there.. thx for useful info.. I’m a noob actually and this helped me alot.. but i have one problem.. I need to round-robin my proxies how can I implement that?
any help is very appreciated..
November 25th, 2009 - 18:50
There are some links about the legality of scraping that i’ve been putting together here: http://www.legalcasual.com/questions/1/automated-site-scraping
March 12th, 2010 - 03:09
couldn’t connect to host
$result = getPage(
‘li56-17.members.linode.com:3128′, // use valid proxy
‘http://www.google.com/search?q=twitter’,
‘http://www.google.com/’,
‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8′,
1,
5);
Thanks