Why remove stopwords

200984_sign_4_stop

You will be shocked to know that your articles can contain up to 70 percent of stopwords, i.e. only 30 percent of words drives search engine traffic.


Stopwords definition

Stopwords are common words that carry less important meaning than keywords. Usually search engines remove stopwords from a keyword phrase to return the most relevant result. I.e. stopwords drive much less traffic than keywords.

So what? Stopwords is a part of human language and there’s nothing you can do about it. Sure, but high stopword density can make your content look less important for search engines.

Look at the picture below. There are two paragraphs from above without stopwords.

stopwords
Text is shorter than the original one: 66 words versus 31. Approximately 50 percent of words are stopwords. I.e. half of the text is not really important for search engines.

Who should bother about stopwords?

If you don’t afraid to experiment and have time for it, replace some (not all) stopwords with yummy words before submitting a post. This may help to get more search engine traffic to your blog.

Also guys, who scrap content for doorways may be interested. However, I’m not sure if Google likes stopwords free keyword mess. Probably, you should keep some.

Stopwords lists

There are two stopwords lists from trusted websites that you can use: Link Assistant and SEO Book. You can find more or make your own list.

Stopword removal script (Perl)

Perl is the best choice to eat some text. You have the power to make difficult tasks easier if you know Perl regular expressions.

How to run?

1. Create a stopword list (stopwords.txt) – one stopword per line.

2. Save a post as a text file (post.txt). Use ASCII, not Unicode.

3. Make sure the script is executable (chmod +x stopwords_eater.pl).

4. Run (./script post.txt stopwords.txt out.txt).

#!/usr/bin/perl
 
if ($#ARGV + 1 != 3) {
	die "Usage: text file, stop words file, output file.\n";
}
 
open POST_FILE, "<$ARGV[0]" or die "$! $ARGV[0]!\n";
open STPW_FILE, "<$ARGV[1]" or die "$! $ARGV[1]!\n";
open OUT_FILE, ">$ARGV[2]" or die "$! $ARGV[2]!\n";
 
{
	local $/=undef;
	$post = <POST_FILE>;
}
 
foreach $line (<STPW_FILE>) {
	chomp($line);
	$post =~ s/\b$line\b//gi;
}
 
$post =~ s/\d//g;
$post =~ s/[?;:!,.'"]//g;
 
print OUT_FILE $post;
 
close POST_FILE;
close STPW_FILE;
close OUT_FILE;

Stopword removal script (PHP)

PHP script does absolutely the same job. It uses preg_replace function to run Perl regexp.

<?php
 
if (count($argv) != 4) {
	echo("Usage: text file, stop words file, output file.\n");
	exit;
}
 
if (!file_exists($argv[1])) {
	exit("Unable to open file $argv[1]!\n");
}
 
if (!file_exists($argv[2])) {
	exit("Unable to open file $argv[2]!\n");
}
 
$post = file_get_contents($argv[1]);
$stop_words = file($argv[2]);
 
foreach ($stop_words as $word) {
	$word = rtrim($word);
	$post = preg_replace("/\b$word\b/i", "", $post);
}
 
$post = preg_replace("/\d/", "", $post);
$post = preg_replace("/[?;:!,.'\"]/", "", $post);
 
$output = fopen($argv[3], 'w') or
	exit("Unable to open file $argv[3]\n!");
fwrite($output, $post);
fclose($output);
?>

How to run?

I hope you are familiar with PHP and are able to run the scrip from the command line.

Enjoy!

18 thoughts on “Why remove stopwords”

  1. I often think do stop works make a difference. For instance if Google does ignore stop words from a users search. Then what if I as a user search for “The Matrix”. I will get a load of pages about the maths formula “matrix”.
    I think you have to make a choice based on how your page title reads. It’s important it makes the user want to click it …

    my two cents ..

  2. azwan,

    I’m not saying that you shouldn’t use stopwords. The more stopwords in content, the less it’s important for search engines. If you replace some stopwords with normal words, you can get more traffic.

  3. I finished to write SEO tool for my site and I found that script idea very useful. I create myown stop word file stopwords.txt and wrote litle function to clean my keywords from stop words
    function del_stop_words($kw){
    $kw = array_map(‘strtolower’,array_diff($kw,array(“”)));
    $sw = explode(“\r\n”,file_get_content(‘stopwords.txt’));
    return array_values(array_diff($kw,$sw));
    }

    array_map I use to make all my values in lower case.
    explode “\r\n” I need because after file get content I got “strin
    ” but not “string”
    and array_diff cleaned my $kw.

    enjoy dudes!

  4. could some one help me to run any one of the above codes.
    I want to eliminate stop words in french.
    Could you clearly tel me how to run a PHP or a PERL script.

  5. when i try to run this line
    ./script post.txt stopwords.txt out.txt
    This is the error that i get
    bash: ./script: No such file or directory
    could anybody tell me why i am getting this error????

  6. Good script! Now I need to convert a post.txt with chaset Unicode. What I need to change in script?

Leave a Reply

Your email address will not be published. Required fields are marked *