Today I decided to implement a StopWords filter in C# that would filter out certain woulds from a search engine query. I wanted something to filter out common words like "a", "I", "to", "the" "how", from search queries since in most cases these words don't really help with getting the most accurrate search results from a query, and instead they just create more unnecessary search results.
Keep in mind, there's not an end all be all list of stop words to use in all cases because ultimately you have to decide for yourself what's best for the application (and it's users) when determining what stopwords to include and what to exclude. However, below are a couple resources to find common stop words lists that you may want to use to create your own StopWords list:
I ultimately narrowed my StopWords list down to some of the more common words, that I felt wouldn't interfere too much with a searcher's intent:
"a", "about", "actually", "after", "also", "am", "an", "and", "any", "are", "as", "at", "be", "because", "but", "by",
"could", "do", "each", "either", "en", "for", "from", "has", "have", "how", "i", "if", "in", "is", "it", "its", "just", "of", "or", "so", "some", "such", "that", "the", "their", "these", "thing", "this", "to", "too", "very", "was", "we", "well", "what", "when", "where", "who", "will", "with", "you", "your"
Once I figured out my StopWords list I created a SearchHelper class in C# to clean search query Words before sending them to the database to return search results. Below is the SearchHelper.cs C# class (download available: see attached .cs file below):
using System;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Text;
public class SearchHelper
{
private static string[] stopWordsArrary = new string[] { "a", "about", "actually", "after", "also", "am", "an", "and", "any", "are", "as", "at", "be", "because", "but", "by",
"could", "do", "each", "either", "en", "for", "from", "has", "have", "how",
"i", "if", "in", "is", "it", "its", "just", "of", "or", "so", "some", "such", "that",
"the", "their", "these", "thing", "this", "to", "too", "very", "was", "we", "well", "what", "when", "where",
"who", "will", "with", "you", "your"
};
///
/// Removes stop words from the specified search string.
///
public static string CleanSearchedWords(string searchedWords)
{
searchedWords = searchedWords
.Replace("\\", string.Empty)
.Replace("|", string.Empty)
.Replace("(", string.Empty)
.Replace(")", string.Empty)
.Replace("[", string.Empty)
.Replace("]", string.Empty)
.Replace("*", string.Empty)
.Replace("?", string.Empty)
.Replace("}", string.Empty)
.Replace("{", string.Empty)
.Replace("^", string.Empty)
.Replace("+", string.Empty);
// transform search string into array of words
char[] wordSeparators = new char[] { ' ', '\n', '\r', ',', ';', '.', '!', '?', '-', ' ', '"', '\'' };
string[] words = searchedWords.Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries);
// Create and initializes a new StringCollection.
StringCollection myStopWordsCol = new StringCollection();
// Add a range of elements from an array to the end of the StringCollection.
myStopWordsCol.AddRange(stopWordsArrary);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < words.Length; i++)
{
string word = words[i].ToLowerInvariant().Trim();
if (word.Length > 1 && !myStopWordsCol.Contains(word))
sb.Append(word + " ");
}
return sb.ToString();
}
}
That's it... Now on your search results page code, you can use the SearchHelper.CleanSearchWords(searchWordsHere) to clean the searched words string. Pretty simple, but works well for filtering out common words from a search query.