Tuesday, September 19, 2006

Removing SQL Server Full Text Noise Words from a Search String

Remove SQL Full Text Noise Words

I'm not going to get into the specifics on constructing full text queries (at least not yet). In this article I'm going to tell you the full text noise words and provide a function for stripping them from queries so you don't get an error.

Have you ever gotten this error when executing a search against your full text index:
Microsoft OLE DB Provider for SQL Server error '80040e14' 

Execution of a full-text operation failed. A clause of the query contained only ignored words.
SQL Server Full Text Indexing provides support for sophisticated word and phrase searches. The full text index stores information about words and their location within a given column, This information is used to quickly complete full text queries that search for rows with particular words or combinations of words.

The noise words (assuming you did a standard install of SQL Server 2000) are located at:
\Program Files\Microsoft SQL Server\MSSQL\FTDATA\SQLServer\Config
In that directory, you'll see a bunch of files named noise with different extensions. The extension indicates the language. The English noise files is noise.eng. If you open that file with a text editor, you can see all the noise words (and symbols, like the $ for example). A search containing any of these words or symbols will cause an error.

There are a couple of options here for preventing that error. One is to remove the noise word from the noise file, then rebuild your full text index. I did this for a site where I aggregated video games for price comparison. I needed visitors to be able to search on Playstation 2. Well, the number 2 is in the noise file, so I took it out.

Even if you decide to take some words out of the noise file, you'll still need to prep your search string by removing any noise words that are left. The easiest way to do that is a function. When someone does a search, pass their search phrase into the function and then return the phrase with the noise words stripped out. There are several ways to do this, but here's the method I went with:
Function PrepSearchString(sOriginalQuery)
strNoiseWords = "1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | $ | ! | @ | # | $ | % | ^ | & | * | ( | ) | - | _ | + |
= | [ | ] | { | } | about | after | all | also | an | and
| another | any | are | as | at | be | because | been | before | being | between |
both | but | by | came | can | come | could | did | do | does | each | else | for |
from | get | got | has | had | he | have | her | here | him | himself | his | how |
if | in | into | is | it | its | just | like | make | many | me | might | more |
most | much | must | my | never | now | of | on | only | or | other | our | out |
over | re | said | same | see | should | since | so | some | still | such | take |
than | that | the | their | them | then | there | these | they | this | those |
through | to | too | under | up | use | very | want | was | way | we | well | were |
what | when | where | which | while | who | will | with | would | you | your | a |
b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v |
w | x | y | z"

arrNoiseWord = Split(strNoiseWords,"|")

For i = 0 to UBound(arrNoiseWord)
sOriginalQuery = " "&LCase(sOriginalQuery)&" "
sOriginalQuery = Replace(sOriginalQuery," "&Trim(arrNoiseWord(i))&" "," ")
Next
PrepSearchString = Trim(sOriginalQuery)
End Function
How it works:

  • I took all the noise words and put them in one long pipe delimited string
  • Split the string to make an array
  • Loop through the array looking for noise words
  • When I find one, replace it with a space

    Some other considerations:

    In the function above, I added an extra space before and after the pipe. This was to allow the long string to break and fit onto the page. To account for that, I use the Trim command referring to an item in the array. If you build the string without the spaces, you won't need the Trim (though it won't hurt anything).

    I took the original query and put a space on either end. This is to account for a noise word being at either the beginning or the end of the search phrase.

    I also use LCase to put the original query into all lower case. I could do a vbTextCompare, but it was easier for just make the term all lower case.

    Lastly, I trim the original query when assigning it to go back to account for a space being at either end (when the noise word started or ended the query)

    Once you have the noise word free search string, you can build your query. Be aware: if the search query contains noise words only, this returns an empty string. You should for that in your code.
  • 3 comments:

    Anonymous said...
    This comment has been removed by a blog administrator.
    Jester said...

    Yes it does - change

    Function PrepSearchString(sOriginalQuery)

    to

    Function PrepSearchString(strOriginalQuery)

    Unknown said...

    Thanks for the post. Is it safe to remove the numbers 1-9 from the noise file? I need users to be able to search for things like "district 2".

    Thanks,
    Jason