Friday, August 5, 2011

Lucene Analyzer

There are several analyzer available in Lucene:

WhitespaceAnalyzer: It splits text into tokens on whitespace characters. It doesn't normalize the tokens and doesn't lowercase each token.

SimpleAnalyzer: It splits tokens at non letter characters and lowercases each token. It discards numeric characters but keeps all other characters.

StopAnalyzer: It acts almost same as SimpleAnalyzer except it removes common words. By default, it removes common words specific to English (the, a, etc).

StandardAnalyzer: It is the most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens such as company names, email addresses, and hostnames. It lowercases each token and removes stop words and punctuation.

KeywordAnalyzer: It treats entire text as a single token.

0 comments:

 

©2009 Stay the Same | by TNB