There are several analyzer available in Lucene:
WhitespaceAnalyzer: It splits text into tokens on whitespace characters. It doesn't normalize the tokens and doesn't lowercase each token.
SimpleAnalyzer: It splits tokens at non letter characters and lowercases each token. It discards numeric characters but keeps all other characters.
StopAnalyzer: It acts almost same as SimpleAnalyzer except it removes common words. By default, it removes common words specific to English (the, a, etc).
StandardAnalyzer: It is the most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens such as company names, email addresses, and hostnames. It lowercases each token and removes stop words and punctuation.
KeywordAnalyzer: It treats entire text as a single token.
Friday, August 5, 2011
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment