It is the first part in series of blogs about Apache Lucene based on my practical experience. I have been working on Lucene for last one year. This blog mainly contains information about Lucene indexing and searching along with some less known facts about index and performance.
Apache Lucene is a text based search framework. It creates an index with the data to be searched. Then the index can be queried upon for data. Its a very fast and efficient search framework which provides a search engine type of capability to your application. It can also be used as a base to create a highly efficient data analysis application.
Index
The main part of Apache Lucene is its index. The indexes can be FileSystemBased, RAMBased, NIO Based Directory etc etc. When you create a index you can add Results which will be found for certain Search Query Terms. The results are known as Documents and the search query terms are known as Term. So, when a index is created, the documents are specified with the terms pointing to it.
The index contains fields which is used to have the index or the original data itself (Its configurable to either only index the data or to store it too).
Documents and Terms
An example of this will be:
Document: The Girl and Boy.
Terms: Girl, Boy
Here, we indexed a document “The Girl and Boy” with terms Girl and Boy. So, when a search query “Girl” is fired on Lucene index, Lucene matches it with indexed terms and finds a Term “Girl” which points to a document “The Girl and Boy.”.
Tokenizers and Filters
One point remained undisclosed is how did we managed to remove “The” & “and” from beind indexed as terms from “The Girl and Boy”. The answer is that Lucene provides lots of Tokeneziers to filter out information from the data being indexed. KeyWordTokenizer, WhiteSpaceTokenizer, CharTokenizer are few of these. These tokenizers can be easily extended to create something which is more specific to your application.
There are also different types of analyzers which can be used to analyze the data before indexing and searching. The analyzers basically transform the data into some specific type before indexing and searching. Like lowercasing the data and all. It is highly recommended to use same analyzers while indexing and searching.
NGramTokenizer (Quiet useful)
It is used for fuzzy matching. It breaks the input into consecutive pieces of specific sizes. This helps in matching when the search query has some spelling mistakes.
For eg. “Friend” will have terms “Fre” “rie” “ien” “end”. So, So, if the same tokenizer is used while indexing and searching, the input will be broken before searching.
If the search query was “Frend”, then also “Fre” and “end” will match. The more the number of terms match, the more prominent the result is.
Lucene also supports Phonetic searching. You just need to index the data using some Phonetic or Soundex Algorithm through a analyzer.
Score
When a query is fired on Lucene index, it mag get back more than one Document. The priority of the documents by something known as the Lucene score. The score depends on many things.
The match type, either exact or fuzzy. The difference between the document and the term. It also depends on one more thing, that is boost.
Boost
Document boost
A document can be boosted. Which means that it can be made more important or less important while searching. The documents can be prioritized while indexing which sets a permanent boost.
Query boost
A query can also be boosted. So, the documents returned by the boosted query will have greater score. Mostly the query on specific fields are boosted.
Query
A lucene query is the way of searching the Lucene index. The Lucene query is queried on the fields of the Lucene index. The query can be configured to do a exact query or fuzzy query. The lucene query supports OR, AND and XOR. The Apache Lucene API provides classes to create query, so, you don’t need to write the query yourself.
Performance
The searching performance is increased by several times if RAMDirectory is used for loading the index. As very less I/O is needed which increases performance. The whole index is loaded and expanded in case of a RAMDirectory and hence the I/O is saved.
It has also been seen that using a Linux machine improves the performance a lot as the paging in Linux is far better than other operating systems.
Multiple indexes
Lucene also supports reading from multiple indexes at a time. There is a MultiSearcher in Lucene which takes different IndexReaders and queries on all indexes at the same time.
Index Generation
Lucene index creation is a continuous process process of creating small indexes called segments and then merging them into a single index. Doing a optimization on lucene index merges the segments into a index. This improves performance a lot, as instead of searching different indexes and then merging the results, all indexes are searched at once.
I will be writing more about Apache Lucene in future.





3 Comments
I think your NGramTokenizer example is wrong. Friend has 3-grams: Fri, rie, ien, end – no “Fre”
Afair, Lucene automatically uses Levenshtein/Hamming distance when creating fuzzy queries.
Looking forward to next in series.
Cheers,
Soren
nice one, it’s always good to follow someone who is working on this… thanks for writing.
i have bookmarked it at http://www.webyam.com/bookmarks.php?tag=lucene hoping more posts will come
This would surely be worth reading for those who are working on Lucene.. keep it up!!!
2 Trackbacks
[...] Apache Lucene – Indexing and Searching April 2010 3 comments 3 [...]
[...] http://paritoshranjan.wordpress.com/2010/04/22/apache-lucene-indexing-and-searching/ [...]