Crude Java and Agile thoughts
Category Archives: Lucene
Maven Lucene Plugin is an open source maven plugin for Apache Lucene developed by Xebia India IT Architects Ltd . The project is hosted on SourceForge and can be found here. It is released under GNU General Public License (GPL).
maven-lucene-plugin creates a Lucene index from a file source. The structure of the lucene index i.e. fields, analyzers, indexLocation, fileSourceLocation, store etc can be configured in a configuration file lucene.xml.
lucene.xml contains all information regarding the lucene index and the data source (from which the index is created). The Maven Lucene Plugin looks for lucene.xml file in the project root directory (adjacent to pom.xml) and creates the lucene index from the file source mentioned in lucene.xml based on the index configuration provided. Read more of this post
The first version of Maven Lucene Plugin has been released. The plugin is an open source project hosted at SourceForge. The plugin can create indexes from a file data source. The index can be configured by specifying elements in a file lucene.xml. It also provides a maven dependency maven lucene search which provides utility methods on the index created.
The plugin empowers you to use the strong capabilities of Apache Lucene with very limited or no knowledge of the technical internals of Lucene. The complete documentation about the usage of the maven lucene plugin can be found here.
The plugin is available at Central Maven Repository.
JSpider-tool is a set of utilities built on top of the JSpider application. JSpider is an open source product written in java. It is available under LGPL License. JSpider-tool can be used to perform basic crawling functionality. JSpider along with sources can be downloaded from here. After extracting it, jspider-tool is found as a utility in bin folder.
Functionality available with JSpider-tool:
- Can print the headers sent by a web server
- Can display information about a web resource
- Can display the content of a web resource
- Can download a certain file from a web server to a local file
- Can find all links to other resources in a certain page
- Can find all e-mail addresses mentioned in a web page
The ternary search tree (TST) is a 3-way tree.It finds all keys having a given prefix, suffix, or infix. It even finds those keys that closely match a given pattern. You can easily search the tree for partial matches. In addition, you can implement near-match functions, which gives you the ability to suggest alternatives for misspelled words.
A TST stores key-value pairs, where keys are strings and values are objects. TST keys are stored and retrieved in sorted order, regardless of the order in which they are inserted into the tree. In addition, TSTs use memory efficiently to store large quantities of data. Best of all, the ternary search tree is lightning fast. The tremendous flexibility of TSTs provides ample opportunity for programming creatively.
A ternary search is an example of a divide and conquer algorithm. It’s a k-ary tree with k=3.
A 3-way tree where every node’s left subtree has keys less than the node’s key, every middle subtree has keys equal to the node’s key, and every right subtree has keys greater than the node’s key.
It is the first part in series of blogs about Apache Lucene based on my practical experience. I have been working on Lucene for last one year. This blog mainly contains information about Lucene indexing and searching along with some less known facts about index and performance.
Apache Lucene is a text based search framework. It creates an index with the data to be searched. Then the index can be queried upon for data. Its a very fast and efficient search framework which provides a search engine type of capability to your application. It can also be used as a base to create a highly efficient data analysis application.
The main part of Apache Lucene is its index. The indexes can be FileSystemBased, RAMBased, NIO Based Directory etc etc. When you create a index you can add Results which will be found for certain Search Query Terms. The results are known as Documents and the search query terms are known as Term. So, when a index is created, the documents are specified with the terms pointing to it.
The index contains fields which is used to have the index or the original data itself (Its configurable to either only index the data or to store it too).
Documents and Terms
Read more of this post