Jump-start Lucene: Indexing and Search

  • Sharebar

This is a blog explaining what Apache Lucene is, how it works and some of its useful features. This information can be useful for someone who wants to quickly start on Lucene and also for someone who has used Lucene earlier but would like to gain from our experience on Lucene v2.4.1.

Introduction

Apache Lucene is an open source Java based indexing and searching technology. It is a technology suitable for any application that requires full-text search, especially cross-platform. Lucene works by first indexing the data to be searched and then using the index for searching.

lucenediagram

Lucene Indexing Basics

Indexing is a process of converting text data into a format that facilitates rapid searching. Lucene stores the data in the form of an inverted index. An analogy is an index at the end of a book; the index points to the location of topics that appear in the book.

There will be “Terms” created from the input which will point to the “Documents”. “Terms” are indexed and any search is performed on “Terms” and the related “Documents” are fetched.

Lucene index is built over an implementation of the Directory class. Lucene supports multiple Directory implementations: -

  • RAMDirectory: Complete index is kept in memory.
  • FSDirectory: Complete index is stored on file system. Some part of index is kept in cache.
  • NIOFSDirectory: Supports multiple threads on FSDirectory.

The “Terms” to be created for the indexing can be governed by something called “Analyzers”. Lets see what they are.

Analysis

The input to be indexed can be analyzed to extract the terms on which the searching can be done, like extracting the words, removing common words, changing words to lowercase, etc. Analyzers are used both at indexing time and search time. It is highly recommended to use the same analyzers both at index creation and searching time so that tokens created for searching are created in the same way the data was indexed.

Analyzers internally use Tokenizers and Filters. Analyzers take a string as input and returns back a stream of tokens.

Lucene provides some out of the box analyzers which come handy. In addition custom analyzers can also be created. Lets have a look at some of the Analyzers/Tokenizers/Filters provided by Lucene.

Useful Analyzers and Filters

SimpleAnalyzer

Tokenize the string to a set of words and converts them to lower case.

StandardAnalyzer

Tokenize the string to a set of words identifying acronyms, email addresses, host names, etc., discarding the basic English stop words (a, an, the, to) and stemming the words.

WhitspaceAnalyzer

Tokenize the data on white spaces.

NgramTokenFilter (quite useful for fuzzy searches)

Tokenize the input into n-grams of the given size. It is used for fuzzy matching. This helps in matching when the search input text has spelling mistakes. For e.g. input text "Lucene" for ngram size 3 will have tokens "Luc","uce","cen" and "ene", if the user entered "Tucene" while searching still word "Lucene" will be shown as a suggestion since "Tucene" has similar tokens "uce","cen" and "ene" which point to the Document "Lucene" in the knowledge.

EdgeNGramTokenFilter

Tokenize the input into n-grams of the given size, in addition also provides capability to specify the side of input from which the ngram should be generated.

LengthFilter

Filters out the tokens which are too short and too long from the stream of tokens.

Lucene can also be used to perform phonetic searching(based on the sound of the human speech). The knowledge needs to be indexed using a Phonetic/Soundex algorithm using a custom analyzer.

Adding data to an index

Lucene index comprises of “Documents” which contain “Fields”

Field

Field represents a piece of data queried or retrieved in a search. The Field class encapsulates a field name and its value. Lucene provides options to specify if a field needs to be indexed or analyzed and if its value needs to be stored. Field contains Terms which are used for querying. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.)

Document

Document represents a collection of all fields. Terms point to Documents. The data that needs to be retrieved through searching needs to be stored in the form of Documents. We'll talk more about Terms when we talk about searching.

IndexWriter

A class that either creates or maintains an index. Its constructor accepts a Boolean that determines whether a new index is created or whether an existing index is opened.

The changes made to the index are initially buffered in the memory and periodically flushed to the index directory. IndexWriter exposes several fields that control how indices are buffered in the memory and written to disk. Changes made to the index are not visible to IndexReader unless the commit or close method of IndexWriter are called. IndexWriter creates a lock file for the directory to prevent index corruption by simultaneous index updates.

The indexes created by Lucene can be viewed and modified using Luke, a useful open source tool for viewing indexes.

Code Snippet for adding data to index

//Step 1. Instantiate the directory, analyzer and indexwriter.Create instance of Directory where index files will be stored
Directory fsDirectory =  FSDirectory.getDirectory(indexDirectoryPath);
//Create instance of analyzer, which will be used to tokenize the input data
Analyzer standardAnalyzer = new StandardAnalyzer();
//Create a new index. Create instance of the IndexWriter
IndexWriter indexWriter =new IndexWriter(fsDirectory,standardAnalyzer,create, IndexWriter.MaxFieldLength.UNLIMITED);
//Step 2. Prepare the data for indexing.
String name = “Test Name”;
String emailId = “test@test.com”;
String subject = “test subject”;
//Step 3. Wrap the data in the Fields and add them to a Document.
Field nameField = new Field("name",name,Field.Store.YES,Field.Index.NOT_ANALYZED);
Field emailIdField = new Field("emailId",emailId,Field.Store.NO,Field.Index.NOT_ANALYZED);
Field subjectField = new Field("subject",subject,Field.Store.YES,Field.Index.ANALYZED);
Document doc = new Document();
// Add these fields to a Lucene Document
doc.add(nameField);
doc.add(emailIdField);
doc.add(subjectField);</span></span>

//Step 4. Add this document to Lucene Index.
indexWriter.addDocument(doc);
//Step 5. Optimize Lucene Index.
indexWriter.optimize();
//Step 6. Close Lucene Index.
indexWriter.close();

Document Deletion

Lucene provides the IndexReader abstract class that contains methods for deleting documents from an index. Lucene internally refers to documents with document numbers that can change as the documents are added to or deleted from the index. The document number is used to access a document in the index. IndexReader always searches the snapshot of the index when it is opened. Any changes to the index are not visible until IndexReader is reopened.

// Delete documents from the index
IndexReader indexReader = IndexReader.open(indexDirectory);
indexReader.deleteDocuments(new Term("name","Lucene"));
//close associate index files and save deletions to disk
indexReader.close();

Lucene Index Optimization

Index optimization is a process to merge all the segments present in an index into a single segment. Optimizing an index is recommended since it improves the search response time. Optimization of an index should be done just before closing the IndexWriter.

Lucene Searching and Retrieval

Searching is the process of finding matching documents for the input text. The input text is processed through the chain of analyzers to generate the set of terms to be queried. The terms generated from analysis are used in the query. Lucene search generates a score which is generated using a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a user's query.

Core Lucene Search classes

Term

Term is composed of two elements, the name of the field and the text to be stored in the field. A query is broken into terms and operators.

Query

Query is an abstract base class. Searching for a specified word or phrase involves wrapping them in a term, adding the terms to a query object, and passing this query object to IndexSearcher's search method.

Lucene comes with various types of concrete query implementations, such as TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, RangeQuery, MultiTermQuery, FilteredQuery, SpanQuery, etc.

Searcher

Searcher is an abstract base class that has various overloaded search methods. The Search method returns an ordered collection of documents ranked by computed scores. Lucene calculates a score for each of the documents that match a given query. You can specify the number of top results that need to be retrieved by specifying in the IndexSearcher's search method.

IndexSearcher is thread-safe; a single instance can be used by multiple threads concurrently.

Searcher.explain(Query query, int doc) functionality is quite useful in debugging why a particular score is returned.

Code Snippet for searching using TermQuery & BooleanQuery

//Step 1. Instantiate the searcher
Searcher indexSearcher = new IndexSearcher(indexDirectory);
//Step 2a. Create query using TermQuery
Query query = new TermQuery(term);
//Step 2b. Create query using BooleanQuery<
Query query1 = new TermQuery(new Term("name","lucene"));
Query query2 = new TermQuery(new Term("subject","full text search"));
BooleanQuery query = new BooleanQuery();
query.add(query1,BooleanClause.Occur.MUST);
query.add(query2,BooleanClause.Occur.MUST);

Lucene also supports searching on multiple indexes, we need to instantiate a MultiSearcher passing an array of searchers pointing to the different indexes.

//Instantiate the multi-searcher
Searcher index1Searcher = new IndexSearcher(index1Directory);
Searcher index2Searcher = new IndexSearcher(index2Directory);
Searcher[] searchers = new Searcher[2];
Searcher multiSearcher = new MultiSearcher(searchers);

ScoreDoc

A simple pointer to a document contained in the search results. This encapsulates the position of a document in the index and the score computed by Lucene.

TopDocs

Encapsulates the total number of search results and an array of ScoreDocs.

//Step 3. Execute search query and retrieve top 10 matching documents.
TopDocs topDocs = indexSearcher.search(query,10); </span></span>
//Step 4. Retrieve the data from top 10 documents.</span></span>
ScoreDoc[] scoreDosArray = topDocs.scoreDocs;   </span></span>
for(ScoreDoc scoredoc: scoreDosArray){</span></span>
     //Retrieve the matched document and show relevant details
     Document doc = indexSearcher.doc(scoredoc.doc);
     System.out.println("name["+ doc.getField("name").stringValue()+"]");
     System.out.println("subject["+ doc.getField("subject").stringValue()+"]");

Score Boosting

Lucene returns the search results with a score. Default boost factor is 1. Search results can be influenced by “boosting” in more than one level:
Index time boost
Lucene supports boosting documents while indexing - by calling document.setBoost() before a document is added to the index. Lucene supports boosting fields while indexing - by calling field.setBoost() before adding a field to the document.
Query time term boost
Lucene supports boosting at the time of searching, by setting a boost on a query clause, calling Query.setBoost().

Conclusion

Our experience in using Lucene for building complex full text search capabilities proves successful.

Leave a Reply

You must be logged in to post a comment.