What Lucene is?
Lucene is a high-performance, scalable Information Retrieval (IR) library, created originally by Doug Cutting. It provides indexing and searching features to applications. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Lucene is a mature, open-source project implemented in Java, available for free download. Moreover, it’s a member of the popular Apache Jakarta family of projects, licensed under the Apache Software License. One of the key factors behind Lucene’s popularity and success is its simplicity.
We can see that a lot of attention is paid in the index and search API. In fact, we don’t need in-depth knowledge about how Lucene’s information indexing and retrieval work’s in order to start using it.
The core logical architecture is the concept of a document which contains fields of text. This flexibility allows Lucene‘s API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others, can all be indexed so long as their textual information can be extracted.
Now, some features about Lucene:
- ranked search, that is, the most fitting results return first;
- multiple query types: phrase queries, wildcard queries, proximity queries, range queries and more;
- field search(e.g., title, author, contents);
- date-range searching;
- sorting by field;
- multiple-index search with merged results;
- simultaneous update and searching;
- implementations in multiple programming languages available, that are index-compatible;
However, we have to be aware that Lucene, by itseft is not a a ready-to-use application like a file-search program, a web crawler, or a web site search engine. Lucene is a software toolkit, not a full-featured search application.
A list of Lucene-Based Projects:
But, how to exploit Lucene‘s capabilities, in order to make an efficient Question and Answering System?
In fact, this tool can be very useful for the passage/document retrieval task. Lucene can support, retrieval of relevant snippets of text for every question.
One simple approach is to create two indexes, one at paragraph level and one at document level. The paragraph index is more precise in terms of relevant text, and is preferred for snippet extraction. However, if the answer is not found in the paragraph index, the document index is returned instead. Using the Lucene query engine, we extract a ranked list of snippets for every question.
The algorithm for answer extraction can be based on the Lucene scores. It is very likely that the paragraph with de highest Lucene score should be correct answer (although this is not guaranteed).
Of course, there are refinement filters, which increase the Lucene score in the following cases:
- the paragraph has the focus;
- the paragraph has some of the name entities (directly proportional with the number of these name entities);
- if the question answer type is Person, or Organization, etc., we try to identify these types of name entities in the extracted paragraphs (and increase the Lucene score accordingly with the number of them);
- if the question type is Definition, then we prefer answers with definition form (could be identified by this grammar).
After applying all criteria, the paragraphs are awarded a score and the paragraph with the biggest score is chosen.
References: Lucene in Action, Lucene website, Question Answering on English and Romanian Languages
Read Full Post »