Research in Question Answering (QA) systems has been improved by the Text Retrieval Conference (TREC) series since 1999. Almost all QA systems fielded at TREC employ some passage retrieval technique to reduce the size of the relevant document set to a manageable number of passages. Here are a bunch of algorithms that might be useful to be aware:
The algorithm presented by [Light et al., J. of Natural. Lang.Eng., Special Issue on QA 2001] simply counts the number of terms a passage has in common with the query. Each sentence is treated as a separate passage. This algorithm represents the simplest passage retrieval technique and serves as a good baseline for comparison.
Sliding Window scored with Okapi BM25
Okapi bm25 weighting scheme [Robertson et al., TREC 4] represents the state of the art in document retrieval. It is based on the probabilistic retrieval framework developed by Robertson in 1970. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. A simple passage retrieval algorithm based on a sliding window scored with bm25, could serve as another baseline for comparison.
The MultiText algorithm [Clarke et al., TREC 9] is a density-based passage retrieval algorithm that favours short passages containing many terms with high idf values. Each passage window in the algorithm starts and ends with a query term, and its score is based on the number of query terms in the passage as well as the window size. Once the highest scoring passage has been identified, our implementation creates a new window of the required length around the center point of the original passage.
Incorpores unique technique for arbitrary passage retrieval. The technique efficiently locates high-scoring passages, where the score of a passage is based on its length and the weights of the terms occurring within it. Passage boundaries are determined by the query, and can start and end at any term position. If a document ranking is required, the score of a document is computed by combining the scores of the passages it contains.
IBM’s passage retrieval algorithm [Ittycheriah et al., TREC 9] computes a series of distance measures for the passage. The “matching words measure” sums the idf values of words that appear in both the query and the passage. The “thesaurus match measure” sums the idf values of words in the query whose WordNet synonyms appear in the passage. The “mis-match words measure” sums the idf values of words that appear in the query and not in the passage. The “dispersion measure” counts the number of words in the passage between matching query terms, and the “cluster words measure” counts the number of words that occur adjacently in both the question and the passage. These various measures are linearly combined to give the final score for a passage.
SiteQ’s passage retrieval algorithm [Lee et al., TREC 10] computes the score of an n-sentence passage by summing the weights of the individual sentences. Sentences are weighted based on query term density. This algorithm weights query terms based on their part of speech.
Alicante’s passage retrieval algorithm [Llopis and Vicedo, CLEF 2001] computes the non-length normalized cosine similarity between query terms and the passage. It takes into account the number of appearances of a term in the passage and in the query, along with their idf values.
ISI’s passage retrieval algorithm [Hovy et al., TREC 10] ranks sentences based on their similarity to the question by weighing various features: exact match of proper names, match of query terms, and match of stemmed words. Their passage scoring function includes a term whose sole purpose is to offset scoring performed by the answer extractor.
[Tellex et al, SIGIR 2003] designed a passage retrieval algorithm by combining the results from their implemented collection of algorithms.
A simple voting scheme was implemented, that scored each passage based on its initial rank and also based on the number of answers the other algorithms returned from the same document.