Why choose Passage Retrieval (PR) over Information Retrieval (IR) ? Which one is the best painkiller?
Information is essential. However, if we can’t find it, it really doesn’t exists. Even if we have all the documents properly stored and indexed, our effort is wasted, unless we have reliable and fast mechanisms to look up these documents, in order to find the required information.
But first, let’s start with some basics:
Information Retrieval (IR) aims fundamentally to satisfy information needs. IR for instance, is the process of automatically locate and retrieve fragments of information (documents, files, passages, metadata), relevant to the user’s need, from a body of information. IR does not really matter whether you search for documents, passages, or images. What is important: unlike data retrieval, the information need is rather vague and the retrieval system has to guess what the user really needs.
The user’s information need is presented to the IR system as a query which usually consists of a string of words. Then, the IR system uses a matching mechanism to decide how closely a document is related to the query [see Xiaoyong Liu]. Hence, we can say that the goal of an IR system is to retrieve all the documents which will, hopefully, satisfy the user’s query while retrieving as few non-relevant documents as possible [see Ribeiro-Neto].
An example of a Information Retrieval Library is Lucene, as it allows you to add indexing and searching capabilities to your applications.
Following IR systems, some research was also made with focus on modeling portions of a document (“passages”). With the fast growth of information resources and the huge number of requests submitted to IR Systems, one of the most challenging and important process is to retrieve the best related excerpts with regard to the user’s questions. Thus, we can eliminate the burden of reading lots or irrelevant documents, only to reach the desired answer.
In fact Passage Retrieval (PR) is concerned in finding passage-blocks that could provide better evidences to the users, rather than the full document text. This can be specially usefully when the documents are long or span different subject areas [Callan, 1994]. This approach is very interesting, as evidence shows that users usually prefer passage sized answers over whole document, as it gives context. [Lin, 2003]
Currently, data is digitally distributed and represented in different formats. Therefore, searching within high collections of data and hence acquiring information is becoming a big headache. As a result, in order to efficiently process and extract information, reliable, distributed, efficient and high-performance information systems are required.
Which kind of information retrieval systems suits you better?
Thanks Itman for the help in this post!
Read Full Post »