In this paper four selection strategies in evolutionary optimization of information retrieval (IR) in a question answering system are compared. The IR index has been enhanced by linguistic features to improve the retrieval performance of potential answer passages. Then, a genetic algorithm is applied to optimize the selection of features when querying the IR database. The evolutionary algorithm optimizes the IR settings in all cases with fitness scores significantly above the baseline scores.
Archive for the ‘Methodology’ Category
Posted in information retrieval, Methodology, passage retrieval, question and answering, trec, tagged algorithm, alicante, bm25, Clarke, clef, hovy, ibm, information, isi, Ittycheriah, lee, Light, llopis, mitre, multitext, okapi, okapi bm25, PR, Robertson, siteq, sliding window, tellex, trec, vicedo, voting on April 19, 2010 | 2 Comments »
Research in Question Answering (QA) systems has been improved by the Text Retrieval Conference (TREC) series since 1999. Almost all QA systems fielded at TREC employ some passage retrieval technique to reduce the size of the relevant document set to a manageable number of passages. Here are a bunch of algorithms that might be useful to be aware:
The algorithm presented by [Light et al., J. of Natural. Lang.Eng., Special Issue on QA 2001] simply counts the number of terms a passage has in common with the query. Each sentence is treated as a separate passage. This algorithm represents the simplest passage retrieval technique and serves as a good baseline for comparison.
Sliding Window scored with Okapi BM25
Okapi bm25 weighting scheme [Robertson et al., TREC 4] represents the state of the art in document retrieval. It is based on the probabilistic retrieval framework developed by Robertson in 1970. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. A simple passage retrieval algorithm based on a sliding window scored with bm25, could serve as another baseline for comparison.
The MultiText algorithm [Clarke et al., TREC 9] is a density-based passage retrieval algorithm that favours short passages containing many terms with high idf values. Each passage window in the algorithm starts and ends with a query term, and its score is based on the number of query terms in the passage as well as the window size. Once the highest scoring passage has been identified, our implementation creates a new window of the required length around the center point of the original passage.
Incorpores unique technique for arbitrary passage retrieval. The technique efficiently locates high-scoring passages, where the score of a passage is based on its length and the weights of the terms occurring within it. Passage boundaries are determined by the query, and can start and end at any term position. If a document ranking is required, the score of a document is computed by combining the scores of the passages it contains.
IBM’s passage retrieval algorithm [Ittycheriah et al., TREC 9] computes a series of distance measures for the passage. The “matching words measure” sums the idf values of words that appear in both the query and the passage. The “thesaurus match measure” sums the idf values of words in the query whose WordNet synonyms appear in the passage. The “mis-match words measure” sums the idf values of words that appear in the query and not in the passage. The “dispersion measure” counts the number of words in the passage between matching query terms, and the “cluster words measure” counts the number of words that occur adjacently in both the question and the passage. These various measures are linearly combined to give the final score for a passage.
SiteQ’s passage retrieval algorithm [Lee et al., TREC 10] computes the score of an n-sentence passage by summing the weights of the individual sentences. Sentences are weighted based on query term density. This algorithm weights query terms based on their part of speech.
Alicante’s passage retrieval algorithm [Llopis and Vicedo, CLEF 2001] computes the non-length normalized cosine similarity between query terms and the passage. It takes into account the number of appearances of a term in the passage and in the query, along with their idf values.
ISI’s passage retrieval algorithm [Hovy et al., TREC 10] ranks sentences based on their similarity to the question by weighing various features: exact match of proper names, match of query terms, and match of stemmed words. Their passage scoring function includes a term whose sole purpose is to offset scoring performed by the answer extractor.
[Tellex et al, SIGIR 2003] designed a passage retrieval algorithm by combining the results from their implemented collection of algorithms.
A simple voting scheme was implemented, that scored each passage based on its initial rank and also based on the number of answers the other algorithms returned from the same document.
“… semantic networks suffer from an inherent semantic ambiguity. For example, we were unable to differentiate individuals from concepts in the resulting concept maps. Moreover, due to the direct translation of written sentences into concept map sentences, various terms were used to express synonyms, resulting in further ambiguity …”
Hummm…. interesting (or not).
We have seen that are many alternatives in designing an Information Retrieval (IR) system. However, how do we know which of these techniques are effective?
To measure IR effectiveness in the standard way, we need an input test collection consisting of three things:
- A document collection
- A test suite of information needs, expressible as queries
- A set of relevance judgments, standardly a binary assessment of either relevant or non-relevant for each query-document pair.
As a rule of thumb, 50 information needs has usually been found to be a sufficient minimum.
The evaluation of the IR system could be focused on two topics: Performance evaluation and quantity of the retrieved information.
- Performance evaluation is concerned with the response time and the storage space. It is evaluated the search mechanism.
- The evaluation of the quantity of retrieved information task don’t really cares about the exact answer definition. The documents are ranked by relevance. It is concerned with the efficiency of the retrieval, based on the relevance concept.
But how to measure relevance?
Relevance is a formal criteria for automatic selection of documents. It is a property evaluated by the user based on the users’ necessity of information. A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query. To evaluate a system, we require an overt expression of an information need, which can be used for judging returned documents as relevant or non-relevant.
Standard Text Collections
- Cross Language Evaluation Forum (CLEF). This evaluation series has concentrated on European languages and cross-language IR.
- Text Retrieval Conference (TREC). Large IR test bed evaluation series.
Evaluation of retrieval sets
The two most frequent and basic measures for IR effectiveness are precision and recall. Precision is the fraction of retrieved documents that are relevant. Recall is the fraction of relevant documents that are retrieved.
References: Introduction to Information Retrieval
- Allow Rapid and Continuous delivery. An Iterative development with small iterations (1 week) is encouraged and progressively align the business needs with business goals. Working software is the measure of progress of the project, so by any time we have e a operational prototype functioning.
- Fast Responding to changes as we use adaptive and flexible methods. Allows a project’s direction to be redefined on completed work, not on speculations and predictions. As some of project topics are not well researched and state-of-art tools and methodologies are being used, the requirements could easily change and we have to be cautious.
- Collaboration between self-organizing and cross-functional teams. It is important as each team member is an expert in one domain area.
- Continuous Cooperation between business individuals and developers. Collaboration between experts and stakeholders, which permanent feedback has to be taken in account.
Posted in information retrieval, Methodology, ontology, passage retrieval, prymas, tagged name-entity, penn treebank, pos-tagging, stemming, stopwords, stopwords removal, tagger, triples on April 6, 2010 | 1 Comment »
It works something like that:
We need to build a system that is capable of automatically identifying highly relevant triples (pairs of concepts connected by a relation) over
concepts from an existing ontology. By extracting relevant verbs and their grammatical arguments from a domain-specific text collection
and computing corresponding relations through a combination of linguistic and statistical processing.
The main goal in this step is to create the triples.
So, starting with the sentence: “Certain department stores are further classified as discount department stores“, we will perform some text operations in order to tag the sentence and create the triples.
By performing POS-tagging, for instance with Stanford-POS-tagger, we are able to find adjectives, verbs, nouns, pronouns, adverbs, and determinants. We convert a steam of characters into a steam of words. They will be useful to recognize the actions in the sentences, who does perform them, where and when.
Output: Certain_JJ department_NN stores_NNS are_VBP further_JJ classified_VBN as_IN discount_NN department_NN stores_NNS
Alphabetical list of part-of-speech tags used in the Penn Treebank Project:
- JJ – Adjective
- NN – Noun, singular or mass
- NNS – Noun, plural
- VBP – Verb, non-3rd person singular present
- VBN – Verb, past participle
- IN – Preposition or subordinating conjunction
1.2. Stopwords removal
Stopwords are common words that carry less important meaning than keywords.
With this step, we are able to remove noise from the sentence. Usually, words that are too frequent among the documents are not good discriminators. So, by that, the words “further” and “as” are deleted from the original sentence.
Output: “Certain department stores are classified discount department stores”.
1.3. Verb Stemming and translation to to_be/to_have/to_do model.
We assume that all the actions in the world can be mapped into the verbs to_be, to_have and to_do.
So, we stem the verbs and then convert them into the verb to_be, to_have and to_do.
1.4. Name-Entity recognizer.
This is a very important step, as we could identify “department store” not only as a simply two words, but also as one entity with meaning. In fact, using Wordnet, we are able to classify “department store” also as a retail establishment and other relations.
2. Ontology Creation and Instantiation
2.1 Ontology Network Creation
Ontology is a formal representation of the knowledge by a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to describe the domain.
In this step we construct an initial infrastructure for the ontology network. We build a simple network, able to map the most generic relationships in the world. We also seek to provide a classification of entities in all spheres of being.
2.2. Ontology Instantiation
In this phase we instantiate the triples within the ontology network. Then, we can perform some inferences about the entities instantiated.
3. Questions to the System
We assume that we only have 4 types of questions:
- What/Which – Concept, definition, object query
- Who – Person Query
- Where – Location query
- When – Temporal query
The same approach that is used on the topic 1 is used here, so the type of question is mapped into a triple like this one:
type_of_question(Object1, Object2, Verb)
4. Answer Retrieval
The answering system is based on a typical question and answering system, embedded within an information retrieval system. However, we expect to improve the efficiency of the system, using an ontology network as a pre-search step.
Posted in information retrieval, Methodology, tools, tagged apache, api, compass, doug cutting, eb-eye, information, IR, jakarta, liferay, lucene, nutch, open-source, pangaea, solr on April 1, 2010 | 1 Comment »
What Lucene is?
Lucene is a high-performance, scalable Information Retrieval (IR) library, created originally by Doug Cutting. It provides indexing and searching features to applications. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Lucene is a mature, open-source project implemented in Java, available for free download. Moreover, it’s a member of the popular Apache Jakarta family of projects, licensed under the Apache Software License. One of the key factors behind Lucene’s popularity and success is its simplicity.
We can see that a lot of attention is paid in the index and search API. In fact, we don’t need in-depth knowledge about how Lucene’s information indexing and retrieval work’s in order to start using it.
The core logical architecture is the concept of a document which contains fields of text. This flexibility allows Lucene‘s API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others, can all be indexed so long as their textual information can be extracted.
Now, some features about Lucene:
- ranked search, that is, the most fitting results return first;
- multiple query types: phrase queries, wildcard queries, proximity queries, range queries and more;
- field search(e.g., title, author, contents);
- date-range searching;
- sorting by field;
- multiple-index search with merged results;
- simultaneous update and searching;
- implementations in multiple programming languages available, that are index-compatible;
However, we have to be aware that Lucene, by itseft is not a a ready-to-use application like a file-search program, a web crawler, or a web site search engine. Lucene is a software toolkit, not a full-featured search application.
A list of Lucene-Based Projects:
- Nutch (Open Souce Web Search Engine) ;
- Solr (Fully-featured search server);
- Compass (Java Search Engine Framework);
- Liferay (Enterprise portal);
- PANGAEA (Digital data library and a data publisher for earth system science);
- EB-eye (search engine that provides easy and uniform access to the biological data resources)
In fact, this tool can be very useful for the passage/document retrieval task. Lucene can support, retrieval of relevant snippets of text for every question.
One simple approach is to create two indexes, one at paragraph level and one at document level. The paragraph index is more precise in terms of relevant text, and is preferred for snippet extraction. However, if the answer is not found in the paragraph index, the document index is returned instead. Using the Lucene query engine, we extract a ranked list of snippets for every question.
The algorithm for answer extraction can be based on the Lucene scores. It is very likely that the paragraph with de highest Lucene score should be correct answer (although this is not guaranteed).
Of course, there are refinement filters, which increase the Lucene score in the following cases:
- the paragraph has the focus;
- the paragraph has some of the name entities (directly proportional with the number of these name entities);
- if the question answer type is Person, or Organization, etc., we try to identify these types of name entities in the extracted paragraphs (and increase the Lucene score accordingly with the number of them);
- if the question type is Definition, then we prefer answers with definition form (could be identified by this grammar).
After applying all criteria, the paragraphs are awarded a score and the paragraph with the biggest score is chosen.
Posted in knowledge management, Methodology, passage retrieval, prymas, tagged Entity Recognition, GATE, machine learning, POS, semantic network, Steeming, Tokenization, UIMA, unstructured knowledge on March 23, 2010 | 1 Comment »
Usually X marks the spot, but the path for conversion of unstructured knowledge into a reliable and efficient knowledge database of facts isn’t straightforward. Despite the knowledge being already assembled in a machine-optimal-representation, information recovery into a English natural language answer isn’t trivial.
Nowadays, the amount of information that companies deal with is overwhelming. Being able to convert unstructured information, into a fact representation, which can be stored and manipulated through data warehouse techniques, is striking gold, as we indeed generate knowledge.
With this is mind, the key concept is the development of a knowledge management system, like a semantic network, embedded within a ontology. Setting up an ontology hierarchy will allow to change the scope of the project easily and increase the flexibility. For instance, we could construct a general ontology for mapping the global state of the world, e.g. to map generic relations such as “a son has a mother and a father”, and within this generic ontology we could have more specific ontologies, mapping the business logic, such as e.g. “cars have 4 tires”.
After mapping the semantic network, the representation allows an easily and efficiently creation, inference, exploration, and retrieval of knowledge. Besides, the network also allows to detect contradictions. This feature is very important, as the source documents are written in natural language, and detection of such occurrences is essential.
But, what’s the easiest and efficient way to inquire the knowledge database? Perhaps, using English Natural Language Questions, such as: “When was Eiffel tower constructed?”. These questions grant users, the ability to inquire it, requiring only a small learning curve of the system. This is a valuable feature, as we want to have a system that can be used not only by technicians, but also for a wide range of users without technical background.
Yes, but how to deal with the knowledge retrieval?
If we annotate the source unstructured text (Tokenization , Part-of-Speech tagging, Entity Recognition, Stemming), using frameworks like GATE or UIMA, we can get relevant actions from the phrase. Then, we can attach for each action, the passage in the text where the action is stated. From the actions, mapped into the semantic network, when some answer is retrieved, the passages are shown with a specific ranking associated. The ranking can be tuned, as the user says if the answer is useful or not. This introduces an interesting concept of machine learning in the knowledge management system.
Any clue to the whereabouts of the lost treasure and how to turn things into gold?
How to represent knowledge in a way that could be efficiently manipulated by a machine program? How to formally represent the domain of a problem? How to achieve intelligent behavior? These are the 1M $ questions…
The key problem is to find a representation and a supporting system that make inferences within the constraints, appropriated to the problem at hand. In one hand, the more expressive the architecture is build, the easier and more compact it is to represent knowledge. In the other hand, more expressive representations are harder to automatically derive inferences from.
In our work project, first we tried to map the domain of the problem into a relation database design.
For instance, analyzing the text “John Smith goes to London, next week” we can split this sentence into:
|John Smith||Go||London||Next week|
Table 01: “John Smith goes to London, next week” sentence splitting
However this model is very feeble and cumbersome, as does not allow mapping between complex sentences. Moreover, the size of the database would increase very fast, decreasing the efficiency of the question & answering mechanism. Another important weakness is the inference system, we have to process the existing facts in the knowledge database, more than once, to incrementally infer new facts.
Knowing all this weaknesses, some investigation was made on the topic.
The leading representation techniques are frames, rules, tagging, and semantic networks. Some experts think the best way to represent knowledge is in the same way as represented in the human mind. Afterward, frame structures are well-suited for the representation of schematic knowledge and stereotypical cognitive patterns. One of the most expressive and comprehensively described knowledge representation paradigms along the lines of semantic networks is MultiNet. Also SNePS provides a good point of view over Knowledge Representation and Reasoning.
From this point, a semantic network, assembled inside an ontology representation seems the best way to attack the problem. Any other idea how to think?