It works something like that:
We need to build a system that is capable of automatically identifying highly relevant triples (pairs of concepts connected by a relation) over
concepts from an existing ontology. By extracting relevant verbs and their grammatical arguments from a domain-specific text collection
and computing corresponding relations through a combination of linguistic and statistical processing.
1. Tagger.
The main goal in this step is to create the triples.
So, starting with the sentence: “Certain department stores are further classified as discount department stores“, we will perform some text operations in order to tag the sentence and create the triples.
1.1. POS-tagging
By performing POS-tagging, for instance with Stanford-POS-tagger, we are able to find adjectives, verbs, nouns, pronouns, adverbs, and determinants. We convert a steam of characters into a steam of words. They will be useful to recognize the actions in the sentences, who does perform them, where and when.
Output: Certain_JJ department_NN stores_NNS are_VBP further_JJ classified_VBN as_IN discount_NN department_NN stores_NNS
Alphabetical list of part-of-speech tags used in the Penn Treebank Project:
- JJ – Adjective
- NN – Noun, singular or mass
- NNS – Noun, plural
- VBP – Verb, non-3rd person singular present
- VBN – Verb, past participle
- IN – Preposition or subordinating conjunction
1.2. Stopwords removal
Stopwords are common words that carry less important meaning than keywords.
With this step, we are able to remove noise from the sentence. Usually, words that are too frequent among the documents are not good discriminators. So, by that, the words “further” and “as” are deleted from the original sentence.
Output: “Certain department stores are classified discount department stores”.
1.3. Verb Stemming and translation to to_be/to_have/to_do model.
Stemming is the process for reducing inflected or sometimes derived words to their stem, base or root form, generally a written word form.
We assume that all the actions in the world can be mapped into the verbs to_be, to_have and to_do.
So, we stem the verbs and then convert them into the verb to_be, to_have and to_do.
1.4. Name-Entity recognizer.
This is a very important step, as we could identify “department store” not only as a simply two words, but also as one entity with meaning. In fact, using Wordnet, we are able to classify “department store” also as a retail establishment and other relations.
2. Ontology Creation and Instantiation
2.1 Ontology Network Creation
Ontology is a formal representation of the knowledge by a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to describe the domain.
In this step we construct an initial infrastructure for the ontology network. We build a simple network, able to map the most generic relationships in the world. We also seek to provide a classification of entities in all spheres of being.
2.2. Ontology Instantiation
In this phase we instantiate the triples within the ontology network. Then, we can perform some inferences about the entities instantiated.
3. Questions to the System
We assume that we only have 4 types of questions:
- What/Which – Concept, definition, object query
- Who – Person Query
- Where – Location query
- When – Temporal query
The same approach that is used on the topic 1 is used here, so the type of question is mapped into a triple like this one:
type_of_question(Object1, Object2, Verb)
4. Answer Retrieval
The answering system is based on a typical question and answering system, embedded within an information retrieval system. However, we expect to improve the efficiency of the system, using an ontology network as a pre-search step.
Read Full Post »