A software engine that pulls together facts by combing through more than 500 million Web pages has been developed by researchers at the University of Washington. The tool extracts information from billions of lines of text by analyzing basic relationships between words. The University of Washington project represents a scaling up of an existing technology [...]
Archive for the ‘Text Extraction’ Category
Extracting Meaning from Millions of Pages
Posted in information retrieval, natural language processing, Text Extraction, tagged semantic web, textrunner on July 12, 2010 | Leave a Comment »
Predicate-Argument EXtractor (PAX)
Posted in natural language processing, Text Extraction, Uncategorized, tagged GATE, PAX, Predicate-Argument, triple, triple extraction, triple extraction from sentences on May 20, 2010 | 5 Comments »
Triple extraction (Subject, Predicate, Object) is a good method to translate free-form sentences into knowledge. The Gate Predicate-Argument Extractor Component (PAX) could be very useful in this task. PAX is a GATE component for extracting predicate-argument structures (PAS). PASs are used in various contexts to represent relations within a sentence structure. Different “semantic” parsers extract [...]
Tools: JGraphT
Posted in natural language processing, Text Extraction, tools, tagged JGraphT, stanford parser, Treebank, triple extraction on May 18, 2010 | 1 Comment »
The Stanford Parser just returns a list of dependencies between word tokens. To manipulate the dependencies, we will almost certainly want to put them in a graph data structure. We are going to try this using JGraphT. JGraphT is a free Java graph library that provides mathematical graph-theory objects and algorithms. JGraphT supports various types [...]
Cypher Natural Language to RDF/SPARQL transcoder – Wanted! Dead or Alive!
Posted in ontology, Text Annotation, Text Extraction, tools, tagged Cyper, Help, NLP to RDF, RDF, SPARQL on April 29, 2010 | 1 Comment »
I’m still looking for the holly grail to convert natural language to RDF. Today I was reading some interesting stuff about Cypher Natural Language to RDF/SPARQL transcoder. Cypher is an AI program that generates the .rdf (RDF graph) and .serql (SeRQL query) representations of plain language input, allowing users to speak plain language to update [...]
Jape Rules
Posted in knowledge representation, Text Annotation, Text Extraction, tools, tagged annotation, fst, GATE, JAPE, parser, shallow parser on April 22, 2010 | Leave a Comment »
JAPE is the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions. Thus it is useful for pattern-matching, semantic extraction, and many other operations over syntactic trees such as those produced by natural language [...]
Porter & Stemming
Posted in information retrieval, Text Extraction, tagged IR, porter, stemm, stemming, te on April 21, 2010 | Leave a Comment »
Stemming is the task of reducing words to their root form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm has been a long-standing problem in [...]
Representing the Structure of Sentences
Posted in knowledge representation, natural language processing, Text Annotation, Text Extraction, tagged information extraction, knowledge representation, NLP, semantic network, unstructured data on April 20, 2010 | Leave a Comment »
We can use two types of grammars to represent the structure of sentences in natural languages: constituency grammars and dependency grammars. Constituency grammars describe a phrase-structure syntax. In dependencies, each pair of word is related by a grammatical link called a dependency. In dependency grammars the syntactic analysis of text is based on dependencies between [...]
Documents Collection – Wanted dead or alive!
Posted in information retrieval, question and answering, Text Extraction, trec, tagged IR, q&a, trec on April 20, 2010 | 5 Comments »
Anyone knows a large and free document corpus to test Question and Answering Information Retrieval Systems? Something TREC-like! Unfortunately, TREC collection of documents requires a password in order to download the documents. In fact, it would be nice if I could get a documents set from one specific topic (e.g. retail business). Moreover, it would [...]
Swiss Army Tools: Java Speech recognizers
Posted in Text Extraction, tools, tagged java, skype4java, speech, speech recognizer, sphinx on April 19, 2010 | Leave a Comment »
Here are 2 good speech recognizer java tools/libraries: Sphinx-4 Sphinx-4 is a state-of-the-art speech recognition system written entirely in the Java. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California [...]
