Feeds:
Posts
Comments

Archive for the ‘Text Extraction’ Category

A software engine that pulls together facts by combing through more than 500 million Web pages has been developed by researchers at the University of Washington. The tool extracts information from billions of lines of text by analyzing basic relationships between words. The University of Washington project represents a scaling up of an existing technology [...]

Read Full Post »

Triple extraction (Subject, Predicate, Object) is a good method to translate free-form sentences into knowledge. The Gate Predicate-Argument Extractor Component (PAX) could be very useful in this task. PAX is a GATE component for extracting predicate-argument structures (PAS). PASs are used in various contexts to represent relations within a sentence structure. Different “semantic” parsers extract [...]

Read Full Post »

The Stanford Parser just returns a list of dependencies between word tokens. To manipulate the dependencies, we will almost certainly want to put them in a graph data structure. We are going to try this using JGraphT. JGraphT is a free Java graph library that provides mathematical graph-theory objects and algorithms. JGraphT supports various types [...]

Read Full Post »

I’m still looking for the  holly grail to convert natural language to RDF. Today I was  reading some interesting stuff about Cypher Natural Language to RDF/SPARQL transcoder. Cypher is an AI program that generates the .rdf (RDF graph) and .serql (SeRQL query) representations of plain language input, allowing users to speak plain language to update [...]

Read Full Post »

JAPE is the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions. Thus it is useful for pattern-matching, semantic extraction, and many other operations over syntactic trees such as those produced by natural language [...]

Read Full Post »

Stemming is the task of reducing words to their root form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm has been a long-standing problem in [...]

Read Full Post »

We can use two types of grammars to represent the structure of sentences in natural languages: constituency grammars and dependency grammars. Constituency grammars describe a phrase-structure syntax. In dependencies, each pair of word is related by a grammatical link called a dependency. In dependency grammars the syntactic analysis of text is based on dependencies between [...]

Read Full Post »

Anyone knows a large and free document corpus to test Question and Answering Information Retrieval Systems? Something TREC-like! Unfortunately, TREC collection of documents requires a password in order to download the documents. In fact, it would be nice if I could get a documents set from one specific topic (e.g. retail business). Moreover, it would [...]

Read Full Post »

Here are 2 good speech recognizer java tools/libraries: Sphinx-4 Sphinx-4 is a state-of-the-art speech recognition system written entirely in the Java. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California [...]

Read Full Post »

Older Posts »

Follow

Get every new post delivered to your Inbox.